Lecture Notes: Quantifying Data Variation
Recap of Previous Lecture
- Discussed three matrices of quantifying data:
Quantifying Data Variation
Arithmetic Mean
- Formula: ( \bar{x} = \frac{\Sigma x_i}{n} ), where ( n ) is the number of observations.
- For a population, replace ( \bar{x} ) with ( \mu ) and ( n ) with ( N ).
- Transformation examples:
- ( y = ax ) leads to ( \bar{y} = a\bar{x} )
- ( y = c + x ) results in ( \bar{y} = c + \bar{x} )
- ( y = c + ax ) provides ( \bar{y} = c + a\bar{x} )
- Caveat: Sensitive to outliers (e.g., data 1, 1, 1, 2, 2, 20 yields ( \bar{x} = 4.5 )).
Geometric Mean
- Formula: ( \sqrt[n]{\Pi x_i} )
- Example values: 15, 10, 5, 8, 17, 100
- Arithmetic mean = 25.8
- Geometric mean = 14.7 (closer to main data set)
- Property: Geometric mean ( \leq ) Arithmetic mean
Median
- Middle value of an ordered data set.
- Formula for position: ( 0.5(n + 1) )
- If even number of values, average the two middle values.
Mode
- Most frequently occurring value in the data set.
- Useful for large datasets; less so for small datasets.
Comparing Mean, Median, and Mode
- Mode is used for large datasets.
- Mean is sensitive to outliers.
- Median is less sensitive to outliers.
Examples
- Example data set: 1, 2, 3, 2, 4, 2, 8, 3, 6, 3, 2, 5, 45, 36, 89
- Median = 4
- Mode = 2
- Mean > Median > Mode
- When data has two subpopulations, neither mean, median, nor mode make sense.
Variability Measures
Range
- Formula: Maximum - Minimum
- Sensitive to outliers, may not represent data distribution well.
Mean Absolute Deviation
- Formula: ( \frac{\Sigma |x_i - \bar{x}|}{n} )
Standard Deviation
- More commonly used measure of variability.
- Formula for population: ( \sigma = \sqrt{\frac{\Sigma (x_i - \mu)^2}{N}} )
- Formula for sample: ( s = \sqrt{\frac{\Sigma (x_i - \bar{x})^2}{n-1}} )
- Using ( n-1 ) gives a better estimate for small sample sizes.
- Variance: Square of standard deviation.
Transformation of Standard Deviation
- ( y = ax ) results in ( s_y = a \times s_x )
- ( y = c + x ) does not change ( s_x )
- For grouped data, use frequency to calculate mean and variance.
Conclusion
- Arithmetic mean, median, and mode have specific use cases.
- Standard deviation is vital for understanding data dispersion.
- Population vs. sample standard deviation calculation differs slightly.
- Transformations of data do not affect standard deviation unless scaled by a factor.
Thank you for your attention. We will continue in the next lecture.