📊

Understanding Data Variability and Spread

Mar 26, 2025

Lecture Notes on Measures of Spread in Data

Introduction

  • Purpose: Understand how to use numbers to describe the spread of data.
  • Variability: Measure of differences within data samples.
    • Low variability: data points are similar.
    • High variability: data points are different.

Measures of Spread

  1. Maximum and Minimum

    • Easiest measure of spread.
    • Represents the range of data values.
    • Sensitive to outliers (e.g., a single large outlier can shift max/min significantly).
  2. Quartiles

    • Median (Q2): Middle value of the ordered data.
    • First Quartile (Q1): Value between the lowest value and the median.
    • Third Quartile (Q3): Value between the median and the highest value.
    • Quartiles are robust to outliers.
    • Data split into quarters, each containing 25% of the data.
  3. Variance

    • Measures the mean of squared deviations from the mean.
    • Population variance uses 'N'; sample variance uses 'N-1' (degrees of freedom).
      • N-1: Corrects for statistical bias when estimating a population from a sample.
    • Provides an average squared deviation, which is less intuitive.
  4. Standard Deviation (SD)

    • Square root of variance.
    • Easier to interpret as average deviation from the mean.
    • Most data falls within one SD from the mean, outliers are beyond two or three SDs.

Identifying Outliers

  • 1.5 IQR Rule: Identifies potential outliers.
    • Calculate IQR (Q3 - Q1), multiply by 1.5.
    • Maximum cutoff: Q3 + (1.5 * IQR).
    • Minimum cutoff: Q1 - (1.5 * IQR).
    • Values outside these cutoffs are potential outliers.

Handling Outliers

  • Possible Causes:
    • Typographical errors.
    • Measurement errors.
    • Incorrect identification of the sample.
    • Legitimate extreme values.
  • Deciding on Outliers:
    • Inclusion: Considered valid data within population.
    • Exclusion: If the data point is not representative or caused by error.
    • Decision should be case-specific and explained.

Summary

  • Use numbers to describe data spread like central tendency.
  • Measures of spread show participant variability.
  • Important tools in identifying and handling outliers.