Extrapolation and Outliers: Key Points from Lecture

Jun 20, 2024

Extrapolation and Outliers

Extrapolation

  • Definition: Making predictions outside a range of data.
  • Example: Predicting a student’s GPA based on study hours per week.
  • Using X to predict Y:
    • Within Range: If a student studies for 7 hours/week, predicted GPA is ~3.6. Valid since 7 hours falls within the range (1 to 10 hours).
    • Outside Range: For 15 hours/week, predicted GPA is ~6.1, which is invalid (max GPA is 4.5).
  • Caution: Extrapolations should be avoided; predictions become unreliable outside the data range.

Outliers

  • Definition: Data points significantly distant from other data points in a dataset.
  • Types of Outliers:
    • In the Y-direction: Far from the central data set vertically.
    • In the X-direction: Far from the central data set horizontally.

Examples of Outliers

  • Mass of Data Points: Central cluster of data.
  • X-direction Range: Minimum value = 0.3, maximum value = 4.2.
    • Outlier: X-value outside 0.3 to 4.2.
    • Point A: X = 2 (not an X-outlier).
    • Points B & C: Outside 0.3 to 4.2 (X-outliers).
  • Y-direction Range: Minimum value = 0.4, maximum value = 4.5.
    • Outlier: Y-value outside 0.4 to 4.5.
    • Point C: Y = 3 (not a Y-outlier).
    • Points A & B: Outside 0.4 to 4.5 (Y-outliers).

Summary of Points

  • Point A: Outlier in the Y-direction.
  • Point B: Outlier in both X and Y directions.
  • Point C: Outlier in the X-direction.
  • Point D: Not an outlier in X or Y, but a bivariate outlier (outside pattern of data points).

Impact of Outliers on Regression

  • X-Outliers: Greatly influence the regression line.
  • Y-Outliers: Barely affect the regression line.

Examples:

  1. Without Outliers: Regression line follows the general trend.
  2. Including Point A (Y-Outlier): Slight shift in line.
  3. Including Point C (X-Outlier): Drastic change in the regression line (influential outlier).
  4. Including Point B (X and Y Outlier): Minimal change if falls within the original trend.
  5. Including Point D (Bivariate Outlier): Similar minimal effect only because it’s not an X or Y outlier.