📊

Understanding Two-Variable Data Analysis

May 6, 2025

Notes: Unit 2 - Exploring Two-Variable Data

Two Categorical Variables

  • Qualitative data: Often involves two categorical variables which may or may not be dependent.
  • Two-way Table/Contingency Table: Used to display such data.

Example 2.1: The Cuteness Factor

  • Study with 250 volunteers observing different categories (baby animals, adult animals, tasty foods).
  • Row Variable: Pictures viewed.
  • Column Variable: Level of focus.
  • Table Total: Sum of all cell values.
  • Marginal Frequencies: Totals for each row and column, used to form proportions/percentages.

Two Quantitative Variables

  • Bivariate Quantitative Data Sets: Concerned with relationships between two numerical variables.
  • Scatterplot: Provides visual representation of the potential relationship.
  • Correlation Coefficient: Measures strength of linear relationship.

Example 2.2: Comic Books

  • Scatterplot comparing speed and strength of comic characters.
  • Positive Association: Larger values of one variable correlate with larger values of another.
  • Negative Association: Larger values of one correlate with smaller values of another.

Correlation

  • Only measures linear relationships.
  • Designated by r: Formula uses means and standard deviations.
  • Unit-Free: Unaffected by switching x and y.
  • Range: -1 to +1.
  • Coefficient of Determination (r^2): Ratio of variance in predicted values to observed values.

Example 2.3: Football Statistics

  • Correlation between Total Points and Yards Gained: r = 0.84, r² = 0.7056.

Least Squares Regression

  • Best-fitting Line: Minimizes the sum of squares of vertical differences between observed and predicted values.
  • Passes through means of X and Y.

Example 2.4: Teen Survey

  • Regression for relationship between number of friends and evening Facebook checks.
  • Slope interpretation: Each friend leads to 0.5492 more checks.

Residuals

  • Difference: Between observed and predicted values.
  • Positive Residual: Model underestimated.
  • Negative Residual: Model overestimated.

Outliers, Influential Points, Leverage

  • Outliers: Points with large discrepancies from the pattern.
  • Influential Scores: Removal changes regression line significantly.
  • High Leverage: x-values far from the mean.

Example 2.5: GPA vs. TV Time

  • Identification of outliers based on regression context.

More on Regression

  • Implications of the regression equation in terms of correlation.

Example 2.8 & 2.9

  • Calculations using attendance and popcorn sales data for predictions.

Transformations to Achieve Linearity

  • Transformation: Logarithmic transformations can reveal linear relationships.

Example 2.10

  • Population data suggests a linear model, but nonlinear might be stronger.

This summary captures key concepts and examples from Unit 2, providing a comprehensive guide to understanding two-variable data analysis.