Notes: Unit 2 - Exploring Two-Variable Data
Two Categorical Variables
- Qualitative data: Often involves two categorical variables which may or may not be dependent.
- Two-way Table/Contingency Table: Used to display such data.
Example 2.1: The Cuteness Factor
- Study with 250 volunteers observing different categories (baby animals, adult animals, tasty foods).
- Row Variable: Pictures viewed.
- Column Variable: Level of focus.
- Table Total: Sum of all cell values.
- Marginal Frequencies: Totals for each row and column, used to form proportions/percentages.
Two Quantitative Variables
- Bivariate Quantitative Data Sets: Concerned with relationships between two numerical variables.
- Scatterplot: Provides visual representation of the potential relationship.
- Correlation Coefficient: Measures strength of linear relationship.
Example 2.2: Comic Books
- Scatterplot comparing speed and strength of comic characters.
- Positive Association: Larger values of one variable correlate with larger values of another.
- Negative Association: Larger values of one correlate with smaller values of another.
Correlation
- Only measures linear relationships.
- Designated by
r: Formula uses means and standard deviations.
- Unit-Free: Unaffected by switching x and y.
- Range: -1 to +1.
- Coefficient of Determination (
r^2): Ratio of variance in predicted values to observed values.
Example 2.3: Football Statistics
- Correlation between Total Points and Yards Gained: r = 0.84, r² = 0.7056.
Least Squares Regression
- Best-fitting Line: Minimizes the sum of squares of vertical differences between observed and predicted values.
- Passes through means of
X and Y.
Example 2.4: Teen Survey
- Regression for relationship between number of friends and evening Facebook checks.
- Slope interpretation: Each friend leads to 0.5492 more checks.
Residuals
- Difference: Between observed and predicted values.
- Positive Residual: Model underestimated.
- Negative Residual: Model overestimated.
Outliers, Influential Points, Leverage
- Outliers: Points with large discrepancies from the pattern.
- Influential Scores: Removal changes regression line significantly.
- High Leverage: x-values far from the mean.
Example 2.5: GPA vs. TV Time
- Identification of outliers based on regression context.
More on Regression
- Implications of the regression equation in terms of correlation.
Example 2.8 & 2.9
- Calculations using attendance and popcorn sales data for predictions.
Transformations to Achieve Linearity
- Transformation: Logarithmic transformations can reveal linear relationships.
Example 2.10
- Population data suggests a linear model, but nonlinear might be stronger.
This summary captures key concepts and examples from Unit 2, providing a comprehensive guide to understanding two-variable data analysis.