Example: Studying the influence of gender on preferred newspaper.
Descriptive Statistics: Summarizes the sample data without making generalizations about the population.
Inferential Statistics: Makes educated guesses about the population based on sample data.
Descriptive Statistics Breakdown
Measures of Central Tendency
Mean: Arithmetic average of a data set.
Median: Middle value in a data set.
Mode: Most frequently occurring value.
Measures of Dispersion
Standard Deviation: Average distance between each data point and the mean.
Variance: Squared standard deviation.
Range: Difference between maximum and minimum values.
Interquartile Range (IQR): Middle 50% of the data.
Tables and Charts
Frequency Tables: Display how often each distinct value appears.
Contingency Tables: Analyze and compare the relationship between two categorical variables.
Charts: Bar charts, pie charts, histograms, box plots, violin plots, and rainbow plots.
Part 2: Hypothesis Testing 🔍
Fundamentals
Hypothesis: A statement that we want to test (e.g., does a drug affect blood pressure?).
Null Hypothesis (H0): Assumes no effect or no difference.
Alternative Hypothesis (H1): Assumes an effect or a difference.
P-Value: Probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true.
Statistical Significance: If P-value < 0.05, results are considered statistically significant.
Types of Errors
Type I Error: Rejecting H0 when it is true.
Type II Error: Not rejecting H0 when it is false.
Common Hypothesis Tests
T-Test: Compares means between two groups (one-sample, independent samples, paired samples).
ANOVA (Analysis of Variance): Compares means across multiple groups.
Chi-Square Test: Tests relationships between categorical variables.
Nonparametric Tests: Used when data do not meet the assumptions required for parametric tests (e.g., Mann-Whitney U test).
Part 3: Correlation and Regression Analysis 📈
Correlation
Pearson Correlation: Measures the linear relationship between two metric variables.
Spearman Correlation: Measures the relationship between two variables using their ranks.
Kendall's Tau: Measures the ordinal association between two variables, useful for small samples with many ties.
Point-Biserial Correlation: Measures the relationship between a binary variable and a continuous variable.
Regression Analysis
Simple Linear Regression: Uses one independent variable to predict a dependent variable.
Multiple Linear Regression: Uses multiple independent variables to predict a dependent variable.
Logistic Regression: Used when the dependent variable is categorical.
Assumptions: Linearity, normal distribution of residuals, homoscedasticity, no multicollinearity.
Dummy Variables: Used for categorical predictors with more than two categories in regression models.
Part 4: Cluster Analysis 👥
K-Means Clustering
Purpose: Identify hidden groups or clusters within data.
**Steps: **
Define the number of clusters (k).
Set cluster centers randomly.
Assign each element to the nearest cluster center.
Recalculate the cluster centers.
Repeat steps 3 and 4 until convergence.
Optimal Cluster Number
Elbow Method: Determines the optimal number of clusters by identifying the point where adding another cluster doesn't significantly reduce the sum of squared distances.
Tools and Software
DataTab: Example tool for performing statistical analysis online, including hypothesis tests, regression, and cluster analysis.
Conclusion
Statistics provides powerful tools for understanding data and making informed decisions.
Various statistical methods can be applied depending on the research question and the type of data.
Always check assumptions and use appropriate tests to ensure valid conclusions.