Overview
This lecture introduces statistical distributions, explains their importance in data analysis, covers common distribution types, and demonstrates how to visualize and interpret distributions using Python.
Introduction to Distributions
- Distributions describe how data values are spread and their frequency shapes.
- Understanding distributions is essential for data analysis, statistical inference, and machine learning.
- Statistical inference involves using samples to estimate population parameters due to the impracticality of collecting full population data.
Visualizing Distributions
- Histograms show the frequency of data values in intervals; the y-axis represents counts or frequencies.
- Symmetric distributions have mirror-image bars around the center; commonly related to the "normal" distribution.
Importance and Types of Distribution Shapes
- Distributions help summarize and model datasets as well as validate statistical assumptions.
- Common shapes: symmetric (bell curve/normal), uniform (equal frequency), bimodal (two peaks), skewed left/right (longer tail on one side).
- Shape affects the relative positions of mean, median, and mode.
Exploring Dataset Distributions in Python
- Use lists or create synthetic data when datasets are unavailable.
- Functions like
.unique() and .value_counts() help summarize categorical variables.
- Histograms can reveal data concentration, skewness, and possible outliers.
Discrete Distributions
- Discrete distributions model countable outcomes (e.g., dice rolls).
- Uniform discrete: all outcomes equally likely (e.g., fair die).
- Bernoulli: two outcomes (success/failure).
- Binomial: number of successes in fixed n trials.
- Geometric: number of trials until first success.
- Poisson: counts events in a fixed interval.
Continuous Distributions
- Continuous distributions cover infinite possible values within intervals.
- Uniform continuous: all values in interval equally likely.
- Normal distribution: bell-shaped curve, defined by mean and standard deviation.
- Student's t, exponential, gamma, and beta are other important continuous distributions.
Practical Use and Statistical Testing
- Many statistical tests (e.g., t-tests) require assumptions about underlying distributions.
- It's important to verify distribution assumptions before applying these tests.
- Overlaying theoretical distributions (e.g., normal) on data enables visual comparison, but formal tests are used for confirmation.
Key Terms & Definitions
- Distribution — The way data values are spread or arranged.
- Histogram — A plot showing frequency of data in intervals.
- Statistical Inference — Drawing conclusions about populations from samples.
- Symmetric Distribution — Both halves mirror each other around the center.
- Skewness — Asymmetry in data distribution; can be left (tail on left) or right (tail on right).
- Discrete Distribution — Deals with countable outcomes.
- Continuous Distribution — Deals with outcomes over intervals with infinite values.
- Normal Distribution — Bell-shaped, symmetric distribution defined by mean and standard deviation.
- Bernoulli/Binomial/Geometric/Poisson — Specific types of discrete distributions.
Action Items / Next Steps
- Review notebook functions for creating and analyzing distributions in Python.
- Practice generating histograms and identifying distribution types on given datasets.
- Attempt the exercise: plot and interpret the distribution of car mileage in the provided dataset.
- Prepare for future lessons on statistical testing of distribution assumptions.