StatQuest: Principal Component Analysis (PCA)

Introduction

Example: Transcription of two genes (Gene 1 and Gene 2) in 6 mice.
- Mice = individual samples
- Genes = variables measured
- Alternative examples: High school students (test scores) or businesses (market capitalization and employees).

1 Gene:
- Data plotted on a number line
- Groups: Mice 1, 2, 3 (high values) vs. Mice 4, 5, 6 (low values).
2 Genes:
- Data plotted on a 2D X-Y graph
- Gene 1 (x-axis) and Gene 2 (y-axis) create clustering of mice.
3 Genes:
- Transition to 3D graph representation.
4 Genes:
- 4D data cannot be visualized easily.

PCA reduces dimensionality from higher dimensions to 2D plots while preserving data relationships.
- Helps identify which variables (genes) are most significant in clustering.
- PCA accuracy assessed through the positioning of data points in the graph.

Centering the Data:
- Calculate average measurements for each gene.
- Shift data so the center is at the origin.
Fitting a Line:
- Draw an initial random line through the origin and rotate to best fit data.
- Determine fit quality by minimizing or maximizing distances from the line to data points.
Distance Measurement:
- Project points onto the line and measure distances to the origin.
- Calculate the sum of squared distances (SS distances).
Principal Component 1 (PC1):
- Identified as the line with the largest sum of squared distances.
- Example: PC1 slope = 0.25, indicating Gene 1 is more important than Gene 2.
Scaling the Line:
- Normalize the length of PC1 to 1 unit using the Pythagorean theorem.
- Resulting vector (Eigenvector/Singular Vector) indicates importance of each gene (loading scores).
Principal Component 2 (PC2):
- Line perpendicular to PC1 without optimization.
- Example recipe: -1 parts Gene 1 to 4 parts Gene 2.

Eigenvalues represent measures of variation in the dataset.
Example Variations:
- PC1 = 15 (83% total variation)
- PC2 = 3 (17% total variation)
Scree Plot: Visual representation of variation explained by each PC.

Use of PCA helps simplify complex data into meaningful 2D representations.
PCA can still analyze higher dimensional data mathematically, even if visualization is not possible.
Importance of identifying clusters in data, even when using fewer principal components.

Encouragement to subscribe for more StatQuest content and support through music purchases.