Principal Component Analysis (PCA) using Singular Value Decomposition (SVD)
Introduction
Presenter: Josh Starmer of StatQuest
Objectives:
- Understand what PCA does, how it works, and how to use it for data insight.
- Demonstrate using an example data set with two genes measured in six mice.
Data Set and Visualization
Data Measurements:
- Samples (Mice): 6 samples
- Variables (Genes): 2 genes
Data Visualization:
- 1 Gene: Plot on a number line
- Mice 1, 2, 3 (high values)
- Mice 4, 5, 6 (low values)
- 2 Genes: Plot on a 2D X-Y graph
- Gene 1: X-axis
- Gene 2: Y-axis
- Mice clustering: Mice 1, 2, 3 (right side), Mice 4, 5, 6 (lower left side)
- 3 Genes: 3D graph by adding another axis
- 4+ Genes: Unable to plot; requires higher dimensional representation
Principal Component Analysis (PCA)
Purpose:
- Reduce higher-dimensional data to two dimensions for easier visualization.
- Identify which gene is most valuable for clustering data.
- Assess accuracy of the 2D graph representation.
Steps in PCA:
- Plotting the Data:
- Use 2D data example (2 genes) for explanation.
- Calculate Averages:
- Average of Gene 1
- Average of Gene 2
- Determine center of the data from averages
- Shift Data:
- Center the data on the origin
- Fit a Line:
- Draw a random line through the origin
- Rotate line to best fit data
- Quantify Fit:
- Project data onto line
- Measure distances from data to line
- Minimize these distances or maximize distance from projected points to origin
Mathematical Details and Terminology
Understanding Distances:
- Use Pythagorean theorem to show inverse relation between distances
- Goal: Maximize sum of squared distances (SS distances) from projected points to origin
- Principal Component 1 (PC1): Line with largest SS distances
Terminology:
- Linear Combination: Mix of Gene 1 and Gene 2 to form PC1
- Eigenvector: Unit vector corresponding to a principal component
- Loading Scores: Proportions of each gene in eigenvector
- Eigenvalue: Average of SS distances on best-fit line
- Singular Value: Square root of SS distances
Principal Components:
- PC1: First principal component (best fit line)
- PC2: Second principal component (perpendicular to PC1)
- No further optimization required
- Recipe for PC2 involves different proportion of genes
PCA with More Variables
Example with 3 Genes:
- Follow similar steps as 2D example
- PC1: Best fitting line with 3-gene recipe
- PC2: Perpendicular to PC1
- PC3: Perpendicular to both PC1 and PC2
Variance and Scree Plot:
- Eigenvalues as measures of variation
- Scree Plot: Graphical representation of percentage of variation each principal component accounts for
Drawing the PCA Plot
Final PCA Plot:
- Rotate data to make PC1 horizontal
- Place samples based on projected points
- Use to identify data clusters
Key Takeaways
- PCA reduces dimensionality for easier data visualization.
- Eigenvalues and eigenvectors play critical roles in understanding data variance and determining principal components.
- Scree plots help in understanding the significance of each principal component.
Conclusion
- PCA is a powerful tool for data analysis and visualization.
- StatQuest offers more detailed videos and explanations on related topics.