Notes on Principal Component Analysis (PCA) with Singular Value Decomposition (SVD)
Introduction
- Presenter: Josh Starmer
- Focus on PCA using SVD to extract insights from data.
- Example dataset: Transcription of two genes (Gene 1 and Gene 2) from 6 mice.
- Alternatives: Students’ test scores or businesses’ metrics.
Data Visualization
One Gene Measurement
- Data can be plotted on a number line.
- Mice 1, 2, and 3 have high values; Mice 4, 5, and 6 have low values.
Two Gene Measurement
- Data can be plotted on a 2D X-Y graph (Gene 1 on x-axis, Gene 2 on y-axis).
- Clustering of subjects observed based on gene expression.
Three Gene Measurement
- Data visualized in 3D, adding a third axis for Gene 3.
More than Three Genes
- Cannot visualize directly beyond 3D; PCA allows reduction to 2D.
- PCA helps identify important genes for clustering.
PCA Process
- Centering Data: Calculate the average for Gene 1 and Gene 2, shift data to center around the origin.
- Fitting a Line: Draw a random line through the origin and rotate to best fit the data.
- PCA evaluates fit by minimizing distances from data points to the line or maximizing distances from projected points to origin.
- Mathematical Intuition: Use a single data point to understand how distances relate when projecting onto the fitting line.
- Apply Pythagorean theorem to analyze the relationship between distances.
- Principal Component 1 (PC1): The line that maximizes the sum of squared distances from projected points to the origin.
- PC1 determines the main direction of variance in the data.
- Example: Ratio for PC1 relates Gene 1 and Gene 2.
Terminology Alerts
- Linear Combination: Representation of PC1 using proportions of Genes 1 and 2.
- Eigenvector: The unit vector for PC1 indicating the direction of variance.
- Eigenvalue: Average sum of squared distances for the best fit line, indicating variance explained.
- Singular Value: Square root of the sum of squared distances for PC1.
Finding PC2
- PC2 is the line through the origin perpendicular to PC1.
- Loading scores indicate importance of genes in terms of variance explained.
Scree Plot and Variation
- Scree Plot: Graphical representation of variation explained by each PC.
- Example variation values:
- PC1 = 83%, PC2 = 17% of the total variation.
More Complicated Example: PCA with 3 Variables
- Center data and find the best fitting line for PC1.
- Determine PC2 perpendicular to PC1.
- Identify PC3, and continue adding PCs.
- A maximum of one PC per variable or sample, whichever is smaller.
- Use eigenvalues to determine the proportion of variation each PC accounts for.
- Example: PC1 = 79%, PC2 = 15%, PC3 = 6%.
Conversion to 2D PCA Graph
- Project with only PC1 and PC2 while ensuring they account for most variance (94% in this case).
- Illustrate samples' positions in the new PCA plot.
Conclusion
- PCA is a powerful method for reducing dimensionality while maintaining variance.
- Even with noisy data, PCA can reveal clusters.
- Encouraged to subscribe for more content and support.
Reminder: Understand terminology such as eigenvalues, eigenvectors, and the concept of PCA as it applies to dimensionality reduction and variance analysis.