Notes on Principal Component Analysis (PCA) with Singular Value Decomposition (SVD)

Introduction

Presenter: Josh Starmer
Focus on PCA using SVD to extract insights from data.
Example dataset: Transcription of two genes (Gene 1 and Gene 2) from 6 mice.
- Alternatives: Students’ test scores or businesses’ metrics.

Data Visualization

One Gene Measurement

Data can be plotted on a number line.
Mice 1, 2, and 3 have high values; Mice 4, 5, and 6 have low values.

Two Gene Measurement

Data can be plotted on a 2D X-Y graph (Gene 1 on x-axis, Gene 2 on y-axis).
Clustering of subjects observed based on gene expression.

Three Gene Measurement

Data visualized in 3D, adding a third axis for Gene 3.

More than Three Genes

Cannot visualize directly beyond 3D; PCA allows reduction to 2D.
PCA helps identify important genes for clustering.

PCA Process

Centering Data: Calculate the average for Gene 1 and Gene 2, shift data to center around the origin.
Fitting a Line: Draw a random line through the origin and rotate to best fit the data.
- PCA evaluates fit by minimizing distances from data points to the line or maximizing distances from projected points to origin.
Mathematical Intuition: Use a single data point to understand how distances relate when projecting onto the fitting line.
- Apply Pythagorean theorem to analyze the relationship between distances.
Principal Component 1 (PC1): The line that maximizes the sum of squared distances from projected points to the origin.
- PC1 determines the main direction of variance in the data.
- Example: Ratio for PC1 relates Gene 1 and Gene 2.

Terminology Alerts

Linear Combination: Representation of PC1 using proportions of Genes 1 and 2.
Eigenvector: The unit vector for PC1 indicating the direction of variance.
Eigenvalue: Average sum of squared distances for the best fit line, indicating variance explained.
Singular Value: Square root of the sum of squared distances for PC1.

Finding PC2

PC2 is the line through the origin perpendicular to PC1.
Loading scores indicate importance of genes in terms of variance explained.

Scree Plot and Variation

Scree Plot: Graphical representation of variation explained by each PC.
Example variation values:
- PC1 = 83%, PC2 = 17% of the total variation.

More Complicated Example: PCA with 3 Variables

Center data and find the best fitting line for PC1.
Determine PC2 perpendicular to PC1.
Identify PC3, and continue adding PCs.
- A maximum of one PC per variable or sample, whichever is smaller.
Use eigenvalues to determine the proportion of variation each PC accounts for.
- Example: PC1 = 79%, PC2 = 15%, PC3 = 6%.

Conversion to 2D PCA Graph

Project with only PC1 and PC2 while ensuring they account for most variance (94% in this case).
Illustrate samples' positions in the new PCA plot.

Conclusion

PCA is a powerful method for reducing dimensionality while maintaining variance.
Even with noisy data, PCA can reveal clusters.
Encouraged to subscribe for more content and support.

Reminder: Understand terminology such as eigenvalues, eigenvectors, and the concept of PCA as it applies to dimensionality reduction and variance analysis.

Understanding PCA with SVD Techniques

Notes on Principal Component Analysis (PCA) with Singular Value Decomposition (SVD)

Introduction

Data Visualization

One Gene Measurement

Two Gene Measurement

Three Gene Measurement

More than Three Genes

PCA Process

Terminology Alerts

Finding PC2

Scree Plot and Variation

More Complicated Example: PCA with 3 Variables

Conversion to 2D PCA Graph

Conclusion