Understanding PCA with SVD Techniques

Aug 8, 2024

Notes on Principal Component Analysis (PCA) with Singular Value Decomposition (SVD)

Introduction

  • Presenter: Josh Starmer
  • Focus on PCA using SVD to extract insights from data.
  • Example dataset: Transcription of two genes (Gene 1 and Gene 2) from 6 mice.
    • Alternatives: Students’ test scores or businesses’ metrics.

Data Visualization

One Gene Measurement

  • Data can be plotted on a number line.
  • Mice 1, 2, and 3 have high values; Mice 4, 5, and 6 have low values.

Two Gene Measurement

  • Data can be plotted on a 2D X-Y graph (Gene 1 on x-axis, Gene 2 on y-axis).
  • Clustering of subjects observed based on gene expression.

Three Gene Measurement

  • Data visualized in 3D, adding a third axis for Gene 3.

More than Three Genes

  • Cannot visualize directly beyond 3D; PCA allows reduction to 2D.
  • PCA helps identify important genes for clustering.

PCA Process

  1. Centering Data: Calculate the average for Gene 1 and Gene 2, shift data to center around the origin.
  2. Fitting a Line: Draw a random line through the origin and rotate to best fit the data.
    • PCA evaluates fit by minimizing distances from data points to the line or maximizing distances from projected points to origin.
  3. Mathematical Intuition: Use a single data point to understand how distances relate when projecting onto the fitting line.
    • Apply Pythagorean theorem to analyze the relationship between distances.
  4. Principal Component 1 (PC1): The line that maximizes the sum of squared distances from projected points to the origin.
    • PC1 determines the main direction of variance in the data.
    • Example: Ratio for PC1 relates Gene 1 and Gene 2.

Terminology Alerts

  • Linear Combination: Representation of PC1 using proportions of Genes 1 and 2.
  • Eigenvector: The unit vector for PC1 indicating the direction of variance.
  • Eigenvalue: Average sum of squared distances for the best fit line, indicating variance explained.
  • Singular Value: Square root of the sum of squared distances for PC1.

Finding PC2

  • PC2 is the line through the origin perpendicular to PC1.
  • Loading scores indicate importance of genes in terms of variance explained.

Scree Plot and Variation

  • Scree Plot: Graphical representation of variation explained by each PC.
  • Example variation values:
    • PC1 = 83%, PC2 = 17% of the total variation.

More Complicated Example: PCA with 3 Variables

  1. Center data and find the best fitting line for PC1.
  2. Determine PC2 perpendicular to PC1.
  3. Identify PC3, and continue adding PCs.
    • A maximum of one PC per variable or sample, whichever is smaller.
  4. Use eigenvalues to determine the proportion of variation each PC accounts for.
    • Example: PC1 = 79%, PC2 = 15%, PC3 = 6%.

Conversion to 2D PCA Graph

  • Project with only PC1 and PC2 while ensuring they account for most variance (94% in this case).
  • Illustrate samples' positions in the new PCA plot.

Conclusion

  • PCA is a powerful method for reducing dimensionality while maintaining variance.
  • Even with noisy data, PCA can reveal clusters.
  • Encouraged to subscribe for more content and support.

Reminder: Understand terminology such as eigenvalues, eigenvectors, and the concept of PCA as it applies to dimensionality reduction and variance analysis.