📊

Principal Component Analysis (PCA) using Singular Value Decomposition (SVD)

Jul 12, 2024

Principal Component Analysis (PCA) using Singular Value Decomposition (SVD)

Introduction

Presenter: Josh Starmer of StatQuest

Objectives:

  • Understand what PCA does, how it works, and how to use it for data insight.
  • Demonstrate using an example data set with two genes measured in six mice.

Data Set and Visualization

Data Measurements:

  • Samples (Mice): 6 samples
  • Variables (Genes): 2 genes

Data Visualization:

  • 1 Gene: Plot on a number line
    • Mice 1, 2, 3 (high values)
    • Mice 4, 5, 6 (low values)
  • 2 Genes: Plot on a 2D X-Y graph
    • Gene 1: X-axis
    • Gene 2: Y-axis
    • Mice clustering: Mice 1, 2, 3 (right side), Mice 4, 5, 6 (lower left side)
  • 3 Genes: 3D graph by adding another axis
  • 4+ Genes: Unable to plot; requires higher dimensional representation

Principal Component Analysis (PCA)

Purpose:

  • Reduce higher-dimensional data to two dimensions for easier visualization.
  • Identify which gene is most valuable for clustering data.
  • Assess accuracy of the 2D graph representation.

Steps in PCA:

  1. Plotting the Data:
    • Use 2D data example (2 genes) for explanation.
  2. Calculate Averages:
    • Average of Gene 1
    • Average of Gene 2
    • Determine center of the data from averages
  3. Shift Data:
    • Center the data on the origin
  4. Fit a Line:
    • Draw a random line through the origin
    • Rotate line to best fit data
  5. Quantify Fit:
    • Project data onto line
    • Measure distances from data to line
    • Minimize these distances or maximize distance from projected points to origin

Mathematical Details and Terminology

Understanding Distances:

  • Use Pythagorean theorem to show inverse relation between distances
  • Goal: Maximize sum of squared distances (SS distances) from projected points to origin
  • Principal Component 1 (PC1): Line with largest SS distances

Terminology:

  • Linear Combination: Mix of Gene 1 and Gene 2 to form PC1
  • Eigenvector: Unit vector corresponding to a principal component
  • Loading Scores: Proportions of each gene in eigenvector
  • Eigenvalue: Average of SS distances on best-fit line
  • Singular Value: Square root of SS distances

Principal Components:

  • PC1: First principal component (best fit line)
  • PC2: Second principal component (perpendicular to PC1)
    • No further optimization required
    • Recipe for PC2 involves different proportion of genes

PCA with More Variables

Example with 3 Genes:

  • Follow similar steps as 2D example
  • PC1: Best fitting line with 3-gene recipe
  • PC2: Perpendicular to PC1
  • PC3: Perpendicular to both PC1 and PC2

Variance and Scree Plot:

  • Eigenvalues as measures of variation
  • Scree Plot: Graphical representation of percentage of variation each principal component accounts for

Drawing the PCA Plot

Final PCA Plot:

  • Rotate data to make PC1 horizontal
  • Place samples based on projected points
  • Use to identify data clusters

Key Takeaways

  • PCA reduces dimensionality for easier data visualization.
  • Eigenvalues and eigenvectors play critical roles in understanding data variance and determining principal components.
  • Scree plots help in understanding the significance of each principal component.

Conclusion

  • PCA is a powerful tool for data analysis and visualization.
  • StatQuest offers more detailed videos and explanations on related topics.