📊

Understanding Principal Component Analysis (PCA)

Mar 3, 2025

StatQuest: Principal Component Analysis (PCA)

Introduction

  • Presenter: Josh Starmer
  • Overview of PCA using Singular Value Decomposition (SVD)
  • Key topics: What PCA does, how it works, and practical insights into data.

Data Set Example

  • Example: Transcription of two genes (Gene 1 and Gene 2) in 6 mice.
    • Mice = individual samples
    • Genes = variables measured
    • Alternative examples: High school students (test scores) or businesses (market capitalization and employees).

Graphing Data

  • 1 Gene:

    • Data plotted on a number line
    • Groups: Mice 1, 2, 3 (high values) vs. Mice 4, 5, 6 (low values).
  • 2 Genes:

    • Data plotted on a 2D X-Y graph
    • Gene 1 (x-axis) and Gene 2 (y-axis) create clustering of mice.
  • 3 Genes:

    • Transition to 3D graph representation.
  • 4 Genes:

    • 4D data cannot be visualized easily.

Principal Component Analysis (PCA)

  • PCA reduces dimensionality from higher dimensions to 2D plots while preserving data relationships.
    • Helps identify which variables (genes) are most significant in clustering.
    • PCA accuracy assessed through the positioning of data points in the graph.

Steps of PCA Calculation

  1. Centering the Data:

    • Calculate average measurements for each gene.
    • Shift data so the center is at the origin.
  2. Fitting a Line:

    • Draw an initial random line through the origin and rotate to best fit data.
    • Determine fit quality by minimizing or maximizing distances from the line to data points.
  3. Distance Measurement:

    • Project points onto the line and measure distances to the origin.
    • Calculate the sum of squared distances (SS distances).
  4. Principal Component 1 (PC1):

    • Identified as the line with the largest sum of squared distances.
    • Example: PC1 slope = 0.25, indicating Gene 1 is more important than Gene 2.
  5. Scaling the Line:

    • Normalize the length of PC1 to 1 unit using the Pythagorean theorem.
    • Resulting vector (Eigenvector/Singular Vector) indicates importance of each gene (loading scores).
  6. Principal Component 2 (PC2):

    • Line perpendicular to PC1 without optimization.
    • Example recipe: -1 parts Gene 1 to 4 parts Gene 2.

Eigenvalues and Variation

  • Eigenvalues represent measures of variation in the dataset.
  • Example Variations:
    • PC1 = 15 (83% total variation)
    • PC2 = 3 (17% total variation)
  • Scree Plot: Visual representation of variation explained by each PC.

More Complex Example (3 Variables)

  • Same principles apply with an additional variable.
  • Calculate PC1, PC2, and PC3; interpret loading scores and variation.
  • Total variation example: PC1 = 79%, PC2 = 15%, PC3 = 6%.

Conclusion

  • Use of PCA helps simplify complex data into meaningful 2D representations.
  • PCA can still analyze higher dimensional data mathematically, even if visualization is not possible.
  • Importance of identifying clusters in data, even when using fewer principal components.

Closing

  • Encouragement to subscribe for more StatQuest content and support through music purchases.