📊

Understanding Principal Component Analysis (PCA)

Apr 22, 2025

StatQuest: Principal Component Analysis (PCA) Explained

Introduction

  • Presenter: Josh Starmer
  • Topic: Principal Component Analysis (PCA) using Singular Value Decomposition (SVD)
  • Objective: Understand PCA and how it offers insights into data by transforming higher dimensional data into a lower dimensional form.

Basic Concepts

  • Data Set Example:
    • Measured transcription of two genes in 6 mice.
    • Can be generalized to other datasets (e.g., students and test scores).
  • Data Plotting:
    • 1 Gene: Plot on a number line.
    • 2 Genes: Plot on a 2D X-Y graph.
    • 3 Genes: Plot on a 3D graph.
    • 4+ Genes: Cannot be directly plotted on a graph.

PCA Overview

  • PCA Goal: Reduce dimensionality while preserving as much variation as possible.
  • PCA Plot: Shows clusters of similar samples.
  • Value of Variables: Identifies which variables are most valuable for clustering.
  • Accuracy of Graphs: PCA can indicate how accurately a lower-dimensional graph represents the higher-dimensional data.

PCA Process

  1. Data Centering: Shift data such that the mean of each variable is at the origin.
  2. Line Fitting:
    • Random line through origin
    • Optimize line to minimize distance of data points to line or maximize distance from projected points to origin.
  3. Principal Components:
    • PC1: Best fit line with largest sum of squared distances.
      • Recipe: Ratio of Gene 1 to Gene 2.
    • PC2: Line perpendicular to PC1.
      • Recipe: Ratio of Gene 1 to Gene 2.

Mathematical Explanation

  • PCA Optimization:
    • Uses Pythagorean theorem to relate distances.
    • Maximizes the squared distances of projected points from the origin.
  • Terminology:
    • Eigenvector (or Singular Vector): Unit vector with loading scores indicating importance of each gene.
    • Eigenvalue: Average of sums of squared distances.
    • Singular Value: Square root of sums of squared distances.

Complex Data Sets

  • 3 Variables Example:
    • Similar process as 2 variables but with an additional principal component.
    • PC1, PC2, PC3: Each represents a line through the origin perpendicular to all preceding PCs.
  • Variation Measurement:
    • Eigenvalues determine variation each PC accounts for.
    • Scree Plot: Visual representation of variation percentages for PCs.

Summary and Application

  • 2D Graphs from Higher Dimensions:
    • Use PCs with highest variation percentages to create lower-dimensional representations.
    • Evaluate scree plots to determine the most informative PCs.
  • Clusters Identification:
    • Even noisy PCA plots can help identify similar sample clusters.

Conclusion

  • Encouragement to subscribe and support StatQuest via music purchase.
  • Emphasis: PCA simplifies complex data into understandable formats, aiding in insights and data analysis.