Lecture on Linear Discriminant Analysis (LDA)
Introduction
- Presented by the Genetics Department at the University of North Carolina at Chapel Hill.
- Topic: Linear Discriminant Analysis (LDA).
- Goal: Understand what LDA does and how it works through examples.
- Purpose: To differentiate between two groups using gene expression to decide the best use of a cancer drug.
Motivation
- Cancer Drug Example:
- Objective: Identify who the drug helps and who it harms using gene expression.
- One Gene: Not sufficient to create a clear separation.
- Two Genes: Better job at separating categories using a line.
- Three Genes: Uses a plane to separate categories but hard to visualize.
- More than Three Genes: Impossible to draw high-dimensional graphs.
Principal Component Analysis (PCA) Review
- Reduces dimensions by focusing on genes with the most variation.
- Useful for creating simple XY plots from high-dimensional data.
- However, PCA doesn't maximize the separation between groups.
Linear Discriminant Analysis (LDA)
- Similar to PCA: Both reduce dimensions.
- Key Difference: LDA focuses on maximizing separability among known categories.
Simple Example
- 2D to 1D Reduction:
- Bad Option: Ignoring one gene (gene X or gene Y) leads to loss of information.
- Good Option: LDA creates a new axis maximizing the separation of categories.
Criteria for New Axis
- Maximize Distance Between Means:
- Greek letter "μ" represents the mean for green and red categories.
- Minimize Scatter (Variation):
- Represented as s² for each category.
- Formula: Ratio of the squared difference between means over the sum of the scatter.
Practical Example: Importance of Both Criteria
- If only distance is maximized, overlap occurs.
- Optimizing both distance and scatter provides better separation.
Multi-Dimensional LDA
- Same process for more than two genes.
- Create new axis that maximizes mean distances and minimizes scatter.
- Three Categories: Introduces two axes (defining a plane) for separation.
- Central point for overall data and measure distances from category centers to this point.
- Optimized for separation: Example shows three distinct categories.
Comparison of LDA and PCA
- Real Data Example: LDA vs PCA applied to 10,000 genes.
- LDA: Better separates categories.
- PCA: Finds genes with most variation, not optimal for separation.
Similarities between LDA and PCA
- Both rank new axes in order of importance:
- PCA: PC1, PC2, etc., based on data variation.
- LDA: LD1, LD2, etc., based on category separation.
- Both methods allow analysis of contributing genes:
- PCA: Loading scores.
- LDA: Correlation with new axes.
Summary
- LDA vs PCA:
- PCA: Reduces dimensions by focusing on variation.
- LDA: Reduces dimensions by maximizing category separation.
Tune in next time for another exciting Stat Quest!