Linear Discriminant Analysis (LDA)

Jul 2, 2024

Lecture on Linear Discriminant Analysis (LDA)

Introduction

Presented by the Genetics Department at the University of North Carolina at Chapel Hill.
Topic: Linear Discriminant Analysis (LDA).
Goal: Understand what LDA does and how it works through examples.
Purpose: To differentiate between two groups using gene expression to decide the best use of a cancer drug.

Motivation

Cancer Drug Example:
- Objective: Identify who the drug helps and who it harms using gene expression.
- One Gene: Not sufficient to create a clear separation.
- Two Genes: Better job at separating categories using a line.
- Three Genes: Uses a plane to separate categories but hard to visualize.
- More than Three Genes: Impossible to draw high-dimensional graphs.

Principal Component Analysis (PCA) Review

Reduces dimensions by focusing on genes with the most variation.
Useful for creating simple XY plots from high-dimensional data.
However, PCA doesn't maximize the separation between groups.

Linear Discriminant Analysis (LDA)

Similar to PCA: Both reduce dimensions.
Key Difference: LDA focuses on maximizing separability among known categories.

Simple Example

2D to 1D Reduction:
- Bad Option: Ignoring one gene (gene X or gene Y) leads to loss of information.
- Good Option: LDA creates a new axis maximizing the separation of categories.

Criteria for New Axis

Maximize Distance Between Means:
- Greek letter "μ" represents the mean for green and red categories.
Minimize Scatter (Variation):
- Represented as s² for each category.
- Formula: Ratio of the squared difference between means over the sum of the scatter.

Practical Example: Importance of Both Criteria

If only distance is maximized, overlap occurs.
Optimizing both distance and scatter provides better separation.

Multi-Dimensional LDA

Same process for more than two genes.
Create new axis that maximizes mean distances and minimizes scatter.
Three Categories: Introduces two axes (defining a plane) for separation.
- Central point for overall data and measure distances from category centers to this point.
- Optimized for separation: Example shows three distinct categories.

Comparison of LDA and PCA

Real Data Example: LDA vs PCA applied to 10,000 genes.
- LDA: Better separates categories.
- PCA: Finds genes with most variation, not optimal for separation.

Similarities between LDA and PCA

Both rank new axes in order of importance:
- PCA: PC1, PC2, etc., based on data variation.
- LDA: LD1, LD2, etc., based on category separation.
Both methods allow analysis of contributing genes:
- PCA: Loading scores.
- LDA: Correlation with new axes.

Summary

LDA vs PCA:
- PCA: Reduces dimensions by focusing on variation.
- LDA: Reduces dimensions by maximizing category separation.

Tune in next time for another exciting Stat Quest!

Full transcript