Linear Discriminant Analysis (LDA)

Jul 2, 2024

Lecture on Linear Discriminant Analysis (LDA)

Introduction

  • Presented by the Genetics Department at the University of North Carolina at Chapel Hill.
  • Topic: Linear Discriminant Analysis (LDA).
  • Goal: Understand what LDA does and how it works through examples.
  • Purpose: To differentiate between two groups using gene expression to decide the best use of a cancer drug.

Motivation

  • Cancer Drug Example:
    • Objective: Identify who the drug helps and who it harms using gene expression.
    • One Gene: Not sufficient to create a clear separation.
    • Two Genes: Better job at separating categories using a line.
    • Three Genes: Uses a plane to separate categories but hard to visualize.
    • More than Three Genes: Impossible to draw high-dimensional graphs.

Principal Component Analysis (PCA) Review

  • Reduces dimensions by focusing on genes with the most variation.
  • Useful for creating simple XY plots from high-dimensional data.
  • However, PCA doesn't maximize the separation between groups.

Linear Discriminant Analysis (LDA)

  • Similar to PCA: Both reduce dimensions.
  • Key Difference: LDA focuses on maximizing separability among known categories.

Simple Example

  • 2D to 1D Reduction:
    • Bad Option: Ignoring one gene (gene X or gene Y) leads to loss of information.
    • Good Option: LDA creates a new axis maximizing the separation of categories.

Criteria for New Axis

  1. Maximize Distance Between Means:
    • Greek letter "μ" represents the mean for green and red categories.
  2. Minimize Scatter (Variation):
    • Represented as s² for each category.
    • Formula: Ratio of the squared difference between means over the sum of the scatter.

Practical Example: Importance of Both Criteria

  • If only distance is maximized, overlap occurs.
  • Optimizing both distance and scatter provides better separation.

Multi-Dimensional LDA

  • Same process for more than two genes.
  • Create new axis that maximizes mean distances and minimizes scatter.
  • Three Categories: Introduces two axes (defining a plane) for separation.
    • Central point for overall data and measure distances from category centers to this point.
    • Optimized for separation: Example shows three distinct categories.

Comparison of LDA and PCA

  • Real Data Example: LDA vs PCA applied to 10,000 genes.
    • LDA: Better separates categories.
    • PCA: Finds genes with most variation, not optimal for separation.

Similarities between LDA and PCA

  • Both rank new axes in order of importance:
    • PCA: PC1, PC2, etc., based on data variation.
    • LDA: LD1, LD2, etc., based on category separation.
  • Both methods allow analysis of contributing genes:
    • PCA: Loading scores.
    • LDA: Correlation with new axes.

Summary

  • LDA vs PCA:
    • PCA: Reduces dimensions by focusing on variation.
    • LDA: Reduces dimensions by maximizing category separation.

Tune in next time for another exciting Stat Quest!