📊

Understanding Decision Trees and Overfitting

Nov 20, 2024

Lecture 4: Overfitting & KNNs

Front Matter

  • Instructors: Matt Gormley & Henry Chai
  • Date: 9/9/24

Announcements

  • HW2 released 9/4, due 9/16 at 11:59 PM
  • Written portion: 1 graded submission
  • Programming portion: 10 submissions to autograder

Decision Tree Learning

Pseudocode Recap

  • Function train: Calls tree_recurse
  • Function tree_recurse:
    • Base case: if some condition (data empty, identical labels/features, etc.), assign majority vote label
    • Recursive case: Find best attribute to split, create child nodes

Handling Edge Cases

  • Possibility of calling tree_recurse with empty data
  • Handling leaf nodes with no training data:
    • Options: Random label or majority vote over entire dataset

Splitting Criteria

  • Key Splitting Criteria: Gini impurity and Information gain
  • Study by Bluntine & Niblett (1992) shows minor differences

Inductive Bias and Decision Trees

  • Seeks the smallest tree with high mutual information features
  • Follows Occam's Razor: simplest tree that explains the dataset

Pros & Cons of Decision Trees

  • Pros:

    • Interpretable
    • Efficient (in terms of computation and storage)
    • Versatile for classification and regression
    • Handles categorical and real-valued features
  • Cons:

    • Greedy learning approach
    • Possible overfitting
    • Not guaranteed to find the simplest tree with zero training error

Overfitting

Concepts

  • Overfitting:
    • Model too complex, fitting noise rather than pattern
    • Lacks inductive bias for generalization
  • Underfitting:
    • Model too simple, missing the actual pattern
    • Excessive inductive bias

Error Rates

  • Training error rate vs Test error rate vs True error rate
  • Overfitting occurs when test error > training error

Combatting Overfitting

  • Heuristics: Do not split:
    • Beyond a certain depth
    • With fewer data points
    • When information gain is minimal
  • Pruning:
    • Use validation dataset to evaluate splits
    • Remove splits that don't improve validation error

Nearest Neighbor (KNN)

The Duck Test

  • Classify by most similar training point
  • Use distance metrics (e.g., Euclidean, Manhattan distances)

Nearest Neighbor Model

  • No training required
  • Zero training error as a point is its own nearest neighbor

Summary of Objectives

  • Implement and use decision tree criteria
  • Understand model generalization and memorization
  • Identify inductive bias
  • Recognize overfitting/underfitting
  • Use pruning to combat overfitting

Real-Valued Features: Fisher Iris Dataset

  • Overview of features (sepal/petal length/width)
  • Used for testing classification models

The lecture provided an in-depth look at decision trees, their advantages and limitations, and introduced strategies to combat overfitting. It also introduced the K-Nearest Neighbor algorithm with a focus on understanding distance metrics and model characteristics.