Lecture 4: Overfitting & KNNs

Front Matter

Instructors: Matt Gormley & Henry Chai
Date: 9/9/24

Announcements

HW2 released 9/4, due 9/16 at 11:59 PM
Written portion: 1 graded submission
Programming portion: 10 submissions to autograder

Decision Tree Learning

Pseudocode Recap

Function train: Calls tree_recurse
Function tree_recurse:
- Base case: if some condition (data empty, identical labels/features, etc.), assign majority vote label
- Recursive case: Find best attribute to split, create child nodes

Handling Edge Cases

Possibility of calling tree_recurse with empty data
Handling leaf nodes with no training data:
- Options: Random label or majority vote over entire dataset

Splitting Criteria

Key Splitting Criteria: Gini impurity and Information gain
Study by Bluntine & Niblett (1992) shows minor differences

Inductive Bias and Decision Trees

Seeks the smallest tree with high mutual information features
Follows Occam's Razor: simplest tree that explains the dataset

Pros & Cons of Decision Trees

Pros:
- Interpretable
- Efficient (in terms of computation and storage)
- Versatile for classification and regression
- Handles categorical and real-valued features
Cons:
- Greedy learning approach
- Possible overfitting
- Not guaranteed to find the simplest tree with zero training error

Overfitting

Concepts

Overfitting:
- Model too complex, fitting noise rather than pattern
- Lacks inductive bias for generalization
Underfitting:
- Model too simple, missing the actual pattern
- Excessive inductive bias

Error Rates

Training error rate vs Test error rate vs True error rate
Overfitting occurs when test error > training error

Combatting Overfitting

Heuristics: Do not split:
- Beyond a certain depth
- With fewer data points
- When information gain is minimal
Pruning:
- Use validation dataset to evaluate splits
- Remove splits that don't improve validation error

Nearest Neighbor (KNN)

The Duck Test

Classify by most similar training point
Use distance metrics (e.g., Euclidean, Manhattan distances)

Nearest Neighbor Model

No training required
Zero training error as a point is its own nearest neighbor

Summary of Objectives

Implement and use decision tree criteria
Understand model generalization and memorization
Identify inductive bias
Recognize overfitting/underfitting
Use pruning to combat overfitting

Real-Valued Features: Fisher Iris Dataset

Overview of features (sepal/petal length/width)
Used for testing classification models

The lecture provided an in-depth look at decision trees, their advantages and limitations, and introduced strategies to combat overfitting. It also introduced the K-Nearest Neighbor algorithm with a focus on understanding distance metrics and model characteristics.

Understanding Decision Trees and Overfitting