Lecture 4: Overfitting & KNNs
Front Matter
- Instructors: Matt Gormley & Henry Chai
- Date: 9/9/24
Announcements
- HW2 released 9/4, due 9/16 at 11:59 PM
- Written portion: 1 graded submission
- Programming portion: 10 submissions to autograder
Decision Tree Learning
Pseudocode Recap
- Function
train: Calls tree_recurse
- Function
tree_recurse:
- Base case: if some condition (data empty, identical labels/features, etc.), assign majority vote label
- Recursive case: Find best attribute to split, create child nodes
Handling Edge Cases
- Possibility of calling
tree_recurse with empty data
- Handling leaf nodes with no training data:
- Options: Random label or majority vote over entire dataset
Splitting Criteria
- Key Splitting Criteria: Gini impurity and Information gain
- Study by Bluntine & Niblett (1992) shows minor differences
Inductive Bias and Decision Trees
- Seeks the smallest tree with high mutual information features
- Follows Occam's Razor: simplest tree that explains the dataset
Pros & Cons of Decision Trees
-
Pros:
- Interpretable
- Efficient (in terms of computation and storage)
- Versatile for classification and regression
- Handles categorical and real-valued features
-
Cons:
- Greedy learning approach
- Possible overfitting
- Not guaranteed to find the simplest tree with zero training error
Overfitting
Concepts
- Overfitting:
- Model too complex, fitting noise rather than pattern
- Lacks inductive bias for generalization
- Underfitting:
- Model too simple, missing the actual pattern
- Excessive inductive bias
Error Rates
- Training error rate vs Test error rate vs True error rate
- Overfitting occurs when test error > training error
Combatting Overfitting
- Heuristics: Do not split:
- Beyond a certain depth
- With fewer data points
- When information gain is minimal
- Pruning:
- Use validation dataset to evaluate splits
- Remove splits that don't improve validation error
Nearest Neighbor (KNN)
The Duck Test
- Classify by most similar training point
- Use distance metrics (e.g., Euclidean, Manhattan distances)
Nearest Neighbor Model
- No training required
- Zero training error as a point is its own nearest neighbor
Summary of Objectives
- Implement and use decision tree criteria
- Understand model generalization and memorization
- Identify inductive bias
- Recognize overfitting/underfitting
- Use pruning to combat overfitting
Real-Valued Features: Fisher Iris Dataset
- Overview of features (sepal/petal length/width)
- Used for testing classification models
The lecture provided an in-depth look at decision trees, their advantages and limitations, and introduced strategies to combat overfitting. It also introduced the K-Nearest Neighbor algorithm with a focus on understanding distance metrics and model characteristics.