Decision Trees and K-Nearest Neighbors (KNN)

Jul 10, 2024

Lecture Notes - Decision Trees and K-Nearest Neighbors (KNN)

Introduction

  • Lecture Focus: Decision trees and K-Nearest Neighbors (KNN) for classification and regression.
  • Review of previous topics: Logistic regression, confusion matrix, classification metrics (recall, precision, specificity, accuracy, F1 measure), and ROC curve.

Decision Trees

Performance Measures & Classification Metrics

  • Confusion Matrix: Tables delineating true positives, true negatives, false positives, and false negatives.
  • Recall: True positives / actual positives.
  • Precision: True positives / predicted positives.
  • Specificity: True negatives / actual negatives.
  • Accuracy: (True positives + true negatives) / total data.
  • F1 Score: Harmonic mean of precision and recall.
  • ROC Curve & AUC: Plotting true positive rate against false positive rate; higher area under the curve indicates a better model.

Bias-Variance Tradeoff

  • Bias: High bias indicates the model is too simple and underfits the training data.
  • Variance: High variance indicates the model is too complex and overfits the training data.
  • Training vs Testing Errors: Should be similar; large differences indicate overfitting or underfitting.
  • k-Fold Cross Validation: Technique distributing the data into training and testing sets to validate model performance multiple times.

Construction & Intuition

  • Tree Construction: Recursive partitioning of data space into regions based on feature splits.
  • Homogeneity: Goal is to create leaves that are as homogeneous as possible.
  • Decision Node Splitting: Splits based on feature value criteria to improve classification purity.

Measures of Purity

  • Information Gain: Difference in impurity before and after a split (Entropy-based measure).
  • Gini Index: Measures impurity as the probability of a random sample being misclassified.
  • Classification Error: Proportion of misclassified instances within a node.
  • Variance: Used for regression trees to determine node impurity.
  • ID3, C4.5, CART: Different algorithms using various split criteria (e.g., entropy for ID3, gain ratio for C4.5).

Overfitting and Pruning

  • Pre-pruning: Setting criteria for early stopping (e.g., minimum samples per node, maximum depth).
  • Post-pruning: Pruning nodes after the tree is fully grown to remove overfitting.
  • Pruning Criteria: Based on misclassification error and performance on validation sets.

Advantages and Disadvantages

  • Pros: Fast, explainable, robust to missing values, handles non-linear data well.
  • Cons: Susceptible to overfitting, does not handle interactions between attributes well, and greedy approach may lead to local minima.

K-Nearest Neighbors (KNN)

Concept and Intuition

  • Idea: Classify based on 'nearness' to K closest points (neighbors) in the dataset.
  • Classification: Assign class based on the majority class among neighbors.
  • Regression: Predict values based on the mean (or median) of neighbor values.

Characteristics

  • Lazy Learning: No model is built; stores training data and computes distances on the fly.
  • Instance-based Learning: Predictions are made based on instances in the dataset.
  • Non-parametric: Does not assume a fixed form for the model or function.

Distance Metrics and Normalization

  • Distance Calculation: Commonly uses Euclidean distance; considers numeric proximity in multidimensional space.
  • Normalization: Important to bring features to the same scale to avoid dominance by variables with larger value ranges.
  • Z-score Standardization: Makes mean = 0 and standard deviation = 1 for features.

Handling K and Distance Weighting

  • Choosing K: Higher K smooths out decision boundaries; smaller K can lead to overfitting.
  • Weighted KNN: Closer neighbors have more influence; weights inversely proportional to the square of the distance.
  • Edited KNN: Removes class outliers to improve model performance.

Advantages and Challenges

  • Advantages: Simple, effective for small datasets, robust to noisy data with proper K selection.
  • Challenges: Computationally intensive with large datasets, sensitive to irrelevant or redundant features, requires tuning of K.

Conclusion

  • Decision Trees: Crucial concepts include information gain, Gini index, and pruning methods.
  • KNN: Emphasizes normalization, appropriate K selection, and potential for weighted approaches for improved performance.
  • Application involves mixing methodologies and leveraging domain knowledge for enhanced model accuracy.