Coconote
AI notes
AI voice & video notes
Export note
Try for free
Decision Trees and K-Nearest Neighbors (KNN)
Jul 10, 2024
Lecture Notes - Decision Trees and K-Nearest Neighbors (KNN)
Introduction
Lecture Focus
: Decision trees and K-Nearest Neighbors (KNN) for classification and regression.
Review of previous topics: Logistic regression, confusion matrix, classification metrics (recall, precision, specificity, accuracy, F1 measure), and ROC curve.
Decision Trees
Performance Measures & Classification Metrics
Confusion Matrix
: Tables delineating true positives, true negatives, false positives, and false negatives.
Recall
: True positives / actual positives.
Precision
: True positives / predicted positives.
Specificity
: True negatives / actual negatives.
Accuracy
: (True positives + true negatives) / total data.
F1 Score
: Harmonic mean of precision and recall.
ROC Curve & AUC
: Plotting true positive rate against false positive rate; higher area under the curve indicates a better model.
Bias-Variance Tradeoff
Bias
: High bias indicates the model is too simple and underfits the training data.
Variance
: High variance indicates the model is too complex and overfits the training data.
Training vs Testing Errors
: Should be similar; large differences indicate overfitting or underfitting.
k-Fold Cross Validation
: Technique distributing the data into training and testing sets to validate model performance multiple times.
Construction & Intuition
Tree Construction
: Recursive partitioning of data space into regions based on feature splits.
Homogeneity
: Goal is to create leaves that are as homogeneous as possible.
Decision Node Splitting
: Splits based on feature value criteria to improve classification purity.
Measures of Purity
Information Gain
: Difference in impurity before and after a split (Entropy-based measure).
Gini Index
: Measures impurity as the probability of a random sample being misclassified.
Classification Error
: Proportion of misclassified instances within a node.
Variance
: Used for regression trees to determine node impurity.
ID3, C4.5, CART
: Different algorithms using various split criteria (e.g., entropy for ID3, gain ratio for C4.5).
Overfitting and Pruning
Pre-pruning
: Setting criteria for early stopping (e.g., minimum samples per node, maximum depth).
Post-pruning
: Pruning nodes after the tree is fully grown to remove overfitting.
Pruning Criteria
: Based on misclassification error and performance on validation sets.
Advantages and Disadvantages
Pros
: Fast, explainable, robust to missing values, handles non-linear data well.
Cons
: Susceptible to overfitting, does not handle interactions between attributes well, and greedy approach may lead to local minima.
K-Nearest Neighbors (KNN)
Concept and Intuition
Idea
: Classify based on 'nearness' to K closest points (neighbors) in the dataset.
Classification
: Assign class based on the majority class among neighbors.
Regression
: Predict values based on the mean (or median) of neighbor values.
Characteristics
Lazy Learning
: No model is built; stores training data and computes distances on the fly.
Instance-based Learning
: Predictions are made based on instances in the dataset.
Non-parametric
: Does not assume a fixed form for the model or function.
Distance Metrics and Normalization
Distance Calculation
: Commonly uses Euclidean distance; considers numeric proximity in multidimensional space.
Normalization
: Important to bring features to the same scale to avoid dominance by variables with larger value ranges.
Z-score Standardization
: Makes mean = 0 and standard deviation = 1 for features.
Handling K and Distance Weighting
Choosing K
: Higher K smooths out decision boundaries; smaller K can lead to overfitting.
Weighted KNN
: Closer neighbors have more influence; weights inversely proportional to the square of the distance.
Edited KNN
: Removes class outliers to improve model performance.
Advantages and Challenges
Advantages
: Simple, effective for small datasets, robust to noisy data with proper K selection.
Challenges
: Computationally intensive with large datasets, sensitive to irrelevant or redundant features, requires tuning of K.
Conclusion
Decision Trees
: Crucial concepts include information gain, Gini index, and pruning methods.
KNN
: Emphasizes normalization, appropriate K selection, and potential for weighted approaches for improved performance.
Application involves mixing methodologies and leveraging domain knowledge for enhanced model accuracy.
📄
Full transcript