Lecture Notes - Decision Trees and K-Nearest Neighbors (KNN)

Introduction

Lecture Focus: Decision trees and K-Nearest Neighbors (KNN) for classification and regression.
Review of previous topics: Logistic regression, confusion matrix, classification metrics (recall, precision, specificity, accuracy, F1 measure), and ROC curve.

Confusion Matrix: Tables delineating true positives, true negatives, false positives, and false negatives.
Recall: True positives / actual positives.
Precision: True positives / predicted positives.
Specificity: True negatives / actual negatives.
Accuracy: (True positives + true negatives) / total data.
F1 Score: Harmonic mean of precision and recall.
ROC Curve & AUC: Plotting true positive rate against false positive rate; higher area under the curve indicates a better model.

Bias: High bias indicates the model is too simple and underfits the training data.
Variance: High variance indicates the model is too complex and overfits the training data.
Training vs Testing Errors: Should be similar; large differences indicate overfitting or underfitting.
k-Fold Cross Validation: Technique distributing the data into training and testing sets to validate model performance multiple times.

Tree Construction: Recursive partitioning of data space into regions based on feature splits.
Homogeneity: Goal is to create leaves that are as homogeneous as possible.
Decision Node Splitting: Splits based on feature value criteria to improve classification purity.

Information Gain: Difference in impurity before and after a split (Entropy-based measure).
Gini Index: Measures impurity as the probability of a random sample being misclassified.
Classification Error: Proportion of misclassified instances within a node.
Variance: Used for regression trees to determine node impurity.
ID3, C4.5, CART: Different algorithms using various split criteria (e.g., entropy for ID3, gain ratio for C4.5).

Pre-pruning: Setting criteria for early stopping (e.g., minimum samples per node, maximum depth).
Post-pruning: Pruning nodes after the tree is fully grown to remove overfitting.
Pruning Criteria: Based on misclassification error and performance on validation sets.

Pros: Fast, explainable, robust to missing values, handles non-linear data well.
Cons: Susceptible to overfitting, does not handle interactions between attributes well, and greedy approach may lead to local minima.

Idea: Classify based on 'nearness' to K closest points (neighbors) in the dataset.
Classification: Assign class based on the majority class among neighbors.
Regression: Predict values based on the mean (or median) of neighbor values.

Lazy Learning: No model is built; stores training data and computes distances on the fly.
Instance-based Learning: Predictions are made based on instances in the dataset.
Non-parametric: Does not assume a fixed form for the model or function.

Distance Calculation: Commonly uses Euclidean distance; considers numeric proximity in multidimensional space.
Normalization: Important to bring features to the same scale to avoid dominance by variables with larger value ranges.
Z-score Standardization: Makes mean = 0 and standard deviation = 1 for features.

Choosing K: Higher K smooths out decision boundaries; smaller K can lead to overfitting.
Weighted KNN: Closer neighbors have more influence; weights inversely proportional to the square of the distance.
Edited KNN: Removes class outliers to improve model performance.

Advantages: Simple, effective for small datasets, robust to noisy data with proper K selection.
Challenges: Computationally intensive with large datasets, sensitive to irrelevant or redundant features, requires tuning of K.

Decision Trees: Crucial concepts include information gain, Gini index, and pruning methods.
KNN: Emphasizes normalization, appropriate K selection, and potential for weighted approaches for improved performance.
Application involves mixing methodologies and leveraging domain knowledge for enhanced model accuracy.