Lecture: Machine Learning - KNN and Association Rule Mining
Introduction
Professor clarified there was no emergency in the postponed class; it was due to a calendar mix-up.
Agenda: Complete KNN and move to Association Rule Mining (often known as Market Basket Analysis).
K-Nearest Neighbors (KNN)
Overview
Classification Technique: Primarily used for classification but can be used for regression as well.
Comparison with Decision Trees: Major difference lies in how 'nearest' is defined.
Key Concepts
Distance Metrics: Commonly uses Euclidean distance. Other metrics will be discussed during clustering for categorical variables.
Impact of Scale: Heightened importance when calculating distances in KNN. Standardization (using Z-scores) is essential.
Selection of K: Commonly uses an odd number (3, 5, etc.). Weighted KNN can give higher weight to closer neighbors.
Overfitting: Smaller K values may overfit; larger K values may help in generalization. Methods to avoid overfitting include increasing K and removing class outliers (Wilson editing).
Issues and Improvements
Curse of Dimensionality: Adding dimensions increases the need for more data and makes distances less meaningful.
Computational Complexity: Using Condensed Nearest Neighbors (CNN) to reduce data size while maintaining classification boundaries.
Dimensionality Reduction: Principal Component Analysis (PCA) and other methods to reduce the number of features.
KNN in Anomaly Detection
Credit Card Fraud: Highly imbalanced dataset. Split data into training and testing. Used KNN to identify fraudulent transactions.