Lecture: Machine Learning - KNN and Association Rule Mining

Introduction

Professor clarified there was no emergency in the postponed class; it was due to a calendar mix-up.
Agenda: Complete KNN and move to Association Rule Mining (often known as Market Basket Analysis).

Classification Technique: Primarily used for classification but can be used for regression as well.
Comparison with Decision Trees: Major difference lies in how 'nearest' is defined.

Distance Metrics: Commonly uses Euclidean distance. Other metrics will be discussed during clustering for categorical variables.
Impact of Scale: Heightened importance when calculating distances in KNN. Standardization (using Z-scores) is essential.
Selection of K: Commonly uses an odd number (3, 5, etc.). Weighted KNN can give higher weight to closer neighbors.
Overfitting: Smaller K values may overfit; larger K values may help in generalization. Methods to avoid overfitting include increasing K and removing class outliers (Wilson editing).

Curse of Dimensionality: Adding dimensions increases the need for more data and makes distances less meaningful.
Computational Complexity: Using Condensed Nearest Neighbors (CNN) to reduce data size while maintaining classification boundaries.
Dimensionality Reduction: Principal Component Analysis (PCA) and other methods to reduce the number of features.

Credit Card Fraud: Highly imbalanced dataset. Split data into training and testing. Used KNN to identify fraudulent transactions.
Metrics: Accuracy, precision, recall, F1-score, ROC curve analysis.

Definition: Unsupervised learning technique used to find interesting associations between variables in large datasets.
Applications: Retail (market basket analysis), medical diagnosis, credit card businesses, web usage mining, etc.

Term Definitions:
- Item Set: All items under consideration.
- Frequent Item Set: Subsets of item sets that appear frequently.
- Rules: If X then Y.
- Antecedent and Consequent: Items on the left and right side of the rule, respectively.

Support: Fraction of transactions where both the antecedent and consequent appear together.
- Support = (Transactions containing both X and Y) / (Total transactions).
Confidence: Given the antecedent, fraction of transactions that contain the consequent.
- Confidence = (Support of {X, Y}) / (Support of X).
Lift: Measures how much more likely the consequent is, given the antecedent, compared to the consequent being independent of the antecedent.
- Lift = (Confidence of {X -> Y}) / (Support of Y).
- Interpretation: Lift > 1 indicates a positive association, Lift < 1 indicates a negative association, Lift = 1 indicates no association.

Example: Tea and coffee purchases.
Interpretations:
- Lift = 1: No real association between X and Y.
- Lift > 1: Positive association; occurrence of X increases the likelihood of Y.
- Lift < 1: Negative association; occurrence of X decreases the likelihood of Y.

Too Many Rules: Exponential growth with the number of items.
Filtering Rules: Using minimum support, confidence, and lift thresholds to filter out irrelevant or trivial rules.
Business Decisions: Planogram decisions, marketing strategies based on discovered associations.

Algorithms: A priori and FP-growth algorithms will be discussed in the next class.
Summary: Covered KNN improvements, dimensionality issues, computational complexity, and introduced ARM with its key measures and applications.