Machine Learning: KNN and Association Rule Mining

Jul 16, 2024

Lecture: Machine Learning - KNN and Association Rule Mining

Introduction

  • Professor clarified there was no emergency in the postponed class; it was due to a calendar mix-up.
  • Agenda: Complete KNN and move to Association Rule Mining (often known as Market Basket Analysis).

K-Nearest Neighbors (KNN)

Overview

  • Classification Technique: Primarily used for classification but can be used for regression as well.
  • Comparison with Decision Trees: Major difference lies in how 'nearest' is defined.

Key Concepts

  • Distance Metrics: Commonly uses Euclidean distance. Other metrics will be discussed during clustering for categorical variables.
  • Impact of Scale: Heightened importance when calculating distances in KNN. Standardization (using Z-scores) is essential.
  • Selection of K: Commonly uses an odd number (3, 5, etc.). Weighted KNN can give higher weight to closer neighbors.
  • Overfitting: Smaller K values may overfit; larger K values may help in generalization. Methods to avoid overfitting include increasing K and removing class outliers (Wilson editing).

Issues and Improvements

  • Curse of Dimensionality: Adding dimensions increases the need for more data and makes distances less meaningful.
  • Computational Complexity: Using Condensed Nearest Neighbors (CNN) to reduce data size while maintaining classification boundaries.
  • Dimensionality Reduction: Principal Component Analysis (PCA) and other methods to reduce the number of features.

KNN in Anomaly Detection

  • Credit Card Fraud: Highly imbalanced dataset. Split data into training and testing. Used KNN to identify fraudulent transactions.
  • Metrics: Accuracy, precision, recall, F1-score, ROC curve analysis.

Association Rule Mining (ARM)

Introduction

  • Definition: Unsupervised learning technique used to find interesting associations between variables in large datasets.
  • Applications: Retail (market basket analysis), medical diagnosis, credit card businesses, web usage mining, etc.

Key Concepts

  • Term Definitions:
    • Item Set: All items under consideration.
    • Frequent Item Set: Subsets of item sets that appear frequently.
    • Rules: If X then Y.
    • Antecedent and Consequent: Items on the left and right side of the rule, respectively.

Measures of Rule Interestingness

  • Support: Fraction of transactions where both the antecedent and consequent appear together.
    • Support = (Transactions containing both X and Y) / (Total transactions).
  • Confidence: Given the antecedent, fraction of transactions that contain the consequent.
    • Confidence = (Support of {X, Y}) / (Support of X).
  • Lift: Measures how much more likely the consequent is, given the antecedent, compared to the consequent being independent of the antecedent.
    • Lift = (Confidence of {X -> Y}) / (Support of Y).
    • Interpretation: Lift > 1 indicates a positive association, Lift < 1 indicates a negative association, Lift = 1 indicates no association.

Rule Evaluation

  • Example: Tea and coffee purchases.
  • Interpretations:
    • Lift = 1: No real association between X and Y.
    • Lift > 1: Positive association; occurrence of X increases the likelihood of Y.
    • Lift < 1: Negative association; occurrence of X decreases the likelihood of Y.

Practical Applications

  • Retail: Store layout, cross-selling, promotions, catalog design.
  • Medical: Diagnosis, treatment analysis.
  • Web Mining: Page associations, user behavior.

Challenges

  • Too Many Rules: Exponential growth with the number of items.
  • Filtering Rules: Using minimum support, confidence, and lift thresholds to filter out irrelevant or trivial rules.
  • Business Decisions: Planogram decisions, marketing strategies based on discovered associations.

Next Steps

  • Algorithms: A priori and FP-growth algorithms will be discussed in the next class.
  • Summary: Covered KNN improvements, dimensionality issues, computational complexity, and introduced ARM with its key measures and applications.