Introduction to Machine Learning

Lecture Overview

Lecture Recording Warning: There's a temptation to skip classes and rely on video recordings.
- It's boring to watch lectures at home.
- Better to attend live lectures.
Emails from Course Team: Check your spam folder for emails related to Bo Karyam or exam performance.
- Contact TAs if you haven't received any communication.
Classroom Rules: No laptops or iPads allowed during the lecture.

Training and Test Split:
- Divide data into D_training and D_test.
- Unbiased estimation of actual loss on test set.
- Be cautious about splitting methods.
- Example: Spam filter and temporal splitting issues.
Overfitting: Avoid memorizing the dataset, as it doesn't generalize well to new data.
Minimizing expected loss: Aim to minimize loss on new, unseen data.

Splitting Data: Importance of correct methods to split data for training and testing.
- Temporal component: Split by time.
- IID data: Uniformly random split.
Validation Set: Training set, Validation set, and Test set.
- Use validation to choose the best model.
- Final model evaluation on the test set.
- Avoid overfitting to the validation set.
Challenges in Data Splitting:
- Example: Issues in splitting patient data in medical studies.

Machine Learning Assumptions: Assumptions made by different algorithms.
- Example: Smoothness assumptions in function prediction.
- No-Free-Lunch Theorem: No single algorithm performs best on all types of data.
- Importance of choosing the right algorithm based on data assumptions.

Introduction to KNN:
- Assumption: Data points close together have similar labels.
- Algorithm Steps:
  - For a test point X, find K nearest neighbors from the dataset D.
  - Majority voting among K neighbors determines the label of X.
- Distance Metrics:
  - Importance of choosing the right distance metric.
  - Minkowski Distance: Generalized distance metric that includes special cases:
    - Manhattan distance (P=1)
    - Euclidean distance (P=2)
    - Max distance (P → ∞)
  - Choosing the best distance metric suitable for the data.
Formalizing KNN:
- Definition of set S(X) of K nearest neighbors.
- Ensuring all non-member points have greater distance than the furthest in S(X).
- Practical applications and considerations for implementing KNN.