Introduction to Machine Learning

Jun 27, 2024

Introduction to Machine Learning

Lecture Overview

  • Lecture Recording Warning: There's a temptation to skip classes and rely on video recordings.
    • It's boring to watch lectures at home.
    • Better to attend live lectures.
  • Emails from Course Team: Check your spam folder for emails related to Bo Karyam or exam performance.
    • Contact TAs if you haven't received any communication.
  • Classroom Rules: No laptops or iPads allowed during the lecture.

Topics and Questions

  • Discussion on project partners and how collaboration should be carried out.
  • Machine Learning:
    • Initial setup: Data set (D) with feature vectors and labels.
    • Goal: Find a function that predicts Y from X.
    • Hypothesis Class (H): Set of all possible functions considered.
    • Example: Decision Trees.
    • Choosing the best hypothesis based on the data.
    • Importance of minimizing loss function (L).

Validation and Testing

  • Training and Test Split:
    • Divide data into D_training and D_test.
    • Unbiased estimation of actual loss on test set.
    • Be cautious about splitting methods.
    • Example: Spam filter and temporal splitting issues.
  • Overfitting: Avoid memorizing the dataset, as it doesn't generalize well to new data.
  • Minimizing expected loss: Aim to minimize loss on new, unseen data.

Practical Concerns and Methods

  • Splitting Data: Importance of correct methods to split data for training and testing.
    • Temporal component: Split by time.
    • IID data: Uniformly random split.
  • Validation Set: Training set, Validation set, and Test set.
    • Use validation to choose the best model.
    • Final model evaluation on the test set.
    • Avoid overfitting to the validation set.
  • Challenges in Data Splitting:
    • Example: Issues in splitting patient data in medical studies.

Key Machine Learning Concepts

  • Machine Learning Assumptions: Assumptions made by different algorithms.
    • Example: Smoothness assumptions in function prediction.
    • No-Free-Lunch Theorem: No single algorithm performs best on all types of data.
    • Importance of choosing the right algorithm based on data assumptions.

K-Nearest Neighbors (KNN)

  • Introduction to KNN:
    • Assumption: Data points close together have similar labels.
    • Algorithm Steps:
      • For a test point X, find K nearest neighbors from the dataset D.
      • Majority voting among K neighbors determines the label of X.
    • Distance Metrics:
      • Importance of choosing the right distance metric.
      • Minkowski Distance: Generalized distance metric that includes special cases:
        • Manhattan distance (P=1)
        • Euclidean distance (P=2)
        • Max distance (P → ∞)
      • Choosing the best distance metric suitable for the data.
  • Formalizing KNN:
    • Definition of set S(X) of K nearest neighbors.
    • Ensuring all non-member points have greater distance than the furthest in S(X).
    • Practical applications and considerations for implementing KNN.

Next Steps and In-Class Activity

  • Planned demo for the next session.
  • Reminder to start the next class with the KNN demo.