Lecture Notes: K-Nearest Neighbors (KNN) Algorithm

Overview

Focus on K-Nearest Neighbors (KNN) algorithm for classification.
Objective: Predict kyphosis disease in children based on features such as age, number, and start.
KNN can be used for regression tasks, but the focus is on classification.

KNN finds similar data points in training data to make predictions.
Example: Classifying t-shirt sizes (large or small) based on weight (kg) and height (cm).
- Red class: large size.
- Blue class: small size.

Select a Value for K
- K represents the number of neighbors considered.
- It's a tunable parameter.
Calculate Euclidean Distance
- Distance between a new data point and all points in the dataset.
- Formula: [ \text{Distance} = \sqrt{(x2 - x1)^2 + (y2 - y1)^2} ]
Pick K Closest Data Points
- Choose points with the smallest distances.
Majority Vote
- Determine the class based on the majority of selected neighbors.
- Classify the new point based on the dominant class.

Example adopted from ListenData.com.
Data includes height, weight, and t-shirt size (small or large).
Calculate Euclidean distances and rank them.
Select K (e.g., K=5) and identify the five closest data points.
Perform majority voting among these points to classify the new data point.
- If majority is small, classify as small; otherwise, classify as large.

Next lesson: KNN in AWS SageMaker.
- Discuss CPU and data requirements.
- Demonstration of coding the algorithm in SageMaker Studio.