Hello everyone and welcome to this lesson. In this lesson, let's go ahead and cover K-Nearest Neighbors or KNN algorithm for short. If you guys recall in the previous lesson, we covered the project overview and we learned that we are going to develop a classifier model based on the KNN algorithm and the objective again is to try to predict whether kids or children essentially have kyphosis disease or not based on their features and these are the simple features that we have in our data. age, the number, and the start as well. And that's about it.
Okay, so let me walk you through the KNN algorithm. So K-nearest neighbors or KNN algorithm is a classification algorithm. And actually, within SageMaker, you can also use it to perform regression type tasks as well.
But here, we're primarily going to focus on the classification piece of it. And simply KNN works by trying to find the most similar data points in the training data and attempt to make an educated guess based on their classification. I know that might sound a little bit strange, so let me show you an example right away.
Okay, so let's assume that I have two different sizes of t-shirts, okay? And I have either the size is either large or either small. And the features that I'm going to use here in my data is simply the weight of a person.
And also here I have their height as well. The weight is in kilograms, and the height here is in centimeters. And essentially what I got here is simply two classes, either based on the customer features, which is either their weight and their height.
I'm going to say, well, do they belong to the blue class or do they belong to the red class? That's all there is. Would they wear a large size or a small size? So the red class is simply the large size.
So these points here and the blue one, these are the small size, which is these ones here. So if you actually. Click here again, you will see that these are the large size and these are the small size. And the question is, let's assume that I have a new customer right now, okay?
And this customer came with here, let's say, this question mark, big question mark in here, which is that green point. And this customer has specific features. They have their own weight and they also have their own height. And the question is, should I give them a small size or should I give them a large size? And that's essentially the question here.
All right. So let me show you how K and N or K nearest neighbors can actually solve that problem for you. So here is the algorithm from a very high level.
What K and N does first, we select a value for K. And the K is simply the number of neighbors to this specific data point that we're going to select. This K could be 1, 2, 3, 4, and it's kind of a tunable parameter. Next, what we do, we go ahead and we calculate the Euclidean distance between that point and...
also with every point that is included as well in the neighboring data sets. And I'm going to show you an example coming up next. And then also I'm going to show you what do you mean by the Euclidean distance. Actually, let me show you what we mean by Euclidean distance first.
It's pretty simple. If let's say I have two points, if I have point A here with coordinates x1 and y1, and I also have another data point, point B, of coordinates x2 and y2. If you wanted to calculate the Euclidean distance between the two, all I need to do is the following.
you obtain x2 minus x1 squared plus y2, which is this distance here, minus y1 squared, and then you obtain the square root of that. And that is going to tell you the Euclidean distance between these two points. And that would be a very important metric to dictate whether that data point belonged to class large or class small, class A or class B, essentially. Okay?
All right, so let's go back. And this is simply this piece here. Third, we pick the k closest data points. These are points with the k smallest distances.
And then afterwards, we run what we call it a majority vote among selected data points. And then the dominating classification is simply the winner. So point is classified based on the dominant class. And then afterwards, we go back and repeat again.
We select the value of k, we calculate the Euclidean distance, We pick the k closest data points. These are the points with the smallest distance. And then we run a majority vote. And then we repeat.
Let me show you an actual examples with numbers. Okay, so please note that this example has been adopted from this source here, ListenData.com. It's actually a really great resource.
I strongly recommend that you guys go ahead and check it out. And this is just a very simple example here. We have the height.
We also have the weight. And I also have here the t-shirt size, which is either small or large. And simply all I need to do is I need to go ahead and calculate the Euclidean distance between a given data point that I wanted to classify and all these data points. I have in my training data. So if you go ahead and you calculate the Euclidean distance, you will see that here I have a bunch of basically numbers, right?
And if I decided to select, let's say, k, which is that tunable parameter here equals to 5, that means if you actually go ahead and see it visually here, for example, basically you will see that here I have the large size, here I have the small size, and here I have this new data point that I wanted to classify. And I need to go and basically select... neighboring points.
So if I say, for example, k equals to 5, that means I'm looking at the 5 neighboring data points to my new data point that I would like to classify. And if you guys scroll back. Actually, before we do that, you can actually go ahead and calculate the Euclidean distance. That's what I'm trying to do. So if you go back here, that's what I've done here.
Let's go back. So here I have these five data points. And then I calculated the Euclidean distance.
And then you go ahead afterwards and you rank them. So here I'm ranking them. Well, this point has the smallest distance, meaning that this point here, that distance 1.4, or kind of this feature, 160 and 60 here, that row. basically will have the vote of one. Okay, that will be kind of the closest point to my newly classified data point that I would like to classify.
And then the second one is this distance here, distance two, this data point, we're gonna take the vote of two or ranking of two. And then afterwards, I have these two points, they have exact same distance. So that's why I put the number three and three. And then the final one here, that's have the largest distance 3.16, then the vote becomes five.
And simply what I'm going to do is I'm going to do a majority voting. Simply, I'm going to say, well, now I have small, small, small, small, and one large. Then this data point belongs to the small class or the small category. Because again, if you check out the majority voting, this data point seems closer to the smaller class or the smaller category more than the large one.
So I'm going to classify it as a small. That's all it is. OK, so if you guys want to check it out, check that out visually. This is simply what I got.
Again, I have the training data. I have the small size here. Here I have the large size. Now I have my new data point, this green data point. And I don't know, should I classify it as large or should I classify the small?
Let's go ahead and run the algorithm. What we do first is we calculate K. We say K is five.
So I'm going to pick the five closest data point to my data point here. And then I'm going to calculate the Euclidean distance between all these data points. And then I'm going to rank them. So I'm going to rank them and then you will find that based on a majority voting of this, you will notice that I have four data points that belong to the small class and one data point that belong to the large class. Then accordingly, this point is classified as the blue size.
It will be classified belonging to the small one because again, there is based on the majority voting, I have four versus one. That's all it is. Okay, and that's about it. That's simply all I have for this lesson. I hope you guys enjoyed it.
In the next lesson, let's go ahead and show you guys the K-Nearest Neighbor algorithm in SageMaker. We're just going to show you from a very high level what is the CPU requirement, what is the input and output data requirement, and that's about it. We'll go ahead right afterwards and jump into our AWS SageMaker studio where we're going to code our algorithm from scratch. Please stay tuned. Best of luck, and I'll see you guys in the next lesson.