Decision Trees and K-Nearest Neighbors (KNN)

e B e hi professor uh very good evening GRE good evening [Music] Prophet evening good morning evening very good evening very good morning to all sir very good evening good evening yes the assignment yeah I I will be focusing only on linear and logistic regation this assignment okay good morning good afternoon happen is this okay now recording uh yeah good good morning good afternoon good evening late evening um and welcome to today's session yesterday we started talking about decision trees but before that what we did was to understand uh because we had already finished logistic regression as a classification technique how do we measure the performance of classification models and we talked about confusion Matrix which really really uh once the classification is done uh we really uh create a table a matrix where we uh quantify how many uh true positives we got which means how many my model predicted as positive and they were actually positive how many true negatives the model and the reality match in terms of their uh predictions on uh on negative uh outcomes and then there were two uh wrong predictions so to say which is false positives and false negatives which means uh false positives are those that my model predicted as positive but actually it was false so reality was negative likewise false negatives were those that the model predicted as negative but reality obviously was positive because it was a false prediction then based on that table we came up with the metrics and that those were recall which is true positives correctly predicted the positive ones which are what we are interested in primarily uh divided by the actual number of positive cases the Precision on the other hand was the same true positives but divided by the number of uh predictions the positive predictions that the model made okay not the real uh actual positives but the predicted positives specificity was the same as the recall of the sensitivity but dealing with the negatives so true negative divided by actual negatives then we talked about accuracy which is all the correct predictions then the F1 measure which was a average of precision and recall and we went through some couple of examples to see that yeah depending on the business context some sometimes the Precision may be more important sometimes recall may be more important but even if one or the other is more important you can't you should not be uh just throwing the other for a toss completely and not take care of that and so F1 measure was something where if precision and recall were equally important if you have different models if you want to compare the tradeoff between the Precision and recall is it worth it or not and that's what we talked about we also looked at AU area under the curve area under the ROC curve as another measure that could compare different models higher the area better it is better the model and what is that area uh the curve it is the uh rooc the receiver operating characteristics curve which is nothing but a plot of true positives versus the false positives and how do we get that well the software will go through a variety of thresholds and say okay let me uh say 0.1 is the threshold for uh uh for discriminating between the positives and the negatives the two classes that we are talking about and so uh then it will say Okay 0.1 is the threshold anybody who has got a probability of more than 0.1 is positive the others are negative now let me count how many positives negatives through positives negatives Etc then the same threshold it changes to 0.11 or 0.12 or02 so it it goes through a whole lot of things very quickly and then it comes up with that curve of true positive rate versus the false positive rate that also helps us identify what is the best threshold we should be looking at for our model for making the uh predictions on the new test data and we we believed that uh uh the the play where the curve starts uh flattening out or or the slope starts reducing that's where the the elbow of the curve is and that's what we normally use once those were over we said there is something else we can understand when we are uh testing the performance and because the idea is not to perform well only on the data that we have but on the new data that comes in so the training errors and testing errors we should be checking and the testing data should really be performing very similar to the training if we have done a good job of it which means the TR training and the testing errors should be very similar and if they are similar but both of them are very high it means my training model itself is not too good which indicates the problem with bias I have not built a model that is captured ing the real patterns in the data if there is nonlinearity I'm forcing a linear model I'm biasing it and that gives me an idea by just looking at the training and the testing errors and if they are high then we look at the bias as a problem then if the model has been overfitted we have built a very complex model working beautifully on the training data so training error is low but suddenly the moment the testing data comes nothing work works and if that is the case then it's a variance problem just changing data slightly the testing data is from the same uh population and if there is some slight those changes are causing the model to really perform very differently then that's a problem and that's the variance problem too complex a model that was the bias variance tradeoff we also saw kfold cross validation sometimes that training and testing uh split may not be feasible if the data set is not too large and in which case we go to kfold cross validation it has other purposes like model uh hyper cuning and all those things that we will talk about as the such models come into play uh but uh the idea is that if we have a little data how do I split that into training and testing kfold cross validation K being a number a typical number we use is 10-fold cross validation I showed the example for illustration with fivefold that's also common five and 10 are the two common numbers for K in kfold Cross validation divide the data randomly into five buckets K buckets take K minus one of those buckets build your model and then the one bucket that you held out kept away you test it on that and capture the error then the if I have fivefold I will have such five models that I can build by just changing one fold uh switching between the training and testing the benefit is that I am using 100% of the data for training 100% of the data for testing but not at any given point each model only still sees 80% of the data for training and 20 for the testing or whatever split you want 7030 Etc so that that's that's what we have uh looked at so five 54 would be 802 anyway so what we want to see in those is when we get the five errors in fivefold Cross validation or 10 errors in 10-fold Cross validation the mean of those errors should be small and if it is small then bias it it is a low bias because low error means my predictions and my re ity are close to each other which means the accuracy of the predictions is good on an average and so there is no bias but if the standard deviation of the errors is high then we say oh just by changing that onefold here and there the model is really behaving very differently and so there is a variance problem if the standard deviation of the errors is also low then it is a small variance that's what we did with the bias variance trade-off and then we moved into decision trees we really mainly talked about the intuition and idea is that it is recursively partitioning the data set into various regions we looked at it as a feature space as a tree as rules but the idea is take some variable split based on its values and see how the good the classification is if the classified if the filtered data in that Branch turn out to be very homogeneous then we call it a leaf and we say okay this these data sets if the income is more than 119,000 and age is more than 57 years then they will uh not take the loan or something like that so that that kind of a rule comes out and and and that that's pretty much the intuition behind decision tree recursively partitioning the data axis parallel is what it does so uh age less than this or more than that income less than this or more than that it just doesn't go through a equation if age is equal to five + 10 times the income plus something something and if it is less than this then you do something the more than that something else that's what Dr s did in his in his oblique decision trees so but anyway decision trees generally access parallel nice splits uh easy to understand the rules are very intuitive anybody can understand explainability is high simple model but the challenge is how do I pick the root node and then the next nodes and how do I avoid overfitting because decision trees are highly prone to overfitting because you keep splitting the data you keep splitting the data you will get to that last data point and you will be able to classify it but you go so deep build such a big tree that it is too complicated overfitting and then that's a problem so how do I avoid how do I grow the tree how do I select the nodes for the splitting and how do I prune or not grow it as much so so so that I avoid overfitting those are the two things that we need to see today we will do that and then we will go to k nearest neighbors it's even simpler than that and we will talk about the same way intuition uh some similarities with decision trees uh and then uh obviously it being uh pretty simple there are certain issues uh that we face and then we will see how we can address those issues by making some uh tweaks to the KNN algorithm and take care of that we may or may not be able to do completely but certainly up to the uh issues uh intuition Etc that we should be able to uh do these are also not too complicated things the concepts are pretty simple but uh we will see if we get to those number of slides okay so today the first thing is now we will start splitting or identifying which variable should I pick for the first split and then the following splits etc etc and this was the slide we ended up with yesterday that was the last slide and the data is here 20 data points 10 of them have a class zero 10 have class one we have gender car type customer ID and shirt size as four different variables and here what I'm showing is the split based on gender if I have to use the gender male and female so there are six and four 10 males and 10 females and uh of these uh 10 males six males are in class zero four males class one and that's what this shows if gender is male class zero six people class one four people so we just are counting and putting up in this thing likewise for family or car type family car sports car luxury car and then how many are family car belonging to class zero how many class one okay that kind of split is there and customer ID very interesting thing also uh is uh each customer has a single class associated with that one zero the second customer again class zero the first 10 customers are class zero and then the remaining class one which should I use to split on as my first step the root node is what we need to figure out so let's start with that this was the slide we used yesterday the principle is very uh logical uh I should pick the variable that leads to as pure a leaf or as pure a region as possible which means all the data points coming to that node belong to the same class that's the general idea okay obviously perfectly getting there may or may not happen but that is the idea as much as possible they should be belonging to the same class and the quicker I am able to classify with as few splits as possible that is my best tree that I can grow to do that we talk about various measures of Purity now we said pure which means homogeneous which means all the examples or the data points belonging to that node are of the same class how pure is that node is what we want to understand and there are a lot of different metrics uh but the most common ones that we use are Information Gain Genie index for classification trees variance for regression trees decision trees can be used for classification that is the main uh purpose that it is used for uh as we did with the logistic regression but decision tree can also be used for predicting numeric variables like linear regression dat okay so very simplistic approach but it it it is used and it can do that as well so in that case variance of the numeric variable is to be checked for Purity we will understand how uh from next slide on just a illustration of what I mean by impurity This is highly impure node because both the classes the green circles and the pink plus both of them are in almost equal numbers so it is uh not homogeneous and that is very high impurity this is less impurity mostly it's actually the green circles and that could be the class I can uh decide for this Zone uh but there is some impurity also in there no impurity at all here this is pure homogeneity completely homogeneous so now I want to be able to measure quantify using a number how pure is it how pure is it and all those things and classification error is a very simple uh approach uh where we just count how many are there belonging to the wrong class for example if I just go here based on this I will classify this as green circle but two of those are wrongly classified so two divided by the total number of data points that are there is my misclassification and that's it so that is what this formula is really saying count how many are Mis classified essentially proportion so we we divide by the total but the formula I I'll just show a slide next in the next slide just to show what that means but uh what I just said is what this classification error is then there are people came up with other kinds of things uh Shannon 1920s or sometime he came up with uh entropy from the physics idea of being uh a measure of uncertainty in the world and or uncertainty in any physical system that was what was entropy and it's same idea uncertainty unpredictability impure having multiple classes same idea and so there is some formula we again don't need to bother about it it's not too complex but I will show some calculations but I will gray out the calculation part and show the final number what we need to keep in mind is whether it's classification error whether it's entropy or the geni index all of these are some formula to quantify the impurity the error okay how much error is in my classification that's what we are going to be do and that is what will be used to decide when I can split which one gives me lesser error when I split that's the one that I will have to do that so let let me just show you what this means and then uh we will go to that point so these are for classification when you are doing these are the uh node impurity measures and for regression we use variance what we covered early on mean of the squared deviations okay okay so let let me just show with a very simple example I have drawn some kind of a I have partitioned the data this is my entire data space I have green colored balls and the red colored balls circles and now I want to understand the classification error how much error is there in my classification obviously you can all see that there is no error here all of them are homogene this one is homogeneous this is homogeneous here because I see two green circles and one red circles Circle I will have to eventually classify it in one or the other uh class and so because the mode majority is green I will say okay this region is green circle and obviously I am misclassifying one out of the three one3 misclassification okay so let me just show you that math heat one minus in the T1 region one minus maximum of the proportion for each class the proportion of the green class and the proportion of the red class and proportion I have three out of three for the green class zero out of three for the red class so 3 by 3 is 1 so maximum of three uh sorry maximum of 1 comma 0 is 1 1 - 1 is zero I just take the maximum of those uh values there but it's really uh it's actually saying that there is no error likewise in T2 also it's the same thing zero out of four greens four out of four rights so 0 divided 4 is 0 comma 4 divid 4 is 1 maximum of 0er and one is 1 1 minus one is zero and the last one T3 is the green is 2/3 red is 1/3 and which one is the higher of the two values 2/3 is the higher of these two values so 1 minus 2 by 2 over 3 is 1 over 3.33 that's the misclassification but again the idea is okay how many have I misclassified one out of 3.33 that's exactly what this formula is getting so we will stop looking at the formula as much I show the calculation gray them out but the idea is that the error is Quantified by classification error entropy Genie index different algorithms use different measures and that's the whole idea what would be what is the goal I am trying to find a variable for splitting the data and the one one that gives me the minimum error after splitting like here that would be my best thing so let's just go through them fairly quickly and then uh we will so again the same uh thing uh so the the process is what is the error before I start splitting before splitting I have 10 records of the class zero and 10 records of class one so but I have to give one class to it like Hunt's algorithm when we started we counted seven people did not default on loan three people defaulted on loan so I said okay majority mode is that seven they have not defaulted I will say not defaulted obviously I was making three mistakes here both of them are equal I will have to just pick one of them and I'm making 50% m akes because if I say that before splitting everything is class zero 50% I'm making a mistake so what that is my error then after I split based on gender based on customer ID based on car type shirt size then I will again see the split and the error and I will look at the weighted error of all the nodes after I split it will be clear when I show it step by step there will still be error obviously using gender I have classified but it has still not be done perfectly six are in class zero one in class four so it's still impure here okay with car type the sports has become pure but the family car and the luxury car there is still some impurity and then customer ID uh strange thing but we will talk about that as well and then the the split that maximizes the difference in the error before and after is the bestl so the concept is simple look at the error if I have to classify everything in one bucket that's before splitting that's some error then I start splitting based on different variables and whichever variable minimizes that error and classifies them very well is the one that I will first split and the same process will go on further yes s uh so in the third picture the customer ID how we are splitting is it splitting for every element or it is a uh bag of spit from the picture it is not clear well actually uh the picture is clear and you are catching on to the fact that it doesn't make sense and that's why you are saying it's not clear but it is I have 20 customers here and so if I have to split based on customer ID I'm getting 20 splits the customer one class zero and not class one customer two because it is based on the all the values the split is always based on all the values in gender the values are male and female so I split them based on male and female in sport car type I have family Sports and luxury So Family Sports and luxury is what I have done in customer ID I have 20 different numbers I have done that oh thanks so that means from C1 to c0 actually there is some dot dot in between right yeah yeah there is a dot dot customer ID specification is the purest one in this case right so par says it is the purest because there is no error there is no error so I will ask this question again but now that you have said it so should we then split based on customer ID no that would make it very complex and it's a very high variance yeah it will become overfitted yeah all those are correct answers not only that now when the new customer comes how do I classify because that customer will be a new ID there is exactly yeah so this is meaningless but decision trees don't understand you understood it we all understood that right away for various reasons but uh but computer will not understand so and it used to make that mistake so I I'll talk about that as I come to that those slides okay so good because uh that was uh the theme I wanted to get to this point uh uh but yeah let's get there okay so let's just go through the motions just so that we understand what's going on the intuitive thing is clear that's done that's what we should really know that before splitting how much is the error and then after splitting how much is the error whichever reduces the error more is a better split what is that is something that I just want to show you so before splitting I have 10 records of class zero 10 of class one and so the proportion of class zero is 10 out of 20 and same with class one so my error I have those three types of error I'm just showing with each different algorithms will pick one of those you don't we don't have to do all three we don't have to do anything manually uh c4.5 will do something ID3 will do something U Cart classification and regression trees will do something Kai Square thing that will do something so they will all pick one of them I'm just showing all of them just for us to understand what's going on okay and I've grade out the calculation I I just explained that in the previous slide what it's doing but it is classification Error 1 minus the maximum of these two and both are equal so I can pick any one of them so 1 Min - 10 / 20 which is 1 - half half some number0 five is the classification before splitting likewise entropy turns out in this calculation as one okay and G index turns out to be 0.5 uh just some calculation it's it doesn't mean anything else what would mean is yeah this is the error before split then I will split based on gender and again I will calculate the error like this only and see which one reduces gender or car type or customer ID or something else okay so this is it and and I've just plotted to show how they behave uh for various values right now we have a uh 50% uh probability or proportion here which is somewhere here and that's why I'm seeing 0. five as the classification error. five as the Gen index and one as the entropy I've done this calculation for various probabilities not just 0.5 uh what if the probability or proportion was 043 point2 etc etc and how how do they look so the peak error is when we have uh let me ask you the question in a different way I have multiple coins a Fair coin has 05 probability of getting heads and0 five of getting TS let's say I have many unfair coins as well and I have a coin that has uh let's say 100% probability 100 it doesn't happen but uh let's say it always falls on heads there is another coin always on Tails there is another coin 75% of the time it stals now these are the points which one would be having the highest error or classification error or entropy or Genie index let's just call entropy okay because that's the first one we will be talking about let's just say entropy which one will have maximum entropy the one with which is falling heads all the time Tails all the time Fair coin is what Rahul says and uh Fair coin re also okay so everyone says fre yeah sometimes I I end up getting answers where and it just sounds confusing I'll say okay it's fair but why the error is the maximum okay so yeah it is because both classes are equally likely if it is always falling on heads then no error I can always predict heads no error always taals no error 75% times heads well if I say heads all the time 75% of the times I will be right okay only 25% will be my classification error and so the Fair coin is the one that will have maximum entropy which is exactly the same situation that we are seeing right now and Rahul says 100% correct coin was invented in chol that was a movie uh a Hindi movie that came out in 1974 73 or 74 I think uh and and I do show actually sometimes the uh the scene from the movie where you end up getting heads all the time there is an English movie also uh where uh these guys are going on a horse and just tossing the coin and every time it just keeps on falling heads good bad and ugly probably not good in yeah that's the there is something else it's a western I don't I don't know if it was a western movie I I have it somewhere P full of dollars no not even that so anyway so so that that's that's a funny part where unfair coin but funny stuff going on okay good so this is the error before splitting I'm just showing all three we will just have to use one of them but I'm just showing here for the sake of our understanding now we will see splits based on gender car type customer ID that I was showing previously and see what happens now when we split based on gender I have six out of 10 belonging to class zero and four out of 10 belonging to class one 10 males are there six of them belong to class zero four to class one likewise for females four of them to class zero and six to class one now classification error one minus for males the classification error male is 1 minus Max of 6 and this point4 class Z class one and maximum is 6 so 1 - 64 some number classification error for females is is 1 minus maximum of0 4 and 6 and again maximum is 6 same thing now that I have split it what is the weighted error here the weighted part is same because there are 10 out of 20 here and 10 out of 20 here but but the idea is before the split I got the error as 0. five after the split for males and for females I have this but what is the total error for for entire thing after the split and there I have to give a weighted sum of these and so the total classification error 10 out of 20 males and the classification error is 04 so 10 out of 20 multiplied 04 plus 10 out of 20 females and they also have the same thing in this case numbers are matching so so it's easier next example uh we will see where it's not that equal there but4 so the total classification error is4 before split was 0.5 so the gain that we have had is 0.5 minus point4 0.1 is our gain if I'm using classification error uh the same thing with entropy I can do that is uh P log P or something we are using 6 * the log of 6 plus4 * the log so formula is material 97 and the entropy was one so I have gained 03 using the split on gender for males and for females gain something and because they are equal then weighted average also is the same thing 97 so I have gained 03 if I'm using entropy likewise for Genie index for males Genie index for females and because they are split equally my I get 48 and the original error for Gen index was 0.5 so 02 is my game some number okay I don't know if it is good or bad let me now try splitting on car type car type had four family cars of which one was class zero so one out of four is class 0.25 is the proportion or the probability same thing both mean exactly the same in the context of data science the proportion and the and the probability uh and then here uh sports cars are eight out of eight belong to class zero so the proportion is 100% zero out of eight for class one and so that's it and luxury cars there are total of eight 1 out of 8 is the proportion for class 0 7 out of two now from this I'll just calculate the errors again for family car 1 minus Max of this or that so 025 okay likewise for sports car there is no error you see perfectly classified already so no error and luxury car something else and then I get a four out of 20 are family cars and the error on the family car is 0.25 so 4 out of 20 multiplied by 0.25 plus this anyway I don't have to do because the error is zero and luxury 8 out of 20 are luxury so the weight I have to give to the luxury car error is 8 out of 20 and that is 0.1 total classification error after split is 0.1 it was .4 if I split by gender before split was 0.5 so this seem to be a much better thing to do picking the car type instead of gender likewise the entropy also shows something and we don't bother about that entropy is 38 it was one before split it was uh what was it uh uh in the previous case I have a table summarizing the whole thing okay 097 okay almost nothing happened I will have a table summarizing all this anyway for our final decision and G index and then customer ID you have already told me the error after splitting is zero because perfect classification one Z 0 one every customer has one specific class they cannot be in two different classes okay so let's just summarize the table all the formula I've just put there again not uh of interest to us uh and then these were the three things that we were looking at for the past few slides and I have put the numbers that we saw in the past few slides before splitting for each of the types of the errors what is it when we split on gender three slides back I got point4 as the classification error the gain is 0.1 point4 minus 0.5 minus point4 gain is that likewise on splitting on car type and this and the question for which you have already given me the answer is uh obviously I have to pick the one that gives me the maximum gain these are the gain in the parenthesis and maximum gains are happening 0 five before splitting zero after splitting entirely gained entire thing customer ID but that should not be how I should do that so so the we call that information gain entropy of the system before split which is a measure of impurity in the entire data which has all classes mixed together minus when I have filtered the data entropy of the system after that and that is what uh information I have gained in the process of splitting this is what was used in ID3 when quinland came up with that in 1986 or something and he faced this exact problem because ID3 was picking customer ID as the variable to split that doesn't make any sense you remove that variable and then do it Information Gain will work but computer doesn't know that so unless you manually do that information gain was not helping and so tland in c4.5 and c5.0 he actually updated it with something else let's just take a look at that and what the problem quinland faced was that any variable not just customer ID any variable which has many values like customer ID had 20 different values gender had only two so any variable that has many many many different levels just by the nature of it if there are too many levels the variation within each level is expected to be less which is already homogeneous essentially so that is automatically going to give you much higher Information Gain and so he wanted to penalize information gain by normalizing it normalizing it by dividing it with the information already present in the variable we will we will see what that means but just normalize the data not just go by plain vanilla Information Gain entropy before minus entropy after the split but normalize it divide it by the information already present in that variable if it is very high then the gain is not really adding much value okay that's the idea let's see what he means by that okay the formula is about the same that we used for entropy but in entropy we were interested in which class they were belonging to these were the males and how many males here how many females here when I'm trying to look at the information within a variable I have the gender what is the information within this that proportion What proportion are males and what proportions are females What proportion Are Family Sports and luxury class that's all it's looking for it doesn't care about the class that is there uh what is this I guess uh you are able to draw on my screen or something like that where is this thing coming from the green and the are you also able to see there is a green line here and then there Prof okay I did not draw it yeah I think uh somebody accidentally turned on the Whiteboard and perhaps was crawling that's okay Prof we can continue yeah but this line will stay in every slide I guess let's see if if it doesn't then uh whoever was able to do that they can remove it also it will if you click your annotate you can clear the line annotate okay the annotate one is the one okay let me just do that clear all RS okay thank you very much that's it yeah information good okay thank you so much okay so information content of customer ID the customer ID has 20 States that's okay sometimes we need some break okay so yeah accidents happen so it has 20 States uh so uh what what does this formula really say well uh each state uh so customer ID one one out of 20 there is only one customer out of 20 20 with an ID of 1 and so 1 divided by 20 log of 1 divided by 20 likewise the second customer also uh 1 divided by 20 log 1 divided by 20 third also all of those let's predict with this a let's predict okay who did that and so I just instead of plus plus plus for all 20 customers because it's the same value I multiply there is already 4.32 in the information content that is there in customer ID same formula as uh what they use for uh entropy but in entropy we were interested in the class here it's just within the variable how much information is there I would show you uh something in the next slide uh that you can relate to as the information uh if you are playing that game yeah I play it every day so anyway so here information content for the the car type is let me move this away car type there were four out of 20 family cars so 4 divided by 20 is the proportion of family cars log 4 divided by 20 plus 8 divided by 20 is sports car and cars and 8 divided by 20 luxury cars so 1.52 is the sum number some calculation calculated value for the information contained in this variable okay and so if that is the case and if you are playing Wordle uh those of you you might see if you are looking at the Wordle bot okay if you look at the Wordle bot it once you do it it shows how we performed what was the skill level what was the luck well I think I lugged out two days back on Friday uh in the second chance I got it and and interestingly that's why I put this uh screenshot up uh I just P picked some r random word in the first chance and I picked prune I don't know if decision trees were weighing heavily on my head or something I just put prune uh pruning the tree and then based on that I just picked paint and I got it and it says uh once it analyzes my performance it says after I picked prun I was pretty lucky very lucky that day and only eight possible words were left I'll show you those words and those were these phony PL paint plant Point piano Plank and plink okay and these were the eight words and and then it shows all those things info gained the share of available information it's the same information that we are talking about right now in the previous slide same calculation so what is the proportion uh probability 24% 133% 133 all of these things and if I do that it also shows 2.8 bit of information was what was there in after selecting this word and this is this is what the the veral bought uh picked the plant and I picked paint and and these are the things so let and this calculation that 2.8 what we did in the last slide is exactly what it is let me show with today's word actually which uh I did it in the fourth uh attempt actually fourth chance there uh and uh it is there only three possible words were there bulge lunge and bigle and uh I I got the uh bigle uh correctly on the fourth chance lunge was my third uh attempt and based on that it says these are the things and what it says is 1.6 bits of information which is and and it also says on average more and smaller groups mean faster solving if you have more groups but each group here this group had two but the other groups had only one one one and if that is the case sometimes a group has many many things when you are in the first or second attempt and uh and it says if you pick a word which leads to fewer uh smaller groups and more of them then it is cleaner same thing with the customer ID that we had smaller groups many groups okay there it ended up being a problem here in Wordle The Information Gain uh works because we are not uh looking at IDs and other kinds of things so let's look at that calculation uh bits of information is that formula and how do we get that 34 * log of 34 +34 * log of 34 Plus 32 * log of3 to 1.6 bits of information this is what is happening they are predicting things the bot is actually seeing where it can gain the maximum information and then among the available possible words and it picks those based on that uh that's artificial intelligence generally going to perform better than the real intelligence Okay so interesting times okay so anyway so what quinland did because Information Gain sometimes was messing things up and uh variables like ID were coming in uh he normalized it and what it did is U he divided the Information Gain that was calculated that was used in ID3 uh algorithm of decision trees and divided it by what we just calculated four .32 for U for customer ID and other kinds of things and called it gain ratio and gain ratio for customer ID was 1 divided by 4.32 one was the Information Gain 100% was gained no error zero error and so the gain ratio was 23 for car type it was 62 divided by this .41 so gain ratio is higher for car type and so quinland said let me use gain ratio in the updated version of his decision tree algorithm in 9396 uh c4.5 okay and he was essentially by doing this he was penalizing the larger number of small partitions which is actually something veral wants okay Information Gain Is wanted C 4.5 and 5.0 also allow multiple split multi-split by multi spit yesterday I had shown the by ordinal and the nominal variables so where I can split uh car type uh there are three levels uh which is uh uh family and uh uh and luxury and sports and all uh multi-split is there binary split would mean that I would have to combine two of them and use binary splits but c4.5 and 5.0 versions of the decision tree algorithm that quinland came up with allows multi-split uses gain ratio and not Information Gain as the as the error metric to split the NES that's the summary of this part of decision trees essentially cart which brayman came up with in 1986 uses binary splits so ideas are they just use some slightly different ideas there in the details but at the highest point is which node should I pick the one that minimizes the error now quinland might have used gain ratio and uh brayman in cart is using uh Genie index and not using the information gain or the entropy or the gain ratio so Genie index is what brayman used and he used a two-way split not the multi-way split that c4.5 and 5.0 do and here he said Sports and luxury is one uh option family is another one sports is one family luxury all possible binary splits and then says whichever is giving the lower Genie index is the one that I have to use so then car type and Sport syus and he will also do it for gender gender anyways male and female and for everything else so among all of these the one that gives you the lowest Genie index is the one that brayman used in K as the splitting option okay does this mean we are creating new features and is an example of feature engineering uh actually we are really not uh doing any feature Engineering in this particular case it's while we are combining it and you may call it that I have created a new feature and that's feature engineering but uh essentially we are finding a split based on this it will further go down and in the next level it can again come up with a new split it can again look at all possible uh breaks within that so uh so it's uh yeah it's it's not really engineering a new feature in this particular case so that was for classification this is about growing the tree how do I grow the tree eventually how do I stop is the other problem that we have to come to but I said reg the decision trees can be used for uh regression problem also and if there is a numeric variable then the idea for splitting would be what is the variance in the entire data if there is a variable called income and it's a numeric variable what is the variance standard deviation we find that and then if I split based on that now what is the variance Within These groups and the uh so if I split based on let's say what I showed yesterday income less than 80 ,000 income more than 80,000 and if I split and I calculate the variance within that variance within this how many are here how many are there the weighted average I can calculate and uh if if they are very close to each other in one group The variance will be very low and that will be the homogeneity part and then what would be my prediction well I have a lot of numbers let's say 80,000 82,000 87,000 83,000 all in one group and if I have to predict someone in that Branch then I'll take the mean or the median of that uh leave and that will be my prediction that's how decision tree does the regression okay for classification mode for uh for regression mean or Med that's the standard way of doing things okay so uh I'll Mize this again but quick questions are decision trees greedy and when we say greedy a lot of data science machine learning algorithms are greedy and the meaning is yes at each step it is picking the variable that is giving the maximum Information Gain it doesn't care that suppose I picked let's say gender eventually I might have come up with a much better classification decision tree won't do that in fact most data science algorithms won't do that it's greedy at this point what is best for me I want to know that and so that is a greedy approach and uh equivalently which also means that they are likely to find local Minima okay the minimum error if I have to find it might just find the best option Loc locally and not globally because I am just finding uh the best split at that particular point so so it's a greedy method okay and uh it can find local uh Minima and that may be the model that it generates uh maybe there could have been a better one but uh that that we won't know okay there are other techniques that overcome this I mean when we go to random Forest Etc when you are doing multiple trees with the different things it overcomes this kind of a issue but uh by definition this decision Tree in its vanilla form or is a greedy method which makes them very unstable also unstable in the sense towards the end when I have very few data points left because the data is very little even if there is a small change in the data the class can change near the leaf okay so they become uh very unstable and that is the reason they overfit also because then they will go to each data point and be uh and overfit that so the next part of this session would be to how to address the overfitting how do I not overfit so what we have done here is decision trees are splitting the data using a variable and its Val values we needed to understand how to pick the variable the measures that are used to pick the variable are of that impurity because the goal eventually is that when I split I should get to a homogeneous class as soon as possible so that should be my best uh split uh uh selection and the measures are okay c4.5 5.0 use in gain or normalize it to get gain ratio uh cart uses gen index classification trees are there so sorry classification error is there but uh all of these things some or the other algorithm uses in some way or the other the goal is see the impurity before splitting how much variation is there just calculate that and then when you split then you again calculate the error the one that gives you the best uh gain uh and minimum error is the best split and then you keep on splitting continuously like that that's all that's what these things are doing any questions on okay sud has a question or a comment for the decision trees is there a chance to change the greedy parameter to G ratio rather than information gain well it's it's not the the gain ratio and Information Gain are not the gritty parameter of course uh uh whatever parameter we use what it's doing is at that moment whatever is the best whichever minimizes the error is what it's taking now it doesn't matter whether I use classification error or uh Information Gain gain ratio um or gen index uh or variance whatever I'm using the greedy thing is coming in that at that node what is uh giving me the best split at that point what is the bestl split so in fact c4.5 is using gain ratio ID3 was using the information game and c4.5 is a updated version of ID3 yeah Professor you know I have a basic question yes so the decision three you know it can be based upon the numerical columns also yeah now now these examples we have seen it is those are the those are the categor column right so the same algorithm will be applied for the numerical columns yeah so yesterday when I showed towards the end uh the I I talked about how uh the ordinal nominal and the uh continuous the numeric variables are handled uh and in that I did mention there are two different ways one it can go to Every value in the data set like 880,000 81,000 87,000 89,500 and then split less than that and more than that and then see how it is doing but then that's very compute intensive because it has to go through if I have uh hundreds of thousand of data points or millions of data points then it will have to check for everything of course with the power that we have it it may be doable but the other thing was if I can discretize it and convert it into what decision trees is doing greatly uh with is if I can say and and logically also makes sense if I can say that the incomes are less than 10,000 10,000 to 20,000 20,000 to 30,000 and then see how many people are within that so it now has converted that into a uh kind of a categorical or a discrete uh U variable so then then it behaves the same way I I can put it in the entire income age age I have a range of 20 years to 90 years let's say I can bucket that into young Med middleaged old or very senior Etc and then uh have four five buckets uh 20 to 40 years 40 to 60 years 60 to 80 years more than 80 years something like that okay okay but there are other there are other ways people do that what they also do is you put them sort them in increasing order and then see at what value the class changes from one class to the other take that as one split then see when again it take switches to another class take it so variety of ways people can bucket them into uh these multiple groups and then it's the same thing that we are doing right here okay but that bucketing Professor right you know how does software will do because you know we may not have a you know we cannot pitch in and do the buting it is the software who will be doing it yeah so that your your team will have to uh do that so the when when you are building models you are doing some kind of exploratory data analysis to understand the distribution of the data each variable how is it doing are there any outliers and those kinds of things and then based on that you will come up with something and then you will have to discretize it sof software automatically will not do it thank you yeah let's prune the trees and uh I guess uh that would uh bring us to a point where we can take the break and then 10 years neighbors is something would which is much simpler in fact here also there was nothing much it's just we don't have to do anything manually the algorithm already has uh that built in whether it's using entropy or gain Ratio or geni index Etc but we just wanted to understand what are the criteria for using to split and what is the thought process minimizing the error is the thought process which means the data points in that bucket are mostly homogeneous which is classifying it correctly essentially how much error is left very little error that's the split that I should go for so that's the idea so idea was pretty simple now as I said the decision trees are prone to overfitting so how do we avoid it and we will just go through again the software each algorithm will have some technique built in to that but if you have to manually give I have uh there are some parameters that you can specify uh while doing it and so I will mention how we go about it okay one of the ways so pruning pruning is the thing but we can say pre pruning or post pruning What is pre pruning as the other words are suggesting early stopping I do not let the decision tree grow to its maximum depth I terminate the tree building process by inputting some criteria that's one way I don't let it grow the other is I let it grow to the maximum and then start pruning and checking if by pruning if the error is not changing a whole lot it's good Let Me prune it why keep it uh when uh without when it doesn't impact theor that is the idea let's see how we do about it one of the pre-pruning criteria I will give a lot of uh these bullet points here in one slide here but uh some of these can be parameters that are uh input when you are building the model either using the uh drag and drop software or or coding and all the records at the node belong to one class well that uh so that is when you stop is uh uh and that anyway it will stop because it cannot further split that but what you can also say is a majority fraction let's say 90% 95% belong to the same class so I have 50 data points that have come into this thing if let's say 45 of them or 40 of them 80% are of one class let me live with that 20% or 10% uh error is something that I can say and so I said okay because I don't want to overfit overfitting would be the perfect thing I I'm okay with a little bit of noise uh going in there so you can specify what fraction of Records 99 95 things like that or you can specify a Criterion where the segment contains only one or very small uh few uh number of Records which is uh how many records you need per node so if I am not getting let's say more than 20 points coming to this node don't split let it go there okay so I want a minimum number of data points coming to a particular node if it is not coming don't split let it stop at the previous step that's one uh thought that can be done or if the Information Gain is not happening too much too much how do I decide well compare it with the information gain at the root node if at the root node when you split if there was a certain Information Gain if the information being gained for further splits is less than 5% of that then not I'm not getting too much let's just stop here so these are some of the ideas some thoughts people come to which can be input uh in the software depending on what options the software gives you uh and and those ways are so then you can pre prune it that way or you can specify this is something that is available generally in the coding part maximum tree depth I don't want to go it Beyond let's say five or 10 uh levels so if income is less than 10,000 and age is more than 30 and uh uh default is no and family size is less than four and um credit card balance is more than uh $100 or something these are all the levels that I'm talking about these are all the nodes this is the depth and I've already come to about seven or eight levels already becoming the explainability is beginning to be lost but fair enough I mean uh people do build to 40 50 levels Etc I written here four and five but four and five would not be a very practical thing in in many cases but yeah maybe 10 10 levels would be a pretty good decision tree okay so the deeper it's going the more overfitting chances there so it's all it goes it will depend on the data and the number of variables so I cannot really put a number here in the maximum tree depth here but when you see the data based on that you can put some kind of a number here okay so these are things that are the pre pruning criteria they stop before be building the entire tree and these are parameters that can be specified in your sof software some of these things I have seen this this that is there I don't know about this part but it should be there somewhere maybe if not in the software maybe in the coding okay and uh the other way to look at that is you build the tree on the training data then test it on the testing data and then plot the training and the testing errors or one minus of the error will be the accuracy if it is in percentages then you plot that and if you see things like that the this is the size of the tree on the x-axis which is the number of nodes the depth and you see that here we have plotted accuracy the training the if I'm plotting the error it will be inverse of this thing essentially uh so what we see is the training data the training error or accuracy is going up as we build deeper and deeper tree the training accuracy is going higher is to be expected but the testing ER accuracy is coming down which is what we saw with the bias and variance tradeoff the training error is very low there is accuracy the error is reverse of that so training error is low testing error is high overfitting if that is the case then what would you choose if this is the thing I have given how many nodes will you choose if this is the performance on the training and testing intersection when I is that is the intersection here eight okay zahid is saying 25 which is somewhere 20 and 25 somewhere here samarjit wants to go further at 40 deep is saying 30 because I think you are all looking at this also 30 somewhere here okay yeah so I have got answers here I have got answer here I have got answer at 30 somewhere here somewhere here I've got okay 20 s say here 20 is here okay which is Goan says 30 actually 20 okay okay sorry sorry so this thing is let me erase is which one clear no I can't erase H eraser so let me erase this okay yeah um well uh it's it's your choice but I would pick somewhere here mainly because you see uh uh if you pick 30 20 here I have 20 nodes what is 10 nodes there is no difference both my training and testing accuracies are very similar for both of them so why do I want a deeper tree when I do that I would not pick this because the training and testing errors difference is getting large the they are performing well this I may not pick the intersection because well there is expected some training and testing error differences will be there but I can still get a little bit more more at uh this thing uh but you can say that my testing error has practically uh uh just flattened away so maybe not here maybe somewhere there I think somewhere here I can piit things like that okay so 10 10 12 something like that okay so that's that's the way I would go about uh picking it if I can now uh so we can always yeah it will be less costly to train if I have uh uh if I stop it at 10 uh versus 20 or 30 or 40 Etc and then here I'm getting to overfitting so this is what it is so you plot the error or the accuracy with at different sizes stop the tree at different levels and you will be able to uh uh identify if it's a regression tree then variance is something you can do if it is clasification trees you can do your u uh recall precision and all those kinds of metrics okay so so that's one way of identifying where to prune and then the software the different algorithms use different things I think 4.5 and 5.0 use something called pessimistic pruning again goes back to Counting how many were misclassified what proportions and coming up with some statistical formula around that uh so that is a called a pessimistic pruning or something uh the car art brayman uses cost complexity idea is straightforward uh while this looks strange but there is a training error this is the uh error that we are uh interested in we want to minimize there is the training error plus he put some penalty on the depth of the tree the more the number of the nodes so this is nothing but the number of nodes if I have many nodes My overall generalization error this plus sign is adding and then there is that penalty Factor Alpha so just to show uh as Alpha is increasing here if the if Alpha is zero let's say if Alpha is zero then I get the minimum error because the moment I add something more the error is increasing maybe small maybe large but error is increasing so if Alpha is zero I'm getting the minimum error but what is Alpha zero well that is the biggest tree that you have built which is overfitted because there is a training error the minimum training error is coming with the overfitting part so I have to balance this error I I will I cannot live with the minimum it will increase if I prune it but I have to also ensure that I don't go too deep so I have to balance the the penalty and the error to find the right one and if you see here when Alpha is zero I get my accuracy on the training so high testing gone and so as I increase Alpha I am getting my training accuracy is coming down testing is going up but somewhere at this Alpha level maybe I can uh say that both are behaving similarly or somewhere here maybe even here okay so I'll just pick that Alpha value and then uh uh that gives me a similar so accuracies are pretty decent 94% testing accuracy 96 or 97% training um or something like that so again depends on the data set that you have and but the idea is change Alpha look at multiple Alpha values K4 cross validation is done build these trees and look at the training and testing error and then see which one gives you the best one and keep pruning it based on that because Alpha is getting multiplied by the number of nodes okay the software does it I'm I'm just giving a general thought process here the software will take care of that but uh and it will output the best Alpha values or you plot them and pick the the pruned tree which which is inbuilt in cart it will take care of that anyway but this is the idea the idea is essentially overfitting is being minimized by pruning the tree either pre- pruning by mentioning what percentage of uh data points coming to a node uh uh if you want to be homogeneous 90% belonging to the same class or or if only less than 10 data points are now coming to this node then I don't need to so things like that maximum depth things like that you can mention as termination which means you are actually saving time and compute you are building a tree much quicker but in the process you might lose the something as maybe a better tree could have been built the post pruning will take more time will first build the most deepest tree which is overfitted tree and then it will have to start pruning and checking the error if I prune it how will the error be is it similar to the previous one okay if it is then I prune then it is prune more prune more till the I'm uh the error becomes too large so at some point the error will start becoming too large and so I don't prune it beyond that so that that is the kind of idea for uh pre- pruning and post pruning to avoid overfitting so what decision trees are really doing the idea is simple break it into multiple zones and so that I get very homogeneous leaves the two issues that we need to look at were how do I select the node to split which is based on on the genie index or uh entropy and Information Gain Etc some formula to measure the error before splitting and after splitting whichever gives more gain is the best note to split at once I have done that I know how to build the tree but where do I stop how do I prevent overfitting two ways I can pre prune or stop early by specifying some parameters in the code or in the software and then it will stop that's what I will live or do it after this and prune them all the while then checking the training and the testing error and ensuring that this is the overfitting part and then you come to somewhere where they are similar okay that's which is what we covered in bias variance trade-off so that same thought process comes here that's decision trees for us and uh I think uh it's uh and uh last slide on this topic and then we will take a break uh uh it's pretty inexpensive has uh completes very fast U explainability is great uh and even if there is noise or even if there are missing values it's a pretty robust one it handles missing values not is pretty well multicolinearity it can handle because it's selecting one variable at a time it doesn't care about it so we don't need to bother about that if there are irrelevant variables when you are pruning them if they are irrelevant the error will not change a whole lot and so you are getting rid of those kinds so it will handle those things very well some disad disadvantages it's a greedy thing picks one variable at a time so in the process uh so while it's greedy it may not be able to uh pick a variable which is not the best at this time but can lead to a better tree overall it doesn't do that but a lot of ml algorithms do that interaction attributes which logistic regression linear regression can handle pretty well uh if I uh the feature engineering for example was mentioned that if I can have two variables that are interacting and I can create a new variable that can be done but decision trees don't allow that they look at one variable at a time and and I can create a new variable then it will take a look at that but uh it may not pay attention to that as much okay so so things like that happen um and uh and that that's the disadvantage but overall it's a very powerful simple fast technique and uh certainly one of the Baseline models that you should be doing because any time in machine learning that we work with we never know how good is it unless we compare it with some more models and linear logistic decision trees Etc are pretty powerful Baseline models and highly explainable and unless your other models are doing much better you these are the workhorses majority of the people still use them and so uh that's that's what you should be using any questions on decision entries uh otherwise we will take a break it's 7:57 uh we can come back at 8:05 so8 minutes if if you want to take and if there are questions I'll I'll just grab a glass of water and come back but that's decision P so 8 minutes is that good sure sure e e e e e e e e e e e e e e okay welcome back so in the next 1 hour we will talk about K nearest neighbors U in is as simple as decision trees in fact simpler than that uh and uh and being a very simple algorithm it has it certain issues but then modifications to to the algorithm uh have uh resolved some of those issues making this a a pretty useful algorithm so let's uh start with k nearest Neighbors [Music] okay yeah that's what it is that's the whole idea if it looks like something it is that thing and what is that something it is something close to you K nearest neighbor so you are like your neighbor is the general idea idea there are certain terminologies or names that are given to such algorithms instance based learning or case based learning uh we will just uh just Define why it is so and and all that okay this is what we saw with decision trees yesterday okay each point we are approximating the Y close cl to that point okay and that approximation could be mode if it's a uh if it's a categorical variable or mean median Etc if they are numeric variables but the definition of near x what do I mean by points near that and that was defined by the regions we were partitioning the entire feature space entire data set into certain homogeneous regions and that region is what defines nearness so if the data point is here this is the near that these are the points near that if this is the data point then these are the points near that okay that region is the definition of nearness and that's the only thing that changes between decision trees and Ker's neighbors same thing it's exactly copied statements except near definition of near is not a region definition of near is it's K nearest neighbors K is again a number one nearest neighbor two nearest neighbors three nearest neighbors the nearest three data points if it is three nearest neighbors what are they so it's not all the data points in that region but the data points near that so 1 2 3 4 five if that's five nearest neighbors it will just pick those and again do the same thing that it did it's the mode or the mean or the median so KN andn also can be used for classification which is the most common usage of that or regression if the nearest three people have an income of let's say 80,000 90,000 and 9 100,000 the average is 990,000 so the new data point in that if these are the three nearest people the income prediction will be 90,000 as simple as that that's an entirety G years neighbor concept done so let's just let me ask you a question because the topic is done okay okay let me just talk about those terms uh they are called lazy Learners an important thing to note is K nearest neighbors does not build a model it is a lazy learner what that means is when you give the training data it just stores it does nothing no model building then a new data point comes and you are asking the question which class then it wakes up which says oh okay let me see what are the closest points to this so it will start measuring the distance to all the data points it has stored in the data database and then finds the closest points and then gives you the answer that's what that's why it's called a lazy learner and it's not as opposed to an eager learner decision trees linear regression logistic they were eager to build a model they were eager to find a pattern and that's what they were doing so they were eager Learners decis uh KNN is a lazy learner okay no model built it's also called instance based or case based learning where the instance is nothing but the data point it's just taking the raw data points and making predictions so it is called an instance based or the same data points can be called cases from the problem the the domain that you have got the data from and these are each data point is a case and so we call that case based learning also so different names given to this approach nothing else it is also not nonp parametric approach uh and when I say that what I mean is that it is not learning any parameters it is not learning any mathematical function like linear and logistic were doing in that sense decision trees is also a non-parametric one we did not learn an equation with coefficients Etc linear and lo logistic regression were parametric ones because the parameter beta not beta 1 beta 2 Etc were found and and they were computed and that's what those were parametric methods this is a non-parametric it doesn't even build a model at least a decision tree is a non-parametric approach that builds a model this one doesn't do even that it's very lazy so just some names given to that so what it does is same thing I'm repeating because there is nothing more to do in this topic when the data is given to you the processes that it stores all of them in the database at prediction time and it will see what are the closest points to the new data point and it gives the class uh or the regression the mean median of the points close to that so let's let me show you this picture and ask you a question and that will be the intuition and the concept uh ofn I have some blue rectangles uh squares and uh and red triangles and then there is a new point xq if I tell you that I am doing a 3 NN K is three okay where is it used any example same thing where you use decision trees where you use logistic regression it is a classification technique so any problem that requires you to classify things K&N is one more algorithm in fact as I said with decision trees also it can uh do the regression also same problem uh when we did spam or not spam uh and uh heart attack not heart attack buyer non- buyer or multiple uh classes uh high value low value medium value customer uh any thing that you are doing for classification this is another algorithm thank you sir is is it commonly used or is is a popular yes yes yes yes yes it's a very popular technique very simple technique and as I said it's also used for regression as with decision while the main purpose is classification that's where it is used a lot more uh it can do a uh nice uh uh uh regression also but if it sees the closest points and sees the and the values are numeric then it will take the mean of those and median of those and give you the answer for this thing uh the benefit of decision trees and K nearest neighbors are when you don't have linear these are not linear models linear regression was a linear model what if there is a lot of nonlinearity then K nearest neighbors decision trees can do that job pretty well so very much used in Imaging Solutions yeah so yeah it's it's a very popular technique so my question now is um if I use three nearest neighbors obviously one of the issues is how do I pick the K okay we will come to that but uh so those are among the issues that I want to highlight but intuition let's say I have a k equal to 3 which means this is the point and these are the three closest points among all these points what would I classify this as a blue square or a red rectangle wow how do we draw that right triangle side you have just a there is a okay some uh Emoji or something it's nice okay red triangle okay yes because that is the mode mean median mode mode is the highest frequency two out of those three are red triangle so I will say this is the red triangle and if the classification is correct or not with the reality I will check and that will give me the uh Precision recall and all those things again same things okay let let's say if I have k equal 5 so the closest points are 1 2 3 4 5 and now well nice three uh three of the closest points are uh blue squares two of the closest points are red triangles and so I will pick the three squares okay so that's uh so I will say this is a blue square just intuition okay the complexities and other things and then improved versions of KNN we will come to as we see what issues could be there and how to fix that because you might say well why yeah I know k for the closest Five that's the thing but it is so close to these two and it looks like these two maybe so those are the the challenges that we will talk about yes so at its simplest Cann uses mode for categories and mean mean or median for numeric exactly the same as uh decision trees the only difference between decision trees and Cann is that there I'm looking at a region by splitting it using some uh algorithm to create a model but once I have defined those Regional region boundaries all those points that are there let me just go back to the picture so suppose I want to uh suppose these are all uh categorical variables loan non- taker loan Taker and this is the point and what I do is I go with the mode here because I look at all these points it's not K nearest neighbors or something I look at this region that defines The Nearness near this point what are the points well you might say this point is closer than this point to that but I have separated them in boxes in different regions so I look at that box as the nearest region in decision Tre and so I will count the mode is what I will say green loan taker on the other hand if this is the the amount of loan uh balance left red ones probably are some 20,000 10,000 Etc green ones are let's say 7,000 5,000 20,000 19,000 4,000 etc etc now majority are those so if it's a numeric variable I will take all those values of the green ones and I will then uh take the mean or the median my choice no rule for anybody to tell me what I should do I can take a mean or a median and uh then that will be the value assign to this that is my decision to this in in KNN I don't break it into region so I have to Define what do I mean by near here the near was the region the uh the partitioning that we did with the entropy and geni index and all those things here I just save a number and I so I don't even build a model here all this model has been built here I don't build a model once this data point is given to me I start saying okay how many nearest neighbors I said okay look for three nearest neighbors it will say oh this is the Red Dot a green one and then the next closest one would be this green so two greens are there one red is there let me classify it as the green if it is categorical that that's the mode if it is numeric then three closest ones okay what is the value for this what is the value for this then I will take the average or the median of those two and sentence so the thought process exactly the same the approach there I built a model where I created the regions here I'm not even building a model picking the closest points rest is the same yeah Professor just one quick uh query so all these things are represented as vectors and at the end we are trying to find the distance between two vectors is it uh internally doing that let's get to that point so let's uh let's find out how is it finding that distance okay so here is the question okay prop and your answer actually so s yeah yeah got it than yeah so okay so uh so here now let's say I have four nearest neighbors okay K is four what would I classify this as default or on default depends on the mean minimum distance whichever Point has the minimum distance well uh uh actually it doesn't because minimum distance is not a criteria I am saying I am looking at four four nearest neighbors two of them defaulted two of them did not default what do I do I'm not completely denying what you are saying the nearest thing that you have brought up sudar will be something that improved KNN Etc we will look at weighted we are giving more weight to closer points less weight to farther away so those are the things but can can we look at the next nearest data point here the fifth one okay the fifth one okay so so the idea is uh yeah this is a problem uh this is an issue that uh we have ties and when the tie happens what well you can actually decide to pick one of them toss of a coin or look at some other u v variables as well but yeah in general yeah K take it as a oddd number if you have even number of classes if you have uh binary class take odd K and that will get rid of it alternatively if there are even number of odd number of classes three or four you take even but that doesn't guarantee anything because see let's say I have three classes and uh the rule of thumb I giving is even number so I will say six nearest Neighbors well if I have three classes and I say six nearest neighbors I could get 2 two two it doesn't help me okay but uh then what Goan said then if that is causing too much problem then just change it to seven and that will take care of it okay and or or there are other ways that we can look at what sudna was saying closest one so can I give more weightage to the closer ones so there are solutions to that but in general yeah uh it's uh at least for a binary always go with an odd number which is three five seven kind of a thing and a lot of real world problems are binary classifications so odd K works out but like I said there could be challenges but there are solutions to that we will come to that point so why not use odd all the time uh well if I have uh let's say uh uh uh three classes okay high value customer medium value customer low value customer I can get two two and one two of them high value closest ones two closest ones are high value two are medium value and one for binary yes for binary it's always odd we should be taking odd because binary means even number of classes we should always take it okay so yeah I just pointed out that ties could be an issue now now that the intuition is done and just just some small and big issues and then how to fix those issues is what we need to discuss there is nothing more actually in this algorithm because all it looks as is points closer to that so what are the challenges we face and what do we do to fix them is what we are doing so it's pretty simple uh stuff from now on actually okay uh so I will show Challenge and then I will so show how to fix it uh what we should be doing to fix it so I have now these variables age is a variable loan is a variable loan amount and did the default or not is my final classification that I have to do when we say nearest neighbors the concept of distance is natural okay so what is the distance between these points and distance the most common there are many distance measures in fact we will spend a lot of time on them when we come to clustering because then that also uses the distance as a very important thing so we will actually expand here I will just talk about the idian distance but all those distance metrics that we are going to talk about in clustering all those can be applied back to KNN uh when you are building the models okay so so at this point we will keep it simple we will just talk about it but uh and talk about only the ukian distance that is this formul which is the most common way distances are calculated uh but we will see many others which are also used okay so anyway so this is the formula for ukian distance uh if I have so this is the new data point a person 48 years old has a loan of 142,000 will that person default or not is what we need to answer based on this data and so I will this is XY y1 so X1 is my 48 years age or X1 I think I have taken uh the loan amount as X and X1 and x and this is y this Y is not the dependent variable y Etc it's just two different variables so I am seeing okay what is the distance between this fellow and the first fellow and the distance is in x 48 48 minus 25 this fellow is 23 years older so 23 square plus this person 142,000 and this person is a loan of 40,000 so the difference y1 minus Y2 142 - 40 100,000 approximately 2,000 that square and so then take the square root of that that is my some distance number I'm getting like that for every person so this is what KNN does takes this data point Compares with every data point finds the distance CES see which one is the closest here I am using k equal to 1 so the closest one and then assign that uh uh class okay so in this case I got 8,000 okay Richard has gone to the next step would will be normalizing the values first that's something that I'm building the case for Okay so suspense gone anyway but no no no problem at all so this is the minimum distance and so I got this and let's see the calculation and this is where Richard's problem comes up okay so can you tell me the problem with this kind of a calculation I have 142,000 minus 150,000 that is about 8,000 square plus 48 - 33 15 15 Square this is 8,000 square this is 15 Square even if I add this 15 square or not this number is so much bigger that it is just eclipsing this completely so Z says weightage for higher values is a problem yes so this one because the scale is so different this is taking the entire weight and this is not even coming into the picture while there is I mean 15 squares is 225 but 8,000 squares is 8864 and, I think some millions and then 225 and then take the square root it will come down to that close to 8,000 only this is not even coming up so the important thing for KNN first a step has to be done which we did not have to do for decision trees we have to do here is what Richard was saying normalize I have to bring them down to the same scale because one of them was really taking up the entire thing and not allowing age which was a much lower scale to come into play sometimes the range if the range of one variable is much bigger some one variable ranges from 20 to 50 another variable ranges from 20 to 250 well that variable has bigger distances that takes over okay so that's why it is important that we standardize or normalize two different words we use in machine learning all the time especially in statistics but even in other non-statistical techn algorithms we have you have to scale them bring them to the same level so that one does not overpower the other variable and that normalizing can be done when we say when we use the word normalize we normally mean to be in the 0o to one range and that's the formula take the x value which was 142,000 here minus the minimum in that range what is that minimum let me go back and maximum minus minimum there and the so 142,000 minus uh uh the uh uh so for each data point is what I need to do sorry so for each data point so let's say the first data point 25 x is 25 25 minus the minimum in this range is 20 25 - 20 is 5 divided by maximum in this range is 60 minimum is 20 60 - 20 is 40 so 25 - 20 5 5 divided by 60 - 20 40 5 divided by 40 is what I am really doing here and that is 0.125 so every x value these are every value I subtract the minimum this was 20 20 minus 20 is zero okay so from every data point subtract the minimum divided by maximum minus minimum in that uh range and I get my normalized values likewise for this thing also for this variable I go 40,000 minus the minimum is 18,000 40,000 minus 18 22,000 divided by maximum is 220,000 minus 18,000 202,000 so 22,000 divided by 202,000 is what that number is okay so that way it's all normalized once you normalize the software will take care of that for you if you just say that you have to normalize and this is what it does the other way to normalize which we call standardize is the Zed score that we talked about when we were comparing the performance of Darby and Hannah um and we said Zed score the stand number of standard deviations okay which was the formula xus mu ided by Sigma how far is the data point from the mean so I will take the average of all the data points and each data point minus the mean of those data point divided by standard deviation that's a zed score okay so two common approaches are done for standardizing data bringing them on the same scale and even if there are different units this is a unitless quantity because X has a unit Sigma also has the same unit and X minimum same same units they cancel out and so unitless quantity this is normalizing puts them in 0o to one range Zed score is generally if it is nice and symmetrical it ranges between plus and minus three but it can go slightly off also depending on outliers Etc but the point is that it brings them all the variables on the same scale and now I can find the distance without giving heavy weightage to a variable that has a big scale and then when I do that now if I see the same formula for distance and now I see minimum distance is this in fact this was closer not this one okay and so I say no previously I had said they will default based on this 8,000 as the distance now I'm saying they will not default this is the thing so this is important so the point that I'm making here is normalizing the data bringing them on all on the same scale is an important step in uh in KNN or in any technique any algorithm where we are Computing distances comparing different variables Etc clustering also we will have to do the same thing we will have to bring them down to the same scale if we are measur measuring distances like this that issue resolved okay the first issue was that scale Etc can be a problem resolution of that normalize now KN andn is sensitive to number of variables number of features the more the variables things can change now I have added another variable here to age and Loan salary also has been added obviously now we have realized I should not calculate distances based on the original scale I have to normalize first so I normalized salary also excuse me and now when I compute then I will get the uh uh let's say that was my old the third data point was my closest one at that point and now this has become the class in this case turns out to be the same still but the point closest point definition has changed because number of features have changed okay so in a multi-dimensional thing another feature might be closer to me than this so if you are adding a variable adding a feature or removing a feature you will have to do this recalculation yeah so Richard also says normalization depending on how it's computed are susceptible to outliers here the Z score especially will be more susceptible than this normalization normalization anyway will bring it back into 0o to one yes always so Professor the way we calculated sorry yeah in the in the linear regression we have value right wherein we could define whether the new attribute or feature you know how much how much they affect the dependent variable right so in this also do we have a way because the feature the feature that we have added that may be a false false alarm right it may not you know it will not be the right Fe feature to pick up so is there a way we can check uh can will not do that so that's that's an important point you bring in because that's one of the things I was going to mention when we are handling the issues so it's a good point now what can be done is you can mix and match multiple algorithms there is nothing saying that if you have to do classification just do K&N and do classification or just do logistic regression and do that you can mix and match so one thing that that you can do is probably KNN all it will do is it will find distances nothing else and it will give you the closest when the new data point comes it doesn't even build a model so there is no model but given this data set you can build a decision trees or a logistic regression and classify and find out which VAR which features are not important now logistic regression might talk about some features which are not important decision trees might give some other ones perfectly fine and so you can try to build both of them one after the other and say oh logistic Recreations say these are not important decision PR says these are not important let me throw all of them out and now bring the remaining features and work on this because what one of the major issues is we call that curse of dimensionality when there are too many variables how do I even deal with them it becomes a challenging thing so now you can do that you can use you don't use logistic to do that classification you want to do that you can do that but you have a goal that I want to use K in and so you use logistic regression get rid of the variables you don't want based on the P values and then you bring the remaining variables and then see the distances that's one approach I'm not saying that's uh going to give you the best result or not it may it may not but that is something you will know only when you try out so there is nothing nothing stopping you from mixing and matching the the benefits that a different algorithm gives okay not use that algorithm for its final purpose but use some of the things that it is actually doing as steps in that so that can be done and you are right the having a lot of variables three variables is not a problem but 3,000 variables if you have there are techniques called PCA SVD Etc which combine highly correlated variables into a single variable and then they do that you can do that and then do your K okay so that all that is feasible so this is It's a good question at this point when we have already learned some prediction and some classification techniques now there is this uh uh input practical input that I am giving uh based on that very valid question is that you have multiple approaches you can use P values to remove you can use uh some the variable importance that decision trees are able to give out uh and you can uh take them uh use them as well and or you can use both of them and you can uh use that to select features or you can use your domain knowledge and say I know this variable does not make a big deal so you can combine all of them but the only thing is when you combine these different thoughts you build models test them how are they performing and iterate and that's what you will need to be do so yes uh you can do that uh prati but but the question uh but the final thing is yeah once you come back to KNN KNN will not give you any of that input on which variable is important KNN is only going to look at each data point and look at other data points that are closed it doesn't look at the variables columns at all doesn't give you because it's not building a model it's just seeing which data points are closer to this data point okay thank you so much great okay so K&N for regression I I want to predict house price not the loan default or non default but what is the house price index I want to predict same approach I'll just do uh and uh same normalize first and then see and if it is k equal to one I am using I will take the minimum distance and say the house price index will be 231 but if it K is more than one let's say I say k equal to three then I will take three closest ones 31 this one 34 this one and 36 this one so this one has a house price index of 139 this also has 139 this is 231 I will take if I want to take the median it will be 139 if I take want to take the mean of those three uh this is 278 478 479 500 about 160 or something okay so that 160 to 170 that will be the value I will assign so you see how simple these things are both decision tree and KNN for regression I will just take the values and take the mean or the median and if it is categorical I'll take the mo and in fact you can do further and we will talk about weighted G nearest Neighbors which sudna had brought up at that point when we were looking at the intuition and we will uh you can do that also okay so the one that is closer give this a higher weightage than this and how do I decide the weight well let me talk about it when I come to the weight end it's just one line answer I don't want to give it now I will give it at that point it's just closer higher weightage so we will we will just do that any questions on that uh we can actually go to a website now that you understood it we really don't have to go it but but let me just go to sleepy heads and what our name uh uh sleepy heads okay here the dots are very small I don't know how to make it bigger apart from zooming up this thing but if you can see it um Let Me Clear the data and there is a button called generate data how many positive data points and how many negative data points that's you can play with that what is the mean and the standard deviation for X variable and for y variable you can just input those things and it will and click on generate data and it will generate 50 and 500 data points it has generated K is five five nearest neighbor I want so let me say I want to pick some I want five nearest neighbors to this I clicked somewhere and then I say click find five nearest neighbor and it will find find the five nearest and it says positive Class 2 and negative class 3 so it will give the negative class to that thing it's just the distances okay so I can um I can click on some other data point and and then say find five nearest anyway all of them are there challenge will always be where you have a mix of both blues and the Reds and then you have to find them so anyway so that's just nothing fancy there just finding the closest ones and measuring the distance like I said we are only talking about the distance as that ukian distance at this point of time but uh when we come to clustering we will talk about all of the variety of uh distance measures between categorical variables categorical and numeric variabl numeric variables all that applies here as well okay so we will bring that back the idea can be will have to be applied here so in summary It's a lazy learner it uh uh it doesn't do anything till you give the test data the new data and um and among the lazy Learners this is the most famous lazy fellow and there are other ones like lowest and lowest local estimated smoothing things local regression what that does is just split the data into small regions and within each region then build a linear regression or even take the mean or the median and then take the average uh very useful if you do not have a very nice uh global patterns and local patterns are there then uh having these kinds of localized models or in this case a model-free technique are going to perform much better than building a Global model an equation for the entire data set like a linear logistic decision trees Etc okay so and and if the data is continuously evolving you have built a model that may not work very well these lazy methods can uh come to your rescue at that point and and do a good job okay but the challenge here is what should be that K that I have to pick that's the issue right now so and uh difference the lazy and the eager learning uh lazy it as I've been saying again and again you store the data and then wait till the test data comes whereas I I had mentioned it earlier eager learning it creates a model and then once it creates a model you can throw away all the data you don't need the data see the lazy learner the issue is you have to store that data storage is a problem and then compute at the time of uh when you have to make a prediction which means it's going to take take a long time because it has to start measuring the distances computation in eager learning is easy Once you have built a model throw the data away you don't need any storage of anything new data comes plug in that in in uh in no time it gives you the prediction okay so those are the pros and the cons kind of a thing but accuracy like I just said in the previous slide if there are local patterns not a global pattern this method works out beautifully because uh the eager methods are building a Global model for the overall pattern in the data and if such pattern doesn't exist uh local patterns this is a much better method okay an eager method is only going to store those beta 1 beta 2 or the rules that uh the tree has come up with and uh and it doesn't need to do anything else svm also is is is an eager method we are not talking about it so I don't want to get into into that but svm also has a has a um just like this local uh regression there is a local svm also and that's a lazy method so svm here I just put it up there neural Nets you have not covered support VOR machines so this is eager method but then there is a lazy versions of of that uh uh that is a local svm we call it and that that's a lazy one also so uh but yeah let's not bother about these it's just that majority of those things the model building ones are your eager methods uh it's just very few uh lazy methods are there mostly are eager neural Nets Etc anyway you will cover in the next topic next course Professor one quick question when you say that you know in the models such as uh um uh decision tree and others we can you know then we don't need the training data available with us do you mean that you know we can save that H5 model and use that is that what you mean by what is a do H5 model um the model that finally gets you know the one that we save and then we use for our further processing later on yeah so so all it needs to uh see theoretically obviously don't throw away data Etc but it it has no further use unless we want to retrain so we don't throw away the data per se but that data is not needed anymore for this purpose unless new data comes in and we further train train and build new models but here what we do is suppose I found an equation uh my uh income is equal to 20 plus seven times the years of experience plus five times the my age minus two times the the family size or something like that if if I get that I have the parameters done now the new data comes I know the age income and those I plug in the equation I get the output so I don't need to really uh uh use the data anymore for the prediction purpose okay that's what I mean yeah yeah so support machine is another technique that that we can use for these classifications in fact it it's a useful one when the dimensionality uh it actually makes linear it's a nonlinear thing but when the data is nonlinear what it does is it actually increases dimensionality where in in a multi-dimensional space which we can't visualize things are so far away that they are kind of linearly separable and and does that that is actually a problem for many other algorithms where the dimensions are too many including for KNN which we will address but anyway svm let's not bother about svm at this point it's one more algorithm for these purposes that we are talking about okay so finally improving KNN we will not uh cover the dimensionality part I guess uh if we want we can but we will not we will just go with the time for another 5 to 10 minutes and then close with the summary so uh these are uh this is pretty straightforward in a minute or two but maybe four five minutes so dimensionality part we will just move to the next class okay so improving KNN and this is something that was uh talked about earlier when you suggested that when we were talking about the intuition let's say I have a data point here and K is equal to 3 now k equal to 3 means it is this class and these two classes as well and because these are the majority I will assign that class but you can clearly see that this is much closer so what we normally for improving the traditional KNN which equally gives the weightage to this that and this I will give more weightage to this than to these two and how should I think about that weight well obviously the closer one how much inversely proportional to you can say inversely proportional to distance but a standard approach that is used is inversely proportional to the square of the distance which means you are actually giving a lot more weightage than just an inverse of the distance you square it it's much more weightage that you are now giving to this compared to this if the distance is large the square of the distance is even larger and then inverse of that you are giving the weightage very little weightage so you will give much lower weightage and then maybe it will end up classifying as this thing or giving a regression value closer to this than to used to that's that's the idea that's pretty straightforward giving more weight to closer and how do I do that because I don't know I can't plot and see the image any time well I know how to calculate the distances the inverse of the squar distances is the weight I just give it that and that's it that's the distance weighted cas NN resolves that problem of maybe I am closer to some points which I know but because I said k n k equal to three all the other points are much farther away but I will have to pick those points and give their class doesn't make sense waiting is an important thing how to avoid overfitting because Cann can overfit so this will answer that issue of cting K how much and then we will be also uh seeing how can I use another version of the KNN called edited n then and uh and and be able to improve that and fix the overfitting and uh we have seen bias variance trade-off so I'll just use this slide to connect with that thought there uh but uh this is a too simple a model and this is too complex a model just going around everything perfect classification perfect classification but too complex this probably is uh a reasonable one where there are a couple of misclassified ones this is too simple many misclassifications happening on both sides okay so that's high bias the same idea that we covered yesterday just showing a different picture and low bias and high variance so how do we fix that using K uh by picking the value of K the same idea in the K overfitting underfitting can be extended here where going around every data point and classifying it this is a much more complex so k equal to one I want to just go to the nearest neighbor and put it there and so uh it it just uh designs this re region based on the closest values that are there whereas if I say k equal to seven okay uh what are the U uh the closest uh seven points to this point well there might be all of these points maybe a couple of them but majority are here so it will ignore it it will just say okay let me put this guy here and likewise this one uh majority the seven closest ones if I pick here mostly are in blue so I'll just make the uh region a smooth enough region region which allows for some misclassification but uh this is uh this is overfitting and this is kind of uh not overfitting so overfitting means variance problem so a small K is lot of convoluted boundaries and the training error will be low no error here in this case large K smoother boundaries but training error will increase so small K is low bias but High variance because it's a so overfitted model so overfitted high variance same idea as yesterday just a different picture and we saw this and just to put that underfitting the k equal to 7 large is the underfitting small K is the overfitting okay that's what we need to remember and this is how we can actually play with the bias and the variance play with the K so to fix that obviously this doesn't have a training error kind of a concept uh in this case but uh the bias and the variance part so what do we do how do we do that uh well take uh some sort of a validation data set start with k equal to 1 and uh you will find some errors uh and then increase the K to three then to five to seven and maybe look at such five or 10 different values whichever K is giving you the minimum error on the validation the testing data and that is the one that you do error metric which one do you use if it's a regression you can use those mean absolute errors and all those things if it is uh other thing and precision recall those are the things that we use for classification the other thing that you can do to avoid overfitting is edited NN and we call that Wilson editing which is remove the data points that are kind of class outliers which means which is exactly what we saw with the previous picture the data points closest to that if they are belonging to a different class throw this data point out okay this is the outlier in that class and then you have a much more smoother thing so showing data set where these are the blues those are the Reds and uh with the k equal to 7 I will just say Okay so this data point this blue one uh let me just throw it away I don't need it and now your decision boundaries are not as uh convoluted as allowing for some uh points to just go it's a smoother way of doing it so so those data points the class outliers are thrown out another example of the same thing much more uh points like that and so throw away these two uh uh things and not build the uh thing going around everything uh but just the line will just go like this that's it here just goes like that these misclassified ones it's throw them away so so that that is one way uh you avoid overfitting so the idea is pretty logical only uh the class outlier in this case class outliers you throw them away but the idea is higher the K it will uh prevent overfitting that's that's the takeaway essentially from here I think we will uh stop at it it's 902 uh we might be able to it's only four or five slides but uh uh let me just see no let's just actually no let's just do it in the next class okay so what I'm tempted uh no I have to speak a little bit more on this one so let's do it uh next class okay so uh we have come up to this much dealing with with dimensionality is what we will do next time this four or five slides uh so 5 10 minutes max um okay so what have we done today is we continued with the we had the intuition for the decision trees yesterday but today we talked about the specifics how do I select the uh the nodes for splitting and how do I stop and avoid overfitting the splitting is is done based on whichever node whichever variable or and the values that it splits on gives me the most homogeneous split classification the the next node the data points that I am filtering from the first one they go to the next level they should be as homogeneous as possible to measure that homogeneously or inversely the impurity we have measures like classification as ER entropy or Genie index variety or if it is a regression problem numeric variable variance those are the things that we measure we measure them before the split because before the split the data is not homogeneous all classes are mixed up so I can look at the entire variance or whatever I can look at the classification error then once I split I have to test it for all the variables so each variable I pick I test the split and then I get the uh uh the split so uh uh gender is male or female within male how many are in class zero how many class one how many defaulted not defaulted among females how many defaulted not defaulted if all the males have defaulted then that's nicely classified but if there are uh impurity how much impurity is still left so I do the calculation for entropy gen index again and find the weighted average of that the after the split how many how much of error is left whichever variable minimizes the error more or conversely Information Gain is more that variable is the one I used to split at that point then at the next level I go through the same process again again so node by node it does that and splits split splits to give you homogeneous uh uh classification uh in in the leaves there that's the approach for uh selecting the variable and splitting it growing the tree then to avoid overfitting I can do either a pre-pruning which is terminating before the entire tree is built where I can specify the maximum depth of the tree or I can specify how many minimum number of d data points I need to have in a node if that many data points are not coming stop it or what uh uh percentage should be there 90% belonging to the same class I'm good with that well that is the thing so like that there are criteria you can specify and then it will stop and that is your tree not overfitting alternatively you allow the entire tree to be built and then prune that and pruning is based on how much error uh I am having if I prune the data if if it is if there are useless variables unnecessarily overfitting it if I remove the error will not change much and when it doesn't change much it's good to uh to prune because a huge tree versus a pruned tree are giving me similar error then why should I have a huge tree and for that we also how much we have to build we can build that training and testing error at different splits different levels of pruning and then it look at the training testing error and that will give us whether there is bias and variance of not that's was the decision trees for uh KNN idea is even simpler in decision trees we were breaking it into regions of similar things here I don't even build a model when the data point comes I look at what are the nearest neighbors and I look at the if there it's a classification problem I take the majority which is mode and then assign that class if it is a regression problem I take the closest uh uh values uh and then take the mean or the median of that and assign that so that is the idea behind K nearest neighbors one thing we have to do is we have to normalize the data because when we are looking at distance metrics between variables if age is a two-digit number income is a five six digigit number Etc then that higher scale will overpower this thing so we have to normalize everything then we also saw the number of variables as we added a new variable then the closeness the the closeness definition changed because now another variable is there and so some other data point is closer in a higher Dimension space than in the other space and what happens when we have too many dimensions that is the part I am not covered the last part that we will do uh next time other than that how can I further improve KNN well I can give more weightage to the points that are closer so based on distance closer one gets higher weightage a little farther one gets lower weightage so inverse of the square distance is the weight I give and that is an improvement on on that K the other thing is how do I find n how do I avoid overfitting and this is what we are seeing I can the higher the K the smoother it is so the the bias will increase variance will go down lower the K the K value it is then overfitting every just the next neighbor and it gives the same class and then it just goes around and uh so high variance so I have to balance it how do I do that well look at a few different K values and see which one is giving you the lower error and that one is what we need to this is all that we have done today and any questions on that it's 909 one minute still there but uh I can stay long so are we going to do some uh PR practices hands on yes so next next Saturday there there will be a Hands-On session that will help you uh build models and these and do all practice on these things thank you look forward to that and I will I will teach on Sunday I believe that's how I remember and on Sunday we will go into this and then I think the next topic is uh Market Basket analysis and Association rules Etc so okay samit there's a question yeah thanks Professor so um as you explained that um High K value will increase the bias SP uh so I actually um probably missed that uh intution part I could not not visualize if you could help me understand that how the um how if the more value of the K is present how the buyas will yeah so uh uh so what I am saying here is if K is equal to one let's say okay then it is building the boundary around each data point it's just building a very complex boundary and so it is overfitting whereas if I say k equal to let's say seven just this example here I am taking seven closest data points or something to that and say okay so uh well that is the value I assigned to that and if this is a different class I'm throwing it away I'm smoothing it much more basically uh I I am otherwise see if K had been one then I would go around this data point like this and the boundary will be drawn like this and then this one and then here the red data point and then go around that and then around this and then do that here I have just taken a larger data points around it and assigned that class to it and so I have just uh smoothened it a lot more okay yeah I got it thank you Prof Professor quick question about the assignment so for the assignment if there is a problem that we have to implement is is it required that we follow it using Azure ml Studio or can we just build it in Jupiter Labs um well uh that is a question that probably deup has to answer to because he will have to evaluate that the Hands-On aspects of it if I ask but I actually I'm not going to be I think you can do whatever you want because I'm not going to be asking you to submit a code or something I will want you to maybe if there is a chart that is coming out of that and the visuals right the visuals Etc will come so so it doesn't matter what you are using because ultimately it shouldn't matter anyway it's the the learning of the concept and applying that is more important but yeah thank you so much Professor thanks so you have uh coming Wednesday uh uh who said Wednesday ah okay okay so Z no so I said by Wednesday I will give the assignment which means I'll just post it on on the teams and and I I will have all the instructions and everything in there it's going to be a straightforward explanation so yeah we don't have a class onday so any other questions if not thank you very much uh and I'll see you have a great day have a great uh uh afternoon evening weekend uh and uh great week I'll see you next Sunday thanks thank you very much thank you sir thank you professor thank you Professor thanks thanks a lot I'm I'm just staying in there in case there is a question but uh yeah please keep dropping off and I will okay Professor good night yeah good night have a great day good night Professor thanks bye thank you bye bye

Transcript for:Decision Trees and K-Nearest Neighbors (KNN)

Transcript for:
Decision Trees and K-Nearest Neighbors (KNN)