Introduction to Machine Learning

and one thing I want to reco warn you so once you post the video recordings there's a temptation of students to say oh you know what I'm just gonna stay in bed and I'm gonna watch them at home it's pretty boring to watch lectures at home so most students who want to watch them ultimately then don't get around to doing so so it's better to go to lectures but you're welcome to watch them online someone had a question in the front No okay it's resolved all right good another thing is so at this point everybody should have received an email either invitation that you were to Bo karyam or an email that maybe the exam didn't go so well and their combination is that you should drop the class and if you have not yet received an email please contact you know why--i Piazza contacted TAS so that means either please also check your spam folder the VOC areum invite is sent automatically through karyam so it looks like spam so please make sure you find it by the way please all close your laptops come on guys it's the one rule I have no laptops all right no iPad either all right any more questions about velarium placement exam or anything else yeah right unfortunately the way it works is you have to first is to the partner and then you decide on the project that then you start with the project you cannot reach we saw later on that then the reason is because we have had problems in the past but people put people on as partners the day before or something you know as a favor or something you know it just gets complicated the best there's a clean solution the ever ever the title and you have a short description of the project that you can look at okay any more questions okay good so last time we talked about the general setup of machine learning we have data but then a set D that consists of pairs of basically vectors feature vectors and labels and our goal is to find a function you know that from a vector X predict spot right so if you have X I and outputs y ideally let's see the vector and this function H is chosen from a hypothesis class the hypothesis cast said of all possible functions that we consider so someone at the end asked me what is this curly H seems very abstract what is it and so let me give you an example so I showed you a little teeny little example of a decision tree right decision trees who want algorithm but you basically split on a feature you for example say is the person male or female right is the first how old is that person you know age is the person you know over 21 yes or no right and then the ng basely you may can make a large tree like this and at the end you make a prediction saying yes the patient will return within 60 days or no somewhat you know you could ever know or something that's miss so in this case the capitalist curly age will be the set of all possible trees you could imagine right so it's actually an infinite set because you could I guess you know well actually the finite feature said actually this finite it's very very large I'd larger than you think even larger right it's very large so but you know that doesn't matter just have to find some algorithm that finds the ride age out of that okay they ride little H okay so basically what we're doing is we saying by choosing curly age all we're doing is we're saying oh I'll be using decision trees right that's basically that's what that means and then we try to find the best one that is learned on that date so the learning is basically choosing this one particular instance of all possible functions out of the set of functions any questions about this raise your hand that made sense okay raise your hand if I'm speaking too fast Billy all right I've watched my own videos like Bao I'm talking fast but I'm trying and trying to slow down ok good so then we said right great right we have this data fed and we try to find the age that has minimum loss on this data set so loss what was the last function again last function was some function L if you stick in an H and we stick in our data set and it tells us how well we're doing right so baby says this particular function age has the following error makes so it make me know makes that fraction of that percentage of examples indeed it gets wrong right oh that's the average error we are making in predicting house prices or something and we would like to be at this function L be small so ideally the idea of machine learning is we try to find an age that minimizes this function all right you remember this crazy and you remember it all right good good good so then he said okay well that's great right we came up with this really cool algorithm that gets zero loss all right and that was the terrible algorithm 3 and what it did it just memorize the data set and when you stick it in ax it just looks up the X in the data set and then outputs that Y all right and that gives you zero loss because for every X Y all in the dataset you output exactly the right way it's great all right we're done so what's the problem with this the problem is that only works on data that is from this theta so it doesn't doesn't generalize and so that led us you know to the next notions that what we really want is actually not a function a function H that - Alhassan this dataset what we really want is a function that minimizes the loss on new data and how do we write this if I this what we really want to minimize the expected loss if we were to apply our function H on a new pair of points that are drawn from the distribution P from which our data set was drawn so basically if we take two new data that if you take a new data point X with label Y which we know but but what would the air what would the last be and that's great that's exactly what we want right this time you're right this is exactly what you want that really means this is 0 that means if you get a new point you get always get it right it's awesome the problem with is is we can't possibly compute it because we don't know the distribution P the distribution P is this elusive distribution from which our data is sampled we don't know where the data comes from and or at least we don't know what the distribution is and so we said well it's elusive you can't actually compute this but there's good news and the good news is we can approximate it and this is exactly what we left off last time we take our data set D and B cut it into two parts and we call this part here D training and this part here D test it there's a deep and what we do is the moment get this data said we should chop off this D testing part and be copied onto some disk and be buried somewhere underground we don't tell anyone else where it is so no one's allowed to ever look at this and now we train our algorithm Prime you know only on this this data set on the training data set and then ultimately if you want to know how well it really works we compute the loss on the test set all right so that's the key and because that's basically you know it's an unbiased estimator of how well you know of the actual real last I be trying okay um there's a few pitfalls and that's very important is you kind of split it any way we want and actually a lot of grave mistakes of machine learning have been made but people splitting between training and testing the wrong way so tell you one example so I used to work in industry and one thing our research team did is we put a spam filter an email spam filter and so it was one day where one one of us may have been me I'm not saying what's really really excited right because you know the new algorithm got 0% training 0% testing here so the evaluating algorithm here you tested yeah you got every single one right not even like spam is soft right awesome take this stammers but what was the problem the problem is I spit split the training set randomly in testing and training but turns out that spam is very very similar all right so people send out a spam message not just to one person but to one hundred thousand people so you have the same email many many times so if you split randomly every single email that wasn't a test set was also actually in the training set in one form or another so our algorithm basically learned do the right answer for all of these then it's all the same emails again essentially and got zero so the right case and this because there's a temporal aspect to your data is to split by time say like when was the data collected and you have your time axis here this time and you say all the you know data collected up to Thursday comes in the training and Thursday to Sunday goes into test not something like this all right and then it's an accurate setting because you can no longer as you get no information from the future okay and that's actually essentially what you want but you want to change the spam filter that's trained today to work tomorrow that's the idea so if you have a temporal component you always have to spit on time if you don't have a temporal component and the data is really sampled iid then the right thing to do is shuffle them randomly right and it's a good exercise to always do this we've seen I've certainly seen cases one except you know a case where this was not me this time but was someone else where the problem was that basically the data was just stored in a way that first we had all the points with label 1 and then he had all the points with label 2 and some ones have split it and so all the tests that was all the same label so that of course doesn't give you an accurate estimation of the test so if you if it's iid data with it you know uniformly at random if it's tamper all data split at some test point as always I'm done time point and everything the future goes in to test those other cases and medical care uh one mistake that I've often seen is that people take data set like multiple data samples from each patient and they sample those uniformly at random of course you can't do that either you have to move the entire patients some patients into tests and some patients into training the idea always is you know you train on a set of patients then you test on a set of patients so the important thing is be really really careful when you do this but if you do it right it isn't it's a great estimation of your test there any questions none all right so I guess let me just I guess you have it right in front of you right so the two points you know that I really want to make is uniformly random if data is ID and by time if there's a temporal component very often okay well very often people do one additional thing and practice they don't just split into training and test they make a second test they actually split into three sets they call this validation anyone have an idea why you may want to do that yeah that's right that's right so if you actually have many many models right and you train them and you only have one test set right so here's one thing that's really really important if you evaluate a model on the test set that is an estimation of how well your test is so how about you generalize about your generalization error is but only the first time you do it if you then train another model and you get a different test error and now you pick the better one of the two actually now you made a decision that there depends on these particular data points and that actually may no longer be reflective of the true error let me give you an example so let's say we have face recognition data set and we are a critical classifier you get 99% correct so maybe in this test set you only make two mistakes right so now we can tweak my algorithm retrain it and now I only make one mistake on this test set so I got a half my error right that's amazing all right should I really pick the second algorithm of the first maybe it's truly better but there's also a good chance I'll just meet some tweak that just get that particular person right right and it may not actually you know be be true in general right and so let's say I try a hundred algorithm 100 algorithms right they're all a little different they're all around you know making one or two mistakes right if I show you many many algorithms it's got to be one of them that coincidentally makes no mistakes and if I pick that one it seems like I'm doing really really well but actually what I really did is I just over fit it right over fit it to the test set so I made a choice that really probably only applies to this does that make sense so in certain some sense if I look at this data set too often then I'm also actually training algorithm on the test set and I'm losing the fact that actually it's unbiased any questions yeah so here comes the comes the point so if you have the right agent set what you really do is you take your argument you take your multiple algorithms you evaluate them on the data validation set you pick the best one on the validation set and then when you finally find a champion you value it on the test set and that's the error you tell your boss right because that's the true error that you can expect entry but area life right and usually what you see is that the training error is much lower than the validation error and the validation error is lower than the test error right a little bit because you made some choices based on the validation set it's not just actually different models you know often people also have parameters right so for example if you have a decision tree how deep should the decision tree be right there's some trade-offs you'll learn about this that's the kind of stuff you decision you make on the validation set but then the key is on the test set only you can only evaluated once you write a paper and be really really strict about it right so people are very annoyed about you know when and you publish an algorithm for example and then it turns out actually you know you evaluated many times the test said and you actually pick the algorithm that's the best because then but you know people notice that immediately because you take the algorithm apply it on a different data set and you get a clearly higher error so it's easy and it's problematic any questions yeah right and it will happen on the prediction set right but then you still take the best algorithm or you still stick whatever algorithm you think is the best and apply it on the test set and now you finally get it you get a true answer of what you're a true estimate of what your generalization error oh isn't that net it's true that it's absolute true and so one thing people do actually you know people talk about this later on you can actually change it once but this is training in this is validation and then train again with this is training in this is validation or something so you make different splits all right to get different violation errors and people talk about this in more detail but yes that is a problem and actually one thing that lately has become a fashion is a compute the error on validation and then actually add a little bit of noise - it's like perturb it a little bit and basically say well there's a little bit of overfitting so I can overcome this by adding noise we will talk about this in a few lectures yep so so you can reuse it so here's what you do you train your data algorithm on on training and then you evaluate it here and then you try a few different models right and you valued here okay once you pick the best model you retrain it one more time on the entire data set and now you test it here but now you stop now you throw away your computer right you're not allowed to touch a do anything more because you've used to have the test set anymore yeah the bay in the test set yes the test that is when he's done right I mean one thing you can do that what we did like 49 was an industry we had a test that on spam data be evaluated here the moment we used it we already collected the next one all right so he always had a new test set that we could actually use for about it yeah so if you have a temporal it a component and your data then you have to spit it by time oh if you actually have time series they actually oh then it gets a little but usually it depends on the case on the other setting but the one thing you have to be careful of is that your algorithm cannot get a glimpse of the future because in the actual test setting you would not be able to do this and that's that's often a pitfall that you actually realized oh the algorithm used some information that I got from the future yeah wipe the memory oh oh I see so you basically want a glimpse at the test set couldn't just you run it again so the problem is not actually Iowa and the problem is you all right so because you glimpse that the test set that informs you right it's gonna be really hard to say oh I totally ignored this now afterwards I made you know like that the fact that you know this right you learned something from this you learned that one set of algorithms works but a new networks like better than to a and decision trees and that actually will change your decisions in the future and at this point you are cheating equals right so because you actually you make decisions you know based on the test set so now if you evaluate again right you basically is using it twice but you first looked at it and said what's gonna work well and then you say oh look here this worked well but does that make sense yeah yeah oh yeah that's right so so you can make many more divisions that's actually we will get into that and we will discuss this formally and then we get into the how you you know all the little details yes that's a good question so the validation Sado the test data should be not too large and not too small and that can I leave it so that's usually people do an 80/20 split and it depends how much data you have all right so it also depends on noisy your data so usually then you want to look at the variance and so on and you want to see yeah if it is attested is that it's very very large and you get a very good estimate of the translation error but also you burned a lot of your data that for training all right so that's basically your trade-off and these are the tricks you can do and that we can we'll talk about in a couple thanks yeah by the test set you basically you know you look at it and then you poke you eyes I don't do it but yeah then you're no longer allowed to look at it so unless you basically say here's my own thing in my outcome is the following right a bit of a basically you know I haven't I'm comparing these algorithms and these old you know what I'm trying my my point is that here the errors that they make on the test set then that's fine alright on the test set and then actually conclude that's right and then keep keep iterating over this right so that that space the point right does that make sense so as long as you as you lay open everything you did right he said I looked at the following five algorithms here random all the tests said here the five of them that's fine right because ultimately you didn't make any decisions you didn't change any behavior afterwards right there's something said but if you try the first algorithm and then based on that you just saw a second algorithm and a third algorithm now actually you're adapting to it and the moment you adapt to it you you actually overfitting yes yeah right right so we will get to this so one thing you can do is here for example let's say you have you know I wrote them on problem once but we had 11 data points right suppose a medical trial it wasn't a try but what was you know experiments was very very expensive each data point cost a lot of money and involved taking MRI scans of people and so on and so what we did is basically you know we can only leave one person out of testing right now that's that's weird right because either we get that person right or wrong so you get zero 100% but what you do is you basically always train on 10 and leave one out and you do that 11 times and then you average and we say the average error is the following and that's fine that's totally kosher you just have to describe what you did does that make sense okay one last question on this because there is more to machine learning yeah oh there's an art there's this a German art they look a little different all right hey so I said again give you a function that's right so here's what he did you try some some age and you get a function Rhino here right and they say well that's not good enough right your boss tells you you want to deploy a spam filter and you better have no more than 0.5 percent error all right do you run your thing and you're doing point you're doing okay point seven percent error but it's not good enough she lying darn right let me try something else as you try some other function you try more complicated networks etc run again now you get point six okay let me tweak it a little more right until you get point four and they say okay now I think I've got it you tested here you get point five awesome does that make sense okay good let me move on [Applause] at one quick quiz okay let me just let me just say this so if we have let me just say it and then you can because I don't have the answer on the sheet so if you have the loss on the test say it what is this there's the sum of all the data points in the test set yeah 1 2 n loss of H comma X I comma Y I the question is why does this as n becomes infinity becomes exactly this can anyone tell me why that is so I claim that this year becomes this as n goes to infinity so Bayes II know that's that's really what we want right saying that the test that is really large then group your computer is exactly this maybe give you a minute to discuss it with your neighbor why is this converging to that [Music] all right who can tell me why and what is the statistical law yeah that's a weak law of large numbers so that big law of large numbers baby says that exactly this right the average of random variable the chance the expected value in the limit all right so remember that's the weak law of large numbers says exactly you know busy.we assures us that what we're doing is right okay any questions okay good so by then if I don't see you you raise yet there's too many people here so please say something right I just you just say anything you want and she said this last time is that last time I taught this class that you can say anything you wanted one one one student said like viagra we talked about spam filtering so it made sense but all right so what have you talked about now that we have talking about the data set D the feature presentation so you take your data represent as features you have labels you're on a function H they goes from X to Y we know we have to split the data so that's all really really good so one thing I haven't talked about is how to find this hypothesis class H right and here's a great idea why don't we just you know during this course figure out what's the best algorithm let me just only use that one well even better why don't I just figure out what's the best algorithm then I only teach you that one all right and then we call it quits why does that not work and let me give you an example on this example is actually is a party game are you ready for a party game oh come on what kind of party is this all right okay so here's my party game I give you a bunch of data points and instead of training machine learning algorithm and I want you guys to be the Eiger okay so I I give you a bunch of points this is my training it this gives my ex and this is now why so tell you for this x value for this value if I the answer is this way okay and I give you a bunch of these points okay and that's our training data and now becoming test a let me look at tested oh I'm gonna say I look at this point here that's here let me let me caught it make this more complete one two three poor so where do you think is the world value at exactly this point of X who wants to volunteer any brave people yeah around to ah sorry totally off its for how about this point here for oh come on SWAT anybody else yeah tae-il close Thank You negative so I know what you thinking right your parties suck vana so what's the problem right what's the problem like oh I can't you learn that's right you guys are terrible at this I'm sorry but hey Mikey the function looks like this actually Zhou here went down negative so what's the brach right so you made an assumption I mean you saw these points of it assumption you mean assumption that the function smooth clay most of you did this and that's the key to machine learning so if you have an adversarial setting right if he actually had some setting where basically I try to predict something but you have some adversary that was me in this case and I could always come up with some other answer machine learning is not possible right I would not know that it's I mean you have to assume that the world is nice I know you have to make some assumptions about the data set otherwise we cannot make predictions the whole whole process of prediction is you know the whole point of predictions is that we assume that basic things keep you know behaving the way they did and different functions or different algorithms make very different assumptions and the no free lunch theorem is we call it a mission Basel II states you can actually formalize this but this there cannot be any function that's the best for everything because you have to make some assumptions otherwise you can't do well in prediction and then I can generate a data set that violates exactly these assumptions does that make sense and therefore my algorithm will do arbitrarily badly okay so there is no one algorithm that as well as everything it cannot exist and I'm just I'm not formalizing this lecture I'm just giving you the intuition and a big part of this class is to understand what assumption each algorithm is making I didn't a big mistake that I see and people apply machine learning in practice is that the Bayes they have one favorite algorithm that they like right for example it's deep learning or it's random force or something and they're just loved it isn't that algorithm they use it for everything right every respective of what assumption that makes if these assumptions actually hold in that dataset so what I'm what I want you to get out of this class is to you know sharpen your vision in that respect and look at data sets and first analyze you know what are the properties of this data set but this particular theta I said what assumptions can I make all these assumptions do this is I'm so Holt which algorithm leverages these assumptions and makes good predictions and that is the most important most important aspect about machine learning and but most people do not do right actually so any questions about this yeah so people so there is algorithms that you it's called meta learning and basically what you do but essentially what you're doing is just what a data science system would do if you if you don't know what assumptions you can make then basically what they do is they spit up a training in a validation set and the test set so they take their training and validation set they try out a whole bunch of different functions and just see which one gives the best air and validation and then they train that on test so that just a purely empirical you know way of doing it but if you think a little bit enough about it then typically it helps a lot to kind of rule out algorithms to begin with if you say well they make this assumption that assumption this does not hold in this data set all right so don't waste your time good question any more questions all right awesome so that means we can actually move to the next topic and I can talk about the first algorithm I said let me do me give some to the middle thing well I'll give you some more yeah these to you [Music] okay who does not have a handouts ha and can it somehow make a cross you know to the upper half of the classroom yeah maybe alright well I can already get started ok the first algorithm is also one of the oldest algorithms in machine learning by the perceptron it's called K nearest neighbors and it comes from 1967 is that correct yeah 1967 like over in heart all right so let's say we have data points this here's my data by the way can you see this in the back raise your hand if you can see this okay a third table you have data points and we have crosses and so these are I just have a binary classification problems I have crosses and higher circles and have a new data point let's see this is my new data point I would like to know is this a cross or is this a circle what would you predict anybody it's a cross whereas a circle it's a cross that's right and why do you think it's a cross it's very close to crosses that's right and that is the assumption that nearest neighbors is making so now given that you know we just played this awesome party game you now know that the first thing you have to do is when you look at the algorithm is thinking about what assumptions is that algorithm making and K nearest neighbors makes the assumption that data points that are similar have similar labels it doesn't have to be the case you can easily think of a data set but that's not the case at all but if that you know that's the assumption we are making we say data points that are very close together are likely to have the same name and so here's how the algorithm works in a nutshell you look at this data point that you don't you know this is your test point you don't know what the label is and now you look at the K nearest neighbors with K could be 1 it could be 3 some usually want to have some odd number so that's a vf3 this one that's one and this one so you pick these three years neighbors and nearest neighbors in this case all three of them are crosses so you let them both this guy votes cross this guy votes cross this guy let's cross so unanimously decide it's a cross if the test point was here and you would actually have these three as nearest neighbors then you would say okay I have two votes that it's cross and I've won both at it's a circle so I'm still saying it's got maybe less confident but I'm still saying okay does that make sense raise your hand if that makes sense all right so it's a super super simple algorithm that's surprisingly effective let me formalize this algorithm so we have a test point X and you have a data set D so we look at the set of K nearest neighbors so how do we define this you say you have this data set s of X which is a subset of D and s of X such that the cardinality equals K so we have to have exact exactly K data points in there k plus 1 is too many and now we say for every other points X Y Prime that are in D but not in s X so every point that's not in the set of nearest neighbors what do we have to have you have to have that the distance between X and that prime X prime must be greater equal the distance the maximum distance for any point in as X to our test but one more time so what are we saying you be saying if you take any point in our data set as X and compute the distance to our test point take the maximum so the point that's furthest away every point that is not an as X has to have a distance that's larger than that so in other in colloquial English what we are saying is if you're not one of the K nearest neighbors then you should be further away than the case nearest neighbor crazy Hannah that makes sense awesome okay good stuff yeah well you could have pointed exactly the same right but you can only have K nearest neighbors right so that you have to choose it's like it's like choosing between momma and Poppa or something and what're you gonna do it's like like them both all right so we have a few more minutes okay so the key about nearest neighbors okay nearest neighbors so they key behind the genius neighbors is that it's only as good as its distance metric okay so if the distance metric reflects similarity and labels then it's a perfect classifier right so from the distance you basically say you have a zero distance if the label is the same and the distance of one of the label is different well then it's perfect because the nearest neighbor will always have the same label as you the nearest neighbor will always have the same label as you so the key behind nearest neighbors is to use a good distance metric and this whole field of metric learning that actually says well we can actually learn these metrics so you're not going into this but the most common metric that people use is the mimic cotton mill minkovski Minkowski Mankowski so its finest and she's ever even kelskiy I where's he from Russia so we go over all the dimensions and we take the odd this here's now the earth dimension maybe I should write it like this the earth dimension of X minus the dimension of Z to the power of P 1 over there's the minkowski this is most commonly used now a very quick quiz for you guys three special cases P equals 1 B equals 2 he becomes really really large I claim these are distances that you're well Avera discuss it with your neighbors with your favorite neighbor maybe two and what are these three metrics [Music] [Music] all right who knows the answer first one P equals one what is this that's who's brave yep Manhattan distance right P goes to for someone else yeah Euclidean P goes to infinity yes something yeah sorry they say one more time nada and I don't understand it unfortunately yeah maybe could be yeah it's the max that's right it's the max its max so the max the max of what it's a maximum difference of two dimensions right and why is that well I'll think about it right P is now very very large right very large so now we we take all these these differences and raise them to the power of P now to imagine one of them is a little larger than the other all right so we take the difference in these different features right so what is the difference with a age difference in weight or something let's say the difference in age is larger the difference in weight you raise it to the factor of a thousand right so let's say you have you know the difference in age maybe thirty two to a thousand plus the difference in you know weight maybe six to sixteen over 15 mm well this is so so so much larger than that right that doesn't matter at all right this whole it's completely dominated by this right does that make sense so there none take the whole thing to one over thousands or that's basically just 32 I mean this is this is epsilon this is so tiny compared to that thing it doesn't do anything does that make sense so as P gets larger and larger this effect goes it becomes more and more pronounced than essentially you just have no matter you know we basically just have two maximum difference yeah well the turbine why does it become this the same way I don't understand it sorry why are we I might be comparing to my zippy oh because Paisley okay so he is like why are you doing this and the answer is there's the Minkowski distance and the nice thing is if you have an algorithm that implements this right just with different values of P you get a whole family of different distances right and that includes the common distances that basically people think of and so the you know they are Euclidean distance and having distance all right I'll just taking the feature that you know they vary a lot of one feature and that actually is the most important does that make sense and you can you can't take any value for P or any non-negative value okay so that's that's that's one reason why it's so popular it's because it's actually very fairly general all right I have a short demo but you know what I'm already one minute over so why don't we start next time in the demo but can you someone please remind me that I haven't shown you yet the demo

Transcript for:Introduction to Machine Learning

Transcript for:
Introduction to Machine Learning