Understanding Candidate Elimination and Learning Algorithms

All right, let's get started. Good morning. Welcome back. Any questions before we start? Yes. Right. Yes. So the question is, for the candidate elimination algorithm, if you remember, you have to maintain two boundaries, right? So you have to maintain two boundaries. One is a G and S. And the question is, will S always have one hypothesis in that most specific boundary? And the answer is yes. And answer is yes only if your hypothesis space is conjunctive concepts. But one thing I maybe didn't stress too much last time was that candidate elimination is more general than just working with conjunctive concepts. So that's why. you know in general you can have more than one but for conjunctive concepts and those are the examples that you will see in the gradient squares in your exam as well it will always be one okay any other questions okay so okay so let's move on and today and maybe today and the next class we are going to do a little bit of theoretical analysis of algorithms right So in the later part of this course, most of the machine learning that we will talk about will be more about tools and algorithms that have been out there to perform different types of learning tasks under different assumptions and so on. But before that, we also want to understand how do we analyze an algorithm? How do we say this is a good algorithm and this is not a good algorithm, right? Unfortunately, these days, if you pick up. Papers and machine learning or data mining where they use machine learning or they propose new methods Typically what people do is they they propose a new algorithm and then they run it on some a few data sets and it But if it performs well, they make a assumption that it will perform very well, right? However, that's not true If you really go work in the real world apply machine learning tools there You will see that the same algorithm could have remarkably different performance depending on what the data was and what kind of problem it was, right? So the point I'm trying to make is that even within machine learning the problems can be different and they can be different in the sense some may be more challenging than others which means that if you apply the same machine learning algorithm they could you could get different sort of performance, right? But the questions there are several questions there how do we define the performance of an of a problem problem We are not talking about algorithms, but we are talking about problem spaces, right? How do we know which problem space is tougher than other, right? So let's sort of do a little quick recap of what we did so far. We were talking about this notion of concepts, right? And let's say we are still talking about data that is, you know, d-dimensional vectors. So data is still defined using d binary features. And it is easy to generalize it. You can make it non-binary as well. But let's say just for our discussion, let's assume that our x is still defined as set of d binary features, which means that the number of binary concepts that you can have, now we are still talking about binary concepts, something belongs to... Concept or does not right, but this can also expand later But for now what we have seen in the past is that the number of concepts that you can have is Essentially the size 2 to the power size of this X right where X is nothing But you're all data points that you can have and for D binary features We saw that this is same as 2 power 2 power D. So that's all concepts. However, what I quickly maybe in last two minutes of last class said was that learning in this concept space is hard. In fact, there are theoretical proofs which show that if you do not make any assumption about your concept class, which means you are trying to find the best concept in this large space, you will not be able to find it unless you have all the data labeled, which means that you don't really need learning, right? So we saw that one way to make it a little more tractable is to make some assumptions, so what we call inductive bias, in which case we sort of reduce our concept. So we make a hypothesis that our target concept actually lies in this. smaller space and then you can define the space in many ways right so one way we defined this space was conjunctive concepts where we said that every concept can be expressed as a conjunction of one or more features right but you could have other concepts as well so for example you could have a disjunctive concept and i'll sort of talk about it in a second So within the same hypothesis class which you have made an assumption, you can have many concepts, right? So one concept could have been that circular and large. So essentially what you're saying is that this concept says that any tumor which is circular in shape and large in size is a malignant tumor. Otherwise, it's a benign tumor, right? You could have another concept which says oval and smooth and something else, dark. That could be another concept, right? And this concept says that any tumor which is oval and smooth and dark belongs to, is a malignant tumor. So maybe these are two different types of tumors and two different diseases. And this one is defined by this, this is defined by this, right? But what is common between these two is that they both belong to conjunctive concepts, right? So we can apply the same algorithm for example our candidate elimination algorithm to learn this or we can apply it to learn this given training data right you will need some training data that has been labeled by this concept to learn this concept and the same way for this one but you could have another concept class right you can have disjunctive concept in which you say anything circular or large or dark So this is you can see it's different from this because what this says is that if the tumor is either circular or large or dark or all of three or one or two, at least one of these things have to be true. Right. Then I'm going to call it a malignant tumor. Right. So the concepts then that can be written in this form are known as disjunctive concepts. So they would be a separate class here as well. OK. I mean there will be some overlap between the two because a concept which is just circular that could be here as well as here right because you it means the same thing as well as if you have a concept which is reject everything I think will also belong to both so there could be overlap but the idea is that you can you can have different types of classes yeah and what we want to understand is is one class more difficult to learn than other okay so you can see that this these type of concepts the size of this this class will be much larger right because this these will sort of apply more than these these are little more restrictive conjunctive right but then you can have other types of concept classes as well I could have a concept class which says It can only have two exactly two Features in the in the expression. That's it just two in a conjunction or smooth or Things like that. So I mean I can come up with these I can make those assumptions So what this would be is another class somewhere here with some overlap and so on but so the point is that you can have different types of concept classes which means that you can think of these as problems, right? Each one of them is a problem that you need problem space from which you need to find your target concept. So the question you want to ask is which one is harder, which one is easier and so on. Right? So let's switch. Any questions so far? Okay. So let's quickly switch to our slides here where I have something. So What we want to do now is we want to understand the complexity of these problem classes And we want to get and what we will see is that it's not it's not easy to just say the complexity is Four or five right it's because you're looking at many possibilities So what we can can actually do is just put some bounds that okay the complexity will not be more than this or less than this And things like that Now the question is how do we measure the complexity? What do we call a more complex problem or a less complex problem? So you can define it in many ways. One is maybe you can define it in terms of how much data do you need? Okay, and next week we will see one way to sort of measure problem spaces in terms of how much data do you need. Another could be how much resource do you need. Maybe one class of algorithm needs a lot of CPU cycles, one does not. So you can measure it with that as well. We will not look at that at all. The third one is what we are going to look at today is how many mistakes do you make when you're doing the learning. Okay, and this is only specific to one type of learning which we call online learning. So I'll give you a Sort of a description of that. So there are three and they could be more but here there are three ways to measure complexity One is how much data do you need and that we will see how to measure algorithms or problem spaces using that metric? Later on the one could be how much resource you need and one is how many mistakes you make So today we are going to look at how many mistakes you make when you are doing your learning. So to do that we have to look at machine learning in a slightly different way which is what we call online learning. So in online learning. So online learning so we I'm going to change some of the notation just so that it's easier to Talk about it, but a lot of this is very similar to what we have seen so far So what we are going to say is that our data X right X is a set of D binary Features or a vector of D binary feature so that is same as before but instead of giving names to each possibility We are going to say that each feature is either true or false true or false, true or false. So this is just sort of a change in notation. So far we had seen features like circular or oval, right? And we saw dark or light and something smooth or irregular. So all we are doing is we are saying that if it is circular we are calling it true, if it is oval we are calling it false. Okay, so that's just a slightly different way of writing the same thing So the reason I'm doing that is because then I can easily write our data as this X K is just one possible X could be true true False false true. It's easier to write that or ttff right or maybe one one one zero and so on So that's that's how I'm I'm just changing it a little bit. Okay So then I have some training data and I have a target concept forgot about that So target concept is still our C right which takes a data point X and gives you either Zero or what right so if it belongs to the concept or does not belong to the concept so that's that part is still the same and and C belongs to a some concept class a fancy C that sort of we define we say okay it's all conjunctive or disjunctive and so on so this is our class and okay and our algorithm works like this algorithm works in an iterative fashion so we say for i equal to one through some you know let's say we have n training data point so for every data point what you do is you get we have a learner right that's our machine learning algorithm and it is given an object object number i so this is our training data object number i so x of i which obviously belongs to x right so we get this then what learner does is learner predicts c star x of i okay so c star is what the learner right now thinks it's the target concept is. So it initializes some so some it has some C star which is the the concept that it thinks is a concept right it's kind of like our other algorithms we restart with some hypothesis so this is our initial hypothesis so you use that hypothesis to predict what is going to be the class for X of I right and then learner is told C of X of I which is the true so this is what you predict and this is true Okay, so now what the learner does is learn or our machine learning algorithm it it Updates C of star right so it so two things can happen Maybe this was it did not make a mistake or maybe it made a mistake, right? So there could be two scenarios so case one is C of star of X of I is equal to C of X of I which means there is no mistake and the algorithm would have some way to update its C star, right? Maybe it doesn't do anything or it does something that's specific to the algorithm and the second thing would could be that C of a star of X of I is not equal to that means made made a mistake and it will keep doing that until the training data is exhausted, right? So that Setup that's the online learning setup So you start with some guess about your hypothesis and as the data comes in you take the data You make a prediction and then you see what the right answer was and if you make a mistake or two things can happen make a mistake or not and Depending on the thing the cases you update your C star accordingly, right? So that's algorithm So the objective of these machine learning algorithms is to learn C, right? Eventually you want to learn C. Minimum mistakes. So you do not have to make too many mistakes. So that is how we measure the performance of this algorithm. We want, so if an algorithm makes a lot of mistakes until it learns C, then it's not a very good algorithm. But if an algorithm makes very few mistakes, it's a good algorithm. So you can think of this as maybe, I mean we can go with the tumor example. maybe there's a radiologist right and the radiologist first day doesn't know anything so he has he or she has some some hypothesis in the mind gets a new case a new tumor whatever thing image and makes a prediction saying yes a tumor or not and then next later that day the doctor tells actually it was a tumor or not and then if there's a miss then keeps refining the hypothesis right but During that time The radiologist might have made a few mistakes Right sometime might have called a tumor not a tumor or not a tumor a tumor, right? So you want to minimize that because you do not want to make too many mistakes even during the learning part because you are using This method that's why we call it online. You are using it while you are training So a lot of machine learning methods are used in that way as well is that you keep using it as the data comes in And at some point you would have learned enough and then then you're fine But you also want to minimize your mistakes till that point. So that is what we call the The number of mistakes that we want to minimize. So that's what I had here is Number of mistakes made any questions. Yeah Yeah, we start with some initial guess so that's our hypothesis that C star is our hypothesis saying okay This is the right answer and then slowly you are refining it So what I have not given you in this algorithm is not the exact steps of how the hypothesis is updated That is specific to different algorithms. We will see one called Vino later which where you will see how it updates it right yeah initially you're talking about the inch so initially of course it's up to the algorithm right so you can maybe assume that my initial hypothesis is accept everything or reject everything so that will be very specific to the design of that algorithm any other questions yeah Yes, so the question is the notion of or the measure of mistakes is totally dependent on What did what training sample you get right? So that is why when we will analyze our algorithms We will say let's count the number of mistakes on all possible training data sets that you can have So you want to generalize it to all possible ones because it could happen and that's what I said in the beginning is it could happen that for one particular set you might do very well. But that should not be the only way to measure it. We want to generalize it. So in fact the next slide is going to do that. Alright. Any other questions? Okay. So again, so the setup is this. You have a learning algorithm L and the details of L will come later. That's up to you. You have a target concept that you're trying to learn. Target concept belongs to this concept class. Sometimes I call it hypothesis class but yeah, this is basically what you do your inductive bias and get to. And D is one data set. only one particular sequence of training example as you said it will depend on the that sequence so let's consider 1d right so this number this ml of c comma d is let's say this is the number of mistakes that your learning algorithm makes to learn c using this as your training data okay let's start with this of course as you see it is dependent on what your c is concept right because even in the same concept class you could have different concepts right so even in conjunctive concepts you could have one concept which could be easier than the other right so right now this number is specific to the concept and the data set and of course the algorithm now to make it little more sort of more convincing or more acceptable we want to find the number of mistake that you make for any possibility right so you So you define this value which is the max the worst you can do for any d Okay, so this is basically the number of mistakes this algorithm would make to learn this concept for the hardest training data Set available. Okay, because you have you can come up with many different sequences right infinite not infinite But a large number of possibilities so over all of them. What is the worst you can do? That's ML of C and and then we make it even more because as I said C can also determine how easy things are how tough things are so you also take out C you say let's try to find the worst you can do for the any concept within this class so that is worst case scenario for this learner in learning any concept in C so that is a number you want to measure then you can really understand how good this learner is, how good this learning algorithm is. Of course you can't measure it because there are too many possibilities you have to evaluate but we can at least get some bounce, so I'll come to that. But actually we want to go one step further and we want to actually find what is, we want to actually remove the learner as well. What this says is, this number, Opt of C is what how many mistakes can the best learner make for this. concept class this defines the complexity of the concept that concept class so let's say c is conjunctive concept so we want to figure out okay for conjunctive concepts over all possible machine learning algorithms what is the best you can do in learning the toughest concept for the hardest training data right so that is what this defines so we want to find this number for a concept class this can help us you then distinguish between two concept classes. Maybe for a concept class like conjunctive concept, OPT is 4 and for disjunctive concept it's 5. Then we can say that okay conjunctive concept is easier than learning disjunctive concept. The reason this is important is because when you go out and let's say you go out to work in some company and somebody gives you a problem to solve machine learning problem and you say that it's it I I can do this by making 20 mistakes. That's it, right? And that is good. But tomorrow, they change the problem and they apply your algorithm and suddenly it's making 40 mistakes. And you say, the person says, hey, why is there a difference here, right? Yesterday it was doing 20, today it's doing 30, right? So the answer to that is, you point them to this, that maybe the new problem is more tougher, is tougher, right? It's more challenging to solve. And that's why it's making more mistakes. So that is sort of the whole idea behind this. How do we define the complexity or the hardness of the problems over all possible L, right? Of course, anybody can look at this and say it's very hard to get this number, right? In fact, it's very, it's maybe okay to get this number, right? Because you know there is a particular data set and you have a concept, right? So you can measure this. This is easy. This one again becomes a little harder because you have to try out all possible data orders of data points that there can be. So it becomes harder, right? And then here you have to iterate over all possible concepts. That also becomes complex. But after that, it just becomes impossible to do because you have to iterate over all possible learning algorithms, which is... Impossible right because every day you can come up with a new learning algorithm So but what we can do is we can set bounds on this so we can set bounds on the optimal number of mistakes So let's say this is what we want to find right of this is the number of mistakes that the best learning algorithm would Make in a worst-case scenario hardest concept and hardest data set and this is what we want to find but we can't find it But what we can do is we can find out some easy bounds on that. So for example, of course 2 power d is an obvious bound, right? Because if your data is defined using d features, then this is the number of data points you'll have. So you can't make more mistakes than the number of data points you have, right? So 2 power d is a bound, but it's a pretty loose bound. We want to go tighter. We want to make fine numbers over here, okay? So the strategy is if you want these tighter bounds, what you want to do is you want to come up with an algorithm. Okay. And then find this value for that algorithm, which is the number of, which is the worst case that this algorithm can do for any concept in this class. And that becomes an obvious bound, right? Because this is basically the minimum you can do for all possible algorithms. So if you have an algorithm, then this will always be less than that. because that's that's how you define so so so what we can do essentially is we can come up with a new algorithm and figure this number out for that algorithm which itself is hard but for certain algorithms and for certain concept classes you can come up with that number okay so for example we saw list then eliminate right list then eliminate can make these many mistakes at most these many mistakes right because what what lists and eliminate does is that you have your concept class and at every when you get a new data point you you throw away you first check right or what is uh the the the right oh no you first try to make a prediction right and then you throw away the all the uh all the concepts that do not agree with this one and then so at the end of it you will need these many steps until you have only one concept left right so you It means that that's the maximum number of mistakes you can make. So when you go back, you can sort of look at it again. But I think there is a problem ingredient this time or next time which will give you that intuition as well, is why does list then eliminate can make these many mistakes for any data set. Because these are the total number of concepts there are, right? And at every point, you throw away some concepts. So... Worst case you can do is maybe you every time you throw away one concept one concept one concept and at the end You will be left with your target concept and you will make these many mistakes It's sort of just I can wave my hand. I think that's you should get convinced All right, it turns out there are other algorithms with which you can come up with even harder bounce even tighter bounce and so far These algorithms are not even making any assumptions about what kind of concept class it is. It could be disjunctive, conjunctive or anything. There are some algorithms with which you can come up with these bounds. So for example, and we'll talk about this, there's one algorithm called halving algorithm which is very sort of, it's just academic, you can't use it in real life. But what you can show with that algorithm is that for any data set and for any concept in this concept class, it would take log. base 2 of the size of concept class to learn the right concept, okay? And so we'll look at this right away. But even this might be very large, right? If your concept class is large, then even log 2 of that might be large, and you might want to even go further down. So we will see another algorithm in which you can get even tighter bounds, but that is only for us. You have to make assumptions about your C at that point, okay? Any questions so far? yes uh-huh have zero number of no so the question is since you're saying there's only one target concept right yeah so if there is only one target concept will list and eliminate make zero mistakes So the point is that maybe for certain types of data, it might make zero mistakes as well. It could happen. But what we want to figure out is for the worst possible sequence of data points, how many mistakes? So for worst case what could happen, so you just need to think about the worst case. So worst case what could happen is that this is your concept space, you get a new data point, every time let's say you just pull out only one concept from the version space and chuck it out, right? Because that's what you do. So every time you just do, so you will need this minus one steps to do that and let's say worst case in every step of that you make a mistake, right? So that's the worst number of mistakes you can make. But in reality, you might make fewer than that. This is sort of the worst case analysis. Any other questions? Yes. No, no. You mean the uppercase calligraphic? See, that's our concept class or hypothesis class. So this would be in the case of conjunctive concept 3 power d plus 1. Or if you don't make any assumptions, 2 power 2 power t. In fact, for certain concept classes, this number might be larger than this number as well. It could happen. I just wrote it down like this. But it depends on what concept class there is. Okay? All right. So now let's look at halving algorithm. Because the reason we are looking at halving algorithm is because it lets us get tighter bounds. Theoretical machine learning is all about getting tighter and tighter bounds. right so having algorithm is pretty easy to understand and it's you don't want to use it in real life and go for a job interview and say I'm gonna apply having algorithm you the interview won't go very well but but for theoretical analysis it's pretty nice so so the idea there is that you first define these two things okay so these are two sets that you define so given your current version space In fact, this should be maybe I'll switch to my document camera. I'll fix that later But the idea is this that you define a set called chi I think zero for any version space that you have and a data point X So let's say this is your current version space, which means this is this is the high path I mean you have whittled down your hypothesis space and this is what you're left with you think the target concept lies in there, right? so this is the subset of this V such that all Concepts that belong to all hypothesis. Let's call them hypothesis H that belong to V such that H of X is 0 I mean the simple way to understand this is that let's say this is our version space right now right so some of the hypotheses in this version space are going to label x as belongs to the concept or one and some are those which will label it as zero it does not belong to the concept so this is what we call and this is what we call the zero Okay, so that's the definition, right? So what is halving algorithm do? Halving algorithm at any iteration of that online learning What it does is that it looks at new X right and then it says it it Finds out this it tries to figure out what is this set What is this set and the way it will do it is it will take every con every? Hypothesis inside here and will apply it on X and C for the 0 or 1 and construct these two sets, right? These will be disjoint sets. Of course, then it tries it sees which one is bigger. So if this one is bigger Then it says that my prediction X which we call C star of X right is 0 Okay, because more Hypothesis in my current space say that it is 0 else prediction on X is Is what? Okay, so that's my prediction then it checks if it has made a mistake or not because now it gets the the true label So now it can compare itself what it predicted with the true label So if it has a mistake right if if the true label was zero and it predicted one or the other way around if mistake Remove the one that you made a mistake on remove from V. You see what I mean? So basically you say okay. I'm gonna let's say just take an example Let's say for this one the this was the bigger set right and let's say this one and you make a mistake So what we do is we just remove all of it Because these ones gave us a rod the ones that were here gave me wrong answer So I'll just check those out next time now. I'll be left with a smaller set now you get a new data point now You again divide it into two parts one which call it one and one which call it zero you make that prediction Compare it the true label if you make a mistake again you check things out right so what you see easily here is that? Your space becomes smaller Every time you make a mistake, right if you do not make a mistake you don't do anything and eventually you want to Arrive at your concept C, right? So if you start with so initially so we start with With C, right you start with C as my whole version space and at every mistake you remove some part of it. So now we need to figure out what is the worst you can do, how many mistakes will you make. So the way to understand this is that since at every mistake what you would do in worst case you will make the, see because. So what happens on a mistake? What happens on a mistake is that you are reducing the size of your version space, right? Worst case you can reduce it by half. You could reduce it by more than half it will always be either half or more than half right because the thing is that You would have chosen The set which is bigger right so it would have been either half or more than half and That's where you would have made a mistake, so you would have you would always remove the bigger set What happened? We remove bigger So in worst case maybe that bigger chunk was exactly half but it could be larger than that But we want to do worst case analysis, right? Which means that let's say at every mistake you remove the half chunk. So in log C steps would have you will be left with one concept which will be a target concept, right? Which means that these are the steps the size of V will be 1, but in at most log 2 C steps, you could have fewer steps if your chunks are bigger, implies that number of mistakes needed to learn that target concept C will be log 2 of C, right? Because you you can only make you can you will only have things when you're making a mistake and you will make when you make a mistake you're at most or at least reducing it by half that means that at the end of it in these many steps or at least in these many mistakes you would arrive at your target cards any questions so far there's a question next radiance about this as well you mean which one to remove Yes, right, right. Yeah, oh sorry, yeah. You remove the one on which you made a mistake. You said that I'm going to call this object, let's say one, because the one was bigger and you made a mistake. So, you just remove all of them because that basically there was some concept in this larger space which. which was inconsistent with state data point right so that's yeah you you're right so you want to this is you remove the set that the one that you chose to make the prediction i'm sorry what's that In case of, oh no mistake, oh no you don't do anything, at least for this algorithm you don't do anything because if you do not make a mistake that means that the concept is in this class. Let me try this thing, I mean maybe you can but for this algorithm you don't, so it's up to you, right, so you can design it in a way that you could do some refinement. even when you don't make a mistake. A real practical machine learning algorithm, that's what you do. Even for the real, for good answers also you use that in some way. But for this algorithm you don't because here, this algorithm is just purely to understand these bounds. So we don't want to make it more complex. Alright? Any questions so far? Any other questions? Yes. Aha! So that's a good question. So the question is, can it so happen that by mistake, my algorithm removed the half that contained the C, the target concept? And the answer is that will not happen. And the reason it will not happen is because if your true concept is in that class, right, that means that... I know that problem is in gradients next time. I'll just I'll post a note on gradients. Oh on Piazza about this. Yeah, I Forgot the example But yeah the question is that will it ever make a mistake like that where it accidentally removes the Part that contained the concept and the answer is no, but the reason I'll give you a little later Okay, any other questions? All right. So one point I want you to see is that this algorithm even though it's very interesting in the way that it gives you these bounds, it's not computationally feasible, right? Because every time you have to start with the version space and then keep halving it and everything. So it's not a good algorithm at all. So now our next question is can we come up with algorithms which we can actually implement, which are feasible to implement but also give us similar mistake bounds. The answer is that for a general concept class you cannot. However For a certain concept classes you can come up with algorithms that you can actually code and I'll show you the code for those and You can also show that it will make cannot it won't make more than certain number of mistakes okay, so let's look at that algorithm we call it winnow and So the first thing that we want to do is redefine our concept class Okay, so for winnow algorithm the concept class that we will look at is not It's not the conjunctive concept class. It's a monotone disjunctive concept class with at most k variables or features. So let's see what it means. So you know what disjunction is, right? Disjunction essentially is ors, logical ors. So here a concept class, a concept in this class will always have a form of this, right? It will have some feature or other feature or other feature. so on and another restriction that we put is that it can only have K such literals here cannot have more than K The reason we do that is because only then we can do this analysis. Otherwise we can't okay So the the concept class has K such features The reason we call it monotone is that we make another restriction the other restriction is that it will only have the positive values in the in the feature so which means that it will have concepts like So remember early in the class I said that instead of using features like circular oval I'll call them true or false right so the idea here is that it will say You can write these disjunctions like this. Let's say X is our data point data object right and let's say X of I is the ith feature and X of I can either be true or false so disjunctive concept class can be written like this X 1 or X 3 or X 4 okay so it means that I'm going to call a tumor posit of tumor malignant if it is X 1 I mean if the first feature is true or the third feature is true or the fourth feature is true All right. The reason we call it monotone is that it does not include features like this x1 or Not of x3 or x4. This is not allowed This is does not belong to our K monotone disjunction Okay The reason we are doing that is because that is going to allow us to do this analysis. Otherwise it becomes more complex So there are two things, three things. They are disjunctions of feature values. Second is they can only have k features in At most case of k or fewer and the third thing is that they are all monotone That means you're only going to consider the positive value of your features in this disjunction So it could be like this. It could be x1 or x4, right? So this is our C these k monotone disjunctions. So the first question you want to ask is Is of C how many K monotone decisions you can have right? So the answer to that is just done using combinatorics monotone These junctions can we have so let's say we have D features total D features and you can have at most K, right? So let's first look at simple example of when you have exactly K So how many exactly K monotone disjunctions we can have it's right at most That will be you can choose K of those in from K from D in D choose K ways right so it's a what's the formula for this D as is just combinatorics. How many ways can you pick? number of things from a larger set right so you can choose K so that means you can construct these many exactly K monotone disjunctions then you can have so if you want to do at most you have to enumerate from K equal to 0 to all the way to all the way to K right so at most K the size of C will be You can choose 0 or you can choose 1 or you can choose 2 and so on until you can choose okay, so that is the at most case and What you will if you do this math work it out. What you will see is that this number is bounded by What you see is that the C is equal nearly equal to or in the in sort of our theta notation right so it's it's close to exp sorry okay or you can say that log of C is equal to k log 2d. So this is just sort of computing this and writing it in this compact form, right? So what you can quickly see is that if you apply your halving algorithm to learn these you will make these many mistakes because log of that, right? So that is the worst case complexity of order. Yeah of any algorithm or having algorithm on this type of concept class, right? What we want to now do is come up with an algorithm which we can actually implement right not just Sort of do it in that having kind of way But still maintain these many Mistakes so that is sort of what our we know algorithm will do so it's an algorithm that is easy to implement And then what we will see is that it also makes exactly these many mistakes, as many as having algorithm, right? So it's a good algorithm in both ways. And the reason we're going to look at Winnow is because Winnow is our stepping stone to perceptrons and perceptrons is our stepping stone to neural networks, right? So next class hopefully we'll finish Winnow and then we'll move to perceptrons. And programming assignment, any questions before that? Yes Alright, so the winnow of course we will see when we see the algorithm, you will see that it's pretty easy to implement it. The reason halving algorithm is not a good algorithm is because you have to start with the entire concept class in your version space, right, which could be huge. So just in terms of how to implement it, its complexity. Alright, the last announcement I wanted to make was that the programming assignment will be out on Monday. Yeah. So it will be out on Monday but we wouldn't have really covered stuff that we need to do that. So I mean you might not be able to start working on it right away. But by Friday you should have everything that you need to sort of start implementing it. All right. Any other questions. OK. Thank you very much. Have a good weekend and I'll see you on Monday.

All right, let's get started. Good morning. Welcome back. Any questions before we start?

Yes. Right. Yes.

So the question is, for the candidate elimination algorithm, if you remember, you have to maintain two boundaries, right? So you have to maintain two boundaries. One is a G and S. And the question is, will S always have one hypothesis in that most specific boundary?

And the answer is yes. And answer is yes only if your hypothesis space is conjunctive concepts. But one thing I maybe didn't stress too much last time was that candidate elimination is more general than just working with conjunctive concepts. So that's why. you know in general you can have more than one but for conjunctive concepts and those are the examples that you will see in the gradient squares in your exam as well it will always be one okay any other questions okay so okay so let's move on and today and maybe today and the next class we are going to do a little bit of theoretical analysis of algorithms right So in the later part of this course, most of the machine learning that we will talk about will be more about tools and algorithms that have been out there to perform different types of learning tasks under different assumptions and so on.

But before that, we also want to understand how do we analyze an algorithm? How do we say this is a good algorithm and this is not a good algorithm, right? Unfortunately, these days, if you pick up.

Papers and machine learning or data mining where they use machine learning or they propose new methods Typically what people do is they they propose a new algorithm and then they run it on some a few data sets and it But if it performs well, they make a assumption that it will perform very well, right? However, that's not true If you really go work in the real world apply machine learning tools there You will see that the same algorithm could have remarkably different performance depending on what the data was and what kind of problem it was, right? So the point I'm trying to make is that even within machine learning the problems can be different and they can be different in the sense some may be more challenging than others which means that if you apply the same machine learning algorithm they could you could get different sort of performance, right? But the questions there are several questions there how do we define the performance of an of a problem problem We are not talking about algorithms, but we are talking about problem spaces, right?

How do we know which problem space is tougher than other, right? So let's sort of do a little quick recap of what we did so far. We were talking about this notion of concepts, right?

And let's say we are still talking about data that is, you know, d-dimensional vectors. So data is still defined using d binary features. And it is easy to generalize it. You can make it non-binary as well.

But let's say just for our discussion, let's assume that our x is still defined as set of d binary features, which means that the number of binary concepts that you can have, now we are still talking about binary concepts, something belongs to... Concept or does not right, but this can also expand later But for now what we have seen in the past is that the number of concepts that you can have is Essentially the size 2 to the power size of this X right where X is nothing But you're all data points that you can have and for D binary features We saw that this is same as 2 power 2 power D. So that's all concepts.

However, what I quickly maybe in last two minutes of last class said was that learning in this concept space is hard. In fact, there are theoretical proofs which show that if you do not make any assumption about your concept class, which means you are trying to find the best concept in this large space, you will not be able to find it unless you have all the data labeled, which means that you don't really need learning, right? So we saw that one way to make it a little more tractable is to make some assumptions, so what we call inductive bias, in which case we sort of reduce our concept.

So we make a hypothesis that our target concept actually lies in this. smaller space and then you can define the space in many ways right so one way we defined this space was conjunctive concepts where we said that every concept can be expressed as a conjunction of one or more features right but you could have other concepts as well so for example you could have a disjunctive concept and i'll sort of talk about it in a second So within the same hypothesis class which you have made an assumption, you can have many concepts, right? So one concept could have been that circular and large.

So essentially what you're saying is that this concept says that any tumor which is circular in shape and large in size is a malignant tumor. Otherwise, it's a benign tumor, right? You could have another concept which says oval and smooth and something else, dark. That could be another concept, right? And this concept says that any tumor which is oval and smooth and dark belongs to, is a malignant tumor.

So maybe these are two different types of tumors and two different diseases. And this one is defined by this, this is defined by this, right? But what is common between these two is that they both belong to conjunctive concepts, right?

So we can apply the same algorithm for example our candidate elimination algorithm to learn this or we can apply it to learn this given training data right you will need some training data that has been labeled by this concept to learn this concept and the same way for this one but you could have another concept class right you can have disjunctive concept in which you say anything circular or large or dark So this is you can see it's different from this because what this says is that if the tumor is either circular or large or dark or all of three or one or two, at least one of these things have to be true. Right. Then I'm going to call it a malignant tumor.

Right. So the concepts then that can be written in this form are known as disjunctive concepts. So they would be a separate class here as well.

OK. I mean there will be some overlap between the two because a concept which is just circular that could be here as well as here right because you it means the same thing as well as if you have a concept which is reject everything I think will also belong to both so there could be overlap but the idea is that you can you can have different types of classes yeah and what we want to understand is is one class more difficult to learn than other okay so you can see that this these type of concepts the size of this this class will be much larger right because this these will sort of apply more than these these are little more restrictive conjunctive right but then you can have other types of concept classes as well I could have a concept class which says It can only have two exactly two Features in the in the expression. That's it just two in a conjunction or smooth or Things like that.

So I mean I can come up with these I can make those assumptions So what this would be is another class somewhere here with some overlap and so on but so the point is that you can have different types of concept classes which means that you can think of these as problems, right? Each one of them is a problem that you need problem space from which you need to find your target concept. So the question you want to ask is which one is harder, which one is easier and so on.

Right? So let's switch. Any questions so far? Okay.

So let's quickly switch to our slides here where I have something. So What we want to do now is we want to understand the complexity of these problem classes And we want to get and what we will see is that it's not it's not easy to just say the complexity is Four or five right it's because you're looking at many possibilities So what we can can actually do is just put some bounds that okay the complexity will not be more than this or less than this And things like that Now the question is how do we measure the complexity? What do we call a more complex problem or a less complex problem?

So you can define it in many ways. One is maybe you can define it in terms of how much data do you need? Okay, and next week we will see one way to sort of measure problem spaces in terms of how much data do you need. Another could be how much resource do you need. Maybe one class of algorithm needs a lot of CPU cycles, one does not.

So you can measure it with that as well. We will not look at that at all. The third one is what we are going to look at today is how many mistakes do you make when you're doing the learning. Okay, and this is only specific to one type of learning which we call online learning. So I'll give you a Sort of a description of that.

So there are three and they could be more but here there are three ways to measure complexity One is how much data do you need and that we will see how to measure algorithms or problem spaces using that metric? Later on the one could be how much resource you need and one is how many mistakes you make So today we are going to look at how many mistakes you make when you are doing your learning. So to do that we have to look at machine learning in a slightly different way which is what we call online learning.

So in online learning. So online learning so we I'm going to change some of the notation just so that it's easier to Talk about it, but a lot of this is very similar to what we have seen so far So what we are going to say is that our data X right X is a set of D binary Features or a vector of D binary feature so that is same as before but instead of giving names to each possibility We are going to say that each feature is either true or false true or false, true or false. So this is just sort of a change in notation. So far we had seen features like circular or oval, right?

And we saw dark or light and something smooth or irregular. So all we are doing is we are saying that if it is circular we are calling it true, if it is oval we are calling it false. Okay, so that's just a slightly different way of writing the same thing So the reason I'm doing that is because then I can easily write our data as this X K is just one possible X could be true true False false true.

It's easier to write that or ttff right or maybe one one one zero and so on So that's that's how I'm I'm just changing it a little bit. Okay So then I have some training data and I have a target concept forgot about that So target concept is still our C right which takes a data point X and gives you either Zero or what right so if it belongs to the concept or does not belong to the concept so that's that part is still the same and and C belongs to a some concept class a fancy C that sort of we define we say okay it's all conjunctive or disjunctive and so on so this is our class and okay and our algorithm works like this algorithm works in an iterative fashion so we say for i equal to one through some you know let's say we have n training data point so for every data point what you do is you get we have a learner right that's our machine learning algorithm and it is given an object object number i so this is our training data object number i so x of i which obviously belongs to x right so we get this then what learner does is learner predicts c star x of i okay so c star is what the learner right now thinks it's the target concept is. So it initializes some so some it has some C star which is the the concept that it thinks is a concept right it's kind of like our other algorithms we restart with some hypothesis so this is our initial hypothesis so you use that hypothesis to predict what is going to be the class for X of I right and then learner is told C of X of I which is the true so this is what you predict and this is true Okay, so now what the learner does is learn or our machine learning algorithm it it Updates C of star right so it so two things can happen Maybe this was it did not make a mistake or maybe it made a mistake, right?

So there could be two scenarios so case one is C of star of X of I is equal to C of X of I which means there is no mistake and the algorithm would have some way to update its C star, right? Maybe it doesn't do anything or it does something that's specific to the algorithm and the second thing would could be that C of a star of X of I is not equal to that means made made a mistake and it will keep doing that until the training data is exhausted, right? So that Setup that's the online learning setup So you start with some guess about your hypothesis and as the data comes in you take the data You make a prediction and then you see what the right answer was and if you make a mistake or two things can happen make a mistake or not and Depending on the thing the cases you update your C star accordingly, right? So that's algorithm So the objective of these machine learning algorithms is to learn C, right? Eventually you want to learn C.

Minimum mistakes. So you do not have to make too many mistakes. So that is how we measure the performance of this algorithm. We want, so if an algorithm makes a lot of mistakes until it learns C, then it's not a very good algorithm.

But if an algorithm makes very few mistakes, it's a good algorithm. So you can think of this as maybe, I mean we can go with the tumor example. maybe there's a radiologist right and the radiologist first day doesn't know anything so he has he or she has some some hypothesis in the mind gets a new case a new tumor whatever thing image and makes a prediction saying yes a tumor or not and then next later that day the doctor tells actually it was a tumor or not and then if there's a miss then keeps refining the hypothesis right but During that time The radiologist might have made a few mistakes Right sometime might have called a tumor not a tumor or not a tumor a tumor, right? So you want to minimize that because you do not want to make too many mistakes even during the learning part because you are using This method that's why we call it online.

You are using it while you are training So a lot of machine learning methods are used in that way as well is that you keep using it as the data comes in And at some point you would have learned enough and then then you're fine But you also want to minimize your mistakes till that point. So that is what we call the The number of mistakes that we want to minimize. So that's what I had here is Number of mistakes made any questions. Yeah Yeah, we start with some initial guess so that's our hypothesis that C star is our hypothesis saying okay This is the right answer and then slowly you are refining it So what I have not given you in this algorithm is not the exact steps of how the hypothesis is updated That is specific to different algorithms. We will see one called Vino later which where you will see how it updates it right yeah initially you're talking about the inch so initially of course it's up to the algorithm right so you can maybe assume that my initial hypothesis is accept everything or reject everything so that will be very specific to the design of that algorithm any other questions yeah Yes, so the question is the notion of or the measure of mistakes is totally dependent on What did what training sample you get right?

So that is why when we will analyze our algorithms We will say let's count the number of mistakes on all possible training data sets that you can have So you want to generalize it to all possible ones because it could happen and that's what I said in the beginning is it could happen that for one particular set you might do very well. But that should not be the only way to measure it. We want to generalize it. So in fact the next slide is going to do that.

Alright. Any other questions? Okay. So again, so the setup is this. You have a learning algorithm L and the details of L will come later.

That's up to you. You have a target concept that you're trying to learn. Target concept belongs to this concept class. Sometimes I call it hypothesis class but yeah, this is basically what you do your inductive bias and get to.

And D is one data set. only one particular sequence of training example as you said it will depend on the that sequence so let's consider 1d right so this number this ml of c comma d is let's say this is the number of mistakes that your learning algorithm makes to learn c using this as your training data okay let's start with this of course as you see it is dependent on what your c is concept right because even in the same concept class you could have different concepts right so even in conjunctive concepts you could have one concept which could be easier than the other right so right now this number is specific to the concept and the data set and of course the algorithm now to make it little more sort of more convincing or more acceptable we want to find the number of mistake that you make for any possibility right so you So you define this value which is the max the worst you can do for any d Okay, so this is basically the number of mistakes this algorithm would make to learn this concept for the hardest training data Set available. Okay, because you have you can come up with many different sequences right infinite not infinite But a large number of possibilities so over all of them. What is the worst you can do? That's ML of C and and then we make it even more because as I said C can also determine how easy things are how tough things are so you also take out C you say let's try to find the worst you can do for the any concept within this class so that is worst case scenario for this learner in learning any concept in C so that is a number you want to measure then you can really understand how good this learner is, how good this learning algorithm is.

Of course you can't measure it because there are too many possibilities you have to evaluate but we can at least get some bounce, so I'll come to that. But actually we want to go one step further and we want to actually find what is, we want to actually remove the learner as well. What this says is, this number, Opt of C is what how many mistakes can the best learner make for this. concept class this defines the complexity of the concept that concept class so let's say c is conjunctive concept so we want to figure out okay for conjunctive concepts over all possible machine learning algorithms what is the best you can do in learning the toughest concept for the hardest training data right so that is what this defines so we want to find this number for a concept class this can help us you then distinguish between two concept classes. Maybe for a concept class like conjunctive concept, OPT is 4 and for disjunctive concept it's 5. Then we can say that okay conjunctive concept is easier than learning disjunctive concept.

The reason this is important is because when you go out and let's say you go out to work in some company and somebody gives you a problem to solve machine learning problem and you say that it's it I I can do this by making 20 mistakes. That's it, right? And that is good. But tomorrow, they change the problem and they apply your algorithm and suddenly it's making 40 mistakes.

And you say, the person says, hey, why is there a difference here, right? Yesterday it was doing 20, today it's doing 30, right? So the answer to that is, you point them to this, that maybe the new problem is more tougher, is tougher, right?

It's more challenging to solve. And that's why it's making more mistakes. So that is sort of the whole idea behind this. How do we define the complexity or the hardness of the problems over all possible L, right?

Of course, anybody can look at this and say it's very hard to get this number, right? In fact, it's very, it's maybe okay to get this number, right? Because you know there is a particular data set and you have a concept, right? So you can measure this.

This is easy. This one again becomes a little harder because you have to try out all possible data orders of data points that there can be. So it becomes harder, right?

And then here you have to iterate over all possible concepts. That also becomes complex. But after that, it just becomes impossible to do because you have to iterate over all possible learning algorithms, which is...

Impossible right because every day you can come up with a new learning algorithm So but what we can do is we can set bounds on this so we can set bounds on the optimal number of mistakes So let's say this is what we want to find right of this is the number of mistakes that the best learning algorithm would Make in a worst-case scenario hardest concept and hardest data set and this is what we want to find but we can't find it But what we can do is we can find out some easy bounds on that. So for example, of course 2 power d is an obvious bound, right? Because if your data is defined using d features, then this is the number of data points you'll have. So you can't make more mistakes than the number of data points you have, right?

So 2 power d is a bound, but it's a pretty loose bound. We want to go tighter. We want to make fine numbers over here, okay? So the strategy is if you want these tighter bounds, what you want to do is you want to come up with an algorithm.

Okay. And then find this value for that algorithm, which is the number of, which is the worst case that this algorithm can do for any concept in this class. And that becomes an obvious bound, right? Because this is basically the minimum you can do for all possible algorithms.

So if you have an algorithm, then this will always be less than that. because that's that's how you define so so so what we can do essentially is we can come up with a new algorithm and figure this number out for that algorithm which itself is hard but for certain algorithms and for certain concept classes you can come up with that number okay so for example we saw list then eliminate right list then eliminate can make these many mistakes at most these many mistakes right because what what lists and eliminate does is that you have your concept class and at every when you get a new data point you you throw away you first check right or what is uh the the the right oh no you first try to make a prediction right and then you throw away the all the uh all the concepts that do not agree with this one and then so at the end of it you will need these many steps until you have only one concept left right so you It means that that's the maximum number of mistakes you can make. So when you go back, you can sort of look at it again.

But I think there is a problem ingredient this time or next time which will give you that intuition as well, is why does list then eliminate can make these many mistakes for any data set. Because these are the total number of concepts there are, right? And at every point, you throw away some concepts.

So... Worst case you can do is maybe you every time you throw away one concept one concept one concept and at the end You will be left with your target concept and you will make these many mistakes It's sort of just I can wave my hand. I think that's you should get convinced All right, it turns out there are other algorithms with which you can come up with even harder bounce even tighter bounce and so far These algorithms are not even making any assumptions about what kind of concept class it is.

It could be disjunctive, conjunctive or anything. There are some algorithms with which you can come up with these bounds. So for example, and we'll talk about this, there's one algorithm called halving algorithm which is very sort of, it's just academic, you can't use it in real life.

But what you can show with that algorithm is that for any data set and for any concept in this concept class, it would take log. base 2 of the size of concept class to learn the right concept, okay? And so we'll look at this right away.

But even this might be very large, right? If your concept class is large, then even log 2 of that might be large, and you might want to even go further down. So we will see another algorithm in which you can get even tighter bounds, but that is only for us. You have to make assumptions about your C at that point, okay? Any questions so far?

yes uh-huh have zero number of no so the question is since you're saying there's only one target concept right yeah so if there is only one target concept will list and eliminate make zero mistakes So the point is that maybe for certain types of data, it might make zero mistakes as well. It could happen. But what we want to figure out is for the worst possible sequence of data points, how many mistakes?

So for worst case what could happen, so you just need to think about the worst case. So worst case what could happen is that this is your concept space, you get a new data point, every time let's say you just pull out only one concept from the version space and chuck it out, right? Because that's what you do.

So every time you just do, so you will need this minus one steps to do that and let's say worst case in every step of that you make a mistake, right? So that's the worst number of mistakes you can make. But in reality, you might make fewer than that. This is sort of the worst case analysis.

Any other questions? Yes. No, no. You mean the uppercase calligraphic?

See, that's our concept class or hypothesis class. So this would be in the case of conjunctive concept 3 power d plus 1. Or if you don't make any assumptions, 2 power 2 power t. In fact, for certain concept classes, this number might be larger than this number as well.

It could happen. I just wrote it down like this. But it depends on what concept class there is. Okay?

All right. So now let's look at halving algorithm. Because the reason we are looking at halving algorithm is because it lets us get tighter bounds. Theoretical machine learning is all about getting tighter and tighter bounds.

right so having algorithm is pretty easy to understand and it's you don't want to use it in real life and go for a job interview and say I'm gonna apply having algorithm you the interview won't go very well but but for theoretical analysis it's pretty nice so so the idea there is that you first define these two things okay so these are two sets that you define so given your current version space In fact, this should be maybe I'll switch to my document camera. I'll fix that later But the idea is this that you define a set called chi I think zero for any version space that you have and a data point X So let's say this is your current version space, which means this is this is the high path I mean you have whittled down your hypothesis space and this is what you're left with you think the target concept lies in there, right? so this is the subset of this V such that all Concepts that belong to all hypothesis.

Let's call them hypothesis H that belong to V such that H of X is 0 I mean the simple way to understand this is that let's say this is our version space right now right so some of the hypotheses in this version space are going to label x as belongs to the concept or one and some are those which will label it as zero it does not belong to the concept so this is what we call and this is what we call the zero Okay, so that's the definition, right? So what is halving algorithm do? Halving algorithm at any iteration of that online learning What it does is that it looks at new X right and then it says it it Finds out this it tries to figure out what is this set What is this set and the way it will do it is it will take every con every?

Hypothesis inside here and will apply it on X and C for the 0 or 1 and construct these two sets, right? These will be disjoint sets. Of course, then it tries it sees which one is bigger. So if this one is bigger Then it says that my prediction X which we call C star of X right is 0 Okay, because more Hypothesis in my current space say that it is 0 else prediction on X is Is what?

Okay, so that's my prediction then it checks if it has made a mistake or not because now it gets the the true label So now it can compare itself what it predicted with the true label So if it has a mistake right if if the true label was zero and it predicted one or the other way around if mistake Remove the one that you made a mistake on remove from V. You see what I mean? So basically you say okay. I'm gonna let's say just take an example Let's say for this one the this was the bigger set right and let's say this one and you make a mistake So what we do is we just remove all of it Because these ones gave us a rod the ones that were here gave me wrong answer So I'll just check those out next time now.

I'll be left with a smaller set now you get a new data point now You again divide it into two parts one which call it one and one which call it zero you make that prediction Compare it the true label if you make a mistake again you check things out right so what you see easily here is that? Your space becomes smaller Every time you make a mistake, right if you do not make a mistake you don't do anything and eventually you want to Arrive at your concept C, right? So if you start with so initially so we start with With C, right you start with C as my whole version space and at every mistake you remove some part of it.

So now we need to figure out what is the worst you can do, how many mistakes will you make. So the way to understand this is that since at every mistake what you would do in worst case you will make the, see because. So what happens on a mistake?

What happens on a mistake is that you are reducing the size of your version space, right? Worst case you can reduce it by half. You could reduce it by more than half it will always be either half or more than half right because the thing is that You would have chosen The set which is bigger right so it would have been either half or more than half and That's where you would have made a mistake, so you would have you would always remove the bigger set What happened?

We remove bigger So in worst case maybe that bigger chunk was exactly half but it could be larger than that But we want to do worst case analysis, right? Which means that let's say at every mistake you remove the half chunk. So in log C steps would have you will be left with one concept which will be a target concept, right?

Which means that these are the steps the size of V will be 1, but in at most log 2 C steps, you could have fewer steps if your chunks are bigger, implies that number of mistakes needed to learn that target concept C will be log 2 of C, right? Because you you can only make you can you will only have things when you're making a mistake and you will make when you make a mistake you're at most or at least reducing it by half that means that at the end of it in these many steps or at least in these many mistakes you would arrive at your target cards any questions so far there's a question next radiance about this as well you mean which one to remove Yes, right, right. Yeah, oh sorry, yeah.

You remove the one on which you made a mistake. You said that I'm going to call this object, let's say one, because the one was bigger and you made a mistake. So, you just remove all of them because that basically there was some concept in this larger space which. which was inconsistent with state data point right so that's yeah you you're right so you want to this is you remove the set that the one that you chose to make the prediction i'm sorry what's that In case of, oh no mistake, oh no you don't do anything, at least for this algorithm you don't do anything because if you do not make a mistake that means that the concept is in this class. Let me try this thing, I mean maybe you can but for this algorithm you don't, so it's up to you, right, so you can design it in a way that you could do some refinement.

even when you don't make a mistake. A real practical machine learning algorithm, that's what you do. Even for the real, for good answers also you use that in some way. But for this algorithm you don't because here, this algorithm is just purely to understand these bounds.

So we don't want to make it more complex. Alright? Any questions so far? Any other questions?

Yes. Aha! So that's a good question. So the question is, can it so happen that by mistake, my algorithm removed the half that contained the C, the target concept?

And the answer is that will not happen. And the reason it will not happen is because if your true concept is in that class, right, that means that... I know that problem is in gradients next time. I'll just I'll post a note on gradients.

Oh on Piazza about this. Yeah, I Forgot the example But yeah the question is that will it ever make a mistake like that where it accidentally removes the Part that contained the concept and the answer is no, but the reason I'll give you a little later Okay, any other questions? All right.

So one point I want you to see is that this algorithm even though it's very interesting in the way that it gives you these bounds, it's not computationally feasible, right? Because every time you have to start with the version space and then keep halving it and everything. So it's not a good algorithm at all. So now our next question is can we come up with algorithms which we can actually implement, which are feasible to implement but also give us similar mistake bounds. The answer is that for a general concept class you cannot.

However For a certain concept classes you can come up with algorithms that you can actually code and I'll show you the code for those and You can also show that it will make cannot it won't make more than certain number of mistakes okay, so let's look at that algorithm we call it winnow and So the first thing that we want to do is redefine our concept class Okay, so for winnow algorithm the concept class that we will look at is not It's not the conjunctive concept class. It's a monotone disjunctive concept class with at most k variables or features. So let's see what it means.

So you know what disjunction is, right? Disjunction essentially is ors, logical ors. So here a concept class, a concept in this class will always have a form of this, right? It will have some feature or other feature or other feature. so on and another restriction that we put is that it can only have K such literals here cannot have more than K The reason we do that is because only then we can do this analysis.

Otherwise we can't okay So the the concept class has K such features The reason we call it monotone is that we make another restriction the other restriction is that it will only have the positive values in the in the feature so which means that it will have concepts like So remember early in the class I said that instead of using features like circular oval I'll call them true or false right so the idea here is that it will say You can write these disjunctions like this. Let's say X is our data point data object right and let's say X of I is the ith feature and X of I can either be true or false so disjunctive concept class can be written like this X 1 or X 3 or X 4 okay so it means that I'm going to call a tumor posit of tumor malignant if it is X 1 I mean if the first feature is true or the third feature is true or the fourth feature is true All right. The reason we call it monotone is that it does not include features like this x1 or Not of x3 or x4. This is not allowed This is does not belong to our K monotone disjunction Okay The reason we are doing that is because that is going to allow us to do this analysis.

Otherwise it becomes more complex So there are two things, three things. They are disjunctions of feature values. Second is they can only have k features in At most case of k or fewer and the third thing is that they are all monotone That means you're only going to consider the positive value of your features in this disjunction So it could be like this.

It could be x1 or x4, right? So this is our C these k monotone disjunctions. So the first question you want to ask is Is of C how many K monotone decisions you can have right?

So the answer to that is just done using combinatorics monotone These junctions can we have so let's say we have D features total D features and you can have at most K, right? So let's first look at simple example of when you have exactly K So how many exactly K monotone disjunctions we can have it's right at most That will be you can choose K of those in from K from D in D choose K ways right so it's a what's the formula for this D as is just combinatorics. How many ways can you pick?

number of things from a larger set right so you can choose K so that means you can construct these many exactly K monotone disjunctions then you can have so if you want to do at most you have to enumerate from K equal to 0 to all the way to all the way to K right so at most K the size of C will be You can choose 0 or you can choose 1 or you can choose 2 and so on until you can choose okay, so that is the at most case and What you will if you do this math work it out. What you will see is that this number is bounded by What you see is that the C is equal nearly equal to or in the in sort of our theta notation right so it's it's close to exp sorry okay or you can say that log of C is equal to k log 2d. So this is just sort of computing this and writing it in this compact form, right? So what you can quickly see is that if you apply your halving algorithm to learn these you will make these many mistakes because log of that, right?

So that is the worst case complexity of order. Yeah of any algorithm or having algorithm on this type of concept class, right? What we want to now do is come up with an algorithm which we can actually implement right not just Sort of do it in that having kind of way But still maintain these many Mistakes so that is sort of what our we know algorithm will do so it's an algorithm that is easy to implement And then what we will see is that it also makes exactly these many mistakes, as many as having algorithm, right?

So it's a good algorithm in both ways. And the reason we're going to look at Winnow is because Winnow is our stepping stone to perceptrons and perceptrons is our stepping stone to neural networks, right? So next class hopefully we'll finish Winnow and then we'll move to perceptrons. And programming assignment, any questions before that? Yes Alright, so the winnow of course we will see when we see the algorithm, you will see that it's pretty easy to implement it.

The reason halving algorithm is not a good algorithm is because you have to start with the entire concept class in your version space, right, which could be huge. So just in terms of how to implement it, its complexity. Alright, the last announcement I wanted to make was that the programming assignment will be out on Monday.

Yeah. So it will be out on Monday but we wouldn't have really covered stuff that we need to do that. So I mean you might not be able to start working on it right away.

But by Friday you should have everything that you need to sort of start implementing it. All right. Any other questions.

OK. Thank you very much. Have a good weekend and I'll see you on Monday.

Transcript for:Understanding Candidate Elimination and Learning Algorithms

Transcript for:
Understanding Candidate Elimination and Learning Algorithms