Overview of Machine Learning Concepts in CS229

hi everyone in this video I would like to present the cs229 course the machine learning course delivered by the esteemed World known andrean G at Stanford University andreon G is a globally recognized leader in the field of artificial intelligence and machine learning with a career that has profoundly shaped the FI as a professor at the Stanford University he has not only educated some of the brightest minds but also co-founded sah and a deep learning that AI pioneering efforts to make AI education accessible worldwide he served as the co-founder and the head of Google brain building its artificial intelligence group into a team of thousands named among the times Magazine's 100 most influential people and recognized by the Fast Company as one of the most creative people in the world Andre continues to influence the world of AI machine learning as a board member at Amazon and as a Time 100 AI most influential people in 20123 the cs229 provides a thorough introduction to the principles of machine learning and statistical parent recognition the curriculum encompasses this foundational Concepts such as supervised and unsupervised learning the learning theory the reinforcement learning and the Adaptive control in addition to the theoretical underpinnings the course dels into uh the real world applications of machine learning spanning the fields like the robotic control the data mining autonomous navigation and also the bioinformatics the speech recognition and text and a web data processing so originally sold in 2008 this course remains one of the most influential resources in the field of machine learning education its Timeless approach to the subject continues to equip Learners like yourself with the essential skills and understanding needed to excel in this rapidly evolving and highly demanded field we would like to extend our heartfelt gratitude to the Stanford University and to andun g for their generosity in sharing this exceptional course with us and the Learners across the globe their commitment to education and innovation has made it possible ible for countless individuals to access worldclass knowledge and resources at their fingerprints if you would like to support the Stanford engineering school and duni and its mission to advance learning and Discovery we highly encourage you to explore the links provided below your contributions whether through direct support or by participating in their open source collaboration projects will play a vital role in sustaining and expanding their work so if you're ready we are really excited and without further Ado let's get started this presentation is delivered by the stand Center for professional development okay good morning uh welcome to cs229 the machine learning class so what I want to do today is just go over spend a lot time going over the logistics for the class and then we'll start to talk a bit about machine learning um by way of introduction my name is Andrew and I'll be instructor for this class and um so I personally work in machine learning and I've worked on it for about 15 years now and I actually think that machine learning is s of you know the most exciting field of all of computer science so so so I'm actually always excited about teaching this class um sometimes I actually think that machine learning is not only the most exciting thing in computer science but the most exciting thing in all of human endeavor so maybe a little bias there um also want to introduce the Tas who are all graduate students doing research in or related to the machine learning and all expit in machine learning um Paul bombstar is um Works in machine learning and computer vision um Katie Chang is actually a neuroscientist who applies machine learning algorithms to try to understand human brain um Tom do is another PhD student Works in computational biology and in s of the basics fundamentals of machine learning um ziko cter is the H ta um ta t two years in a row now works on machine learning and applies them to a bunch of Ro robots and Daniel ramage is um guess it's not here Daniel applies learning algorithms to problems in natural language processing um so you get to know the T and me much better for this culture but just from you know the sorts of things the T do I hope you already tell that machine learning is a highly interdisciplinary um Topic in which with just the TA is applying learning algorithms to problems in computer vision and biology in robots and langu G um and machine learning is one of those things that has had and is having a large impact on many applications so um just in my own daily work I actually frequently end up talking to people like helicopter Pilots to biologists to people in computer systems or databases to economists and there sort of all also you know an unending stream of people from industry uh coming to Stanford interested in applying machine learning problems interested in applying machine learning methods to their own problems so um yeah this is fun a couple of weeks ago a student actually forwarded to me an article on uh in Computer World about the 12 it skills that employees that employers can't say no to so of about the you know so of the 12 most desirable skills in all of it and all of Information Technology um and Topping the list was actually machine learning so I think this is a good time to be learning this stuff and and learning algorithms and having a large impact on many segments of Science and Industry um I'm actually curious about something um learning algorithm is one of the things that touches many areas of of Science and industries I'm actually kind of curious how many people here are computer science Majors or in the computer science department okay about half of you how many people are from e oh okay like maybe about fifth um how many biologists are there here wow just a few not many I'm surprised uh anyone from statistics okay a few so where where are the rest of you from say again icme Cool C and what else sis systems yeah cool KY cool AR asro yes right yeah okay cool anyone else Ms pardon MSN right cool yeah so ERS pardon end Enders IND oh see industry okay cool great great yeah so as you can tell from from from the cross-section of this class I think we're a very diverse audience in this room and that's one of the things that um makes this class fun to teach and and fun to be in I think um so in this class tried to convey to you a broad set of principles and tools that'll be useful for doing many many things um and every time I teach this class I I can actually very confidently say that um after December no matter what you're going to do after this December when when you have completed this class um you'll find the things you learn in this class very useful and and and these things will be useful pretty much no matter what you end up doing later in your life um so I have more Logistics to go over later but let's say a few more words about machine learning um the you the machine learning out of early work in AI early work in artificial intelligence and over the last I want to say last 15 over the last 20 years or so it's been viewed as a s of growing new capability for computers um and in particular it turns out that there many programs or there many applications that you can't program by hand um for example if you want to get a computer to read handwritten characters to read of handwritten digits that turns out to be amazingly difficult to write a piece of software to take an to take this input an image of something that I wrote and to figure out you know just what it is to translate my my my cursive handwriting into um into into um to extract the characters that I wrote out in long hand um and other things one thing that my St and I do is autonomous play so turns out to be extremely difficult to sit down and write a program to fly a helicopter um but in contrast if you want to do things like these write to get software to fly helicopter have software recognize hton digits um one very successful approach is to use a learning algorithm and have a computer learn by itself how to say recognize your handwriting um and and in fact for handwriting digital recognition this is pretty much the only approach that works well these are sort of applications that are hard to program by hand um learning algorithms has also made I guess significant inros in what's sometimes called um database mining um so for example with the growth of it and computers um increasingly many hospitals are keeping around medical records of you know what sort of patients what problems they had what their prognosis whereas what the outcome was and taking all of these medical records which start to be started to be digitized only about maybe 15 years ago um applying learning albums to them we can turn raw medical records into what I might lose called medical knowledge in which we start to detect Trends in medical practices and even start to alter Medical Practice as a result of um medical knowledge do dered by applying learning algorithms to the medical by applying learning algorithms to the source of medical records that hospitals have just been building up over the last you know 15 20 years in electronic format um turns out that most of you probably use learning algorithms I don't know I think half a dozen times a day or maybe a dozen times a day or more um and often without knowing it so for example every time you send mail via the US Postal System um turns out there's a algorithm that tries to automatically write read the zip code you wrote on your envelope and that's done by our learning algorithm so every time you send us mail you're using a learning algorithm perhaps even perhaps without even being aware of it um similarly every time you write a check I actually don't know the number for this but the significant fraction of checks that you write are processed by a learning algorithm that's learned to read the digits so the dollar amount that you wrote down on your check so every time you write a check there's another little learning algor that you're probably using without even being aware of it um if you to if you use a credit card or um I know at least one phone company was doing this and and lots of companies like eBay as well that that do electronic transactions there's a good chance that there's a learning ALG in the back end trying to figure out if say your credit card has been stolen of someone's engaging in a fraudulent transaction um if you use a webs like like like um Amazon and Netflix that you often recommend um books for you to buy or movies for you to rent or whatever these are other examples of learning algorithms that have learned what sorts of things you like to buy what sorts of movies you like to watch and can therefore give customized recommendations to you um I don't know just about a week ago I had my car service and even there my my my car mechanic was trying to explain to me some learning algorithm and the inits of my car that's of doing as best optimize my driving performance with fuel efficiency or something so say most of us learn most of us use learning algorithms you know half a dozen a dozen maybe dozens of times without even knowing it um and of course learning Al is also doing things like giving us a growing understanding of the human genome so if if someday we ever find a cure for cancer of that learning AR will have had a large roow in that that's s the thing that Tom works on yes um so in teaching this class um I have I sort of have three goals um one of them is just to help convey some of my own excitement about machine learning to you um the second goal is by the end of this class I hope all of you will be able to apply State ofthe Arts machine learning algorithms um to whatever problems you're interested in and and if you ever need to build a system for write for reading zip codes um you you know how to do that by the end of this class um and lastly by the end of this class I realized that only a subset of you are interested in doing research and machine learning but by the conclusion of this class I hope that um all of you will actually be well qualified to start doing research in machine learning um so let's say a few words about logistics um the prerequisites of this class are written on the on on one of the handouts um r as follows in this class we going to assume that all of you have sort of basic knowledge of uh computer science and um basic knowledge of the basic computer skills and principles so I assume all of you know what bigo notation is that all of you know about s of data structures like qes Stacks binary Creeds and then all of you know enough programming skills to write you know like write a simple computer program um and it turns out that um most of this class will not be very programming intensive that we will do some programming mostly in either math lab or octave say a bit more about that later um also assume familiarity with basic probability and statistics so most undergraduate statistics class um like stat 116 taught here at Sanford will be more than enough I'm going to assume that all of you know what random variables are that all of you know what expectation is what the variance of a random variable is um and if in case for some of you it's been a while since you've seen some of this material at some of the uh discussion sections will actually go over some of the prerequisites um so of as a refreshing course on the prerequisite class I'll say say a bit more about that later as well um lastly I also assume um familiarity with basic linear algebra and again most undergraduate linear algebra classes are more than enough so uh if if you taken classes like math 51 uh 103 math 113 or CS 205 at Stanford that would be more than enough um basically I'm going to assume that all of you know what matrices and vectors are that you know how to multiply matrices and vectors and multiply matri and matrices do you know what a matrix inverse is um if you know what an i Vector of a matrix is that'd be even better but if you don't if you don't quite know if you're not quite sure that's fine too we'll go over in the review sections okay so um there's one more there a couple more logistical things I should deal with in this class um one is that um as most of you know cs229 is a televised clause in fact I guess many of you are probably watching this you know at at at home on TV so I'm say hi to our home viewers um so earlier this year I approached scpd uh the which which televises these classes about trying to make a small number of Stanford classes publicly available of posting the videos on the with um and so this year Stanfield is actually starting a small pilot program in which we'll post videos of a small number of closes online um so of on the internet in a in a way that makes it publicly accessible to everyone I'm kind of I'm very excited about that because machine learning is cool get the word out there um one of the consequences of this is that let's see so videos or pictures of um the students in this classroom will not be posted online so so your images so don't worry about you know being surprised by seeing your own face appear on YouTube one day um but the microphones May pick up your voices so um so so so I guess the consequence of that is that you know because microphones May pick up your voices or no matter how irritated you are at me don't yell out swear words in the middle of class um but but because there won't be video you can safely sit there and make faces at me and that will't be show okay um let's see I also handed out this there were two handouts I hope most of you have um course information handout so let me just say a few words about parts of these um on the third page there's a section that says online resources yeah oh okay oh louder yeah okay actually could you turn up the volume testing is this better uh testing testing okay cool thanks so um right online resources uh the CL of the homepage written on the handouts I won't write on the on on the chalkboard um HTTP cs229 stanford.edu and so um when there are homework assignments or things that we we usually won't bring so uh in the uh mission of saving trees we will um usually not give out many handouts in class um so homework assignments homework Solutions will be posted online at the at the at the course homepage um as part of this class I've also written and I guess I'm also uh revise every year a set of Fairly detailed lecture notes that cover the technical content of this class and so um you go to the course homepage you also find the detailed lecture notes that sort of cover you know that go over in detail all the math and equations and so on that I'll be doing in class um there's also a News Group su. class. cs229 also on the handout um this is a news group that's sort of a forum for people in the class to you know get to know each other and have whatever discussions you want to have amongst yourself um so the CL News Group will not be monitored by the T andme um but this is a place for you to you know form study groups or find Project Partners or discuss homework problems and so on um and it's not monitored by the TM me so so you can also so feel free to talk trash about this class there um um if you want to contact the teaching staff please use the email address um written down here cs229 hyen QA cs. sf.edu um this goes to an account that's read by all the T and me so so rather than sending us email individually if you send email to this um account it'll actually let us get back to you um Maxim quickly with answers to your questions um if you're asking questions about homework problems please St in the subject line which assignment and which question the email refers to since that will also help us to roue your question to the appropriate TA or to me appropriately and uh get the response back to you quickly um let's see skipping ahead let's see it's four homor one midterm One open inter term project um not on the onic code so one thing that I think will help you to succeed and do well in this clause and even help you to enjoy this class more is um if you form a study group so start looking around where you're sitting now or then at the end of class today um you know mingle a little bit and get to know your classmates um I strongly encourage you to form study groups and and sort of have a group of people to study with and and have group of uh your fellow students to talk over these concepts with um you can also post on the class News Group if you want to use that to try to form a study group um but some of the problems that on this in in this class are reasonably difficult um people that have taken off before may tell you they were very difficult um and and just I think our B will be more fun for you and you probably have a better learning experience if you have a small if if you form a study group of people to work with so I definitely encourage you to do that um and want to say a word on ano which is that um definitely encourage you to form a study group and work together discuss homework problems together um but if you discuss homework problems with um other students they not ask you to um s of go home and write down your own Solutions independently without referring to notes that were taken in any of your joint study sessions um so in other words when you turn in a homework problem what you turn in should be something that was reconstructed you know independently by yourself and without referring to notes that you took during your study sessions of other people okay and obviously um you know showing your solutions to others or copying other Solutions directly is write out um we occasionally also reuse problem set questions from previous years so that the problems are a bit more debugged and and and work more smoothly um and as a result of that I also ask you not to look at um solutions from previous years and this includes both s of official solutions that we've given out to previous generations of this class and um previous solutions that you know people that taken those class in previous years may have written out by themselves okay um sadly in this class there are usually sadly in previous years there have often been been a few honor Cod violations in this Clause um and last year I think I prosecuted five honor Cod violations which I think is a ridiculously large number and so just don't look at those Solutions and and and hopefully there'll be zero onor Co violations this year I'd love for that to happen um there's a section here on the late homework policy if you ever want to hand in a homework late I'll leave you to read that yourself um we also have a midterm which will which is scheduled for the uh Thursday 8th of November at p.m. so please keep that um so please keep that evening free and um let's see and one more one one more thing I want to say is about the class project um so part of the goal of this class is to leave you well equipped to apply machine learning algorithms to a problem them or to do research in machine learning and so as part of this class I'll ask you to execute a small research project of as a as a small term project um and what most students do for this is um either apply machine learning to a problem that you you find interesting or investigate some aspect to machine learning um so to those of you that are either already doing research to those of you who are in industry or taking this from from from a company um one fantastic way to do a class project would be if you apply machine learning algorithms to a problem that you're interested in um to a problem that you're already working on uh whether it be a science research problem or a problem industry we trying to get a system to work using a learning algorithm um to those of you that are not currently doing research this especially um I one great way to do a project would be if you um apply learning algorithms to just pick a problem that you care about pick a problem that you find interesting and um apply learning albums to that and play the ideas and see what happens um and let's see oh and the goal of the project should really be for you to um do a publishable piece of research in machine learning okay um and if you go to the course website um you actually find a list of the projects that students had done last year and so I'm holding list in my hand you can you can go go home later and take a look at it online but look reading down this list I see that last year there was students that um applied learning algorithms you control a snake robot uh there are few projects on improving learning algorithms um there's a project on um flying autonomous aircraft there was a project actually done by our TA poll um on improving computer vision albs using machine learning uh there are couple of projects on Netflix rankings using learning algorithms um few medical robotics ones on segmenting Iota segmenting pieces of the body using learning algorithms um one on musical instrument detection another on RNA sequence alignment um I don't know few algorithms on understanding the brain Neuroscience uh actually quite a few projects on Neuroscience um couple of projects on understanding fmri data on brain scans um and so on another project on Market making the financial traing there interesting project on uh trying to use learning alums to decide what is it that makes a person's face um you know physically attractive um just learning out from optical illusions and so on it goes on so fun lots of fun projects you take a look um then you know come come of your own ideas but whatever you find cool and interesting I hope you'll be able to make a learning out machine learning project out of it yeah question oh yes thank you many people can be a grou right so um projects can be done in groups of up to three people so um as part of forming study groups um you know later today as you get to know your classmates I definitely also encourage you to you know grab two other people and and form a group of up to three people for your project okay um and just start brainstorming ideas for now amongst yourselves um you can also come talk to me or the T if if you want to brainstorm ideas with us um Okay so one more organizational question um I'm curious how many of you know how how many of you know math lab wow cool quite a lot um okay so as part of the um actually how many of you know octave or have used octave H okay much smaller number so as part of this class um especially in the homeworks will ask you to implement a few programs a few machine learning algorithms as as part of the homeworks um and most of those homeworks will be done in either mat lab or in octave which is sort of a I don't know some people call the free version of mat lab um which is sort of is sort of isn't um so I guess for those of you that haven't seen mat lab before um and I know mostly you have mat lab is a programming I guess called a programming language that uh makes it very easy to write code using matrices um to perform to write code for numerical routines to move data around to plot data um and it's sort of an extremely easy to learn tool to use for implementing a lot of learning algs um and for those of you in case some of you want to work you know on your own home computer or something if you don't have a mat lab license um for the purposes of this class there's also just write that down to M that um there's also a software package called octave that you can download for free off the internet and it has you know somewhat few features in mat lab but it's free and for the purposes of this class it'll it'll work for just about everything um so actually well so uh yeah just just a side comment for those of you that haven't seen mat lab before I guess um once a a colleague of mine at a at a different University not at Stanford actually teaches you know teaches another machine learning class um he's told it for many years so one day um he was in his office and then student of his from like 10 years ago came into his office and he said oh professor professor um thank you so much for your machine learning CL I learned so much from it um there's a stuff that I learned in class I now use every day and it's helped me make lots of money and here's a picture of my of my big house um so my friend was very excited he said wow that's great I'm glad to heare this machine learning stuff was actually useful so so what was it that you learned you know was it the logistic regression was it the uh PCA was it the basing networks what was it that you learned was so helpful and the student said oh it's the mat lab um so for those of you that um don't know mat lab yet I I hope to do learn it is not hard um and we'll actually have a short uh mat lab tutorial in one of the discussion sections um uh for those of you that don't know it yet um okay the very last piece of logistical thing is the uh discussion sections um so discussion sections uh will be taught by the Tas and attendance of discussion sections is optional um although the they they'll also be recorded and televised um and we use the discussion sections mainly for two things um for the next two or three weeks we use the discussion sections to go over the prerequisites um to this class or if some of you haven't seen you know probability of statistics for a while or linear algebra um we'll go over those in the discussion sections as a as a refresher for those of you that want one um later in this quarter we'll also use the discussion sections to go over extensions to the material that I'm teach you in the main lecture so so machine learning is a huge field and there are a few extensions that we really wanted to teach but didn't have time in the main lectures for um so later this quarter we use the discussion sections to talk about things like conx optimization um to talk a bit about hidden marof models which is a type of machine learning algorithm modeling time series and a few other things of extensions to the materials that I'll be covering in the main lectures um and attendance that at at the discussion sections is optional okay so that was all I had um from Logistics um before I move on to start talking a bit about machine learning let me check what questions you have yeah have use can r or something like that um yeah let's see right so I would um s policy as a policy has been that you're welcome to use R but um I would strongly advise against it mainly because in the last problem set we actually Supply some code that will run in MA letter octave but that would be somewhat difficult for you to translate but that would be some painful for you to translate into R yourself so for the early assignments if if you want to submit a solution in R that's fine but um I I think M La is actually totally worth learning I I actually I know R in mat lab and I personally end up using mat lab quite a bit more often for various reasons yeah um oh so for the ter project you're welcome to do it in smaller groups of free welcome to do it by yourself or in groups of two um grading is the same regardless of the group size so with a with a larger group you probably I I I I recommend trying to form a team but it's actually certainly fine to do it in a smaller group if you want yeah that's true what language um so let's see there is no C programming in this class other than any that you may choose to do yourself in your project um so the all the homeworks can be done in mat lab or octave and um uh let's see and and I guess the prr prerequisites is more the um ability to understand Bigg old notation and data and and knowledge of what a data structure like a linklist or a q or or or binary tree is um more so than your knowledge of C Java specifically yeah looking at the end end semester project I mean what exactly will you be testing over there what will the grading methodology be um of the of the project yeah let me answer that later let me um in in a couple of weeks I actually give out a handout with s of guidelines to the project um but I for now we should think of the goal is being to do um a cool piece of machine learning work that will let you experience you know the joys of machine learning first hand um and and really try to think about doing a publishable piece of work um so many students we try to build a cool machine learning application that's probably the most common project um some students we try to develop we try to improve state of the machine learning some some of those projects are also very successful so a bit harder to do and there's also a smaller minority of students that sometimes try to prove the um develop the theory of machine learning further we try to prove theorems about machine learning so they're they they're usually great projects with all of those types with applications of machine learning being the most common anything else okay cool so that was it with Logistics um let's talk about learning algorithms um so can I have the laptop display please can I have the projector um she Lo um actually could you load the big screen cool this is amazing customer service thank you try to get going oh it's okay cool um okay no that's fine we have technicians working on I see okay that's cool thanks okay um big screen isn't working today but hope you can read things on the smaller screens up there actually in case you're wonder I think um this room just got a new projector that that I don't know someone sent you an excited email uh was it just on Friday saying we just got a new projector in this and they said with 4,000 to1 something or rather brightness ratio I don't know so some someone was very excited about the new projector in this room but I guess we'll see that we'll see that in operation on Wednesday um so um start by talking about what machine learning is um what is what is machine learning she say actually can you can you read can you read the text up there can you raise your hand if if the text on the small screens is legible okay cool mostly legible okay so I just read it out um so what is machine learning um way back in about 1959 ASA Samuel um this is a defined machine learning informally as the field of study that gives computers to learn um VI the study that gives computers the ability to learn without being explicitly programmed um so off Samu so way back from history of machine learning actually um did something very cool which was um he wrote a Checkers program which would play games of checkers against itself um and so because you computer can play thousands of games against itself relatively quickly um author Samuel has his program play thousands of games against itself and over time it would start to learn to recognize patterns which led to wins and patterns which led to losses so over time we learn things like that you know gee if I get a lot of my pieces taken by the opponent then I'm more likely to lose and win or gee if I get my pieces into a certain position then I'm especially likely to win rather than lose and so over time after Samuel had a Checkers program they would actually learn to play checkers by learning what are the the board positions that are tend to be associated with wins and what what are the board positions that tend to be Associated of losses um and way back in you know 19 around 1959 the amazing thing about this was that um his program actually learned to play Checkers much better than Arthur Samuel himself could right so even today there are there are some people that say you know well computers can't do anything that they're not explicitly programmed to and Alor Samuel's Checkers program was um maybe the first I think really convincing reput rep reputation of this claim um namely um after Samy managed to write a Checkers program that could play checkers much better than he personally could and and this is an instance of maybe computers learning to do things that they really were not programmed explicitly to um here's the more recent or more modern uh more formal definition of machine learning uh due to Mitchell um who says that a well post learning problem um is is defined as follows he says that um a computer program is set to learn from an experience e with respect to some task T and some performance measure P if his performance on T as measured by P improves of experience e right so not only is it a definition it even Rhymes um so for example the in the case of checkers um The Experience e that the program has would be um you know the the experience of playing lots of games of checkers against itself say the task T will is is the task of playing checkers and the performance measure P will be something like you know the fraction of games that wins against a certain set of human opponents and by this definition we'll say that alos Samuels um Checkers program has learned to play Checkers okay so as an overview of what we're going to do in this class um this class is s of organized into four major sections um I'm going to talk about four major topics in this class the first of which is um supervised learning so let me give you an example of that um so suppose you collect a data set of um holding prices and one of here is Dan actually collected the data set for me last week to to use an example later um but suppose that um you go to go to collect a bunch collect um statistics about how much houses cost in a certain geographic area and Dan the TA collected data from housing prices in Portland Oregon um so what you can do is let's say plot you know the square footage of the holes against the uh list price of the house right so you can so you get you know to collect data on a bunch of houses let's you get a data set like this with houses of different sizes that you know are listed for different amounts of money um now let's say that I'm trying to sell a house in the same area as Portland Oregon was and where the data comes from um let's say I have a house that's you know this size in square footage um and I want an algor to tell me about how much should I expect my house to sell for okay so the last of ways to do this um and some of you may have seen elements of of what I'm about to say before um so one thing you could do is look at this data and maybe fit a straight line to it and then you know if if this is my house you may then look at the straight line and predict that my house is going to go for about that much money right um there are other decisions that you can make which which we'll talk about later which is um well what if I don't want to fit a straight line maybe I should should fit a quadratic function to it maybe that fits the data a little bit better um you notice if you do that the price of my houseal gos up a bit so that'd be nice um so and this sort of learning problem of learning to predict housing prices is an example of what's called a supervised learning problem um and the reason it's called supervised learning is because we're providing the algorithm a data set of a bunch of square footages a bunch of housing sizes and as well as the right answer of what the your know actual prices of a number of houses were right so we call the supervis Learning because we're supervising the algorithm or in other words we're giving the algorithm um the quote right answer for a number of houses and then we want the Alum to learn the association between the inputs and the outputs and to S of give us more of the right answers okay um turns out this particular example that I drew here is an example of something called a regression problem and um the term regression so refers to the fact that the variable you're trying to predict is a continuous value is a is a continuous valued price um there's another CL of supervised learning problems which we'll talk about which are classification problems um and so in a classification problem the variable you're trying to predict is discret rather than continuous um so as as as one specific example that um sort of actually standard data set you can download online that I've worked on and that lots of machine learning people have played with let's say you collect a data set on breast cancer tumors and you want to know you know whether um you want to learn the arum to predict whether or not a certain tumor is malignant malignant is the opposite of benign right so malignant me is harmful bad tumor um so we collect you know some number of features some properties of these tumors and for the sake of you know this for the sake of so having a simple toy explanation let's just say that we're going to look at on the size of the tumor and depending on the size of the tumor we try to figure out whether or not the tumor is malignant or benign um so tumor is either malignant or benign and so the variable on the y axis is either zero or one and so your data set may look like you may look something like that right um that's one and that's zero okay um and so this is an example of a classification problem where um the variable you're trying to predict is discrete values either zero or one um in fact more generally you know there'll be many learning problems where um we'll have more than one input variable more than one input feature and use more than one variable to try to predict say whether tumorous malignance or Bon so for example um you know contining with this you may instead have a data set that looks like this I'm going to plot this data set in a slightly different way now um and I'm making this data set look much cleaner than than it really is in reality for illustration but okay where um for example maybe the crosses indicate malignant tumors and the O's May indicate benign tumors right and so you may have a data set um comprising patients of different ages and who have different tumor sizes and um where a cross indicates malignant tumor and O indicates benign tumor and you may want an algorithm to learn to predict given a new patient whether their tumor is malignant over not right so for example um what a learning Alburn may do is maybe come in and decide that you know a straight line like that separates the two cles of Tans really well and so if you have a new patient whose you know age and tumor size far over there then you may be the May predict that the tumor is benign rather than malignant okay so this is just another another example of um another supervised learning problem um and and and another classification problem so it turns out that one of the issues to talk about later in this class is um in this specific example we're going to try to predict whether a tumor is malignant or benign based on two features or based on two inputs namely the age of the patient and the tumor size um turns out that when you look at the real data sets you find that learning algorithms of often use other sets of features uh in the breast cancer data as example you also use properties of the tumors like um Clump thickness uniformity of cell size uniformity of cell shape marginal adhesion and so on so various various other medical properties and so um and one of the most interesting things we'll talk about later this quarter is um what if your data doesn't lie in a two dimensional or threedimensional or sort of even a finite dimensional space but is it possible what if you data actually lies in an infinite dimensional space right I plot it here a two-dimensional space um I can't plot to an infinite dimensional space right and so it turns out that um one of the most successful classes of machine learning algorithms something called support Vector machines um actually takes data and works and and Maps data to an infinite dimensional space and then does classification using you know not two features like could done here but an infinite number of features and and that would actually be one of the most fasing things we talk about when we go deeply into classification algorithms it's actually interesting question right think about how do you even represent an infinite dimensional Vector you know in in in computer memory you don't have an infinite amount of computers how do you even represent a point that lies in infinite dimensional space we we'll talk about that when we get to um get to support effective machines okay so um let's see I think well so the second of the um so that was supervised learning the second of the uh four major topics of this class will be learning theory um so I have a friend who teaches math um at a different University not Stanford and when when you talk to him about you know his work and what he's really out to do um this friend of mine will he's a math professor right this friend of mine will sort of get to look and wonder in his eyes and he'll tell you about how in his mathematical work he's you know he feels like he's discovering truth and Beauty in the universe and and and he says it in s of a really touching sincere way and then he has this you can see it in his eyes there's this deep appreciation of you know the truth and Beauty in the universe as revealed to him by the mafias um in this class I'm not going to do any truth and Beauty um in this class I'm going to talk about learning theory um to try to convey to you an understanding of how and why learning algorithms work um so that we can apply these learning algorithms as effectively as possible um so for example we will it turns out you can prove surprisingly deep theorems on um when you can guarantee the learning algorithm will work right so think about learning algorithm for reading zip codes when can you prove a theorem guaranteeing that the learning album will be at least 99.9% accurate on reading zip code this is actually somewhat surprising we actually proof theorems showing when you can expect that to hold um we also have we also to Del the learning theory to to try to understand what algorithms can approximate different functions well and when and um also try to understand things like um how much training data do you need so how many examples of houses do I need in order for your learning Al room to recognize the pattern uh between you know the square footage of a house and his housing prices and this help us answer questions like um if you're trying to design a learning algorithm should you be spending more time collecting more data or is it the case that you already have enough data and be a waste of time to try to collect more okay so um I think learning algorithms are a very powerful tool um but as I you know walk walk around of industry in Silicon Valley or as I work with various in CS and outside CS um you find that that there's often a huge difference between how well someone who really understands this stuff can apply a learning algorithm versus someone who you know sort of gets it but sort of doesn't um the analogy I like to think of is is think of if if imagine you're going to you know carpentry school instead of a machine learning class right if you go to a carpentry school they can give you the tools of carpentry they give you a hammer a bunch of nails a screwdriver or whatever um but the master Carpenter will be able to use those tools far better than most of us in this room so I know a cop can do things with a hammer and nail that I couldn't possibly um and it's actually a little bit like that in machine learning too um one thing that's sadly not taught in many courses on machine learning is how to take the tools of machine learning and really really apply them well um so in the same way so the tools of machine learning are you know I'll say quite a bit more advanced in the tools of carpentry maybe a carpal disgree but so a large part of this tool large parts of this CL will be just giving you the real tools of machine learning just the alums and so on um but what I want to but what I plan to do throughout this entire quarter not just in the segment of learning theory but actually as a as a theme running through everything I do this quarter will be to try to convey to you um the skills to really take the learning Alm ideas and really to get them to work on the problem um It's s of hard for me to you know stand here and and and say what how big a deal that is but when I when I walk around companies in Silicon value is completely not uncommon for me to see someone using some machine learning algorithm um and then explain to me what they've been doing for the last six months and like go oh GE you know it should have been obvious from the start that the last six months you've been wasting your time right and and and uh um and um so my go in this class running through the entire quarter not just on learning theory is actually not only to give you the tools of machine learning but to um teach you how to use them well and and notice this is something that um really not many other classes teach so um and this this um um this is something I'm really convinced is is a huge deal and so by the end of this class I hope you know all of you will be Master confidence I hope all of you will be really good at learning applying these learning algorithms um and getting them to work amazingly well in many problems okay um let's see so some these SL out the board after learning theory um there's another class of learning algorithms that I then want to teach you about um and that's unsupervised learning so so you recall right well on on the on the um a little earlier I drew a Drew an example like this Right Where You Have You Know cover of features a cover input variables and of malignant tumors and benign tumors or whatever and that was an example of a supervised learning problem because you know the the the data you have gives you the right answer of each of your patients the data tells you this patient the malignant tumor this patient a benign tumor so had the right answers and and you wanted the ALB to just produce more of the same um in contrast in an unsupervised learning problem this is the sort of dat to you get okay where um speaking Loosely you're given a data set and I'm not going to tell you what the right answer is on you know any of your data I'm just going to give you a data set I'm going to say would you please find interesting structure in this data for me so that's the unsupervised learning problem where you so not given the right answer for everything um so for example an Al um find structure in the data in the form of the data you know being partition into two clusters so clustering would be one examp is one example of um an unsupervised learning problem um so you can see this it turns out that these sorts of UNS supervised learning algorithms are also used for many problems um this is a screenshot this is a picture I got from suan Lee who's a who's a PhD student here who's applying unsupervised learning algorithms to try to understand Gene data so you know it's trying to uh look at genes and individuals and group them into clusters based on based on based on properties of what genes they respond to based on properties how the genes respond to different experiments um another interesting application of um clust these sorts of clustering algorithm is actually the image processing this picture I got from Steve ghouls who another PhD student um it turns out what you can do is if you give this sort of data say an image to certain unsupervised learning algorithms they will then learn to group pixels together and say gee you know this set of pixel seem to belong together and that s of pixel seem to belong together and so um the images you see on the bottom I guess you can just barely see them on yeah let me so the images you see on the bottom are um groupings are what the Alm has done to group different pixels together um on a small display it might be easier to just to just look at the image on the right the two images at the bottom are two sort of identical visualizations of the same grouping of um the the the pixels into um into contiguous regions um and so it turns out that this sort of clustering algorithm this sort of unsupervised learning algorithm which learns to group pixels together um turns out to be useful for many applications in Vision in computer vision image processing um let show you one example this is a rather cool one that two students AR seor and minson here did um which is given an image like this right this is this is actually a picture taken on Stanford campus um you can apply that sort of clustering algorithm and you know group the picture into regions you let me actually blow that up so you can see it more clearly okay okay so in the middle you see uh the lines sort of grouping the image together grouping the image into contiguous regions and what Ash and Min did was they then applied the learning algorith to say um can we take this clustering and use it to build a model to build a 3D model of the world um and so using the clustering they didn't had a learning algorthm try to learn what the 3D structure of the world looks like so that they could come up with a 3D model that you can sort of fly through okay by way many people used to think it's not possible to take a single image and you build a 3D model but uh using a learning algorithm and that s of clustering algor as the first step they were able to just show you one more example I like this because it's a picture of stanit of our beautiful Stanford campus um so again taking the same sort of clustering algorithms you can you know taking the same sort of unsized learning algorithm you can group the pixels into different regions and using that is a pre-processing step they eventually built the sort of 3D model of Stanford campus from a single picture so you can walk into the scene look around the campus okay this actually turned out to be a mix of supervised and unsupervised learning but the unsupervised learning the sort of clustering was the first step um so turns out there are these sorts of UNS supervised clustering algorithms are actually routinely used for many different tasks um things like organizing Computing clusters um social network analysis Market segmentation so if if if you're a marketer and you want to divide your set your your Market into different segments or different groups of people to Market them separately um even for uh astronomical data analysis and and understanding how galaxies are formed these are just sort of a small sample of the applications of um un supervised learning albs of clustering albs that that we talk about eight in this class um just one particularly cool example of an un supervised learning algorithm that I want to tell you about um and to motivate that I tell you about the consider What's called the cocktail party problem which is um imagine that at some cocktail party and there are lots of people standing all over and you know how it is right if you're at a large party you know and everyone's talking it can be sometimes very hard to hear even the person in front of you a imagine L cocktail p a lots of people so the problem is if there all of these people talking you know can you can you separate out the P the the the voice of just a person you're interested in talking to because with all this loud background voice so I should play you so I should show you a specific example in a second um but yeah here's a cocktail party um that's I guess rather sparely attended by just two people um and um but what we're going to do is we'll put two microph in the room okay and so you know because the microphones are just at slightly different distances to the two people and the two people be speaking at slightly different volumes each microphone will pick up um an overlapping combination of these two people's voices but um but they are so slightly different overlapping voices so speaker one's voice may be more loud on microphone one and speaker 2's voice me louder microphone too whatever um the question is given these microphone recordings can you separate about the original speaker voices so going to play you some audio clips they were collected by by tan Le at UCSD um let me actually play for you the um original um raw microphone recordings from this cocktail party so this is the um microphone one 1 2 3 5 6 79 so it's a fascinating poy with people coming from 1 to 10 this is the second M okay so remember in unsupervised learning we don't know what the right answer is right so what we're going to do is take um exactly the two microphone recordings you just heard and give it to an unsupervised learning algorithm um and tell the Alm would you please discover structure in the data from you what structure is there this data um we actually don't know what the right answer is beginning off hand so give this data to an unsurprised learning algorithm and what the algorithm does in this case it discover that this data can actually be explained by two independent speakers speaking at the same time and the converter um separate out the two speakers for you so here's um output one of the algorithm 1 2 3 4 5 6 7 8 n 10 and that was the second okay and so the album discovers that gee the best the structure underline the data is really that there are two sources of sound and here they are um show you one more example this is um well this the second this is a second a different uh pair of microphone recordings 1 2 3 4 5 6 7 8 9 10 so the poor guy is now the cocktail party he's talking to his Radio 1 2 3 second recording 5 6 7 8 9 10 right and you give this data to the same um unsupervised learning algorithm the algorithm is actually called independent component analysis and later in this quarter you'll see why um it then outputs the following 1 2 3 4 5 6 7 8 99 10 and that's the second [Music] one okay so it turns out that um Beyond you know solving the cocktail party problem this specific class of um unsupervised learning algs are also applied to a bunch of other problems like in text processing or understanding uh functional brain Imaging data like Magneto syphy and EEG data we we'll we we'll talk about that more when we go and describe IA or independent component analysis algorithms which is what you just saw um and as as side you know this this this algor I just showed you it seems like it must be a pretty complicated ALG right to take these overlapping audio streams and separate them out it sounds like a pretty complicated thing to do so you going to ask how complicated is it really to implement an Al like this um it turns out if you do it in mat lab you can do it in one lineer code um so got this from s wise at Toronto EUR Toronto and uh the example I showed you actually use a more complicated IC algor than this but but nonetheless you know I guess this is why um for this class I'm going to ask you to do so most of your programming in matte lab and octave um because uh if if if you try to implement the same algorithm in C or Java or something you know I can tell you from personal painful experience you end up writing pages of pages of code rather than relatively few lines of code I just should mention that um it did take researchers you know many many years to come up with that one line of code so so this is not not easy um so that was UN supervised learning and then the last of the four major topics I want to tell you about is um reinforcement learning um and this refers to problems where um you don't do one shot decision making right so for example in the uh supervised learning cancer prediction problem you know you you have a patient come in you predict that the cancer is malignant or benine and then based on your prediction you know maybe the patient lives or dies and then that's it right so you make a decision and then and then there's a consequence you either got it right or wrong um in reinforcement learning problems you are usually used ask to make a sequence of decisions over time so for example um this something that my students and I work on um if I give you the keys to an autonomous helicopter um W of this helicopter you know here here here at Stanford um how do you rather perent to make it fly right you notice that if you make a wrong decision on a helicopter you know the consequence of crashing it may not happen until much later and in fact usually you need to make a whole sequence of bad decisions to crash a helicopter um but conversely you also need to make a whole sequence of good decisions in order to fly helicopter really well so um let me show you some fun videos of of you know learning Al was flying helicopters um this is a video of um our helicopter at Stanford flying using a controller that was learned using a reinforcement learning [Music] algorithm so this is done on the standard football field we'll zoom out the camera in a second you see the trees plant in the sky so this so this is one of the most um IAL aerobatic Maneuvers built on any helicopter uh Under Computer control and it was it was and this controller which is very very hard for a human to sit down the right out was learned using one of these reinforcement learning algorithms um just a word about that the the basic idea behind the reinforcement learning algorithm is this idea of what's called a reward function um what way to think about it is imagine you're trying to train a dog um so every time your dog does something good you say you know good dog you reward the dog every time you dog does something bad you go bad dog right and hopefully over time your dog will you know learn to do the right things to get more of the positive rewards to get more of the good dogs and to get few of the bad dogs so the way we teach helicopter to fly or any of these robots It's s of the same thing um every time the helicopter crashes we go bad helicopter um and every time it does the right thing we go good helicopter and over time it learns how to control itself so to get more of these positive rewards um so reinforcement learning is is I think of it as a way for you to specify what you want done so you have to specify what is a good dog and what is a bad dog behavior and then it's up to the learning algorithm to figure out how to maximize um you the good dog reward signals and minimize the bad do punishments um so just just show you I turns out reinforcement learning is applied you know to other problems in robotics it's applied to things in web crawling and so on it's just cool to show videos so let me just show a bunch of them um this robot was actually um this learning album was actually implemented by our hit ta uh ziko um of programming a full Leger dog um I guess Sam's Al been Sam driver in this class is also working on a project and and Peter riew and uh uh Mike and a few others um but um I guess this really is a good dog bad dog isn't it since they're a robot dog um second video on the right um some of the students gu Peter ziko Tonka working on a robotic snake um again using learning alums to teach a snake Rob how to climb R obstacles um the low left this is a kind of fun example um as6 and Jeff Michaels use learning algs to teach a car how to drive at you know reasonably high speeds off-road um um avoiding obstacles um and um on the lower right uh that's a robot program by PhD student y Shen uh you know to teach so somewhat strangely configure robot um how to get on top of an obstacle how to get over an obstacle sorry I know the video is kind of small I hope can Sol see it okay so I think all of these are robots that um I think are very difficult to hand code the controller for but using these source of learning algorithms um you can in relatively short order um get a robot to do you know often pretty amazing things okay so that was most of what I wanted to say today just a couple more last things but um let me just check what questions you have right now so um if there no questions let me I'll just close with two reminders which are um after class today or as you you know start to talk about talk with other people in this class um me just encourage you again to start to form project Partners uh to try to find Project Partners to do your project with um and also uh this is a good time to start forming groups so either talk to your friends to post the news group but just encourage you to try to St to do both of those today okay form study groups and try to find two other project Partners um so thank you I'm looking forward to teaching class and I'll see you uh in a couple of days this presentation is delivered by the Stanford center for professional development okay so let's get started with today's material um so um welcome back to the second lecture what I want to do today is talk about um linear regression grade in descent and the normal equations um and I should also say uh lecture notes have been posted online and so you know if some of the math I go over today I go over rather quickly if you want to see every equation written out and work through the details more slowly yourself um go to the course homepage and you download so detail leure notes that uh pretty much describe all the mathematical technical content I'm going to go over today um today I'm also going to delve into a fair amount some amount of linear algebra and so um if you would like to see a refresher on linear algebra um this week's discussion section will be taught by the Tas and will be a refresher on linear algebra so so so if some of the linear algebra I talk about today so seems to be going by a bit quickly or if um you just want to see some of the things I'm claiming today without proof if you want to just see some of those things written out in detail um can come to this week's discussion section um so I actually want to start by showing you a fun video um remember at the last lecture uh the initial lecture I talked about supervised learning and supervised learning was this machine learning problem where I said um we're going to tell the algorithm what the quotes right answer is for for um you know every for for a number of examples and then we want the Alm to replicate more of the same right so the example I had um at the first lecture was the problem of predicting housing prices where you may have a training set and we tell the Alm what the quote right housing price was for every house in the training set and then you want the AL to learn the relationship between you know sizes of houses and the prices and essentially produce more of the quote right answer so let me play for you a video now um I should load the big screen please so let me play a video now um that was from Dean pomalu um on some work he did at kongi melon on um applying supervised learning to get a car to drive itself um this was this work on a vehicle known as Alvin it was done about it was it was done about 15 years ago um and it was um excuse me I think it was a very elegant example of um the sorts of things you can get supervis learning albums to do um on the video you hear Dean pom's voice mentioned an album called neuron networks we'll say a little bit about that later but um the the the essential learning algorithm for this is something called grade in descent which which I actually talk about later in today's lecture let's let's watch um the video Alvin is a system of artificial neural networks that learns to steer by watching a person Drive Alvin is designed to control the nav lab 2 a modified Army humb equipped with sensors computers and actuators for autonomous navigation experiments the initial step in configuring Alvin is training a network to steer during training a person drives the vehicle while Alvin watches once every two seconds Alvin digitizes a video image of the road ahead and Records the person's steering Direction this training image is reduced in resolution to 30X 32 pixels and provided as input to Alvin's three- layered Network so two comments right one is um this is supervised learning because it's learning from a human driver in which a human driver shows that you know if we're on this segment of the road I would steer at this angle it this segment of the road steer at this angle and so the human provides a number of quote correct steering directions to the car and then it's the job of the car to try to learn um to produce more of these you know quote correct steering directions that keeps the car following the road um on the uh monitor display up here um I'm just going to tell you a little bit about what this display means so um on the upper left where where the mouse pointer is moving um this horizontal line actually shows a human steering Direction and this you know white bar or this white area here shows um that the human shows the steering Direction chosen by the human driver uh by by moving the steering wheel so the human is steering a little bit to the left here indicated by you know the position of this white region um this second line here where my mouse is pointing the second line here is um the output of the learning algorithm and where the learning algorithm thinks currently thinks is the right steering Direction and right now what you're seeing is the learning algorithm just at the very beginning of training and so it actually has no idea where to steer and so it output is sort of a white smear over the entire range of steering directions and as the Alm collects more examples and learns over time you see it you know you see it start to more confidently choose the steing direction so let's keep watching the video using the back propagation learning algorithm Alvin is trained to Output the same steering direction as the human driver for that image initially the network steering response is random after about 2 minutes of training the network learns to accurately imitate the steering reactions of the human driver this same training procedure is repeated for other Road types after the networks have been trained the operator pushes the Run switch and Alvin begins driving 12 times per second Alvin digitizes an image and feeds it to its neural network [Music] each Network running in parallel produces a steering Direction and a measure of its confidence in its response the steering direction from the most confident Network in this case the network trained for the on Lane Road is used to control the vehicle suddenly an intersection appears ahead of the vehicle as the vehicle approaches the intersection the confidence of the one lanee network decreases as it crosses the intersection and the two-lane road ahead comes into view the confidence of the two-lane Network Rises when its confidence Rises the two-lane network is selected to steer safely guiding the vehicle into its Lane on the two-lane road all right so who thought driving could be that dramatic right can I switch back to the to the chalkboard please um I should say um this work was done so about 15 years ago and um autonomous driving has come a long way so many of you were heard of the do Grand Challenge where um one of my car colleague Sebastian FR let the wining team the winning team to drive a car across the desert by itself so Alvin was I think absolutely amazing work for his time but you know I say so autonomous driving has also come a long way since then um but so what you just saw was an example um again of supervised learning and in particular it was an example of um what they call the regression problem because the vehicle is trying to predict a continuous value variables of a continuous value steering directions we call these we call them called it a regression problem um and what I want to do today is talk about the our first supervised learning algorithm and it will also be to regression task um so for the running example that I'm going to use um throughout today's lecture um she going to return to the example of trying to predict housing prices um so here is actually a data set um collected by uh ta Dan ramage on housing prices in uh Portland Oregon um so so here's a um data set of a number of houses of of of different sizes and [Music] um here are their um asking prices in thousands of dollars that's 400,000 [Music] 32 okay and so um you know you can take this data and plot it square feet that's price and so you may get a data set like that and the question is given a data set like this or given a train what we call a training set like this how do you learn to predict the relationship between the size of house and the price of a house um so I'm actually going to come back and and and modify this chart a little bit more later but um let me go and introduce some notation which I'll be using actually throughout the rest of this quarter um first piece of notation is um I'm going to let the lowercase alphabet M denote the number of training examples and that just means the number of rows or the number of um examples hes and prices we have and you know in this particular data set we have um uh we actually happen to 47 training examples um Al I wrote down only fine okay um so throughout this quarter um I'm going to use the alphabet M to denote the number training examples um I'm going to use well the lowercase um alphabet X to denote um the input variables which are which are often also called the features and so in this case X would denote the uh size of a house that we're looking at um I'm going to use y to denote the quotes output variable which which is sometimes also called the target Target variable and so um one pair X comma Y is what comprises one training example okay in other words one row on the table I I I draw just now would be what I call one training example and um and the I training example um in other words the I Row in that table I'm going to write as um x i comma Yi okay and so um for the in this notation I'm going to use this superscript I is not exponentiation so this is not x to the power of i y the power of I okay in this notation um this Su superscript iron parenthesis is just um sort of an index into the I row of um of of my list of training examples so um in supervised learning this is how so this is this is the what we're going to do I guess we're given a training set um and we're going to feed our training set comprising our M training examples the 47 training examples into a learning algorithm okay and um our algorithm then has to Output a function that uh so by tradition and for historical reasons um is usually denoted lowercase alphabet H and it's called a hypothesis don't worry too much about whether the term hypothesis is a deep meaning it's a more of a term that's used for historical reasons um and the hypothesis's job is to so take his input um you know if there's some new house whose price want to estimate um what the hypothesis does is it takes us input um a new living area in square feet say and output the estimated price of this house so um the hypothesis H maps from inputs X to outputs y um so in order to design the learning algorithm um the first thing we have to decide is how we want to represent the hypothesis right and just for the purposes of this lecture for the purpos of our first learning algorithm um I'm going to use a linear representation for the hypothesis so I'm going to represent my hypothesis as H ofx equal Theta 0 plus Theta 1 x where um X here is is a input feature and so that's the size of the HTH we're considering um and more generally actually let me come back to this um more generally for many classification for many regression problems we may have more than one input feature so for example if um instead of just knowing the size of the tells us if we know also the number of bedrooms in these hous is um 32 let's say then um so if our if our uh training set also has a second theat of the number of bedrooms in the house then um you may let's say X1 denote the um size and square feet um let X subscript 2 denote the number of bedrooms and then um I would write the hypothesis h of X as Theta row Plus Theta 1 X1 plus Theta 2 X2 okay and sometimes when so when I want to take the hypothesis H and when I want to make it dependence on the Theta explicit I'll sometimes write this as H subscript Theta of X and so this is um the price that my hypothesis predicts a house with features X cost so given a given a new house of features x with a certain size and a certain number of bedrooms this is going to be the price that my hypothesis predicts this houseal is going to cost um lastly one last one one one final piece of notation um so for conciseness um just to write this a bit more compactly I'm I'm going to take the convention of defining x0 to be equal to one and so I can now write h of x to be equal to sum from I = 1 to 2 of theta oh sorry 0 to two of thet x i and if you think of thetas and X as vectors then this is just Theta transvers X um and and the very final piece of notation is um I'm also going to let lowercase alphabet N Define let lowercase n be the number of features in my learning problem and so this actually becomes a sum well let me just a sum from I equal 0 to n where in this example if you have two features n would be equal to two okay all right I realized there was a fair amount of notation um and as I proceed through late the rest of the lecture today um or in future weeks as well if you know if someday you're looking at me WR a symbol and you're wondering gee what was that symbol lowercase end again or what was that lowercase x again or whatever um please raise your hand and a this is fair amounts of notation we'll probably all get used to it um you know in a in a in a few days um and with standardized notation and make a lot of our descriptions of learning arice much easier okay but again if if you see me write some symbol and you don't quite remember what it means CH is are there are others in this class for Goten too so please raise your hand and ask if if if you're ever wondering what some symbol means um what questions you have about any of this yeah just a variable it can be anything uh let's see can be wait oh say again uh Theta Theta 0 Theta 1 yes right so well um yeah so well the the this one was going to do next but the thetas um or the Theta R um are called the parameters um the thetas are called the parameters of our learning algorithm and Theta 0 Theta 1 Theta 2 are just real numbers and then um is a job of a learning algorithm to use the training set to choose or to learn appropriate parameters Theta okay are there other questions yeah what is um oh transpose right oh yeah sorry right just write Theta TX Theta transpose so in a product between two doors uh is this like a typical uh whatever function uh hypothesis function or could have like higher orders or could Thea actually be a function in itself yeah all great questions um the answer so the question was of what is this a typical hypothesis or can Theta be a be a function of other variables and so on and the answer is sort of yes um for now just just for this first um you know learning Alver we'll talk about I'm using a linear hypothesis class um a little bit actually later this quarter we'll talk about much more complicated hypothesis classes um and we actually talk about higher order functions as well a little bit later today cool okay so um so for the learning problem then um how do we choose the parameters Thea so that our hypothesis H will make accurate predictions about housing prices right so um you know one reasonable thing to do seem to be that um well we have a training set so um and just on the training set a hypothesis es will you know make some predic predictions of the housing prices of the of the um prices of the houses in the training set so one thing we could do is just try to make um the predictions of a learning algorithm accurate on the trading set at least right so given some features X and some correct prices y we might want to make let's say the square difference between the prediction of the algorithm and the actual price small um so to choose the parameters data let's we want to minimize over the parameters data of this sort of squared error between the predicted price and the actual price um and so I'm going to fold this in we have M training examples so I'm going to sum from ials 1 through M of my M training examples of the price predicted on the I pulse in my training set u minus the you know actual Target variable minus actual price on the I training example um and by convention instead of minimizing this some of squares differences I'm just going to put a one half there uh which which which will simplify some of um some of the math we do later okay and so let me go ahead and Define J of theta to be equal to just this thing half sum from I = 1 through M on number of training examples of the value predicted by my hypothesis minus the actual value and so what we'll do let's say is minimize as a function of the parameters of theta this quantity J of theta okay um I should say to those you that taken so you linear algebra classes um or maybe uh basic statistics classes some of you may have seen things like these before um in and seen called you know lead s regression in order squares um many of you will not have seen this before I think some of you may have seen it before but either way regardless of what they've seen it before let's keep going and for just for those you that have seen it before I should say eventually we will actually show that this algorithm is a special case of a much broader CLA of alms um but let's keep going we'll we'll get there eventually [Applause] so so I'm going to talk about a couple of different algorithms for performing that minimization over Theta of J of theta um first Alm I'm going to talk about is a search algorithm where the basic idea is we we'll start with some um value of my parameter Vector Theta um maybe maybe initialize my parameter Vector Theta to be the vector of all zeros um and oh excuse me to write that I write I so write zero of an arrow on top to denote the vector of all zeros and then um you then when we keep changing my parameter Vector Theta to reduce um J of theta a little bit until we hopefully end up at the minimum with respect to Theta of J of theta okay so um search the laptop display please and load the back screen so let me go ahead and um show you an animation of this first algorithm um for minimizing J of theta which is an algor called gradient descent so um here's the idea you see on the display a plot um and the axes so the the the horizontal axes are Theta 0 and Theta 1 so us minimize J of theta which is represented by the by the height of this plot right so the surface represents a function J of theta and the axis of this function or the inputs of this function or the parameters theeta zero and Theta one written down here below so here's the um Grand descent algorithm I'm gonna choose some initial point it could be you know Vector of all zeros or some randomly chosen point and let's say we start from that point denoted by the by the CR by the um by the Star by the cross um and now want should imagine that um this display actually shows a 3D landscape imagine you out in a holy park or something and this is the 3D Shape of like a hill in some Park and um so imagine they actually standing physically at the position um of that star of that cross and um imagine going stand on that hill right and look all 360 360 Degrees around you and ask if I were to take a small step what would allow me to go downhill the most okay so imagine that this is physically a hill and you're standing there you we look around and ask if I take a small step what is the direction of steepest descent that would take me downhill as quickly as possible so the great incent Alin does exactly that I'm going to you know take a small step in this direction of steepest descent or in the direction of the gradient turns out to be and then you take a small step and you end up in a new Point um shown up there and then we keep going you're now at a new point on this hill and again you're going to look around you look all 360 look all 360° around you and ask what is the direction that would take me downhill you know as quickly as possible and we want to go downhill as quickly as possible because we want to find the minimum of J of theta so you do that again you can take another another step okay and you sort of keep going um until you end up at a local minimum of this function JF data right um one property of grand descent is that um where you end up in this case we ended up at this point on the um lower left hand corner of of of this plot um but yeah let's let's try running gr and descent again from a different position Al so that was where I started gr descent just now let's rerun gr descend by using a slightly different initial starting point so Point slightly to the further to the up and further to the right so it turns out if you run grent descent from that point then if you take a steepest descent direction again you you that's the first step and if you keep going um turns out that with a slightly different initial starting point you can actually end up at a completely different local Optimum okay so this is the property of gr and descent we'll come back in a second but um so be aware that grad and descent can sometimes depend on where you initialize um your parameters Theta zero and Theta 1 but um actually let's switch back to the TR please um let's go ahead and work out um the math of the grand descent Alum then we'll we we'll come back and revisit this issue of local Optima so here's the gradient descent algorithm um we're going to repeatedly take a step in in this direction of steepest descent and it turns out that um you can write that as follows which is we're going to update the parameters steta as um thet I minus the partial derivative with respect to the I of J of theta okay so this is how we're going to update the I your parameter Theta how we're going to update Theta on each iteration of bent um just a point of note to notation I use this colon equals notation to denote um so of setting a variable on the left hand side to be equal to the variable on the right hand side right so um if I write a colon equals B then what I'm saying is this is part of a computer program or this is part of an algorithm where we take the value of b or the value on the right hand side and use that to overwrite the value on the left hand side um in contrast if I write a equals B then this is um an assertion of a of of Truth well this is I'm claiming that the value of a is equal to the value of B okay and so whereas this is a computer operation where we overwrite the value of a if I write a equals B then I'm of asserting that the values of A and B are equal um so let's see this algorithm sort of makes sense um um well actually let's just move on let's let's go ahead and take this Al and apply it to our problem [Applause] um and um to work how gr and descent um let's let's take gr and descent and just apply to our problem um and this being the you know uh first somewhat mathematical lecture I'm going to step through derivations much more slowly and carefully then I will later in this schol ter we just you know work through the steps of these in in much more detail than than I will later in the school term let's actually work out what this GRE descent Ru is um so and I'll do this just for the case of if we have only one training example okay so in this case we need to work out what the partial derivative with respect to the parameter Theta R is of J of theta um if we have only one training example then J of theta is going to be 12 H subscrip Theta of x - y square right so if if you have only one training example comprising one pair X comma y then this is what J of theta is going to be um and so taking derivatives um you know you have 1 12 something squared so the two comes down you have 2 * 12 * xet of x - y um and then by the Chamber of derivatives um we also need to multiply this by the derivative of what's inside the square right um the two and the 1/2 cancel so this Lees h times that the 0 x0 plus dot do dot the r okay and um if you look inside the sum oops excuse me we're taking the partial derivative of the sum with respect to the parameter Theta um but all the terms in the sum except for one do not depend on thet the only in in this sum the only term that depends on Theta will be some term here of theta XI and so we take the partial derivative respect to Theta XI um take the partial derivative respect to Theta of this term thet x i and so you get that time x okay and so this gives us our learning rule right of thi gets updated as th minus Alpha times times that okay um and this Greek alphabet Alpha here is a parameter of the Alm called The Learning rate and um this parameter Alpha controls how how large a step you take so if you're standing on the hill you decided um what direction to take a step in and so this parameter Alpha uh controls how aggressively you how how large a step you take in in this direction of steepest descent okay um and so if you well if Al and this is a parameter of the album that's often set by hand um if you choose Alpha to be too small then your steepest descent Al will take very tiny steps and take a long time to converge um if Alpha is too large then um the steepest descent may may actually end up overshooting the the minimum if you if you're taking too aggressive a step um okay yeah uh say that again isn't there one over two missing somewhere um is there you a one half Miss okay okay cool thanks I do I do make lots of errors in that so it's good to any questions about this okay so questions right so let me just um well wrap this up property into an algorithm so o over there I deriv the algorithm for if you have just one training example um more generally for M training examples gradient descent becomes the following um we're going to repeat until convergence um the following step okay theeta I gets updated as the and and I'm just writing out you know the appropriate equation for M examples rather than one example um th giv up as thus Alpha * sum from I = 1 to M um oops EXC me okay and um I won't bother to show it but you can um go home and so of verify for yourself that this summation here this is indeed the partial derivative with respect to Theta of J of theta uh where if you use the original definition of j of theta for when you have M training examples okay um so let me just go and show switch back to a laptop display let me show you what this looks like when you run the AL um so it turns out that um for the specific problem of linear regression or ordinary Le squares which is what we're doing today um the function J of theta actually does not look like this nasty one that I was showing you just now with multiple local Optima in particular it turns out for ordinary Le squares the function J of theta is is is is just a quadratic function and so always have a nice bow-shaped nice bow shaped like what you see up here and they only have one Global minimum with no other local optim um so when you run GR in the sent um here here actually the Contours of a function J so um The Contours of a bow shape function like that are going to be ellipses and um um if you run Grand descent on on the Alum here's what you might get so let's see initialize the parameters you know so let's say randomly at the position of that crossover there right that cross on the on the upper right and so after one iteration of grain descent um as you change in uh uh the space of parameters um so that's that's the result of one step of grain descent two steps three steps four steps five steps and so on and it you know converges reasonbly rapidly to the global minimum of this function J of theta okay um and this is the property of these squares of ordinary these squares Reg with with a linear hypothesis Clause that that the function J of theta has no local Optima this question Al changing every time cu the step is not I see yeah okay um so it turns out that um yes so it turns out this was done with did this with a fixed value of alpha um and one of the properties of great understand is that as you approach a local minimum it actually takes smaller and smaller steps so they'll converge and the reason is um the update is you um uh you update Theta by subtracting from it you know Alpha times the gradient and so as you approach a local minimum the gradient also goes to zero right so and as as you approach local minimum at the local minimum the gradient is zero and as you approach the local minimum the gradient also gets smaller and smaller and so um Grand descent will automatically take smaller and smaller steps as you approach a local as you approach a local minimum does make sense um and here's the same plot um of and so here's here's actually a plot so the housing prices data um so here let's you initialize the parameters to the vector of all zeros and so this blue line at the bottom shows the hypothesis um with the parameters initialization right so initially SATA 0 and Theta 1 are both zero and so your hypothesis predicts that you know all prices all all prices are equal to zero um after one iteration of GRE descent that's the Blue Line you get after two iterations 3 four five and after a few more iterations um excuse me it converges and you've now found the these squares fit to the data okay um cool let's switch back to the troll Port um so are questions about this yeah when we say run an iteration we mean that we run each sample all the sample cases once update the value of the parameters and run it again with the new values uh yes right and and converge means that the value will remain same of the Theta the parameters will remain roughly the same yeah so so so um yeah so sort a question of um how do you test the convergence right and and there's different ways of testing for convergence one is you can look at two different iteration a and see if Theta has changed a lot and if Theta hasn't changed much within two itations you may say it's so more or less converged um something that's done maybe slightly more often is look at the value of J of theta and if J of theta if the optimiz if the quantity you're trying to minimize is not changing much anymore then you might be inclined to believe is converged so these are sort of standard heris sixs or standard rules of thumb that are often used to decide if um gr incent is converg yet yeah miss something but men all the directions you know and choose the direction the lowest um so one one one feature example you have two directions right you can either a do the curve and you can go this way or that way y but the math I didn't quite understand where that comes in when do you choose whether you go left you go this way or that way I see okay so actually turns out that um the question is you know how is G understand looking 360 around you and choosing the direction of C ver so um so it actually turns out not sure I understood the second part but it turns out that if you um if if if you stand on the hill and if you um uh turns out that we compute the gradient of a function when you compute the derivative of function then it just turns out that that is indeed the direction of steepest descent um but there's no point out you would never want to go in the opposite direction because the opposite direction would actually be the direction of steepest ascent right um so turns out maybe hope maybe maybe maybe TS can talk a bit more about this on on a dissection if there's interest um turns out when you take the derivative of a function the derivative of a function so turns out they just give you the direction of steepest descent um uh and so you don't explicitly you know look all 360 Degrees around you you sort of just compute the derivative and that turns out to be the direction of steepest descent um yeah maybe maybe a t is this May T can talk a bit more about this on Friday and if um okay let's see um so let me go ahead and give this ala um a specific name so this algorithm here is actually called um bash grad in [Applause] descent and um the term batch isn't a great term but the term batch refers to the fact that um on every step of grand descent you're going to look at the entire training set you're going to you know perform a sum over your all M training examples um since out so B often works very well I I use it very often um and it turns out that sometimes if you have a really really large training set so imagine that instead of having 47 houses from Portland Oregon in a training set if you had say the US Census database or something with us sensorized databases you can often have you know hundreds of thousands or millions of training examples um so if m is you know a few million then if you're running bashra in the scent then this means that to perform every step of grade and descent you need to perform a sum from J equals 1 to a million which is that's that's of a lot of training examples for your computer program to have to look at before you can even take you know one step downhill on the function J of theta so it turns out that when you when when you have very large training sets um here's you write down an alternative algorithm that's called um C Cony grain descent um sometimes also call incremental grade descent but the alth is follows again we'll repeat until convergence and we iterate for JAL 1 to M um and will perform one of these sort of gradient descent updates you using just the J training example oops umus okay oh and as as usual this is that really you perform you update all the parameters St R so you perform this update you know for all values of I right meaning for i indexes and the parameter vectors you just perform this update all all of your parameters simultaneously um and the advantage of this algorithm is that um in order to in order to to to start learning in order to start modifying the parameters um you only need to look at your first training examples you said look your first training example and performance update using you know the derivative of the error with respect to just your first training example and then you look at a second training example and perform another update and you s of keeps adapting your program much much more quickly um without needing to take a scan over your entire you know us sensors database before you can even start adapting your parameters um so let's see for lunch data sets um stochastic rain descent is often much faster and um what happens with stasic descent is that it won't actually converge to the global minimum exactly but um well if these are the Contours of your function then as you run so con Grand descent you sort of tend to wander around and you may actually end up going uphill occasionally but your parameters will sort of tend to wander to the region close to the global minimum but sort of keep wandering around a little bit near the region of the Golding room and and and often that's just fine right to to have a parameter you know that um wers a little around a little bit the global minimum and so it's a and in practice this often works much faster than b GR in descent um especially if you have a l training set okay I'm going to clean a cover of boards while I do that why don't you take a look at the equations and and after I'm done cleaning the boards I'll ask what questions you have um actually see this one as well okay so what questions you have about all this yeah soent descent is it true that uh are you just sort of like rearranging the order that you that you do the the the computation like so do you just use the first training sample and update all of the Theta eyes and then step and then update with the second training example and update all the Theta eyes and then step and is that why you get sort of this early yeah let's see right so so I'm just look at my first training example and then I'm going to take a step and and then I'm going to um perform the second grade in descent update using my new parameter Vector that has already been modified using my first training example and then I keep going does that make sense yeah so so in each update of all the Theta eyes you're only using one training example one training example yeah cool than is it empirical um let's see is Def empal um I believe this's theory that sort of supports that as well you know Zoo yeah this theory that supports that the the how clean the statement of the theorem is I don't remember but yeah yes okay cool so in what I've done so far I've talked about an iterative algorithm um for performing this minimization in terms of J of theta um and it turns out that there's another way for for the for the specific problem of Le SES regression of of of ordinary Le SES turns out there's another way to perform this minimization of JF Thea that allows you to um you know solve for the parameters Thea in close form without needing to run an iterative algorithm um and I know some of you may have seen some of what I'm about to do before in in in like an undergraduate linear algebra course um and the way is typically done um so it requires you know messy orgal projections or taking losss of derivatives and writing loss of algebra um what I like to do is show you a way to derive you know the close form solution of theta in just a few lines of algebra um but to do that I need to introduce a new notation for Matrix derivatives um it turns out that the notation I'm about to um Define here just in my own personal work has turned out to be one of the most useful things that you know I actually use all the time to have a notation how to take derivatives with respect to matrices so that um you can solve for the minimum of J of theta with like a few lines of algebra rather than writing out pages and pages of matrixes and derivatives so let me go and Define this new notation first and then and then we'll go ahead and work out the minimization um given a function J um since J is a function of a of a vector parameters sta right I'm going to define the derivative of the gradient of J with respect to Theta as itself a vector okay and so this is going to be you know an N plus one dimensional Vector remember Theta is an N plus one dimensional Vector with indes ranging from 0 to n and so I'm going to Define this derivative to be equal to that okay and so um we can actually rewrite the gradient descent Al as follows I guess this is a bash gradient descent when wew write gradient descent as updating the parameter Vector Theta notice this no subscript I now updating parameter Vector Theta as the previous parameter minus Alpha time the gradient okay and so in this equation all of these quantities you know Theta and this gradient Vector all of these are n plus one dimensional vectors um I see right I was using BS out of all the wasn't I so more generally um if you have a function f um that maps from the space of matrices a oops excuse me um that maps from say the space of n byn matrices to the space of real numbers so if you have a function you know F of a where a is an N byn Matrix so this this is function that maps from matrices to real numbers the function that takes this input to Matrix let me Define the derivative with respect to F of the Matrix a right now this is taken the gradient of f with respect to its input which is which is a matrix I'm going to Define this itself to be a matrix okay so the derivative of f with respect to a is itself a matrix and Matrix contains all the partial derivatives of f with respect to the elements of a um one more definition is um if a is a square Matrix so if a is an N byn Matrix number of rows equals number of columns let me Define the trace of a to be equal to the sum of A's diagonal elements okay so this is just sum of I of a i i um for those you that haven't seen the sort of operator notation before you can think of trace of a as you know the trace operator applied to the square Matrix a but it's more commonly written without the parentheses so I usually write the trace of a like this just and this just means the sum of diagonal elements so um here are some facts about the trace operator and about derivatives and I'm just going to write these about proof you can also T the proof some of them at the discussion section um uh uh or or you can actually go home and so verify the proofs of all of these yourself turns out that um given two matrices A and B the trace of the Matrix a * B is equal to the trace of ba a okay I'm not going to prove this but you should be able to go home and prove this yourself um without too much difficulty um and similarly the trace of a product of three matrices so you can take the Matrix at the end and you know cyclically permute it to the front this trace of a * B * c equal to the trace of c a b so take the Matrix C at the back and move it to the front and this is also equal to the trace of BCA we take the Matrix B and move it to the front Okay um also so suppose you have a function f of a which is defined as a trace of ab okay so this is right the trace is a real number so the trace of ab is sort of a function that takes us input a matrix a and outputs a row number so then the derivative with respect to the Matrix a of this function of Trace AB um is going to be bepos this is just another fact that you can prove by yourself by going back and referring to the definitions of traces and Matrix derivative but but I'm not going to prove this work it out okay and lastly a couple of easy ones um the trace of a is equal to the trace of a transpal because the trace is just the sum of diagonal elements and so if you transpose The Matrix the diagonal elements don't change and um if lowercase a is a real number then you know the trace of a real number is just itself so think of a real number is a 1 by one Matrix so the trace of a 1x one Matrix is just whatever you know whatever that real number is um boy and lastly this is somewhat tricky one um the derivative with respect to the Matrix a of the trace of ABA transpose C is equal to cab plus cpose a bepos and and I won't prove that either but it's just algebra and work it out yourself okay and so the um I guess key equations the key facts I'm going to use again about traces and Matrix derivatives all these five um 10 minutes okay so armed with these things I'm going to um figure out let's let's try to come with a quick derivation for how to minimize J of theta in as a function of theta in close form without needing to use an iterative algorithm um to work this out let me Define um The Matrix X this is called a design Matrix um to be a matrix containing all the inputs from my training set right so you know X1 was was was of the vector of inputs the vector of features for the first training example so I'm going to set X1 to be the first row of of this Matrix X um set my second training examples inputs to be the second row and so on and I have M training examples and so that's going to be my um design Matrix X okay I'm just Define this Matrix capital x as follows and um so now let me take this Matrix X and multiply by my parameter Vector Theta this is derivation will just take two or three steps so x * Theta um remember how Matrix Vector multiplications goes right you take this vector and you multiply by each of the rows of the Matrix so x * Theta is going to be just you know X1 transpose Theta dot dot dot down to XM transel Theta and this is of course just um the predictions of your hypothesis on each of your M training examples let me also Define the Y Vector to be the vector of all the target values y1 through ym um in my training set okay so y Vector is an M dimensional [Applause] Vector so X Theta minus y um containing the math from the previous board is going to be h of that right and now um X Theta minus y this is a vector this is an M dimensional Vector if I have M training examples and so I'm actually going to take this vector and take us inner product with with itself okay so recall that um you know if Z is a vector then Z transpose Z is just Su over i z i s right that's how you take the inner product of a vector with with with itself so I'm going to take this Vector X Theta minus y and take the inner product of this Vector wor itself and so that gives me sum from I = 1 to M of H of x i minus y squar okay since it's just a sum of you know the sum of squares of the elements of this vector and for the half there then this is this is our previous definition of j of theta okay yeah that's the scalar bias um oh okay yeah I I know I know I I threw a lot of notation you today so um m is the number of training examples and um the number of training examples runs from one through M and then is the feature Vector that runs from zero through n does that Mak sense so um so this is this is a sum from 1 through m is um is sort of theta transpose X that's equal to sum from j = 0 through n of theta y x th J XJ does that make sense okay so so the feature vectors that in that index from 0 through n where x0 is equal to one whereas the training example is actually index from 1 through n so let me clean a few more boards and and take a look take another look at this make sure it all Mak sense okay okay yeah oh yes thank you yes why oh is that what you meant yes thank you right I training example anything else cool so we're actually nearly done with this derivation um we would like to minimize J of theta with respect to Theta and we've written you know J of theta fairly compactly using this Matrix Vector notation so in order to minimize J of theta respect to Theta what we're going to do is take the derivative with respect to Theta of J of theta and set this to zero and solve for the okay so we have derivative with respect to Theta of that is equal to um I should mention there'll be some steps here that I'm just going to do fairly quickly without proof so is it really true that the derivative of half of that is half of the derivative can I really exchange you know the derivative and then one half it turns out the answer is yes but later on you should go home and um look for the lecture notes and make sure that you know you understand and believe why every step is correct I'm going to do things relatively quickly here um and and you can work through every step yourself more slowly by referring to lecture notes okay so um that's equals to I'm going to expand out this quadratic function so this is going to be okay um and this is just of taking a quadratic function and expanding It Out by multiplying the arguments and again work through this step later yourself if uh you're not quite sure how I did that um so this thing this vector vector product right X you know this quantity here this is just J of theta and so it's just a real number and the trace of a real number is just itself oh thanks then cool right um so this this quantity in parenthesis this is J of theta and it's just a real number and so the trace of a real number is just the same real number and second take a chase operator without changing anything um and this is equal to 12 derivative with respect to Theta of the trace of um by the cyclic permutation property of Trace I can take the STA at the end and move it to the front so this is going to be trace of theta time Theta transpose x transvers x minus derivative with respect to Theta of the trace of and again I'm going to take that and bring it to the um oh sorry you know what um actually this thing here is also a real number and the transpose of a real number is just itself right so I take the transpose of a real number without changing anything so let me go ahead and just take the transpose of this so a real number trans itself is just the same real number so this is minus the trace of taking the transpose of that gives y transpose X Theta then minus XA okay and this last quantity y transpose y doesn't actually depend on Theta so when I take the derivative of this last term with respect to Theta is is zero so just drop that term um and lastly um well the derivative respect to Theta of the trace of you know Theta Theta transpose um X transpose X I'm going to use um I'm I'm going to use one of the facts I wrote down earlier without proof right and I'm going to let this be a insert a identity Matrix there so this is a a transpose C and using a rule that written down previously that find lecture notes um or I guess it's still one of the boards but but you have previously um this is just equal to X trans X Theta you know um excuse me so this is c a B which is so just the identity Matrix which you can ignore plus X transpose X Theta where this is now C transpose a you know again times the identity which we can ignore times B transpose okay and and the Matrix X transpose x a symmetric so C transpose is equal to C um similarly the derivative with respect to Theta of the trace of um excuse me yron supposed say the X um you know this is the um derivative for respect to a matrix a of the trace of ba and this is just X transpose y this is just be transpose by by again by one of the rules that I wrote down earlier and so if you plug this back in we find therefore that the derivative wow this works really bad this is a different one so if we plug this back into our formula for the derivative of J you find that the derivative with respect to Theta of J of theta is equal to um you know 12 x xet plus xose X Theta - xose y - xose Y which is just X+ X Theta minus X Y okay so we set this a zero and we get that um which is called the normal equations and um we can now solve this equation for Theta in Clos form as X transpose X Theta um inverse time x transpose Y and so this gives us a way for solving for the least s fit to the parameters in close form without needing to use an iterative algor like reent okay and using this Matrix Vector notation I think it I don't know it took far few I think we did this whole thing in about 10 minutes which we couldn't have if if I was writing out reams of algebra Okay so some of you look a little bit days but guys this is our first learning algor aren't you excited this any quick questions about this before we close for today oh inverse of X I'll say that again what you derve here wasn't that just a p inverse of X what what inverse pseudo inverse Pudo inverse um yeah I yeah it turns out that in cases if x transpose X is not inviable then you use the pseudo inverse to to solve this but um it turns out if xros X is not invertible that usually means your features were dependent you it usually means you did something like repeat the same feature twice in your training set um so if this is not invertible it turns out the minimum is obtained by the pseudo inverse instead of the inverse if if you don't know what I just said don't worry about it usually won't be a problem anything else um the second board when we start that Del where does the come from um let me take that off just I think we're running over let's close to today and if there are other questions I'll take them offline okay thanks guys this presentation is delivered by the Stanford center for professional development okay good morning welcome back to um third to the third lecture of this class so here's what I want to do today um and some of the topics I'll do today may seem a little bit like I'm jumping so from topic to topic but here's sort of the outline for today and The Logical flow of ideas um in the last lecture we talked about linear regression and today I want to talk about sort of an adaptation of that called locally weight of regression um it's actually a very powerful algorithm that's actually one of my uh former Mentor probably favorite machine learning algorithm um we'll then talk about the probabilistic interpretation of linear regression and use that to um move on to our first classification algorithm which is logistic regression um take a brief digression to tell you about something called the perceptron alv which is something we'll come back to again later this quarter and um time allowing I hope to get to Newton's method which is which is an Alm for fitting logistic regression models um so let's just recap what we're talking about in the previous lecture um remember the notation I defined was that I used this x superscript i y superscript i to denote the I training example and um when we're talking about linear regression or ordinary Le squares we use this to denote the predicted value output by my hypothesis H um on the input XI and my hypothesis was parameterized by um the vector parameters Theta and so we said that this was equal to sum from right the j i j written most simply as St trans X and um we had the convention that X subscript 0 is equal to one so this accounts for The Intercept term in our linear regression model um and lowercase n here was the notation I was using for the uh for the number of features in my training set okay so in the example we're trying to predict housing prices if we had two features the size of the house and number of bedrooms um we had two features and N was little n was equal to two um so just to finish recapping the previous lecture we defined this quadratic cost function J of theta = 12 sum from I = 1 to M of FX IUS y i squared um where this is a sum over our M training examples in my trading set so lowercase M was the notation I've been using to denote the number of training examples I have for the size of my training set and um at the end of the last lecture we derive the value of theta that minimizes this in close form which was you Xpose X inverse X transpose y okay so um as I move forward in today's lecture I'll continue to use this notation and again I realized there's a fair amount of notation to all remember um so if partway through this lecture you forgot you know if if you haven't trouble remembering what lowercase m is or what lowercase n is or something please raise your hand and ask um when we talked about linear regression last time um we used two features one of the features was the size of the house in square feet so of the living area of the house and the other feature was the number of bedrooms in the house um in general when you're out there applying machine learning algorithms to some problem that you care about the choice of the features will very much be up to you right and um the way you choose your features to give the learning Al will often have a large impact on how actually does um so just for example the choice we made last time was X1 equals the size and and let's let's let's leave aside the other feature of the number of bedrooms for now let's say we don't have data that tells us how many bedrooms around these houses um one thing you could do is actually Define oh let's draw this out and so right so let's say that was the size of the Hol and that's the price of the H so if you use this as a feature maybe you get you know Theta 0 plus Theta 1 X1 this is the linear model um if you choose we just copy the same data set over right you can define a set of features where X1 is equal to the size of the holes and X2 is the square of the size of the Hol okay so X1 is the size of the house and say square footage and X2 is just take whatever the square footage of the house is and just square that number and this would be another way to come with a feature and if you do that then the same algorithm will end up fitting a quadratic function for you um the 2 X1 2 okay because uh this is actually X2 and and you know depending on what the data looks like maybe this is a slightly benefit to the data um you can actually take this even further right which is let's see I have seven training examples here so you can actually maybe fit up to a six order polinomial you can actually fit a model Theta 0 plus Theta 1 X1 + Theta 2 x x 2 + up to Theta 6 x the^ of six and fit the six4 the polinomial um to these seven data points and if you do that you you find that you come with a model that fits your data exactly with with I guess in this in this example I Dre you know we have seven data points and so if you fit a six Vol polinomial you can so fit a line that passes through these seven points perfectly um and you probably find that the curve you get is look maybe something like that um and on the one hand this is a great model in the sense that it fits your training data perfectly um and on the other hand this is probably not a very good model in the sense that you know none of us seriously think that this is a very good predictor of housing prices as a function of the size of the house right so we actually come back to this later um turns out of the models we have here I feel like maybe the quadratic model fits the data best um whereas the linear model it looks like you know looks like there's some looks like there's actually a bit of a quadratic component in this data that the linear function is not is not capturing um so we actually should come back to this a little bit later and and and talk about the problems associated with fitting models that either twoo simple use two small set of features all the models are too complex maybe um use to large a set of features um just to give these a name we call this the problem of underfitting and very informally this refers to setting where there are obvious patterns or where there are patterns in the data that the algorithm is just failing to fit and this problem here we refer to as overfitting and again very informally this is when the Alm is fitting the idiosyncrasies of this specific data set right that it just so happens that of the seven houses we sampled in you know Portland or wherever you collect data from um that house happen to be a bit more expensive that house happen a little less expensive and by fitting a six water polinomial was are fitting the ID idiosyncratic properties of this data set rather than the true underlying trends of how housing prices various the functional size of house okay so these are two very different problems we'll Define them more formally later and talk about how to address each of these problems um but for now I hope you appreciate that there is this issue of selecting features um so if you want to choose features for learning problems there there there are few ways to do so um and we'll talk about feature selection algorithms later this quarter as well so automatic algorithms are choosing what features to use in a in a regression problem like this um what I want to do today is talk about a class of algorithms called non-parametric learning algorithms that will help to alleviate the need somewhat for you to choose features very carefully okay and this will lead us into our discussion of um locally weight to regression um let me just Define a term um linear regression as we've defined it so far is an example of a parametric learning algorithm and um parametric learning algorithm is one that's defined as an algorithm that has a fixed number of parameters that's fit to the data okay so in linear regression we had to fix set of parameters data right that was fit to the data in contrast what I'm going to talk about now is our first non-parametric learning algorithm um the formal definition which is which is not very intuitive so I replace this when a second something more intuitive um the the the S the formal definition of non-p paramet learning Alm is that is an Alm where the number of parameters um grows with M with the size of the trading set and usually it's defined as the number of parameters grows linearly with the size of the training set um this is the formal definition a slightly in slightly less formal definition is that the amount of stuff you need amount of stuff that your learning algori needs to keep around will grow linearly with the training set um or in other words saying is that this is an algorithm that will need to keep around an entire training set even after learning okay so don't worry too much about this definition but what I want to do now is describe um a specific non-parametric learning algorithm called locally weight of regression um which also goes by a couple of other names uh which also goes by the name loes for S of historical reasons loes is usually spelled l o SS sometimes spelled like that too I just call locally weighted regression so here's the idea and um this would be an algorithm that allows us to worry a little bit less about having to choose features very carefully so for my motivating example let's say that I have a training set that looks like this okay so this is X and that's y um if you run linear regression on this and you you know fit maybe a linear function to this and you end up with a more or less flat straight line which is not a very good fit to this data um you can sit around and stare at this and try to decide what other features to use right so maybe you want to toss a quadratic function but this isn't really quadratic e there so maybe you want to model this as a you know X+ X2 plus maybe some function of s of X or something you actually sit around and F of features um and after a while you can probably come over a set of features the model is okay but that let's talk about an algorithm that you can use without needing to do that um so if now suppose you want to evaluate your hypothesis H at a certain point there a certain query Point lowercase x okay and let's say you want to know you want to know what's the predictive value of y at this at this um position for X right so for linear regression what we were doing was we would fit Theta to minimize Su over i y i minus State transal XI squared and return the trans X okay so that was linear regression in contrast in locally weight to linear Direction I'm going to do something slightly different going to look at this point x and then I'm going to look at my data set and take into account only the data points that are sort of in a little vicinity of X okay so I'm going to look at you know where I want to evaluate my hypothesis I'm going to look only in the in the vicinity of of of the point this point where I went to evaluate my hypothesis and then I'm going to take let's say just these few points and I will um apply linear regression to fit a straight line just to this subset of the data okay I'm using the sub term subset well let's come back to that l so I take this data set and I fit straight line to it maybe I get I know straight line like that and what I'll do is is then evaluate this particular value of straight line and you know that would be the value I returned for my algorith okay that this this would be the predicted value for um uh uh this would be the value that my hypothesis outputs in locally ways of regression okay so let me go and for let me go ahead and form line that in locally ways of regression we're going to fix Theta to minimize um Su ofi to minimize that where these terms w super strip I are called weights um and there are many possible choic of ways I'm just going to write one down so say e to the minus x i - x^ 2 over two um and so let's look at what these weights really are right so notice that um suppose you have a training example XI so that x i is very close to X so this is small right then um if x IUS X is small so if x- x is close to Z then this is e to the minus 0 and E the 0 is is 1 so if x i is close to X then wi will be close to one in other words the weight Associated the I training example be close to one if x i and X are close to each other conversely if x i - x is large oops [Music] um then I know what w IV zero right close to zero cool right so this is if XR is very far from X then this is e the minus of some large number and each the minus some large number will be close to zero okay um so the picture is if I'm carrying at a certain point x um shown here on the x- axis and if my data set yeah know say looks like that then I'm going to give the points close to this a large weight and give the points far away a small weight um and so for the points that are far away wi will be close to zero and so it's as if for the points that are far away um they will not contribute much at all to this summation right so think is this sum over I of 1 times this quadratic term for close by points plus 0o times this quad term for farway points and so the effect of using this waiting is that locally weight of line regression fits a set of parameters sta paying much more attention to fitting the points close by accurately whereas sort of ignoring the contribution um from from far away points okay yeah yeah why is an exponentially yeah let's see so turns out there are many other waiting functions you can use um it turns out that there are different re different communities of researchers that tend to choose different choices by default um there is a there is a somewhat of a literature on debating what point what exactly what function to use um this s exponential decay function is is just happens to be a reasonably common one that seems to a reasonable choice of many problems but you can actually plug in other functions as well um just mention one side comment to that for those of you that are familiar with you the the the M the um normal distribution the Gan distribution I'm going to say this what this formula that written out here you know it cosmetically looks a bit like a galaxian distribution okay but this has actually this actually has absolutely nothing to do with Gan distributions so um this is not it's not that you know probably if XI is gussan or whatever this is no no such interpretation this is just a convenient function that happens to be b shape function that don't endow us with any gussian semantics okay um and so in fact well if you remember the familiar bell-shaped Gan um again it's just the weights you end up associating with these points is that um if you imagine putting the sort of bell-shaped bump centered around the position of where you want to evaluate your hypothesis H then this is saying this point here I'll give a weight that's proportional to the height of the Gan excuse me to the height of the Bell shape function evaluated at this point and the weight to give to this point will be to this training example will be proportional to that height and so on okay and so training examples are really far away get a very small weight um one last small generalization to this is that normally there's one other parameter to this algorithm um which I'll denote as tow um again this looks suspiciously like the variance of a gan but this is not a gan this is a convenient form of function this parameter tow is called the bandwidth parameter and um um informally it controls how fast the weights fall off with distance okay so just copy my diagram from the other slide I guess so if um Tower is very small that's a quer x then you end up choosing a fairly narrow G excuse me a fairly narrow Bell shape so that um the weights the points the far away fall off rap whereas if um Tower is large actually if tow is large then you end up choosing um a waiting function that falls off relatively slowly with distance from your query um okay so I hope you can therefore see that if you apply locally weight to linear regression to a data set that looks like this then to ask what your hypothesis's output is at a point like this you end up fitting a straight line making that prediction um to ask swi hypothesis outp put is you at that value you put a straight line there and you predict that value and turns out that every time you you try to evaluate your hypothesis every time you ask your learning algorithm to make a prediction for you know how much new houseal cost or whatever you need to fit you need to run a new fitting procedure and then evaluate um this line that you fit just at the position of um the value of x at the position of the query where you're trying to make a prediction okay but if you do this you know for every point along the x-axis then you find that Lally weight to regression is able to trace out the S the very nonlinear curve for a data set like this okay um so in the problem set we're actually going to let you play around more with this algorithm so I won't say too much more about it here but um before I move on to the next topic let me let me check what questions you have yeah it seems like you still have the same problem with overfitting and underfitting like you have to choose town like if you make it too small yep then you're yes absolutely yes so um local weight to regression can run into local weight to regression is not a pener for the problem of overfitting or underfitting um uh it you can still run into the same problems with local weight of regression um what you just said um about and and so some some of these things I'll leave you to discover for yourself in the homework problem you you actually see what she just mentioned yeah it almost seems like you're not even building a model with this locally need all the data that you originally had anyway trying to think the original data points right so the question is of just uh what it's almost as if you're not building a model because you need the entire data set and uh the other way of saying that is that this is a non-parametric learning algorithm um and and so this is and um I don't know I won't I I I won't debate whether you know are we really building a model or not but um this is a perfectly fine so I think what when you write code implementing locally weight to linear ression on um on the data set I think of that code as a whole as building your model um so we actually use this we've actually used this quite successfully to model the Dynamics or autonomous helicopter for instance yeah yeah the variance of this algorithm that uh learn the weights based on the data um learn what weights oh the weights wi yes instead of using the expansion I see yes so it turns out there are a few things you can do one thing that's quite common is um how to choose this bandwidth parameter towel right as as using a data we should talk about that a bit later when we talk about model selection one last question the constant you can make it then oh I guess um let's see boy the weights are not random variables and it's not for the purpose of this algorithm is not useful to endow it with probabilistic semantic so you could choose to Define things as Galan but it so it doesn't lead anywhere um in fact uh it turns out that um I happen to choose this bell-shaped function um for to Define my ways is actually fine to choose a function that doesn't even integrate to one that integrates Infinity say as your waiting function um and so in that sense you know I mean you could you could force in the definition of a gan but it's sort of not useful especially since you use other functions that that integrate to Infinity don't integrate to one okay it's last question I should move on assume that we have a very huge set of data for example very huge set of houses and we want to predict the label for each house so we should run the algorithm for each input I think it is very costly for yeah yeah right so um because locally weight to regression is um is nonparametric algorithm every time you make a prediction you need to fit Theta to your entire training set again um so you're absolutely right for if you have a very large training set then this is a somewhat expensive algor to use because every time you want to make a prediction you need to fit you know a straight line to to to a huge data set to gain um turns out there algorithms that uh turns out their ways to make this much more efficient for launch data sets as well um so they want to talk about that if if you're interested look up the work of Andrew Moore on KD trees um so figured out ways to fit these models much more efficiently that's not something I want to go into today okay let let let me move on let's take more questions later um so [Applause] okay so that was locally weight of regression um remember the outline I had I guess at the beginning of this lecture what I want to do now is um talk about a probabilistic interpretation of linear regression right and in particular it it'll be this probabilistic interpretation that lets us move on to um talk about logistic regression which would be our first classification algorithm so um let's put aside locally weighted regression for now we just talk about ordinary you know unweighted linear regression and let's ask the question of um why these squares right of of all the things we could optimize how do we come up with this criteria for minimizing the square of the error between uh the predictions of the hypothesis and and the values y predicted so why not minimize you know the absolute value of the errors or or the errors to the power of four or something um what I'm going to do now is present one set of assumptions that will serve to quote justify why we're minimizing the sum of squares zor okay um it turns out that there are many assumptions that are sufficient to justify why we do Le squares and this is just one of them um so so you know just just be clear present one set of assumptions under which Le squares regression makes sense um but this is not the only set of assumptions so even the assumptions I I describ don't hold these squares actually still make sense in many circumstances but this will may help you give one rationalization one reason for doing these squares regression um and in particular what I'm going to do is um endow the Le squares model with probab six semantics and so let's assume in our example of predicting housing prices that the so the price of the H is sold for is going to be some linear function of the features plus um some term Epsilon I okay and Epsilon I will be you know our error term um you think of the error term as capturing unmodeled effects like that maybe there are some other features of H like maybe how many five places it has or whether is a garden or whatever that there are uh additional features that we just fail to capture or you can think of Epsilon as random noise okay so Epsilon is our error term that captures both these unmodeled effects um just things we forgot to model maybe the function isn't quite linear or something um as well as um as well as random noise like maybe that day the seller was in a really bad mood and so you know he sold it he just refused to go for a reasonable price I think um and now I'll assume that the errors have um a probabilistic have a probability distribution I'll assume that the errors Epsilon I are distributed this totally uh Den notes Epsilon I is distributed according to a probability distribution that's a gaan distribution with mean zero and variant Sigma squ okay so I'm going to use script n here um and sense for normal right to denote the normal distribution also known as the Gan distribution which means Z and coant sigma squ um actually just quickly raise your hand if you've seen a gan distribution before okay cool most of you great almost everyone so in other words um the density for Gan most you seen this before the density for Epsilon I will be 1 /un 2 pi Sigma e to the Epsilon s/ 2 squ right and uh the density for Epsilon I will be this bell-shaped curve with you know the um with one standard deviation um being of Sigma okay um this is formula for that Bell shape cve so let's see eras that erase the board okay so this implies that um the probability distribution of a price of the house given XI and the parameters steta that this is going to be galum with that density okay um another way of saying this is that the price of a house you know given um the features of the house and my parameters sta this is going to be a random variable that's distributed Gan with mean the transpose X only and Varian Sigma squ right because we we we imagine that um the way the housing price is generated is that you know the price of a house um is equal to Theta trans x i and then plus some random Galaxy noise with variance Sigma squ and so the price of a house is going to be have mean say the trans 6 I in Sig squ right does this make sense can you raise your hand if this make sense um yeah okay most of you um um in point of notation oh yes we don't know anything about the errow why do you assume here error is a go right so um boy yeah so part of the uh what is it why does it seem the error is Gan um two reasons right one is it turns out to be mathematically convenient to do so um and the other is um I don't know I could also mum about justification such as things Central limit theorem it turns out that if you for the vast majority of problems if you apply a linear regression model like this and try to measure the distribution of the errors not all the time but very often you find that the errors really are Gan that that just of Gan model is a good assumption for the error of um in in regression problems like these um some of you may have heard of the central limit theorem which says that the sum of many independent random variables will tend towards the gum so if the error is caused by many effects like you know the move mood of the seller the mood of the buyer some other features that we miss whether the place has a garden or not and if all of these effects are independent then by the central limit theorem you might be inclined to believe that you know the sum of all of these effects will be approximately calcian but in practice I guess the two real answers are that one in practice this is actually a reasonably accurate assumption um and two is it turns out to be mathematically convenient to do so okay um yeah it seems like we're saying if we assume the error around our model has zero mean then the error is centered around our model which it seems almost like we're trying to assume what we're trying to prove um let's let let's let's delay to what I'm trying to prove but yes we are assuming that the error has zero mean which is yeah right um I think maybe I suspect later this quarter we get to some of the other things but so for now think of this as a mathematic is actually not an unreasonable assumption um I guess you know in in in in machine learning all the all the assumptions we make are almost never true in in in the absolute sense right because for instance um housing prices are priced to Dollars and cents and so the error will be you know errors in prices are not continuous value random variables because um houses can only be pric at a certain number of dollars a certain number of cents and you never have fractions of cents in in housing prices whereas a ROM variable W so in that sense assumptions we make are never quote absolutely true but for practical purposes this is a accurate enough assumption that it'll be it'll be useful to make it okay um maybe I think in yeah I think in a week or two we actually come back to say a bit more about the source of assumptions we make and when they help our learning algorithms and when they hurt our learning algorithms we'll do that we'll say a bit more about when we talk about generative and discriminative learning algs like in a in a week or two um okay so let just point out one one one bit of notation which is that um when I wrote this down I actually wrote P of Yi given XI and then semicolon Theta and I'm going to use this notation when we are not thinking of theta as a random variable so in statistics this is sometimes called the frequences point of view where we think of there as being some sort of true value of theta that's out there that's generating the data say but um you know we don't know what Theta is but Theta is not a random variable right so it's not it's not like there's some random value of theta out there is that Theta is there some true value of theta out there it's just that we don't know what the value of theta is so if Theta is not a random variable then I'm going to avoid writing P of Yi given x i comma Theta because this would mean the probability of y i conditioned on X and Theta and you can only you know I on random variables um so at least for this part of the class where we're taking so the frequences Viewpoint rather than a basian Viewpoint in this point class we're thinking of theta not as a random variable but just as something we're trying to estimate I'll use the semicolon notation and so the way to read this is this is the probability of Yi given XI and parameterized by Theta okay so you read the semicolon as parameterized by and the same way here I'll say y I give an example parameterized by Theta is distributed galum with that um all [Applause] [Music] right so um then we going to make one more assumption going to assume that the error terms are um IID okay which stands for independently and identically distributed so just going to assume that the error terms are independent of each other right the identically distributed part just means that I'm assuming they all come from the same galaxian distribution with the same variance um but but the more important part of this is I'm assuming that the Epsilon I are independent of each other now let talk about how to fit the model the probability of Y given X parameterized by Theta um I'm actually going to give this another name I'm going to write this down I'm going to call this the likelihood of theta as the probability of Y given X parameterized by Theta and so this is going to be the product over my trading set that um which is in turn going to be a product of those Gan densities that down just now okay um again in in point of notation I guess I defined this term here to be the likelihood of theta and the likelihood of theta is just you know the probability of the data y right given X and Par by Theta um the terms likelihood and probability are often confused and um so the likelihood of the of theta is the same thing as the probability of the data you saw and so likely and probability i s of the same thing except that when I use the term likelihood I'm trying to emphasize that you know I'm taking this thing and viewing it as a function of theta okay so um so likelihood and probability they're really the same thing except that when I want to view this thing as a function of theta holding X and Y fixed other then call the likelihood okay um so hopefully you hear me say the likelihood of the parameters and the probability of the data right rather than likelihood of the data or proberbly ref so so try try to be consistent in that terminology um so given the the probability of the data is this and this is also the likelihood of the parameters um how do you estimate the parameters sta so given a training set what parameters sta do you want to choose for your model well um the uh principle of Maximum likelihood estimation says that right um you can choose the value of theta that makes the data as like as probable as possible right so choose Theta to maximize the likelihood or in other words choose the parameters that make the data as as probable as possible right so this is maximum likely estimation from s six says choose parameters that you know makes it as likely as probable as possible for me to have seen the data I just did um and so for mathematical convenience let me Define lowercase L of theta um this is called the log likelihood function and it's just log of you know L of capital L of theta so this is log or product over I Sigma e to the that and I won't I won't bother to write out what's in the exponent for now it's just same as from the previous board um log of the product is the same as the sum of the logs right so this is sum of the logs of um we're Sy IES to M * 1 / < TK 2 piun Sigma plus and then log of exponentiation cancel each other so log of e of something is just whatever is inside the exponent so um you know what write this on next okay minus okay so are maximizing maximizing the likelihood the maximizing the log likelihood is the same as minimizing that term over there um well you get it right um because it's a minus sign so maximizing this because of the minus sign is the same as minimizing this as a function of theta and this is of course just the same quadratic cost function that we had last time J of theta right um and so what we've just shown is that the ordinary Le squares algorithm that that we worked on in the previous lecture is just maximum likelihood assuming this probabilistic model um assuming this assuming IID ging errors on our data okay um one thing that will'll actually revisit um in the next lecture notice that the value of Sigma squar doesn't matter right that somehow no matter what the value of Sigma squar is Sigma square has to be a positive number it's the variance of Galan so but no matter what Sigma squ is you know so long as it's positive number the value of theta we end up with um will be the same right so because yeah minimizing this you get the same value of theta no matter what Sigma square is so it's as if in this model the value of Sigma Square doesn't really matter um and we we just just remember that for the next lecture we'll come back to this again um questions about this actually let me let me clean another couple of boards and I'll see what questions you have that to okay there questions yeah here uh I think here you try to measure the uh the N OFA uh by by a function of error but I but I think but I think it's not a good measure because because it depends on the family of Thea 2 for example if we have a lot of parameters we have problem with overfitting and uh yeah I think you're asking about overfitting whether this is good model I think let's the things the things you're mentioning are um maybe deeper questions about learning algorithms that that come back to this that will come back to later so don't really want to get into that right now yeah other questions yeah okay so um this endows um linear regression with a probabilistic interpretation and I'm actually going to use this probab use this sort of probabilistic interpretation in order to derive our next learning algorithm which will be our first classification algorithm okay so um you recall that I said that you regression problems are where the variable y that you're trying to predict is continuous values um now I'm actually going to talk about our first classification problem where um the value why you're trying to predict will be discrete value you can take on only a small number of discrete values and in this case I'll talk about binary classification where um y takes on only two values right so you come up with classification problems if you're trying to do say medical diagnosis and try to decide you know based on some features um if a patient has a disease or does not have a disease um or if in housing example maybe trying to decide will this house sell in the next 6 months or not the answer is either yes or no it'll either be sold in the next six months or it won't be um other standard examples you if you want build a spam filter is this email spam or not it's yes or no or um if you you know some of my colleagues are interested in um predicting whether a computer system will crash so you know it's have learning a to predict will this Computing cluster crash within the next 24 hours and again is yes or no answer um so that's X that's Y and so in the classification problem y takes on two values zero and one let's say in binary classification um so what can you do well one thing you could do is take linear regression as we've described it so far and apply to this problem right so you you know given this data set you can fit a straight line to it maybe you get that straight line right and but this and this this data set I've drawn right this is an amazingly easy easy classification problem it's pretty obvious you know to all of us that right the relationship between X and Y is well you just look at a value around here and it it's to the right Y is one to the left and Y is zero so apply linear regression to this data set and you get a Reas more fit and you can then maybe take your linear regression hypothesis the straight line and you know fresh hold it at 0.5 and if you do that you sort of get the right answer you predict that you know if Y is if if x is to the right of of S of aid Point here then Y is one and if x to the left of that midpoint then Y is zero um so some people actually do this apply linear regression to classification problems and sometimes it'll work okay but in general it's actually pretty bad idea to apply linear regression to um classification problems like these and and here's why let's say I Chang my training set by giving you just one more training example example all the way out there right um imagine if given this trading set is actually still entirely obvious what the relationship between X and Y is right it's just take this value it's greater than y is one if it's less than y is zero by telling you you know that by giving you this additional training example it really shouldn't change anything I mean I I didn't really convey much new information there's no surprise that this corresponds to y equal 1 but if you now fit linear regression to this data set you end up with I don't know maybe a line that looks like that right and now the predictions of your hypothesis have changed completely if you threshold your hypothesis at yal 0.5 okay so in between there might be an interval where it's zero right that far out Point Oh you mean like that yeah yeah yeah yeah sure dat set like that same thing right so I guess this is just yes you're right but this is this is this this is an example in this example we're to that would change more oh you gave it all yeah then I think this actually make it even worse we actually get in line it pulls down even further right so this is my example I get to make it whatever I want right but but the point of this is is that there's no deep meaning to this the point of this is just that you know it could be a really bad idea to apply linear regression to class problem and sometimes it work fine but but usually I wouldn't do it um so a couple problems with this one is that well so so what do we want to do for classification um you know if you know the value of y lies between 0 and one then um to try to fix this problem let's just start by changing the form of a hypothesis so that my hypothesis always lies in the unit interval between 0 and one okay um so if if I know Y is either 0 or one then let's at least not have my hypothesis predict values much larger than one or much smaller than zero and so I'm going to instead of choosing a linear function for my hypothesis I'm going to choose something slightly different and in particular I'm going to choose this function H subscript the X is going to be equal to G of theta transpose X where um G is going to be this function um and so this becomes 1/ 1 + e to of Thea transvers X okay and um G of Z is called the sigo [Music] function and is often also called the logistic function it goes by either of these names um and what G ofz looks like is the following so on the horizontal axis I'm going to plot Z and so G of Z will look like this okay can draw that very well okay so G of Z tends towards zero as Z becomes very small and G of Z will ASM toote towards one as Z becomes large and it crosses the vertical axis at 0.5 so this is what the sigmoid function also called the logistic function of s yeah why choose a sigo not the step faction say that again why we cannot choose a step function isn't like that's better binary yeah let me come back to later so it turns out why where did I get this function from right where how did I I mean I just wrote down this function it actually turns out that there are two reasons for using this function that will come to one is um when we talk about generalized linear models we'll see that this falls out naturally as as part of a broader cons of models um and another reason that we'll talk about uh next week is it turns out there are a couple of I think very beautiful reasons for why we choose the logistic function so we'll see that in a little bit but for now let me just you know Define it and and just take my word for it for now that this is is a reasonable Choice okay um but notice now that my the the the values output by my hypothesis will always be between zero and one um and further of all just like we did for linear regression I'm going to endow the outputs of my hypothesis with a probabilistic interpretation right so I'm going to assume that the probability that Y is equal to one given x and parameterized by Theta that's equal to H subscript Theta of X right so in other words I'm going to imagine that you know my hypothesis is outputting all these numbers that lie between 0 and one I'm going to think of my hypothesis as trying to estimate the probability that Y is equal to 1 okay um and because and because y has to be either zero or one then the probability of Y is equal Z is going to that right um so more simply it turns out actually take these two equations and write them more compactly going to write P of Y given X parameterized by Theta this is going to be H subscript Theta of x to the power of y * 1 - h of X to^ of Y 1 - y okay so I know this looks somewhat bizarre but this this actually makes the derivation much much nicer so Y is equal to 1 then this equation is you know h of x to the^ of 1 times something to the power of zero so anything to the power of zero is just one right so y equals 1 then this is something to the power of zero and so this is just one and so if yal 1 this is just saying P of yal 1 is equal to H subscrip Theta of x and in the same way if Y is equal to Zer then you know this is p of yal 0 equals this thing to the^ of 0 and so this disappears this is just one times this thing to the power of one okay so this is a compact way of writing both of these equations um you know together into one line um so let's talk about parameter fitting right and again we can ask um well given this model of my data how do I fit the parameters theeta of my model um so the likelihood of the parameters is as before it's just the probability of data right which is prodct over i p ofy i given x i parameterized by Theta which is just plugging this in okay and I Dr I dropped this sta subscript just so you can write a little bit less oh excuse me you should be X eyes and Y eyes okay [Applause] so um as before let's say we want to find the maximum likelihood estimate of the parameters data so we want to find the setting of The parameters Theta that maximizes the likelihood um L of theta um it turns out that very often you know just when when you work up the derivations it turns out it's often much easier to maximize the log of the likelihood rather than maximize the likelihood um and so the log likelihood L of theta just log of capital L this will therefore be sum of this [Applause] um okay and so to fit the parameters stat of a model we'll um find some find the value of theta that maximizes this log likelihood yeah say that again oh yes thank you thanks so how do you maximize this function well it turns out we can actually apply um the same gradient descent algorithm that we learned that was the first algorithm we we we used to minimize the quadratic function remember when we talked about the squares um the first ALG we used to minimize the quadratic error function was grade in descent and so we can actually use exactly the same algorithm to maximize the log likelihood um and you remember that algorithm was just you repeatedly take the value of theta and you replace it with the previous value of theta plus um a learning rate Alpha times the gradient of you know the cost function the log likelihood with respect to Theta One Small Change is that because um previously we were trying to minimize the quadratic sum of the quadratic error term today we're trying to maximize rather than minimize so rather than having a minus sign we have a plus sign so this is just you know grad in grad in desense but for maximization rather than minimization and so we actually call this gradient Ascent it's really the same algorithm um so to figure out what this gradient so in order to derive gradient descent um what you need to do is you know compute the partial derivatives of your objective function with respect to each of your parameters the right so um and it turns out that if you actually compute this partial derivative um so if you take this formula this l Theta which is oh got that wrong if you take this lowercase L Theta if you take the log like of theta and if you take his derivative take his partial derivative with respect to Theta R um you find that um this is equal to let's see okay and um I don't know the derivation isn't tery complicated but but in the interest of so of saving you watch me write down the cover of black bols for math I'll just write down the final answer but the way you get this is just take this plug in the definition for X subscript Theta as a function of XI and and take their and work for the algebra and it turns out it'll simplify down to this formula okay um and so what that gives you is that gradient Ascent is the following rule Theta J gets updated as Theta J plus Alpha times this okay does this look familiar to anyone do you remember seeing this formula at the last lecture right so when I worked out bash grad and descent for Le squares regression um I actually wrote down exactly the same I actually wrote down you know actually wrote down exactly the same thing or maybe there was a minus sign and and those was also flipped but actually had exactly the same learning rule last time for Le squares regression right so I don't know is this the same learning algorithm then so what's different how come I was making all that noise earlier about the squ regression being a bad idea for a classication problems and then you know I did a bunch of math and I skipped some steps but but I'm sort of claiming that you end up with really the same learning Al them hypothesis oh right listic exactly right so is saying this is not the same right and the reason is in logistic regression this is different from before right the definition of this H subscript theeta of x i is not the same as the definition I was using in the previous lecture and in particular this is no longer Theta transpose this is not the linear function anymore this is um this is a logistic function of theta transpose XI okay so even though this looks you know cosmetically similar even though this looks similar on the surface to um The bashra Descent r i derived last time for the squares regression this is actually a totally different learning Alm okay and it turns out that there's actually no coincidence that you ended up with the same learning rule when and we we'll actually talk talk a bit more about this later when we talk about generalized linear models um but this is one of the this is one of the most elegant results of generalized learning models that we'll see later that even though we're using a different model you actually ended up with you know what looks like the same learning algorithm and there's actually no coincidence um cool one last comment as as part of a uh s of learning process um over here you know I said I take the derivatives and I end up ended up with this line um I didn't want to um you make you sit through a long algebraic derivation but later today or later this week please do go home and look the lecture notes where you I wrote out the entirety of this derivation in full and make sure you can follow every single step of how we take partial derivatives of this log likelihood to get this form over here okay um by the way for those who interested in you seriously mastering machine learning material um when we go home and look for the lectr notes it it will actually be very easy for most of you to look for the Leon notes and read through every line and go yep that makes sense that makes sense that makes sense that makes sense and so say okay cool I see how you get this line um you want to make sure you really understand the material my concrete suggestion to you will be to go home refill the lecture notes and check every line and then to cover up the derivation and see if you can derive this yourself right so in general that's that's usually good advice for studying technical material like machine learning which is if you work for a proof If you think you understood every line um the way to make sure you really understood it is to cover it up and see if you can rederive the entire thing yourself this is actually a great way for I did this a lot when I was trying to study very various pieces of machine learning you know Theory and various proofs and this is actually a great way to study which is cover up the cover up the derivations and see if you can do it yourself without looking at the original derivation okay um all right I probably won't get to Newton's method today I just want to say take one quick digression to talk about um just want to take one quick digression to talk about one more algorithm um which was this question sort of alluding to this earlier is which is the perceptron algorithm right so um I'm not going to say a whole lot about the perceptron algorithm but this is something that we'll come back to later later this quarter we talk about learning theory um so you know in logistic regression we said that g of Z output the my hypothesis output values that were real numbers between 0 and one um the question is what if you want want to force G of Z to output value to Output either zero either or output either zero or one right so um the perceptron algorithm defines G of Z to be this so the picture is of the conun is rather than the smooth sigmoid function you know G of Z now looks like the step function that that you're asking about earlier um and same as before we can use H subtitute th of xals G of Thea transal X okay so it's actually everything's ex exactly the same as before except that g of Z is now the step function um it turns out there's this learning rule called the perceptual learning rule that's actually even the same as the contic gradient a sent for logistic regression and the learning rule is given by this okay so it looks you know just like the uh gradient sense rule um or really the stochastic gr sent rule for logistic regression um so this is very different flavor of algorithm then Le squares regression logistic regression um and in particular because it output is only value vales that either zero or one turns out it's very difficult to endow this algorithm with probabilistic semantics um and this is again even though oh ex um excuse me right there okay and even though this learning rule looks Co again looks cosmetically very similar to what we have for logistic regression this is actually a very different type of learning rule um than than the others that we seen in this class and so um because this is you know so such a simple learning algorithm right just compute Theta transpose X and then you freshh hold and you up with 0 one this is um relatively simpler algor than logistic regression I think and um when we talk about learning theory later in this class the Simplicity of this Alm will let us come back and use it as a as a building block okay but um that's all I want to say about this Alm for now um this presentation is delivered by the Stanford center for professional development okay so welcome back and um what I want what I want what I want to do today is talk about um Newton's method which is an algorithm for 15 models like logistic regression and then we'll talk about exponential family distributions and um generalized linear models which is a very nice class of ideas that will tie together the logistic regression and the odinary Le squares models that we'll see so so hopefully I'll get through that today um and so throughout you know the previous lecture and this lecture we're starting to use increasingly large amounts of material on probability um so if you like to see a refresher on so foundations of probability if if if if you're not sure if you quite had the prerequisites for this CL in terms of a background in probability and statistics then the discussion section um taught this week by the Tas will go over so of like a review of probability um at the same discussion sections alsoas will also briefly go over um sort of mat lab and octave notation which you need to use for your problem set and so um if either you want to see a review of the probability in statistics prere or if you want to so get a short tutorial on mat lab and octave um please come to this this so the next discussion section um all right so just to recap briefly um towards the end of the last lecture I talked about the logistic regression model where we had which was a which was an algorithm for classification and we had that P of Y given one given x u p of y equals 1 given x and parameterized by Theta under this model right this was 1 1 + D th trans X um and then you know you can write down the log likelihood right given a training set which was that and um by taking derivatives of this you can derive um so the gradient gradient asent Ral for finding the maximum likelihood estimate of the parameters data for this logistic regression model right and so um last time I wrote down the um learning rule for bash gradient ENT um but you know the version for still Conant grad in the sent where you look at just one training example at the time would be like this okay um okay so last time we wrote down bash gcent this is still constantly gr incent um so you want to fit just if you want to fit a logistic regression model um meaning you find the value of theta that maximizes this log likelihood gradient Ascent or sastic gradient Ascent or bash gradient Ascent is a perfectly fine algor to use um but what I want to do is talk about a different algorithm for fitting models like logistic regression and this will be an algorithm that will I guess often run much faster than um grade innocent okay um and this album is called Newton's method but and let me describe Newton method let me actually I'm actually going to ask you to consider a different problem first which is let's say you have a function f of theta right um and let's say you want to find the value of theta so that um f of theta is equal to zero okay so let's stop by considering this problem first and then we'll we'll slowly change this until it becomes an algorithm for fting maximum likelihood models like toes Direction so let's see yeah I guess that works okay so let's say that's my function f um this is my horizontal axis is Theta this is a positive F of theta and so we're really trying to find this value for Theta at which F of theta is equal to zero this is a horizontal axis um so here's here's here's the AR I'm going to use I'm going to initialize you know Theta at some value um which I call Theta super Z and then here's what Newton's method does um we're going to evaluate the function f at that value of theta and then we'll compute the derivative of F and we'll use the linear approximation to the function f at that value of theta so in particular so I'm going to dot my function a bit I'm going to excuse me I'm going to take the tangent to my function right hope that makes sense let's doctoring the function a bit to make this work out nicely um I'm going to take the tangent to my function at that point Theta 0 and I'm going to extend this tangent down until it intercepts the horizontal axis I'm going to see what value this is and I'm going to call this Theta 1 okay um and then and so that's one iteration of new method and what I'll do then is the same thing look at this point take the tangent down here and that's two varations of the algorithm and then you still keep going that's St three and so on okay so let's just go ahead and write down what this Alm actually does um so to go from Theta 0 to Theta 1 let me call you know that length let me just call that Capital Delta um so Capital so if you remember the definition of a derivative right the derivative of f evaluated at Theta 0 in other words the gradient of this first line um by the definition of gradient is going to be equal to this vertical length divided by this horizontal length the gradient of this so the slope of this function you know is defined as the ratio between this vertical height and and and this width of triangle and so that's just equal to F of theta 0 divided by Delta which implies that Delta is equal to F of theta 0 / by FR Prime of theta 0 okay um and so Theta 1 is therefore Theta 0 minus Delta minus so Capital Delta which is therefore just F of theta 0 over frime of theta 0 right um and more generally one iteration of Newton's method you know proceeds as Theta t + 1 equal Theta T minus F of theta T / frime of okay so that's one iteration of Newton's method make sense um now this is an algorithm for finding a value of theta for Which F of theta equals zero so we can apply the same idea to um maximizing the log likelihood right so we have a function L of theta and we want to maximize this function well how do you maximize the function you set this derivative zero right so we want Theta such that um l of theta is equal to Z right so to maximize this function we want to find a place where the derivative of this function is equal to zero and so we just apply the same idea and so we get th t + 1 equal Theta T minus L Prime of theta T over lble Prime of t l Prime of theta T okay cuz um rather because to maximize this function we just let F be equal to l Prime let l f be the first derivative of L and then we want to find the value of theta for which the derivative of L is zero and therefore it must be a must must be a local Optimum okay um so does this make sense the questions about this what are the conditions of f for the method to work um yeah the answer to that is is is fairly complicated there there are conditions on F that will guarantee that this will work they're fairly complicated I want and this is more more more complex I want to go into now um in practice this works very well for logistic progression and for sort of generalize the the models I'll talk about later yeah yeah selection of Z yeah usually doesn't matter when I Implement is I usually just initialize Thea zero to zero to the V you know to just initialize the parameters to the to the vector of all zeros and and usually this works fine it's usually not a huge deal how you initialize data yeah last question um so without getting into how when it converges is it more likely to converge or less likely to gr descent or is it just different situations um Let me let me let me say some things about that I'll s of answer it um yeah to all of these alums uh all of these alums tend not to have conversions problems and and all of these alums will generally converge um unless you choose to launch a learning rate for grad in grade in Ascent or something um but the speeds of convergence of these alums are very different so it turns out that um Newton's method is is is an Alum that enjoys extremely fast convergence um the technical terms that it enjoys a property called quadratic convergence um don't know about what that means but but just stated informally um means that ASM totically every iteration of Newton's method will double the number of significant digits um that your solution is accurate to okay so so so just up to constant factors right um suppose that on you know a certain iteration Your solution is within 0.01 of the optimum so you have 0.01 error then after one iteration your arrow will be on the on the order of 0.01 and after another iteration your arrow will be on the order of 0.01 okay so this is called quadratic conversions because you essentially get to square the error on every iteration of Newton m meod um cave has this is an Asin totic result that holds only when you're pretty close to the optimum anyway so so so this this a theorical result that s is true but because of constant factors and so on May paint a slightly Rosier picture than than than that might be accurate um but in practice when you implement I know when I Implement logist logistic nuisance method for logistic regression usually it converges in you know like a dozen iterations also for most reasonable size problems of you know tens of hundreds of features um so one thing I should talk about um which is what I wrote down over there was actually Newton's method for the case of theta being a single row number um the generalization to Newton's method for when Theta is a vector rather than when Theta is just a real number um is the following um which is that you know Theta t + 1 is Theta t plus and then we had right the uh second derivative divided by the first excuse me the first derivative divided by the second derivative um and the appropriate generalization is this where you know this is the usual gradient of your um objective and H here is a m is is is a matrix called the Hessian um this is just the Matrix of second derivatives where H J equals um okay so just instead of um the first derivative divided by the second derivative now you have a vector of first derivatives times so the uh the the inverse of the Matrix of second derivatives so this is just the same thing but written out for multiple Dimensions um so for logistic regression again usually uh for a reasonable number of features in training examples you know when I when I run this Al it usually conver usually you see a conversion anywhere from sort of three iterations to like a dozen or so iterations um to compare to gradient descent to compare to gradient descent this usually needs far fewer it ation to converge um compared to grad in Ascent let's say bash gradient Ascent the disadvantage of Newton's method is that on every iteration you need to you know invert the Hessian and so the hessan will be an N byn Matrix or an N plus 1 by n plus one dimensional Matrix if n is the number of features and so if you have a large number of features in your learning problem if you have tens of thousands of features then inverting H could be a slightly computationally expensive step but but for so smaller more reasonable numbers of features this is a this is usually a very fast out question for the for Matrix Cas is there a oh yes you're right um let's see I think you're right that should probably be a minus um do yeah thanks yeah thanks minus okay and actually I invite you to think about the following problem also um I wrote down this algorithm to find the maximum likely estimate of the parameters for logistic regression right so I wrote down this you know for for maximizing a function so so I'll leave you to think about this yourself um if it want to use Newton's method to minimize a function how does the Alm change right so I'll leave you to think about that so I wrote this down for maximization so how does the AL change if you want to use it for minimization um okay say again yeah actually yeah yeah yeah um actually the answer is that it doesn't change I'll leave you to work that out yourself why okay all right let's talk about generalized linear models [Music] um and and let me just to you just to give a recap of both of the algorithms we talked about so far we talked about two different algorithms from modeling pfy given x and parameterized by Theta right in one of them um R was a real number and we assumed that and we saw that we assume it has a gan distribution then we got you know ordinary leas squares um linear regression in the other case we saw that if um was was a classification problem where y took on a value of either zero or one um in that case you know what's the most natural distribution over zeros and ones is the Beni Bui distribution models random variables with two values and in that case we got logistic regression okay so along the way some of the questions that came up were um for the itic regression where on Earth did I get the sigmoid function from and so there are actually other choices you can use for just where did this function come from and there are other functions that could have plugged in um but the sigo function turns out to be particularly natural default choice that led us to logistic regression and what I want to do now is um take both of these algorithms and uh show that there are special cases of the broader cons of algorithms called generalized called generalized linear models um and it'll be part of this broad it'll be as part of this broader class of algorithms that things like the sigma function will fall out very naturally as well so um let's see look for a longer piece of CH okay um I just War you the the ideas in generalized line models are somewhat complex so what I'm going to do today is try to so point you point out the key ideas and give you a gist of the entire story and then some of the details and the math and the derivations I'll leave you to work through by yourselves in the in the lection Els which Post online Okay um okay so let start by considering these two distributions the Bui and the Gan right um so suppose we have data that is 01 valued and we want to model it with um you know a Beni random variable parameterized by five all right the Bui distribution has that the probability of yal 1 is just equal to five right so so the parameter five in the beri um specifies the probability of Y being one now as you vary the parameter Theta you get you you sort of get different Bui distributions as you vary the value of theta you get different probability distrib dist butions on y that have different probabilities of being equal to one and so I want you to think of this as not one fixed distribution but as a set or as a class of distributions um that you get as you vary Theta okay and in the same way if you consider the Gan distribution um as you vary me you will get um different Gan distributions so so think of this again as a as a class or as a set of distributions and what I want to do now is show that um both of these are special cases of a class of distributions called the exponential family distribution um and in particular we'll say that the class of distributions like the like the Veni distributions that you get as EV very Theta we'll say that the class of distributions is in the exponential family if it can be written in the following form P of Y parameterized by Theta is equal to B of y a a okay um let me just give some of these terms names and then um uh and then and then well let me I'll say a bit more about what this means so a is called the natural parameter of the distribution and T of Y is called the sufficient statistic um usually for many of the examples we'll see uh including the Beni and the gussan um TF Y is just equal to Y so for most of this lecture you can mentally replace T of Y y to be equal to Y this won't be true for the very final example we do today but mentally you think of T of Y is equal to Y um and so for a given choice of these functions um a b and t right so let's say fix the forms of the function a b and t then this formula defines again a set of distributions so defines a class of distributions that's Now parameterized by ATA okay so again let's say let's say let's say you write down specific formulas for ab and T choose specific choice of ab andt then you know as I Vari I get different distributions again and um I'm going to show that you know the Beni um and and I'm going to show that the ban and the Gans are special cases of um exponential family distributions and by that I mean that I can choose specific functions a b and t so that this becomes the formula of the the distribution of either a b or Gan and then again as a variator I'll get you know Beni distributions with different means or as a variator I'll get Gan distributions with different means um for my fixed values of a b and t okay um oh and and for those of you that know what a sufficient statistic and statistics is um tfy actually is a sufficient statistic in in in the formal def in the formal sense of um a sufficient statistics for probity distribution that you may have seen in the statistics s if you don't want if you don't know what a sufficient statistic is don't worry about it is we sort of don't need that prop withy today [Music] okay so um oh uh one one last comment um right often T of Y is equal to Y and in many of these cases ater is also just a real number so in many cases the parameter the distribution is just a real number and a transpose T of Y is just a product of real numbers so again that'll be true for our first two examples but not for not not for the last example I'll do today oh um so now let's show that the benan and Gan are examples of exponential family distributions we'll start with the Beni so the B distribution with parameter 5 I guess I wrote this down already P of Y = 1 parameterized by 5 just equal to five so the parameter 5 specifies the probability that y equal 1 and so my goal now is to choose T A and B is choose a b and t so that my formula for the exponential family becomes identical to my formula for the distribution of a Bui okay so um probability of Y parameterized by five is equal to that right and you already saw um a similar exponential notation when we talked about logistic regression right the probity of Y being one is five probity of Y being zero is 1 - 5 and so we can write this compactly as 5 the y * 1 - 5 1 - y um so I'm going to take the exponent of the log of this and exponentiation and taking logarithms cancel each other out right so doesn't change anything um and this is equal to e to the Y okay and so going Define that at this to be T ofy and this will be minus a of a and then um B of Y is just one so B of Y doesn't matter okay still just take a second to to to look through this and make sure it makes sense shall clean another board while you do that okay so let me just write down a few more things um just copying from the previous B we had that a ter is therefore equal to Log 5 over 1 - 5 um turns out so if I won't do the algebra it turns out you just take this formula and if you invert it if you solve for five um excuse me if you solve for AER as a function of F just do the algebra solve for at as a function of F um just invert this formula you find that f is 1 over 1 plus e to minus a okay and so somehow the logistic function magically falls out of this and and we we'll we'll we'll take this even further later um again copying definitions from the board on from from from the previous board a of ater I said is minus log of 1 - 5 right so again five and a are functions of each other right so ater depends on F and five depends on ater so if I plug in this definition for ater you know into this excuse me plugging this definition for five into that I'll find that a of a is therefore equal to log 1+ e to the a again this is just algebra this is not terribly interesting and and just a complete excuse me and just to complete the rest of this T of Y is equal to Y and B of Y is equal 1 Okay so just recap what we've done um we've come up with a certain choice of functions a T and B so that my formula for the exponential family distribution now becomes exactly the formula for the for the distribution so for the pro properly Mass function of the Beni distribution okay and the natural parameter ater has a certain relationship with the original parameter of the Beni question on the second last line to the last uh let's see from which line to which line um the the the second to the last line the oh this this line to this line yeah um let's see yeah so this is um well expand this term out 1 - y * log 1 - 5 and so 1 * log 1 - 5 becomes this and the other term is - y * log 1 - 5 and then um so the you minus of a log is log 1 /x is log 1 over whatever so - y * log 1 - y becomes um so of y * log 1 1 - 5 does that make sense yeah yeah cool anything else yeah yes uh now ITA is a scaler isn't it um up there it's e transpose so it can be a vector or uh yes there right so let's see in most in in this and the next example AA will turn out to be a Scala um and so well on this Bo um and so if a is a scalar and T of Y is a scalar then this is just you know a real number times a real number so this be like a one-dimensional vector transpose times a onedimensional vector and so this just real number times real number um towards the end of today's lecture we'll go over just one example where both of these are vectors but for many distributions will turn out to be scalers yeah that I'm I'm not sure like does the beri I mean I'm not sure how that's a distribution he why because I mean it doesn't have the zero probability for Stuff other than zero and one um I see so uh yeah let's let's let's for this you know let's imagine that we're restricting the the domain of the the the um the domain of the input of the function to be yal 01 so think of that as maybe an implicit constraint on this I could write it down and make it more explicit but so this is a probity Mass function for yals 0 or yal one so I write down y equal 01 let think of that as an implicit constraint okay so cool let me now also um so this takes a b distribution and writes in the form an exponential family distribution let me also just do that very quickly for the Gan I won't do the algebra for the Gan um but I I'll basically just write out the answer um so right well right with the normal distribution with mean mu and varant sigma squar and so you remember you know was it two lectures ago um when we're deriving the uh maximum likelihood excuse me oh no just the previous lecture um when we were deriving the maximum likelihood estimate for the parameters of um ordinary Le squares right we showed that the parameter for Sigma Square didn't matter so when we deriv the probabilistic model for Le squares regression we said that you know no matter what Sigma Square was we end up with the same value of the parameters um so for the purposes of just writing Less in today's lecture and and not taking account Sigma squ um I'm just going to to set Sigma squ to be equal to one okay so so so let's not worry about it um Le was talks a little bit more about this but I'm just going to just to make yeah just just to make what I'm doing in class a bit easier and simpler today let's just set s squ equals one this is s squ is entally just a scaling Factor on the uh on on on the variable y so in that case the Gan density is given by this oop 1- mu squar um and well by you know a couple of steps of algebra which I'm not going to do um but it's written out in full in the Le not you can download this is 1 < tk2 piun E to Theus 12 y^ 2 time e to D mu Yus 12 mu^ s okay so I'm not I'm just not doing the algebra and so that's B of Y we have mu of is equal to a t of yal Y and well um AF of AA is equal to um minus 12 actually oh I think it actually be plus 12 um I get that right no sorry let's see excuse me plus sign there okay should minus 12 mu^ 2 and because mu is equal to a this is just- 12 a s okay and so um this would be a specific Choice again of a b and T that um expresses a gan density in the form of an exp in the form of an exponential family distribution and in this case the relationship between mu and a is that mu is just equal to AA so the mean of the Gan is just equal to the Natural parameter of the exponential family distribution um oh this is minus all okay thanks and so no guessing that should be plus then is that right okay um oh yes you're right thank you yeah right and so it actually turns out that um you taken look an undergrad statistics CLA turns out that most of the quote text F distributions not all but most of them um can be written in the form of an exponential family distribution so you saw you know the G Gan the normal distribution it turns out the multivariant normal distribution which is a generalization of Gan random variables right to to high Dimensions the vectors um the multivariant normal distribution is is also an exponential family um you saw the Beni is an exponential family it turns out the multinomial distribution is two right so Vani models outcomes over you know zero and one over coin tosses with two outcomes the monomial models outcomes um over K possible values that's also an explan family distribution um you may have heard of the pong distribution so the pong distribution is often used for modeling counts things like the number of you know Radio Active DEC in in the sample or the number of uh customers to your website the number of visitors arriving in the store um the posal distribution is also in the exponential family um so the gamma and the exponential distributions if you've heard of them so the gam exponential distributions uh are distributions of positive numbers right and so they're often used the model intervals like um if you're standing in the bus stop and you want to ask you know when is the next bus likely to arrive how long do I have to wait for my bus to arrive um often you model that with gamma distributions or exponential families or or or exponential or the exponential distribution those are also the EXP family um even more esoteric distributions like the beta and the derish distributions um these are probably distributions over fractions already probably distributions over probably distributions um and um um also things like the wishart distribution which is a distribution over covariance majores so all of these it turns out can be written in the form of exponential family distributions um and um well and in in in the problem set where she ask you to take one of these distributions and write it in the form of the exponential family distribution and derive a generalized Li model for it okay which brings me to the next topic of um having chosen an exponential family distribution how do you use it to derive a generalized linear model right so um General generalized linear models are often abbreviated G and um I'm going to write down so the free assumptions you can think of them as assumptions you can think of them as design choices that would then allow me to so of turn a crank and and come over generalize the new model so the first one is I'm going to assume that you know given my inputs X and uh par and my parameters Theta I'm going to assume that the variable y um the output y or the response variable y I'm trying to predict is distributed exponential family with some natural parameter AA okay and so this means that you know there are some specific choice of um those functions a b and t so that the conditional distribution of Y given X and um parameterized by by Theta is exponential family with parameter AA where where here AA may depend on X in some way okay so for example if you're trying to predict um if you want to predict how many customers will arrive at your website right you may choose to model the the number of the number of people you know the number of hits on your website via Pon distribution since Pon distribution is natural for modeling count data and so you may choose the exponential family distribution here to be the personal distribution um we to assume that given X our goal is to um output the expected value of y given X okay so given the features in the website examples theyve given a set of features about you know whether there were any propotions whether there was sales how many people linked to your website or whatever um I'm going to assume that our goal in a machine learning problem is to estimate the expected number of people that arrive at your website on a certain on on a given day um so in other words I'm saying that I want H ofx right to be equal to oh excuse me I she meant to write T of Y here my goal is to get my learning algorithms hypothesis to Output the expected value of T of Y given X um but again for most of our examples T of Y is just equal to Y and so for most of examples our goes to get a learning out output the expected value of y given X because T of Y is usually equal to Y um yeah oh yes same thing right T of Y is a sufficient statistic same same T of Y um and lastly this this last one I wrote down these are assumptions this last one you might you maybe you want to think of this as a as a design choice which is so up here I assume that the distribution of Y given X is distributed you know exponential family with some parameter ATA right so the number of visitors in my website on any given day will be plus long with some parameter ATA and the last decision I need to make is um what's the relationship between my input features and this parameter ATA your parametrizing my personal distribution or whatever and and and this last step I'm going to make the Assumption already a design choice that I'm going to assume the relationship between ater and my features X is linear and in particular that they're govern by this that at is equal to Theta transpose X okay and um the reason I make this design choices will allow me to turn the crank of the generalized line model uh machinery and come up with some very nice algorithms for fitting say pong regression models or um or or perform regression with a with a G distribution outputs or exponential distribution outputs and so on so let's work for an example um oh excuse me and um right aals Thea transpose X works for the case where a is a real number um for the more General case we would have at I equals th I transpose X if you know a is a vector rather than a real number but again most of examples a would just be a real number okay all right um [Applause] so let's work through the B the example um we see in y given X parameterized by Theta this is [Applause] distributed exponential family with natural parameter AA and um for the B distribution I'm going to choose you know a b and t to be the specific forms that um cause this exponential family to become the Bia distribution right so this is the example we work through just now the first example we work through just now so um oh and we also have so for fixed so for any fixed value of x and Theta right my hypothesis my learning algorithm we'll make a prediction or make we we'll s of output h of X right which is by you know my my my I guess assumption number two want my learning arent to Output the expected value of y given x and parameterized by Theta um when y can take on only the values zero and one then the expected value of y is just equal to the probability that Y is equal to one right so the the expected value of a b newly random variable is just equal to the probability that is equal to one [Music] um and so you know the probability that y equal 1 is just equal to five because that's the parameter of my Boni distribution right five is by definition I guess is the probability of my B distribution the value of one um which we worked out previously F was 1/ 1 plus e to the negative a so we we worked this out on the previous board this is the relationship so when we wrote when we wrote down the Beni distribution in the form of an exponential family right we we worked out what the relationship was between F and ater and it was this so we worked was relationship between between the expected value of y and a was this relationship and lastly because um you we made the design choice or the assumption that AER and Theta are linearly related this is therefore equal to 1 1 + e minus Theta transvers X okay and so that's how I come up with the logistic regression algorithm when um you have a variable Y when you have a Target variable y or also call a response variable y that takes on two values and that you choose to model V Bly distribution okay do you show the steps make sense you raise your hand if this make sense yeah Okay cool so I hope you get the the the ease of use of this so the power of this right the only decision I made was really I said why let's say I have a new learn machine learning problem and I'm trying to predict the value of a variable y that happens to take on two values then the only decision I need to make is I chose beri distribution I said I want to model I want to assume that given X and Theta I'm going to assume Y is distributed bur newly right that's the only decision I made and then everything else follow follows automatically from having made the decision to model y given x and parameterized by Theta as being brand newly okay um in the same way you can choose the different distribution choose why is pong Y is gamma Y is whatever and follow similar process and come up with a different Model A different learning algorithm come with a different generalized linear model for for whatever learning problem you're faced with um tiny bit more notation um the function G that relates G of a that relates the natural parameter um to the expected value of y right which in this case is 1 1 + e minus a this is called the canonical response [Applause] function and um G inverse is called the canonical link function okay these aren the huge deal I won't I won't use this terminology a lot I'm I'm I'm just mentioning this in case you hear about people talk about generalized linear models and if they talk about can cor response functions or Canon canonical link functions just so you know there there's all of this um actually many texts actually use the reverse where this is G inverse and this is G but this notation turns out to be more consistent with um other algs and machine learning so I'm going to I'm going to use this notation but I probably won't use the terms canonical response functions and canonical link functions in lecture a lot so just I don't know I'm not big on memorizing lots of names of things we just tossing those out there in case you see it elsewhere okay time H you know what I think yeah in the interest of time I'm going to skip over the Gan example but again just like I said I choose Beni why is Beni you know go for the derivation I get out logistic regression you can this you can do the same thing with the gum distribution and end up with the ordinary Le squares model um the the the problem with gum is almost so simple that when you see for the first time it's sometimes it's sometimes more confusing than the Bui model because it looks so simp looks like it has to be more complicated so let me just skip that and leave you to read about the the gaussin example in the lecture notes um and what I want to do is is actually go through a more complex example 's the question how you um oh okay right so uh how how do you choose what Theta will be right so um let me get to that at the end so what you have there is the logistic regression model which is a probabilistic model that assumes the probability of y given X is given by a certain form and so um what you do is you can write down the log likelihood of your training set and find the value of theta that maximizes the log likelihood um the log likelihood of the parameters does that make sense so um I'll say that again towards the end of today's lecture but for logistic regression the way you choose data is is exactly maximum likelihood as we worked out sort of you know I guess in the previous lecture using new method or great in the sense or whatever but I'll of try to do that again for one more example towards the end of today's lecture so what I want to do is um actually use the remaining I don't know 19 minutes or so of this class um to go through the the one of the more com so yeah probably the most complex example of a generalized learning model that I've used and this one I want to go through because it's a little bit trickier than many of the other textbook distribu textbook examples who generalized in new models um so again you know what I'm going to do is um go through the derivation reasonably quickly and give you the gist of it and if there steps that skip or details that that I omitted I'll leave you to read about them so more carefully in the lecture notes okay um and what I want to do is talk about the momal okay and um the momal is the distribution over yeah say k possible outcomes and um imagine you now have a machine learning problem where the value why that you're trying to predict can take on K possible outcomes rather than only two outcomes right so obvious so examples will be um if you want to um have a learning algorithm automatically send emails for you into your right email folder and you may have a dozen email folders you want your Al to classify emails into um or you know instead of predicting if a patient either has a disease or does not have a disease which would a binary classification problem if you think the patient may have one of K diseases um and you want to have a learning out figure out which one of K diseases your patient has and so on so lots of multi classification problems where you have more than two cles you model that with multinomial and um you know eventually well so for logistic regression I had pictures like these right where you have training set and you find a decision boundary that separates them um for the model I'm going to talk about we're going to entertain the value of predicting taking on multiple values so you now have three classes and the learning Al will learn you know some way to separate out three classes or more rather than just two classes okay so let's write the multinomial in the form of an exponential family distribution um so the parameters of a momal are 51 5 2 up to 5 K I I'll actually change this in a second where the probability of yal I is f i right because then now K possible outcomes um but if I choose this as my parameterization of the momal then my parameters are actually redundant because um if these are probabilities then they have to sum up to one and therefore for example I can derive you know the loss parameter 5 K as 1 minus you know pi1 up to 5 K minus one right so this would be a redundant parameterization for a momal this would be result also called overparameterized um and so for the purposes of this derivation I'm going to treat my parameters of my multinomial as 51 5 2 up to 5 K minus one and I won't think of 5K as a parameter I'll just so my parameters I just I just have K minus one parameters parameterizing my multinomial um and sometimes I write 5K in My derivations as well and you should think of 5K as just a short hand for this for one minus the rest the parames okay so since the momia is one of the few examples where tfy is one of the examples where T of Y is not equal to Y um so in this case Y is right one of K possible values and so T of Y will be defined as follows T of one is going to be a vector with a one and Zer everywhere else T of2 is going to be a 0 1 0 and so on um except that these are going to be T minus one dimensional vectors okay and so T of K minus one is going to be 0 0 0 1 and T of K is going to the vector of all zeros okay so this is just how I'm choosing to define t of Y um to write down the multinomial in the form of of of of an exponential family distribution okay again these are K minus onedimensional vectors um so this is a good point to introduce one more useful piece of notation which is which is called indicator function notation so I'm going to write one and then curly braces and if I write the true statement inside then the indicator of that statement is going to one and if I write one and then I write a false statement inside then the value of this indicator function is going to be a zero so for example if I write indicator 2 = 3 then clearly that's false and so this is equal to zero whereas indicator 1 + 1 = 2 I wrote down a true statement inside and so the indicator of the of the statement is equal to one so the indicator function is just a very useful notation for indicating truth or falsehood of the statement inside um and so let should just do this here to combine both of these right if I um C out a bit of space here so if if I use um so Ty Y is a vector right Y is you know one of K values and so Ty Y is one of these K vectors um if I use Ty subscrip I to denote the I element of the vector Ty okay then Ty the I element of the vector Ty is just equal to indicator for whether Y is equal to I okay just take a let me I clean a couple more balls take a look at this for a second and make sure you understand why that make sure you understand all that notation and why this is true all right actually raise your hand if this equation makes sense to you um most of you not all okay um just say one let's see so well just as one concrete example um suppose Y is equal to one actually say let's say um let me see fine suppose Y is equal to one right so t y is equal to this vector and therefore the first element of this Vector will be one and the rest of the elements will equal to zero right and so um let me try that again sorry let's say I want to ask I want to look the I element of the vector Ty and I want to know is this one or zero right well this will be one you know the the if element of the vector Ty y will be equal to 1 if and only if Y is equal I because for example if Y is equal to 1 then only the first element of this Vector will be zero if Y is equal to two then only the second element of the vector will be zero and so on so the question of whether or not you know whether the I element of this Vector t y is equal to one is answered by just asking is y equals to I okay all right yeah if still not quite sure why that's true go home and and think about it a bit more and I think I take a look at the lection NOS as well maybe that'll help um just for now I just take my word for it um so let's go ahead and write write out the distribution um for the momal in exponential family form so p ofy is equal to 51 indicator y = 1 * Y 2 indicate to Y = 2 up to 5 K times indicator yal K right and again 5 K is not the parameter of the distribution 5K is the short hand for 1 - 51 - 52 minus the rest um and so you know using this equation on the left as well I can also write this as y1 * Ty y1 I 2 Ty Y2 dot dot dot 5 kus1 Ty y Kus is 1 * 5 K and then 1us sum from that should be J um and it turns out um it takes several steps of algebra that I don't have time to show it turns out you can simpli y this into well the exponential family form where ater is a vector this is a k minus onedimensional vector and um well okay so um deriving this is a few steps of algebra um that that you can work out yourself but I won't do here and so using my definition for Ty and and um by choosing you know a a and b this way I can take my distribution from momal and write it out in the form of um an exponential family distribution okay um it turns out also that let's see so when we worked at the B distribution right one of the things we did was we also had ater as a function of fi and then we inverted that direct right fi um as a function of ater so it turns out you can do that as well so this defines ater as a function of the multinomial distributions parameters F so you take this relationship between ater and fi and invert it and write out five is a function of ater and it turns out you get that f i is equal to e to the 8 excuse me and you get that 5 I is equal to e to the a i of 1 + that okay um and the way you do this is you just you know this defines AER as a function of fi if you take this and solve for AER you end up with this and this there again there are a couple steps of algebra that I'm just not showing and then lastly using our assumption that um uh using our assumption that the atas are a linear function of the input XIs f i is therefore equal to e to the thet transpose x / by 1 + sum over J = 1 to Kus 1 e to the Theta J transpose X okay and this is just using the fact you know that ad equals th transv X which was one of our earlier design choice for generalized linear models so we're just about done and um so my learning algorithm H H subscript Theta of X I'm going to think of it as outputting the expected value of TY given x and parameterized by Theta so you know Ty right was this Vector of indicator functions right so T1 was indicator y = 1 down to indicator y = K minus one right so I want my learning algorithm to Output this the expected value of this Vector of indicator functions um but you know the expected value of indicator y equals 1 is just the probability that yal 1 which is given by 51 right so I have a random variable that's one whenever Y is equal to 1 is zero otherwise so the expected value of that of this indicator yal 1 is just the probability that y equals 1 which is given by 51 um and therefore by our by by by what we worked out earlier this is therefore e to thet 1 transvers X over well okay and so my learning algorithm will output the probability that y = 1 Y = 2 up to Y = Kus 1 um and these probabilities are going to be parameterized by you know these functions like these okay and so um just to give this Alm a name um this Alm is called Soft Max regression and um is widely thought of as the generalization of logistic regression which which is regression with two claes why he thought of as a generalization of logistic regression to the case of um K classes rather than two classes um and so just be very concrete about what you do right so you have a machine learning problem and you want to apply softmax regression to it so work through the entire derivation s of addes I think the question you had is what about how to fit parameters so you know let's see let's say you have a machine learning problem and why takes on One of K classes what you do is you sit down and say you know okay I want want to model y as being distributed multinomial given any X and and and Theta and so you chose multinomial as your exponential family um then you S turn the crank then everything else I wrote down you follows automatically from your having made the choice of uh using multinomial distribution that's a choice of exp family and then what you do is you then have this training set right x i y i you know up to XM ym right um so so you're given a trading set where now each of the values of y takes on you know one of K possible values and what you do is you then um find the parameters of the model by maximum likelihood so you write down the likelihood of the parameters and you maximize the likelihood right so what's the likelihood well the likelihood as usual is the product over your training set of P of Yi given x i parameterized by Theta right that's the likelihood same as we had before and that's product from one uh product of your training set of let me write this down know y i = 1 Time 5 2 of indicator you know Yi = 2 dot dot dot up to 5 K of indicator y i equal k where for example 51 depends on Theta through this formula is e to the Theta 1 transpose X over 1 + Su over J well you know that formula I had just now okay and so 51 here is really a shorthand is for this formula and similarly for 52 and so on to 5K where 5K is one minus all of these things all right so this is a I don't know and this this this formula looks more complicated than it really is what you really do is you write this down then you take logs compute the derivative of this formula with respect to Theta right and apply say grade in Ascent to maximize the likelihood yeah what are the rows OFA before Theta has just been a vector right and now it looks like it's three dimensional oh uh yeah I in in the notation I've been using I think of um Theta 1 um through Theta K minus one I I've been thinking of each of these as um an N or n plus one dimensional Vector if if if x is n plus one dimensional then I've been think so I'm thinking of you have a set of parameters comprising K minus one vectors um and each of these is a you could group all of these together in a matrix but I just haven't been doing that I've been thinking of you that they out of K minus one foram effectors in what do they what do they correspond to uh I don't know H let's s all the time let me take that offline it's it's hard to answer in the same way that um for logistic regression what does State correspond to logistic reaction you can sort of answer that but it's sort of yeah it's kind of like the weights for each feature yeah so so similar similar interpretation but bit yeah yeah I think um are there yeah I'm running a little bit late won't I uh won't we officially close with the DAT but you can come up if you have more questions and take them offline okay thanks this presentation is delivered by the Stanford center for professional development so what I want to do today is talk about a different type of learning algorithm um and in particular start to talk about generative learning algorithms and a specific algorithm called G discre analysis um take a slight digression talk about G ience um this and I'll briefly discuss generative versus discriminative learning algs and then hopefully wrap up today's lecture with a discussion of naive Bas and the plast moving so just to motivate our our discussion on generative learning algorithms right so by way of contrast the source of classification algor we're talking about um I think of as algs that do this you know so you given a training set and um if you run an Alum like logistic regression on this trading set the way I think of logistic regression is that is trying to find look at the data and it's trying to find a straight line to divide the crosses and the O's right so so trying to find a straight line make the data a bit noisier trying to find a straight line that separates out the positive and the negative classes um as well as possible Right In fact show something on the laptop maybe we just use the screens on the the small monitors for this um in fact you see you know there's a data set uh with logistic regression and so um I've initialized the parameters randomly and so logistic regression is currently outputting the current hypothesis that iteration zero is that straight line shown on the bottom right and so after one iteration of GRE and descent you know the straight line changes a bit after two iteration ations 3 four um until logistic regression converges and is found a straight line that more or less separates the positive and negative CLA okay so you think of this as logistic regression sort of searching for a line that separates the positive and the negative claes what I want to do today is talk about an algorithm that does something slightly different um and to motivat let's let's use our old example of trying to classify between malignant cancer and benign cancer so patient comes in and if a cancer you want to know if it's a Mal malignant or harmful cancer or if it's a benign meaning harmless cancer um so rather trying to F find a straight line to separate the two classes here's something else we could do we can go through our training set and look at all the cases of malignant cancers go through you know look for our training set for all the positive examples of malignant Cancers and we can then build a model for what malignant cancer looks like then we'll for training set again and take out all the examples of benign Cancers and then we'll build a model for what benign cancers look like okay and then um when you need to classify a new example when you have a new patient and you want to decide this this cancer malignant or benign you then take a new cancer and you match it to your model of malignant cancers and you match it to your model of benign Cancers and you see which model it matches better and depending on which model it matches better to you call that you you then predict whether the new cancer is malignant or n okay um so what I just described um this this class of methods where you build a separate model for Malin cancers and a separate model for benign cancers is called a generative learning algorithm and let me just go ahead and formalize this um so in the models that we been talking about previously those were actually all discriminative learning algorithms and um slightly more formally a discriminative learning algorithm is one that either learns P of Y given X directly um or even you know learns a hypothesis that um outputs value 01 directly okay so logistic regression is an example of a generative learning algorithm is an example of discriminative learning algorithm in contrast a generative learning algorithm um models P of x given y the probability of the features given the class label um and as a technical detail it also models P of Y but that's a less important thing and the interpretation of this is that a generative model builds a probabilistic model for what the features looks like conditions on the class label okay in other words conditioned on whether cancer is malal or benign it models prob distribution over what the features of the cancer looks like then having built this model having having built a model for p of x given y and p of Y then by base rule obviously you can compute you know P of Y given one condition on X this is just P of x given y = 1 * P of x divided by P of X um and if necessary you can calculate the um denominator using this right right um and so by modeling P of x given Y and given and modeling P of Y you can actually use base R to get back to P of Y given X but um a generative model generative learning AR starts modeling P of x given y rather than P of Y given X okay we'll talk about some of the trade-offs and why this may be a better or worse idea than than a than a discriminative Model A bit later um let's go for a specific example of a generative learning algorithm and for this specific motivating example um I'm going to assume that your input features X are in RN and are continuous valued okay and under this assumption let me describe to a specific algorithm called Gan discriminant analysis um and the I guess core assumption is that we're going to assume in the Gan analysis model that P of x given Y is Gan Okay so actually just raise your hand how many of you have seen a multivariate gan before not a 1D Gan but High though okay cool like maybe half of you two3 of you um so let me just say a few words about about Gans and and for those you that have seen it before it would be a refresher um so we say that a random variable Z is distributed to gan multivariate gan as um script end for normal with parameters um mean mu and covariance sigma squar um if you know Z has a density one/ 2 pi two okay that's the formula for the density um it's a generalization of the onedimensional gcia and the normal The Familiar bell-shaped curve it's a high diens mention Vector value random variable z um don't worry too much about this formula for the density you I don't know you rarely end up needing to use it but the two key quantities are um this Vector mu is the mean of the gussan and this Matrix Sigma is a CO variance Matrix covariance and so um you know Sigma will be equal to right the definition of covariance so the vector value random variable is you know x- new x- mu transpose okay and um actually if this doesn't look familiar to you um you might rewatch the uh discussion section that the TS held last Friday um or the one that they'll be holding uh later this week on sort of a a recap of probability okay but so um multiv gxy in this par parameterized by a mean and aarian and let me just uh can I have the laptop display please let me just go ahead and and actually show you you know graphically the the effect of um varying a gan varying the parameters of a gan so what I have up here is um a zero mean is the density of a zero mean Gan with ciance Matrix equals the identity The cience Matrix is shown in the upper right hand corner of the slide and um a familiar bell-shaped curve in two dimensions and so um if I shrink the covariance Matrix instead of coari identity if I shrink the coari Matrix then the gum becomes more Peak and if I widen the covariance like Sigma equals 2 two then the distribution or the density becomes more spread out okay um that's back to standard normal identity coari one um if I increase the diagonals The ciance Matrix right if if I make the variables correlated then the G becomes flattened out in this x equals y direction if I increase it even further then you know my variables X and Y right on on the hor excuse me X Z1 and Z2 my two variables on the horizontal ax become even more correlated um show the same thing in Contours the standard normal distribution has Contours that are they actually circles um because of the aspect ratio they look these look like ellipses these should actually be circles um and if you increase the off diagonals of the Gan ciance Matrix then it becomes ellipses aligned along the so 45° angle in in this example um and same thing here's an example of a gan density with negative C variances um so so now the correlation you know is goes the other way so even stronger negative Co variance and the same thing in Contours to the Gan with negative entries on the diagonals and even larger entries on diagonals okay so um and the other parameter for the Gan is the mean parameter so this is with mean zero and as you change the mean parameter this mual 0 1.5 the location of the Gan just moves around okay um all right so that was a quick primer on what Gans look like um and here's as as a road map or as a as a picture to keep in mind we describe the Gan discre analysis algorithm um this is what we're going to do here's the training set and in the gaussian discriminant analysis algorithm what I'm going to do is I'm going to look at um the positive examples say the crosses and just looking at only the positive examples I'm going to fit a gan distribution to the positive examples and so maybe I end up with a gan distribution like that okay so that's P of x given yals 1 and then I'll look at the negative examples the O's in this figure and I fit the G to that and maybe I get you the Gan centered over there this is the conso is my second Gan and together we'll say more say how later together these two Gan densities will Define a separator for these two classes okay and it turn out that the separators will is will will turn out to be a little bit different from what logistic regression gives you the the if you run logistic regression you actually get the decision boundary shown in the green line whereas G discri analysis gives you the Blue Line okay um switch back to chort [Music] please all right here's the gaan discriminant analysis model um going to model pfy as a Bly random iable as usual but as a Bly random variable and parameterized by parameter five you've seen this before um model P of Y P of X excuse me given y equal Z as a gan oh you know what um yes excuse me knew I I thought this look strange there should be a sigma determinant of Sigma to the 1 half in the denominator there no V you is um yeah well okay um right I was missing the sigma to the determinant of Sigma to the 1/2 on the previous board excuse me okay and so I model P of x given yal 0 as a gan with mean mu0 and covariance sigma this is a sigma to the minus one2 and okay um and so the parameters of this model are um f mu 0 mu1 and sigma and so I can now write down the likelihood of the parameters as oh actually the log likelihood of the parameters as the log um of that right so in other words if I'm given a training set then I can write down the log likely of the parameters as the log of you know the product of probabilities of P of XI comma Yi right and this is this is just equal to that where each of these terms P of x i given Yi or P of Yi is then given by um by one of these three equations on top okay and um I just want to contrast this again with discriminative learning algorithms right so this to to to give this a name I guess this sometimes is actually called The Joint data likel The Joint likelihood um and let me just contrast this with what we had previously when we were talking about logistic regression where I said that the log where the log likelihood of the parameters Theta was log of a product I equal 1 to m p of Yi given x i and parameterized by Theta right so back when we were fitting listic regression models or generalized linear models we're always modeling P of Yi given x i and parameterized by Theta and this was the conditional likelihood um in which we're modeling P of Yi given XI whereas now with generative learning algorithms we're going to look at the Joint lik which is p of XI comma Yi okay so um [Applause] let's see so given training sets um and using the gaussian discre analysis model to fit the parameters of the model we do maximum likelihood estimation as usual and so you maximize know L with respect to the parameters 5 mu 0 mu1 Sigma and so um when we find the maximum likely estimate of parameters you find that you know five is um the maximum likely estimates are actually no surprise I'm writing this down mainly as a practice for indicator notation right so the maxim estimate for f will be sum over i y i / M um or written alternatively is some of all your training examples of indicator y i = 1 / M okay in other words maximum likely estimate for the um brly parameter 5 is just the fraction of training examples with label one with yal 1 um maximum likely estimate for mu0 is this x i okay you should stare it this for a second and see if it makes sense actually I just write down the next one from me1 you do that okay so what this is is um what the denominator is some of your training sets indicat the Yi equals 0 so for every trading example for which Yi equals z this will increment the count by one right so the denominator is just um the number of examples with um label 0o right and then the numerator will be let's see Su from I equal 1 for M well every time Yi is equal to zero this will be one and otherwise this thing will be zero and so you know this indicated function means that you're including only the terms for which Yi is equal to one only only the terms for which Y is equal to zero because for all the terms where y i is equal to 1 um this sum and would equal to zero and then you multiply that by x i and so the numerator is the sum of x i um corresponding to examples where um the CL labels were zero okay raise your hand if this makes sense yeah um Okay cool so just to just to say this let's ful this is just this just means look for your training set find all the examples for which y equals z and take the average of the value of x for all your examples which y equals z so take all your negative train examples and average the values for x and that's mu0 okay um if you still if if if you still not if if this notation is still a little bit cryptic if if you're still not sure why this equation translates and what I just said um so do go home and and and take stare stare at it for a while until this makes sense and this s of no surprise it just says estimate the mean for the negative examples take all your negative examples and average them so no surprise but there a useful practice indicates a notation um again you can derive the m likelihood estimate for Sigma um I won't do that you can read that in the notes yourself um and so having fit to model excuse me having fit the parameters um find mu0 mu1 and sigma to your data um we now need to make a prediction now when when when you're given new value of x when we give it a new cancer you need to predict it with Malo your prediction is then going to be let's say the most likely value of y given X I should write semicolon the parameters there but I'll just skip that um which is the aax of a y by base rule right and that is in turn just that um because the denominator P of X doesn't depend on Y and um if P of Y is uniform in other words if um uh if if each of your classes is equally likely so P of Y is is is takes the same value for all values of Y then this is just AR Max over y p of X Y okay um this happens sometimes maybe not very often so usually you end up using this formula where you compute P of x given Y and pfy using your model oh um uh let's see so the well so if you take actually let me um so the Min of ax means the value for y that maximizes this right so so just a example the Min of x - x - 5 square is zero because by choosing xal 5 you get this to be zero and the augen over X of x - 5^ 2 is equal to 5 because 5 is the value of x that makes this minimize okay cool than thanks for asking that okay I any other questions about this yeah why it's distributed why is it a [Music] uniform um oh uh um I see by uniform I meant um it was being loose here um I meant if pfy = 0 is equal to P of yal 1 or if Y is the uniform distribution over the set 0o and one oh by yeah right I just meant um yeah P of Y = 0al P of Y given one that's all I meant anything else okay so turns of Gan discriminant analysis has an interesting relationship um to logistic regression let me let me let me illustrate that um so let's say you have a training set um let me go and draw a 1D training set and yeah that'll kind of work yes okay so let's have a training site comprising a few negative and a few positive examples and let's say I run Gan discre analysis I'll fit Gans each of these two densities fit fit a gum density to each of these two to my positive and negative training examples and so maybe my you know positive examples the X's are fit with a gine like this and um my negative examples and my negative examples I will fit a gan that looks like that okay um now hope this picture makes sense right now let's vary along the x axis and what I want to do is um I'll overlay on top of this plot I'm going to plot P of Y = 1 actually um given X for a variety of values X okay so actually realized what I should have done I'm going to call the X's the negative examples I'm going to call the O's the positive examples makes this plot come nicer so let's take a value of x is fairly small let's say x is this value here on the horizontal axis then what's the probity of Y being equal one condition on X well the way you calculate that is you know you right P of Y = 1 given X and mean you plug in all of these formulas as usual right P of x given y = 1 which is your Gan density time P of yal 1 you know which is essentially just this is just going to be equal to five um and then divide it by right P of X I'm show you how you can calculate this but using these two Gans and and and my PRI on P of Y need compute what P of y equals one given X is and in this case you know um if if if x is this small clearly it belongs to left Gan is very unlikely to belong to a positive clant so it'll be very small be very close to zero say okay and then we can increment the value of x consider slightly different value of x and cl what is the P of Y given x uh P of yal 1 given X and again it'll be pretty small um let's choose a point like that right at this point you know the two Gan densities have equal value and um if I ask if x is this value right shown by the arrow what's the probability of Y being equal to one for that value of x well you really can't tell so maybe it's about 0.5 okay and if you f in a bunch more points get a curve like that um and then you can keep going let's say for a point like that you can ask what's the probability of X being one well if it's that far out then clearly it belongs to this you know rightmost Gan and so the probability of Y being one will be very high it would be almost one okay and so you can repeat this exercise for a bunch of points right compute P of yal 1 given X for a bunch of points and if we connect up these points you find that the curve you've just plotted takes the form of s white function okay um so in other words when you make the assumptions under the Gan discriminative the G the um gussian discriminant analysis model that P of Y that P of x given Y is gussian um when you go back and compute what P of Y given X is you actually get back exactly the same signal function that we're using L regression okay um but it turns out the key difference is that Gan discriminant analysis will end up choosing a different you know position and um steepness of the sigmoid then what legistic regression question the G of Y you do uh no let's see the Gan so this Gan is p of x given yal zero uh given yal 1 and this Gan is p of x given yal 0 does that make sense anything else um yeah how did you when you were drawing all the dot what Y and of x uh what what say that again I guess I I'm sorry could you could you go over how you figured out where to draw each dot let's see okay so the the the um so the computation is as follows right the steps are um I have a training set and so given my trading set I'm going to fit a gaussian discri analysis model to it and what that means is you know I'll build a model for p of x given yals 1 I'll build a model for p of x y equal Z and I'll also um fit a Bly distribution to P of Y okay so in other words I given my training set I fit P of x given y and p of Y to my data and now I've chosen my parameters of F mu0 mu1 and sigma okay then this is a process um I went through to plot all these dots right just I pick a point in the x-axis and then I compute P of Y given X for that value of x and P of Y given one condition on x some value between 0 one it be some real number and whatever that real number is I then plot it on the vertical axis okay and the way I compute P of yals 1 conditioned on X is um I would I would use these quantities I would use P of x given y and p of Y and sort of plug it into base Rule and that allows me to compute P of Y given X from you know these three quantities does that make sense was something more that um and and how did you model P of X is that oh okay um yeah so well just earlier so P of X can be written as um right so P of x given y = 0 * P of y = 0 plus P of x given y = 1 P of Y = 1 right so um and so each of these terms P of x given y and p of Y these are ter terms I can get out of directly from a Galaxy IND disc analysis model these are one of each of these terms is something that my model gives me directly so I'll plug that into the denominator and by doing that that's how I compute P of yal 1 given X make sense Okay cool so [Applause] um so let's talk a little bit about the advantages and disadvantages of using a generative learning algorithm right so in the particular case of gaussian discriminant analysis we assume that X conditioned on y is Gan and you know the argument I showed on the previous choke board I didn't prove it for me but you can you actually go back and prove it yourself um is that if you assume x given Y is Gan then that implies that um when you plot y given X you find that just write logistic posterior okay and the argument I showed just now which I didn't prove you can goad and prove it yourself is that if you assume x given Y is gaussian then that implies that the um posterior distribution or that that you know the form of P of Y given given the form of P of y equals 1 given X is going to be a logistic function um and it turns out this implication in the opposite direction does not hold true okay um in particular it actually turns out this is actually kind of cool turns out if you assume that x given yals 1 so is um plus with parameter Lambda 1 and x given y = 0 is possible with par to Lambda 0 it turns out if you assume this then that also implies that P of Y given X is logistic okay so there are lots of assumptions on x given y that would lead to P of Y given X being logistic um and therefore um this the assumption that x given y being Gan is a stronger assumption than the assumption that y given X is logistic okay because so because this implies this right that means that this is a stronger assumption than this because this the logistic holds whenever x given Y is gum but not vice versa um and so this lead on the tradeoffs between Gan discri analysis and um logistic regression right G discrimin analysis makes a much stronger assumption that X even Y is gussan and so when this assumption is true or when this assumption approximately holds if you plot the data and if x given Y is indeed approximately gaussian then if you make this assumption explicit in the algorithm then the algorithm will do better because it's as if you're the algorithm is making use of more information about the data the arum knows that the data is Gan right and so if the Gan assumption you know H or roughly hes then G discrimin analysis may do better than logistic regression if conversely if you're actually not sure what x given Y is then logistic regression the discriminative Alm may do better and in particular if you use logistic regression and maybe you secretly thought the data was Gan but it turned out the data was actually P right then logistic regression would still do perfectly fine because if if if if the data were were actually p then P of yal 1 given X will be logistic and be and it'll do perfectly fine but if you assume there was Gan then then the AL may go up and do something this this that's not as good okay um so it turns out that um right so so to say it SL differently um it turns out that for it turns out the real advantage of generative learning algorithms is often that it requires less data um and in particular you know data is never really exactly Gan right this data is often approximately Gan is never exactly gine and it turns out the generative learning algorithms often do surprisingly well even when these modeling assumptions are not met but one other tradeoff is that by making stronger assumptions making stronger assumptions of all the data G discrimin analysis usually often needs less data in order to like an OKAY model if you have less even less training data whereas in contrast logistic regression by making less assumptions um is you know more robust to your modeling assumptions because you're making a weaker assumption making less assumptions but sometimes it takes a slightly larger training set to fit than G than G discri analysis question you do we need any assumption about the number of cuz here we assume that P of yal 1al to number of POS samples over all number of samples true when the number of samples uh is um okay so let's see so there there there's so two question was is this true was that let me try on that differently so um the modeling assumptions are made independently of the size of the training set right so you know like Le s regression well in all of these models I'm assuming that these are random variables drawn from some distribution and then finally I'm giving a a single training set and then I to estimate the parameters of the distribution right so the what's the probability of Y = 1 um probably of yal 1 so so so this just get a little bit back to velocity of Max likely estimation right I mean you I'm assuming that there P of Y is equal to 5 to the Y 1 - 5 to y or 1 - y so I'm assuming that is some true value of y generating all my data and then um well when I write when when I write this I guess maybe what I should write isn't so when I write this um I guess there really two values of five one is there's a true underlying value of five that you know God used to generate the data and then that's the maximum likelihood as of the value of five so when was writing those formulas earlier those formulas are writing for five and mu0 and mu1 were really the maximum likelihood estimates for 5 mu and mu1 and that's different from the true underlying values of five mu and mu1 yeah right so the maxim likelihood estimate comes from the data and there's some sort of true underlying value of five that I'm trying to estimate and my maximum estimate is my attempt to estimate the True Value but you know in in b ational convention often I'll just write this as that as well without bothering to distinguish between the maximum likel value and the true underlying value that theing is out there but and I'm only hoping to estimate um actually yeah but if if so philosophical questions like these about Max likelihood and so on um I hope the ta's the the the the Friday discussion sections um is a is a good time to to ask question about so probabilistic definitions like these as well um are there other questions right okay [Music] so right oh turns out let just mention one more thing that's kind of cool um I said that um if you know x given Y is Plus on then you also get logistic plus Thea it actually turns out there's a more General version of this if you assume x given yal 1 is exponential family with parameter a to 1 and we assume x y = 0 is exponential family with parameter a to zero then this um implies that P of yal 1 given X is also logistic okay and that's kind of cool so it means that you know y given X could be I don't know some strange thing could be gamma because we've seen Gan um y x could be I don't know gamma exponential um dur beta uh just ratting off my mental list of exponential family distributions could be any one of those things um so so long as the same exponential family distribution for for the two Clauses with different natural parameters then the posterior P of Y give one given x p of y equals 1 given X will be logistic and so this shows the robustness of logistic regression to the choice of modeling assumptions because it could be the the data was actually you know gamma distributed and this still turns out to be logistic so robustness of logistic regression to modeling assumptions and um yeah right and this in the density I think early on I promised two justifications for where I pull the logistic function out of the hat right so one was the exponential family derivation we went through last time and this is the second one that all of these mon assumptions also lead to the logistic function yeah oh oh that y one given is logistic then this implies that no this is this is also not true right so yeah so uh this exponential family distribution implies yal 1 is logistic but the reverse assumption is is is also not true there actually all sorts of really bizarre distributions um uh for X that would give rolous function okay so let's talk about those are first Gena of learning algorithm let me I'll talk about the second generative learning algorithm um and the motivating example this is called the naive phase algorithm and the motivating example um that I'm going to use would be spam classification right so let's say that you want to build a Spam classifier to take your incoming stream of email and decide if a Spam or not um so let's see why will be zero or one with one being spam email and zero being non-spam and um the first decision we need to make is given a piece of email how do you represent a piece of email using a feature Vector X right so email is just a piece of text right you get a your email is is like a list of words or a list of asky characters so you represent email as a feature Vector X so we use a different couple of different representations um but the one I'll use today is we will construct the vector X as follows um I'm going to go through my dictionary and just make a listing of all the words in my dictionary okay so the first word is r a second word in my dictionary is odv um OD wolf okay um you know maybe somewhere along the way you see the word buy you spam emails and you buy stuff um depending how you collect your list of words um you know you won't find CS 229 right call number in dictionary but if you if you collect a list of words via your other emails you've gotten you have this list somewhere as well and then the last word in my dictionary was um zigar which pertains to technological chemistry that deals with the fermentation process and Brewing um so so I get a piece of email and um what I'll do is I'll then scan through this list of words and wherever a certain word appears in my email I put a one there so if get a piece of email and has a word a then that's one you know my email doesn't have where it's odw or OD Valk so get zeros I get a piece of email they want me to buy something cs229 doesn't appear in Z okay so um this would be um one way of creating a feature Vector to represent my um to represent a piece of email now let's build the generative model out for this actually let's use this in other words I want to Modo P of x given y given yal 0 y = 1 right and my feature vectors are going to be 0 one to the N it's going to be these bit vectors binary value vectors that n dimensional where n may be you know on the on the order of say 50,000 if you have 50,000 words in your dictionary which is not a typical so values from I don't know mid thousands to tens of thousands is fairly typical for for problems like these um and therefore there are two to the 50,000 possible values for X right there so two to the 50,000 possible bit vectors the of length 50,000 um and so you know one way to model this is a multinomial distribution but because there two to 50,000 possible values for X I would need you know two to the 50,000 like maybe minus one parameters right because they have to sum to one right it's a minus one um and this is clearly way too many parameters to model using a multinomial distribution over over all two to 50,000 possibilities so in the now you Bas algorithm I'm going to make a very strong assumption on P of x given y um in in particular I'm going to assume let me just say what it's called now write out what it means I'm going to assume that the X I are conditionally independent given y okay let me say what this means um so have that P of X1 X2 up to X 50,000 right given y um by the chain rule of probability this is p of X1 given y time P of X2 given y comma X1 time P of well just put dot dot dot write one one times dot dot dot up to you know well whatever you get the idea up to P of X 50,000 okay so this is the Chamber of probability this this always HS I've not made any assumption yet and now we're going to make What's called the naive base assumption or this assumption that X the XI is a conditionally independent given y um going to assume that well nothing changes for the first term but I'm going to assume that P of X2 given y comma X1 is equal to P of X2 given y I'm going to assume that that term is equal to P of x3 given Y and so on up to P of X 50,000 given y okay or just written more compactly I'm going to assume that P of X1 350,000 given Y is the product from IAL 1 to 50,000 of P of x i given y okay um and Stat formly what this means is that I'm sort of assuming that so long as you know the class label why so long as you know whether this is Spam or not spam um then knowing whether the word a appears in email does not affect the probability of whether the word odd wolf appears in email all right and um in other words you're assuming once you know whether spam where email spam or not spam then knowing whether other words appear in the email won't help you predict whether you know any other word appears in the email and um obviously this this this this this assumption is false right this this assumption can't possibly be true I mean if if you know if you see the word I don't know cs229 and email you're more much more likely to see my name and the email or the TA names or whatever um so this assumption is nor just falls under you know English right for normal written English but um turns out that despite this assumption being false in the literal sense um the naive Bas Alm is is of extremely effective algorithm for classifying text documents into spam or not not spam for classifying your emails into different email folders for you automatically for looking at web pages and classifying you know whether this web page is trying to sell something or whatever it turns out this assumption works very well for classifying text documents um and for other applications to that I'll talk a bit about later um as a digression that only only some that that makes sense only some of you let me just say that um if you're familiar with Bas networks and graphical models um the graph code the basing Network Associated this model looks like this um and you're seeing that this random variable y that then generates X1 X2 through x5000 okay if you've not seen the baset before if you don't know gra mod just ignore this is it's not important on purposes but if youve seen it before that's what it will look [Music] like Okay so the parameters of the model um are as follows with five of I given yal 1 which is probably of xal 1 x IAL 1 given y = 1 5 I given y = 0 and 51 okay so these are the these are the parameters of the model um and therefore to fit the parameters of the model you can write down the joint likelihood right is equal to as usual okay so given a training set you can write down the joint likelihood write down the joint likelyhood those are parameters um and then um when you do maximum likelihood estimation you find that you know the maximum likelihood estimate of the parameters are they're really pretty much what You' expect um maximum likelihood estimate for 5 J given yal 1 is sum from I = 1 to M indicator x i j = 1 well y i = 1 okay um and this is just a I guess State more simply the numerator just says run through your entire training set some for our training examples and count up the number of times you saw word J in a piece of email for which the label y was equal to one so in other words look for all your spam emails and counts the number of emails in which the word J appeared all of all your spam emails and the denominator is you know s from I equals 1 to M the number of spam the denominator is just the number of spam emails you got and so this ratio is um in all your spam emails in your training set what fraction of these emails um did the word J appear in did the J word in your dictionary appear in and that's your maximum likely estimation for that's your maximum likely estimate for the probability of seeing word J conditions on the piece of email being spam okay um similarly your Maxim likely estimate for 5y is pretty much what you expect right okay and so having estimated all these parameters um when you given a new piece of email that you want to classify you can then compute P of Y given X using base rule right same as before because together these parameters gives you a model for p of x given Y and for p of Y and by using base rule given these two terms you can compute P of x given Y and this is Spam class fire okay um turns out we need one more elaboration to this idea but let me check if the questions about this so far does this model depends on the number of inputs uh what do we number of inputs number of features uh no number of samples yeah I mean well m is the number of training examples so this given M training examples this is the formula for the maximum likelihood estimate of the parametes right so other questions does it make sense where m is the number of training examples so when you have M training examples you plug into this formula and that's how you compute the maximum likely estimates the number of emails um yeah right so so right so so so to collect trading said I would go through you know all the email I've gotten in the last two months and label them as spam or not spam and so You' have I don't know like a like a few hundred emails or few th few hundred emails label a Spam or not spam and that will comprise your training set right X1 y1 through XM ym where X is one of those vectors representing which words appeared in the email and Y is 01 depending on whether our label spam or not spam okay so you're are saying that this model depends on the number of samples but the last Model uh doesn't depend on the number of samples but here five is the same fora so the different things right there there's a model which is um the the the the modeling assumptions I've made right I'm assuming that I'm I'm making the naive base assumption so the the probalistic model is an assumption on the joint distribution of X and Y that's what the model is and then I'm given a fix number of training examples I'm giv M training examples and then it's after I'm given a training set I'll then go and derive the maximum likely estim parameters right so that's sort of maybe maybe we should take that offline yeah last question uh then how do you use like the 50,000 words though um s oh how did you select the 50,000 words oh okay how do I select the 50,000 words yeah so it turns out um this is a sort of very practical question right how do I come up this list of words one common way to do this is to um actually find some way to come up list of words like go through all your emails go through all the in practice one common way to come up with this list of words is to just take all the words that appear in your training set that's one fairly common way to do it um um or you could or if that turns out to be too many words you can take all words that appear at least three times in your training set you know so words that you didn't even see free times and the emails you got on the Lost um two months you discount so those are I was talking about going through a dictionary which is a nice way of thinking about it but in practice you might go through you know your training set and just take the union of all the words that appeared in it um there and sometimes there have even better ways to select these features but this is one way to think about creating your feature Vector right as zero and one value use okay and then yeah okay um last question I'm not entirely clear how how you come came up with the parameters um how it came up with the parameters okay so let's see so in na phase what I need to do question is how do I come with the parameters right so in na phase I need to build a model for p of x given Y and for p of Y right so this is this is a I mean in generative learning algorithms I need to come up with models for these so how do I model pfy well I just chose to model using a beri distribution and so you know P of Y will be parameterized by that right and then how do I model P of x given y well so keep changing vs um my model for p of x given y under the naive base assumption I assume that P of x given Y is the product of these probabilities and so I'm going to need parameters to tell me what's the probability of each word occurring you if each word occurring or not occurring conditions on the email being spam or not spam email okay oh uh because because why because um X is either zero or one right by the way I Define the feature vectors x i is either one or zero depending on whether word I appears in email right so so by by the way I Define the feature vectors x i dxi is always 0 one so so that by definition if XI you know is either zero or one then it has to be a b distribution right if if x I were continuous then you might you know model this as Gan say and you end up like like we did in Gan analysis it's just the way I constructed my features for email X is always binary value and so end up with a B here okay all right I should move on um so it turns out that this idea almost Works um and here's the problem so let's say you you know complete complete this class and you start to do maybe do a class project or and you keep working on your class project for a bit and it becomes really good and you want to submit your class project to a conference right so um you know around I don't know June every year is the conference deadline for for the nips conference um it's just the name of the conference it's some acronym and so maybe you you know send your project Partners or send your friends email and say hey let's work on a project and submit to the nips conference and suddenly you're getting these emails with the word nips in them which you probably never seen before right and so um piece of email comes from your project partner and say about let's let's let's send let's send the paper to the nips conference um and then your spam classifier will say p of X let's say let's say nips is a 30,000 word in your dictionary right so X 30,000 given one given yal 1 will be equal to zero that's our maximum likely estimate right because you've never seen the word nips before in your training set so maximum likely of the parameter is that probably have seen the word nips is zero and similarly you know in I guess non-spam male the chance of seeing the word nips is also estimated a zero so um when your spam classy goes to comput P of Y = 1 given X it will compute this right P of time P of Y over right and so if you look at that term say this will be product from I equal 1 to 50,000 of P of x i given Y and one of those probabilities will be equal to zero because P of X 30,000 equal 1 given y = 1 is equal to zero so you have a zero in this product and so the numerator is zero um and in the same way it turns out the denominator will also be zero and so you end up with actually all of these terms end up being zero so you end up with P of Y = 1 given X so 0 over 0 + 0 okay which is undefined um and the problem with this is that it's it's just statistically a bad idea to say that P of X 30,000 given Y is zero right just because you haven't seen the word nips in your last you know two months worth of email is is statistically not sound to say that therefore the chance of ever seeing this word is zero right um and so it's is this idea that just because you haven't seen something before um it may mean that you know that event is unlikely but it doesn't mean that is impossible just saying that if you've never seen the word nips before then it's impossible to ever see the word nips in future emails there chance of that is just zero um so we're going to fix this and um to motivate the fix I talk about um here here the motiv example going to use is um let's say that you've been following the Stanford basketball team for all of their away games and then s of tracking their wins and losses um you know to gather statistics and maybe I don't know form a betting pool about you whether likely win or lose the next game okay so these are some of the statistics um so on um I guess 8th of February um right last last season they played Washington State and they did not win um on 11th of February they play Washington 22nd they play USC UCLA play USC again and now you want to estimate it what's the chance that the winner lose against Louis though all right so F the four guys lost you know five times in the row and there are away Gaines but it seems awfully harsh to say that right so it seems awfully harsh to say there zero chance that the win the loss of the fifth game um so here's the idea behind theas moving which is that we estimate um the probability of Y being equal to one right normally the maximum likelihood is the maximum likelihood estimate is this the number of ones divided by the number of zeros plus the number of ones okay help this in formal notation make sense right normally the maximum likely estimate for S of a win or loss for Bly random variable is um just you know the number of ones you saw divided by the total number of examples so it's the number of zeros you saw plus the number of ones you saw so in the Plus Moving we're going to just take each of these terms the number of ones instead add one to that number of zeros and add one to that number of ones and add one to that and so in our example instead of estimating you know the probability of winning the next game to be zero divided by 5 + 0 well adds one to all of these counts and so we say that the chance of their uh winning the next game is 17 okay which means that having seen them lose you know five away games in a row we are terribly we don't think it's terribly likely to win the next game but at least we're not seeing as impossible um as a historical side note llas actually came up with with this method it's called llas moving after him um when when he was trying to estimate the probability that the sun will rise tomorrow his rationale was you know in a lot of days now we've seen the sun rise but that doesn't mean we can be absolutely certain the sun will rise tomorrow he was using this estimate to probably the sun will rise tomorrow isn't kind of cool um so and more generally um if y takes on K possible values if you're trying to estimate the the the parameter of a multinomial then you estimate P of Y = 1 um um let's see so the maximum likelihood estimate will be sum from J = 1 to M indicator Yi equals J divided by m right that's the maximum likelihood estimate of a multinomial probability of Y being equal to um oh excuse me y equal J right that's maximum lik estimate for the probability of yal J and so when you apply the plve Ming to that um you add one to the numerator and add K to the denominator if y can take up K possible values okay um [Applause] so for naive base what that gives us is um shoot right so that was a maximum likely estimate and what you end up doing is adding one to the numerator and adding two to the denominator and this solves the problem of the zero probabilities and when your friend sends you email about the nips conference you know um your spam your your your spam filter will still be able to make a meaningful prediction um okay shoot any questions about this yeah to what extent do it makes sense because uh for instance if you take the the one of uh the games on on the right it's quite a reasonable assumption to say that uh the probability of uh of winning is very close to zero so I mean the prediction should be should be equal to zero oh so in this case the prediction is 17 right we don't we don't have a lot of if if if you see someone lose five games in a row you may not have a lot of faith in them but as as an extreme example suppose you saw them lose one game right it's just not reasonable to say that the chance of winning the next game is zero but that's what Maximo will say and in such in such a case that's anyway the learning algorithm doesn't have enough data yes so so so so the question is of you know given just five Training examples what's the reasonable estimate for the chance of winning the next game um and 17 is I think it's actually pretty reasonable it's less than 1th for instance saying it's less the chance of winning next game is less than 1th um it turns out under a certain set of assumption I won't go into under a certain set of basian assumptions about the pr and posterior this llas moving actually gives the optimal estimate in a certain sense that won't go into um of what's the chance of winning the next game it's of under a certain assumption about the basing prior on the on the parameter so I I I don't know it actually seems like a pretty reasonable assumption to me although I I should say um actually turned out no I'm just being mean we actually a pretty good basketball team but I chose a losing streak you know cuz it's funnier that way um let's see shoot does someone want to are other are there other questions about this [Music] yeah okay so there's more that I want to say about naive B um but but we'll do that in the next lecture so let's let's wrap up for today this presentation is delivered by the Stanford center for professional development um what I want to do today is continue our discussion of naif Base which is the learning arum that I just start started to discuss in the previous lecture um and talk about a couple of different event models in I Bas and then I'll take a brief digression to talk about neuron Network which is something that I actually won't spend a lot of time on um and then I want to start to talk about support Vector machines um and support Vector machines is the learning algorithm is the supervised learning algorithm that you know many people um consider the most effective off the-shelf supervised learning algorithm um that point of view is debatable but but there are many people that hold that point of view and we'll start discussing that today um and this will actually take us a few lectures to complete um so let's talk about naase to recap from the previous lecture um I started off describing um spam classification as the most example for na base in which we would create feature vectors like these right that correspond to words in the dictionary and so you know based on what words appear in a piece of email were represented as a feature Vector um with ones and zeros in the corresponding places and um now e base was a generative learning algorithm and by that I mean um is an algorithm in which we model P of x given y um and for naive base specifically we modeled it as product from I = 1 to n um P of XY given Y and also we model P of Y and then we use base rule right to combine these two together um and so our predictions when you given a new piece of email you want to tell a Spam or not spam you predict arax of a y p of Y given X which by base R is AR Max over y p of x given y * p y okay so this is not e base um and just to draw attention to two things one is that in this model each of our features were 01 indicating where different words appear and the length of the feature Vector was um so the length n of the feature Vector was you know the number of words in the dictionary so it might be on for instance um on the order of 50,000 words say um what I want to do now is describe two variations on this algorithm um the first one is is is the simpler one which is just the generalization to um if x i takes on more values so you know one of the thing that's commonly done is to apply our base the problems where some of these features XI takes on K values rather than just two values um and in that case you actually buil some very similar model where P of x given Y is really the same thing right where now these are going to be momal probabilities rather than Benes because the XI can maybe take on up to K values um it turns out the situation where one situation where this arises very commonly is if you have a feature that's actually continuous valued and you choose to discretize it you choose to take a continuous value feature and discretize it into a finite set of K values um and so it's a conrete example if you remember our very first you know supervised learning problem of predicting the the price of houses um if you want to if you have a classification problem on these houses So based on features of a house if you want to predict whether or not the houseal will be sold in the next six months as a classification problem um if you want to use naive base then given a continuous value feature like the living area um you know one pretty common thing to do would be take the continuous value living area and just discretize it into a few um you discreet buckets and so depending on whether the living area of the house is less than 500 square ft or between, and 1500 square ft and so on or whether is greater than 2,000 square ft you choose the value of the corresponding feature XI to be to be 1 2 3 or four okay so that was the first variation or generalization of Na Bas I want talk well um I just check other questions about this okay cool um and so it turns out in practice it's fairly common to use about 10 buckets uh to discretize a continuous value fature I drew four here only to save on writing um the second um and so final variation I want to talk about for na base is a variation that's specific to classifying um text documents or more generally for classifying sequences so so text document like a piece of email you can think of as a sequence of words um and and you can apply this little the model I'm about to describe to classifying other sequences as well but let me just focus on text and um here's the idea um so the Naas album as I've described it so far right given a piece of email we were representing it using this binary Vector value representation and one of the things that just loses for instance is um the number of times that different words appear right so for example if some word appears a lot of times you see the word you know buy a lot of times you see the word I don't know Viagra seems to be a common email example you see the word Viagra a ton of times in email is more likely to be spammed than it appears I guess only once um even once I guess is enough um so let me describe a different what's called an event model for na phase that will take into account um the number of times of a word appears in the email and and to give this previous model a name as well um this particular model for tax classification is called the um multivariate thei event model um it's not a great name don't don't worry about what the name means um uh it refers to the fact that the multiple bely random variables really don't don't don't don't worry about what the name means in contrast what I want to do now is describe a different representation for email in terms of the feature Vector um and this is called the multinomial event model and again there there is a rational behind the name but it's slightly cryptic so don't worry about why it's called the momia V model so just called that um and here's what we're going to do given a piece of email I'm going to represent my email as a feature Vector um and so my I training example XI will be a will be a feature Vector um x i subscrip 1 x i subscrip 2 comma x i subscript ni where ni is equal to the number of words in this email right so if one of my training examples is an email with 300 words in it then I represent this email via you know feature Vector with with 300 elements um and each of these you know elements of the feature Vector let's see to write this as X subscript J this will be an index into my dictionary okay and so if my dictionary has 50,000 words then each position in my feature Vector will be a VAR will be a variable that takes on you know one of 50,000 possible values corresponding to um the corresponding to what word appeared in the J position of my email okay so in other words I'm going to take all the words my email in and um you get a feature Vector that just says you know which word in my dictionary was was each word in the email okay um so different definition of n now n now varies and it's different for every training example and these xjs now indexed into the dictionary these you know the components feature Vector um are no longer binary random variables there there are these indices in the dictionary and take much larer set the values um and so our generative model for this will be that the joint distribution over X and Y will be um will be that where again n is now the length of the email all right so the way to think about this formula is um you imagine the that there was some probably distribution over emails there some random distribution that generated emails and that process proceeds as follows um first why is chosen first the class label is someone going to send you spam email or not spam emails is is is chosen first right so um so first why the random variable why the class label spam or not spam is generated and then um having decided whether the S you spam or not spam um someone iterates over you know all 300 positions of the email right all or 300 words that they're going to compose in this email and um will generate words from some distribution that depends on whether they chose the sent you spam or not spam so Sentry spam they'll s you words they'll tend to generate words like you know buy and Viagra and whatever discount sale whatever and if someone chose to send you not spam then they'll send you so the more normal words you get in email okay um so so just careful right XI here has a very different definition from the previous event model and N has very different definition from the previous event model um and so the parameters of this model um are let's see F subscript K given y = 1 um which is the probability that you know conditioned on someone designed to send you spam what's the probability that the next word they choose to email you in in in the spam email is going to be word k um and similarly you know so same thing well I just write it out I guess and FY which same as before okay so these are the parameters of the model um and given a training set you can work called the maximum likelihood estimates of the parameters um so the maximum likely estim of the parameters will be will be equal to and now I'm going to write one of these you know big indicator function things again will be a sum over your training set indicator whether that was spam times sum over all the words in that email where n subscript I is the number of words in email I in your training set times indicator x i j k um time I okay so um the numerator says some all your emails and take into account all the emails that had class label one take in account only of the emails that that were spam if if y equals zero then this is zero and this will go away and then times some over um all the words in your spam email and count up the number of times you observed word K in your spam emails so in other words the numerator is look at all the spam emails in your training set and count up the total number of times the word K appeared in this email um the denominator then is some over entire training set of um whenever one of your examples will spam you know sum up the length of that spam email and so the denominator is the total length of all the spam emails in your training set and so the ratio is just all of all your spam emails what is a fraction of words in all your spam emails that were word K and that's your estimate for the probability of you know the next piece of spam mail generating the word K in any given position okay um and at the end of the previous lecture I talked about the pl moving and so when you do that as well you add one to the numerator and K to the denominator and this is the the plus smoo estimate of um of of of this parameter okay and similarly you do the same thing for and and and you can work out the the um estimat for the other params yourself okay so very similar yeah sorry on the right on the top I was just wondering what the x sub y is and what the N is right okay so so um in this second event model the definition for XI and definition for n are different right so here well this is for for one example XY so here n is the number of words in in a given email right and if if it's the if email is subscripting thing this n subscript I um and so n will be different for different training examples and here x I will [Applause] be you know these values from 1 to 50,000 and um XI is essentially um the identity of the I word in a in a given piece of email okay so that's why this is this is looping or this is a product over all the different words of your email of the probability of the I word in your email conditional one yeah um oh uh oh actually you know what I apologize I just realize that overloaded notation right I shouldn't have used K here let me use a different alphabet and see if that makes sense does that make sense oh oh you know what I'm sorry gots you're right thank you right so this this in the pl moving that shouldn't be k um this should be um uh this should be you know 50,000 if you have 50,000 words in your dictionary yeah thanks right I I still know notation from the previous lecture and then translate properly so theas moving right this is the number of possible values that the random variable X I can take on cool um you just raise your hand if this make sense yeah okay cool some of you are there are there more questions about this [Applause] yeah Poss the Dominator the plus values that y could take um yeah and let's see so the plus moving is um um a method to give you s of hopefully better estimates of the probability distribution over momal and so um uh was I using X1 in the previous lecture so so um if you're trying to estimate the probability over over a m no I think X and Y are different I think you was it X or Y I think it was X actually well oh I see right right I I think I think I was using a different definition for the random variable y but the point suppose you have a multinomial random variable um X which takes on um uh let's use a different alphabet supposed to have a multinomial random variable X which takes on all different values then the maximum likelihood estimate for you know for the probability of x p of xal K will be equal to right the number of observations um the maximum likelihood estimate for the probability of X being equal to K will be the number of observations of xal a divided by the total number of observations of X okay so that's the maximum likely estimate and to do to to add the pl moving to this you so of add one to the numerator and you add L to the denominator where L is the number of possible values that X can take on so in this case um you know this is a probability that x equals K and X can take on 50,000 values if if if 50,000 is the L For Your Dictionary it may be something else but that's why at 50,000 is a nominator what other questions yeah is there a specific definition for a maximum likelihood estimation of a parameter we talked about it a couple times and all the examples make sense but I don't know what the like general formula for it is I see yeah right so the definition for maximum so the question of what's the definition for maximum likely estimate um so um actually in today's in the previous lecture when I talked about Gan discre analysis was of fing up the maximum likelihood estimates on the board um without without proving them the way to actually work this out is um to actually um write down the likelihood actually W so the way to figure out the the all of these maximum likely estimates is to write down the likelihood of the parameters given y Zer 5 y right and so given a trading set um the likelihood I guess I'm as wrting log likelihood will be the log of the product of I = 1 to m p of x i comma y i you know parameterized by these things okay um where P of x i comma Yi right this is is given by um n i x i j given y i parameterize by well I just dro the parameters to write this most simply uh let's put it in times pfy I okay so this is my log likelihood and so the way you get the maximum likelihood as of the parameters is you it's like given a fixed training set given a you set of x i Yi you maximize this in terms of the parameters and then you get the maximum likely estimates I've been writing out um so in the previous lecture in today's lecture I wrote out the maximum likely estimates for the Gan discrimin analysis model and you know for naive base and then improve them um but you get to S of play that yourself in the homework problem as well um and and so for one of these models and you'll be able to verify that when you maximize the likelihood and maximize the log likelihood that hopefully you do get the same formulas as was for bring up on the board but but the way it's defined the way these are derived is by maximizing this okay cool all right so that wraps up one want say about oh um so that more or less wraps up one what I want to say about Na phas and it turns out that um for text classification the naive Bas the naive Bas algorithm with the second event model the set the the the loss of naive based model I presented with the melom event model um it turns out that almost always does better than the um first naive based model I talked about when you're applying it to the specific case to the specific of tech classification um and you know one of the reasons that's hypothesized for this is that this the second model the Momi event model um takes in account the number of times um um the number of times uh a word here in a document whereas the formal model doesn't um I said in truth the that actually turns out not to be completely understood why the latter model does better than the foral one for text classication so researchers are still debating a bit about debating about it a bit but if you ever have a text classification problem um yeah now you Bas classify is probably not by far the best learning algor out there but is relatively straight forward to implement and it's and it's a very good out first out room to try if you have a text classification problem okay that's a question yeah so um the second model is still positioned in variant right it doesn't actually care where the words are each other yep and I mean I asse that Mar model do add that information in does that usually do better if you have enough data yeah so so question is the second model right doesn't second model the most no event model actually doesn't care about the ordering of words you can shuffle all the words in the email and it does exactly the same thing um so in natural language processing this model actually has another name it's called the unigram model in natural language process Ing and there are some other models like um um so say higher order Markov models that take into account some of the ordering of the words it turns out that for text classification the the the the models like the um um like the like the B models the trigr models um I believe they do only very slightly better if at all um but that's that's for the when that's when you're applying them to text location okay um all right so the next thing I want to talk about is um to start to get into discussion of nonlinear classifiers um so it turns out well and so um the very first the very first classification Al we talked about was logistic regression which you know had the falling form for hypothesis and um you can think of this as predicting one when um you know this estimated probability is greater than equal to 0.5 and predicting zero right when the probab when when this is less than 0.5 and um you know given a training set right logistic regression will maybe do gradient descent or something or use Newton's method to find a straight line that reasonably separates the positive and negative classes um but sometimes the data set just can't be separated by straight line so is there an algorithm that will let you start to learn these sorts of nonlinear decision boundaries um and so how do you go about getting a non-leading classifier um and by the way one one one cool result is that um remember I said when we talked about generative learning algorithms I said that um you know if you assume y given X is exponential family right with parameter AA um and if you build a generative learning algorithm using this right one this is a to one this is exponential family with natural par a to zero right I think when I talked about gon disc analysis I said that if this holds true then you end up with a logistic posterior it actually turns out that the na based model actually falls into this as well so the naive based model actually falls into this exponential family as well and therefore um under the na based model you're actually using the S of linear classifier as well okay um so the question is how can we start to get non-learning classiers um and I'm going to talk about one method today which is um I'm just going to talk about it very briefly which is um taking a simpler algorithm like logistic regression and using it to build up to more complex nonlinear classifiers okay so to motivate this discussion um I'm going to use a little picture let's see so suppose your features um X1 X2 and X3 and so by convention you I'm going to follow our early convention that x0 is set to one and so I'm going to use a little diagram like this to denote um a logistic regression units okay so think of a little picture like this you know this this little circle as denoting a computation note that takes us input you know several features and then outputs another number that's X subscript theeta of x given by a sigo function and so this little computational unit will have parameters Theta now in order to get nonlinear decision boundaries um all we need to do what at least one thing we could do is just um come up with um come up with a way to represent hypotheses they can output nonline decision boundaries right um and so this is when you put a bunch of those little pictures that um I drew on the previous board you can then get what's called a neuron Network um in which and think of you know having my features here and then I would feed them to say a few of these little sigmoidal units and these together will feed into yet another sigo unit say which will output my final output H subscript Theta of X okay um and just to give these things names let me call the values output by these three intermediate sigmoidal units let me call them A1 A2 A3 um and and let me just be completely concrete about what this formula represents right so each of these in each of these units in the middle will have their own Associated set of parameters and so the value A1 will be computed as G of X transposed and then some set of parameters which should write as Theta 1 and similarly A2 will be represented will be computed as G of X transpose um Theta 2 and A3 will be G of x Theta free where um G is the sigo function right so G of Z and then finally our hypothesis will output um you know G of a transpose Thea 4 right where you know this a vector is a vector VOR of A1 A2 A3 you can append another one to it in at First Once okay um I just draw up here sorry about the clet bullet um and so H subscript Theta of X this is a function of all the parameters theeta one through Theta 4 um and so one way to you know learn parameters for this model is to write down the cost function say J of theta = 12 Su from I = 1 to m y IUS H sub Theta of X I squar say so that's our familiar quadratic cost function and so um one way to learn the parameters of of an algor like this is you just use gradient descent to minimize J of theta as a function of theta okay so you can apply gr descent um to minimize this squared error which stated differently means you use grain descent to make the predictions of your new network as close as possible to what you observed as the labels in your training set okay um so it turns out gradient descent on this sort of neuron network is a specific name the Alm that implements Grand descent is called back propagation so have you ever hear that all that means is just means GR in descent you know on a cost function like this or or variation of this on a neuron Network that looks like that um and um well this the this albm actually has some advantages and disadvantages but let me actually show you um so let's see one of the interesting things about the neuron network is that um you can look at what these intermediate nodes of computing right so so this is um this neuron network has of one what's called a hidden layer before you then have the output layer and more generally you can actually have you know inputs feed into these computation units feed into more layers of computation units even more layers the more layers and then finally you have an output layer at the end um and one one cool thing you can do is look at all of these intermediate units look at these units in What's called the hidden layer of the neon Network don't worry about why called that look at computation is the hidden unit and Al you know what is the hidden unit Computing in the neuron Network um so to may get a better sense of what neuron networks might be doing let me show you a video and switch to the laptop of um this is made by a friend uh Yan Lun who's currently a professor at New York University um can I show can I show show a video on the laptop um so let me show you a video um from Yun on um a Network that he developed for um Harington digit recognition there was one other thing he did in this neon Network that I'm not going to talk about called the convolutional neuron network but um his system is called L nette and huh let's see should you put on the laptop display actually maybe if or you can just put on the screens of the side that will work too if the big screen isn't working let's see I'm just trying to think okay how do I keep you guys entertained while we're waiting for the video to come on um well let me say a few more things about neuron Network so um so it turns out that the back when you write tell the quadratic cost function like I wrote down on the chalkboard just now um it turns out that unlike logistic regression that will almost always correspond to a non-convex optimization problem and so you whereas for uh logistic regression if you run grade descent or Newton's method or whatever you converge the global Optima this is not true for neuron networks in general there lots of local optimer and um it's sort of much harder optimization problem um so neuron networks If you're sort of familiar with them and we good at making design choices like what learning rate to use and how many you know hidden units to use and so on you can sort of get them to be fairly effective um and S of the S of often ongoing debates about you know is this learning a better or is that learning a better um the vast majority of machine learning researchers today seem to perceive support V machines which is what I'll talk about later to be much more effective off the shelf learning Al than neuron networks um this point of view is contested a bit um but so neon networks are not something that I personally use a lot right now um um because it's a hard optimization problem it usually takes a while to converge um and but it actually sort of works it sort of works reasonably well but it's it's just because this is it's fairly complicated um it's not an album that you know I use commonly or that or that my friends use all the time oh cool so but let me actually go and show you um an example of neuron Network which was which was for many years um you know the most effective learning algorithm before support F machines were invented um so here's Yan's video and um well there's actually audio on this too that's not important but so let me tell you what's happening um this is what you're seeing is a trained neuron Network and this display where Mouse pointer is pointing right this this big free there is um uh the input to the neuron Network so you're showing the neuron Network this image and it's trying to recognize what is this the final answer output by the neuron network is this number up here right right below where say l Net 5 um and the new network correctly recognizes this image as a free and if you look to the left of this image what's interesting about this is um the display on the left portion of this is actually showing the intermediate computations of the neuron Network in other words it's showing you what are the hidden layers of the neuron network computing and so for example if you look at this one the third image down from the top this seems to be Computing you know certain edges in the digits right was was Computing digits on the right hand side of of of right hand side of the bottom or something of um the input display on of of the input image okay so let me just play this video and you can see um some of the inputs and outputs of the neuron net works and um so very different fonts [Music] um what's [Music] that [Music] [Music] to this robustness to noise [Music] all right [Music] [Music] multiple digits that's kind of cool all right so um just a fun let me show you one more video um which was um let's see this is another video from uh uh the video series the machine that changed the world which was produced by wgb television um in corporation with uh British Broadcasting Corporation and it was aired on PBS um a few years ago I think um I want to show you video describing the net talk neuron Network which was developed by um by by Terry sinowski a researcher um and uh so net talk was one of the actually one of the major milestones in the history of neon Network um and the specific application is getting a neuron Network to read text so in other words can you show you know piece of English to an to a computer and have a computer read the verbally produce sounds that correspond to reading out the the reading out the text um and it turns out in the history of AI and history of machine learning um the this video uh creates a lot of excitement about neuron networks and about machine learning um part of reason was that um Terry xowski you know had had the foresight to choose to use um in his video a a childlike voice um talking about visiting grandmother's house and so on you you see it in a second and so um to and so this really created the perception of created the impression of the new netw being like a young child learning how to speak and tell going their grandmothers and so on um so this actually this actually helped generate a lot of excitement within Academia and outside Academia on networks early in the history of neon networks let me just go and show you the video You're going to hear first what the network sounds like at the very beginning of the training and it won't sound like words but it'll sound like attempts that will get better and better with time the network takes the letters say the phrase grandmother's house and makes a random attempt at pronouncing it grandmother's house the phonetic difference between the guess and the right pronunciation is sent back through the network grandmother's house by adjusting the connection strengths after each attempt the net slowly improves and finally after uh letting a train overnight uh the next morning it sounds like this grandmother's house I like to go to my grandmother's house well because she gives us Cy well net talk understands nothing about the language it is simply associating letters with sound all right so you know this for at the time this was done this is um I think this is a this is an amazing piece of work um I should say today uh there are other um text of speech systems that work better than what you just saw um and and and guess you also appreciate was getting candy from your grandmother's house would have been less impressive we're talking about you know the Dow Jones following 15 points and profit taking whatever so um but I want to show that just because that was another cool major landmark in in the history of new networks okay so let's switch back to the TR board um and um what I want to do next can I what I want to do next is um tell you about support vect machines okay that wraps up our discussion on neuron [Music] [Music] networks um so I started off talking about neuron networks by motivating it as a way to um get us to Output nonlinear classifiers right I didn't really prove it but it turns out that you'll be able to come up with nonlinear decision boundaries using a neuron Network like like what I drew down like what I drew on the troll board earlier um s Vector machines will be another learning Al room that will give us a way to come up with nonlinear classifiers that's a very effective offshelf learning algorithm but it turns out that in the discussion I'm going to in the uh your progression or development I'm going to pursue I'm actually going to start off by describing yet another class of linear classifiers with linear decision boundaries and only be later so in in either prob in the next lecture or the one after that they will then take the support Vector machines idea and sort of do some clever things to it to make it work very well to generate nonlinear dig boundaries as well okay but we actually stop by talk about linear classifi a little bit more um and to do that I want to convey two uh intuitions about classification one is um you think about logistic regression right we had this logistic function that was outputting the probability that y equals 1 um and you know it crosses this line at um zero so when you run listic regression I want you to think of it as um you know an algorithm that computes Theta transpose X and then it predicts one right um if an only of the transpose X zero right if stand for if and only if right means the same thing as a double implication and it um predicts zero if and only if the trans X is less than zero say okay so if it's the case that the the transpose X is much greater than zero the double greater than sign means is is much greater than right so if the trans X is much greater than zero then you think of that as a very confident prediction then Y is equal to one right if if the trans X is much greater than zero then we're going to predict one and moreover we're very confident as one and in the picture for that is um if Thea trans X is way out here then we're estimating that the probability of Y being equal to one on the Sig function you will be very close to one and in the same way um if Thea respes X is much less than zero then we're very confident that Y is equal to zero so um one would be nice so when when we fit logistic regression on some of the cl to training set um then so would it be nice if right dra I such that Y is equal to 1 um we have Thea transpose x i is much greater than zero and for all I such that Y is equal to zero we have the trans XR is much less than zero okay so so would it be nice if this is true that for all that that that essentially um if our trading set we can find pram to SATA so the learning Al not only makes correct classifications on all the examples in the training set but further it so is very confident about all of his correct classifications so so this is um the first intuition that I want you to have and um we'll come back to this first intuition in a second when we talk about functional Marin okay I'll Define this later um the second intuition that I want to convey [Music] um and um oh it turns out for the rest of today's lecture I'm going to assume that training sets is linearly separable okay so um by that I mean for the rest of the section I'm going to assume that there is indeed a straight line that can separate your trading set and and and we and we'll we and we'll remove this assumption later just to develop the AL let's say if we have linearly separable training set and so does a sense that out of all the straight lines that separate the training set you know maybe that straight line isn't such a good one um and that one actually isn't such a great one either but maybe that line in the middle is a much better linear separator than the others right and um one reason that when you and I look at it this one seems best is because this one is just further from the data right that it separates the data with a greater distance between your positive and negative examples and this ision boundary okay and this second intuition we'll come back to shortly about you know this line this this final line that I drew being the maybe the maybe the best line um this notion of distance from the train examples this this is the second intuition we convey and um we'll formalize it later when we talk about geometric margins of our classifi [Music] okay so um so in order to describe support Vector machine um unfortunately I'm going to have to pull a notation change um and so unfortunately it so wasn't possible to do logistic regression and support Vector machines you know and all the other algorithms using one completely consistent notation um and so I'm actually going to change the notation slightly for linear class fires and that'll actually make it much easier for us to that'll make it much easier later today and in next week's lectures to to actually talk about support back machine um but the notation that I'm going to use um for the rest of today and for most of next week will be that my labels y instead of being 01 there'll be minus one and + one and and um right in the development of a support Vector machine we will um have H have a hypothesis um output values that the either plus one or minus one um and so you know we'll let G of Z be equal to 1 if Z is 0 and minus one otherwise right so just rather than 0 and one we change everything to plus one and minus one and finally you know whereas previously I wrote right G of G subscript Theta of x equals g of theta transpose X and we had the convention that x0 is equal to one right and and and so X is an RN + one um I'm going to drop this convention well of letting x0 equals one and letting X be in RN + one and instead I'm going to parameterize my linear classifier as x subscrip w comma B of x equals g of w transpose X plus b okay and so you know B just now plays the row of theta 0 and W now plays the row of that the rest of parametes theta 1 through Theta R okay so just this um by separating out the uh intercept term B rather than lumping it together it will make it easier for us to to develop support Vector machines so yes oh yes right yes so w is right so w is a vector in RN and X is now a vector in RN rather than n plus one and uh lowercase B is a real number okay now let's formalize the notion of functional margin and geometric margin um let make a definition I'm going to say that the functional margin of the hyperplane WB with respect to a specific training example x i Yi is Right WT stands for with respect to the function margin of a hyper WB with respect to a certain training example um x i y i and I'm going to Define as gamma hat I equal y i time wose x i plus b okay and so a set of parameters W comma B um you know defines a classifier or Define it so defines a linear separating boundary and so when I say hyperplane I just mean um the uh decision boundary that's defined by the parameters W comma b um you know what if if if you're confused about a hyperplane term just ignore it the hyperplane of a classifier with parameters WB with respect to a training example is given by this formula okay and the interpretation of this is that you know if if y i is equal to one then for we need to have a large functional margin we want W transpose x i plus b to be large right and if y i is equal to minus one then in order for the functional margin to be large we still want the functional margin to be large in order for the function margin to be large if Y is equal to minus one then the only way for this to be big is if w transp x i plus v is much less than zero okay so um this captures the intuition that we had earlier about functional margins that if the ination we had earlier that if y i is equal to one we want this to be big and if Y is equal to minus one want this to be small and and just of captures the two cases into one statement that we'd like the functional margin to be large um and Noti is also that um so long as Yi time wose x i plus b so as this is greater than zero that means we classified it correctly okay um and um one more definition I'm going to say that the functional margin of a hyper plane with respect to an entire training set um is going to be um h going to Define gamma hat to be equal to Min over all your training examples of gamma hat I right so um if you have a trading set if if you have just more than one training example I'm going to define the functional margin with respect to the entire trading set as you know the worst case of all of your function margins of the entire trading set and so for now we should think of um the first function margin intution as saying that we would like the function margin to be large and and and for our purposes for now let's just say we would like the worst case functional margin to be launch okay we and we'll change this little bit later as well now turns out that there's one little problem with this intuition that that makes it that that that will s of AES later um which that it actually turns out to be very easy to make the functional margin large right so for example suppose I have a class five with parameter is w and B um if I take W and multiply by two and take B and multiply by two then if you refer to the definition of a functional margin I guess that was what uh gamma gamma hat I equals y i * W trans = x + B right if I double w and B then I can easily double my functional margin so so this go of making the functional margin large you know in and of itself Isn't So useful because you easily make the functional margin arbitr large just by scaling up the parameters and so maybe one thing we we need to do later is um add a normalization add a normalization condition for example maybe we want to add the normalization condition that the norm the L2 Norm of the parameter W is equal to one come back to this in a second um all right and and and so yeah um okay now let's talk about um let's see how much time have 15 minutes well so I'm trying to decide how much to try to do in the last 15 minutes okay so let's all for the geometric margin um and so the geometric margin of a training example it's convenient and we use this right so the decision boundary of my classifier is going to be given by the plane WX x w transpose X plus b is equal to Z okay right and these are the you know X X1 X2 a you say and um I'm going to draw relatively few training examples here let's say I drawing deliberately few training examples so that I can so can add things to this okay and so um assuming we classify an example correctly I'm going to define the geometric margin as um just the distance so just a geometric distance between a point between a training example between training example x i comma Yi and the distance given by this separating line given by this separating hyperplane okay that's what I'm going to define the geometric margin to be um and so I'm going to do some algebra fairly quickly in case it doesn't make sense in we through the leure NOS more carefully for details um so by standard geometry um the normal or other words the the the vector that's at 90° that the separating hyper plane is going to be given by W divided by the normal of w that's how that's just how planes and high Dimensions work and if if if this looks somewhat mysterious to you so take a look at the Elon NOS on on on the website um and so let's say this distance is gamma I okay and so I'm going to use the convention that I'll put a hat on top when referring to functional margins and no hats on top for geometric margins so let's say geometric margin of this example is gamma I um that means that this point here right is going to be x i minus gamma I * W over normal W okay because um W over normal W is a unit Vector is a length one vector that's normal to the separating hyperplane and so when we subtract gamma times the unit Vector from this point x i right this point here is Xi so x i minus you know this low Vector here is going to be be this point that I've drawn in in drawn as a heavy Circle okay so this heavy Point here is x i minus this vector and this Vector is gamma time W over normal W okay um and so because this heavy point is on the separating hyper plane right this point must satisfy W transpose times that point equals zero right because all points All Points X on the separating hyper plane satisfy the equation W trans X plus b equal 0 and so this point is on the separating hyper plane therefore it must satisfy W trons fills this point oh excuse me plus b is equal to Z okay raise your hand if this makes sense so f oh okay cool most of you um but again I'm so of being slightly fast and loose to the geometry so if you're not quite sure why thises normal vector or how it subtracted is or whatever take take a look at the details of lections [Applause] um and so what I'm going to do is I just take this equation and I'll solve for gamma right so this equation I just wrote down solve this equation for gamma for gamma I and you find that um you solve that previous equation for gamma I um well I'll just do it you have W transpose x i plus b equals gamma I * W transpose w the norm of w um that's just equal to gamma times the norm of w because W transpose W is the norm of w s um and therefore gamma is just um X okay and in other words this little calculation just showed us that if you have a training example um XI then the distance between XI and the separating hyper plane defined by the parames W and B can be computed by this formula okay um h so the last thing I want to do is actually uh take into account the sign of the the the the the correct classification of the training example so I've been assuming that you know we've been classifying an example correctly so more generally um Define the geometric margin of a training example to be gamma I equal y itimes that thing on top okay and so um this is very similar to the functional margin except for the normalization by the norm of w and so um as before you know this says that so long as um we would like the geometric margin to be large and all that means is that so long as we're classifying the example correctly we would ideally hope for the example to be as far as possible from a separating hyperplane so as is on the right side of the separating hyperplane that's what Yi multiplied into this does um and so a couple of easy facts one is um if the norm of w is equal to to one [Music] then the functional margin is equal to the geometric margin and see that quite easily and more generally um the geometric margin is just equal to the functional margin divided by the norm of w okay um let's see okay and so one final definition is [Music] um so far I've defined the geometric margin with respect to a single training example and so um as before I'll Define the metric margin with respect to an entire training set as gamma equals Min / I of gamma all right um and so the maximum margin classifier which is a precursor to the support Vector machine is the learning algorithm that chooses um the parameters W and B so as to maximize the geometric margin and so I just write that down the maximum margin classy poses the following optimization problem says choose gamma W and B so it's to maximize the geometric margin subject to that y i * this is just one way to write it subject to um actually write like that yeah fine there are several ways to write this and one of the things we'll do next time is actually so I'm trying to figure out if I can do this in five minutes I'm guessing this could be difficult yeah well so this maximization classifier um there a maximization problem over parameter gamma W andb and um for now let's it turns out the geometric margin doesn't change depending on the norm of w right because in the definition of the geometric margin notice that we're dividing by the norm of w anyway so you can actually set the norm of w to be anything you want you can multiply you know W and B by any constant it doesn't change the geometric margin um th this will actually be important we'll come back to this later right notice that you can take the parameters WB and you can impose any normalization constant to it or you can change W and B by any scaling factor replace them by 10 W and 10B whatever and it does not change the geometric margin okay um and so in this first formulation I'm just going to impose the constraint say that the normal W must equal to one and so the function of the geometric margins will be the same and I'm say maximize the geometric margin subject to you maximize gamma subject to that every training example must have geometric margin at least gamma and this is a geometric margin because when the normal W is equal to one then the function of the geometric margin are identical okay um so this is maximum margin classifier and it turns out that if you do this it'll run you know maybe about as well as a slight maybe comparable to to logistic regression um but it turns out that as we develop this Alum further um there'll be a clever way to allow us to change this algorithm to let it work in infinite dimensional feature spaces and come up with very efficient nonl classifiers um so there's a ways to go we going to turn this into a support Vector machine um but this is the first step so other questions about this do we need any assumptions about position of samples in the space for example all consider the points in the UN circles because if not we can expand the space and the margins in the quality of the so for now let's just say you're given a fixed training set and you can't yeah for now let's just say you given a fixed training set and that the scaling of the training set is not something you get to play with right so so everything I've said is for a fixed training set so that you can't yeah so that you can't change the x's and you can't change the Y's are there other questions okay so what's um okay so all right next week um we will take this and we'll talk about optimization algorithms and um work our way towards turning this into into one of the most effective offshelf learning algorithms um and just final reminder again this this next discussion section will be on mat live and octave so sh for that you want to see a tutorial okay see you see you guys in the next class this presentation is delivered by the Stanford center for professional development so welcome back and um what I want to do today is continue our discussion on support V machines in particular we talk about the optimal margin classifier um then we'll take a brief digression to talk about Primal and dual optimization problems and in particular What's called the kkt conditions um and then we derive the Dual to the optimization problem that we have posed earlier and that will lead us into a discussion of kernels which which I won't really which we just get to say a couple of words about but but which I'll do properly only in the next lecture um and as part of today's lecture I'll spend some time talking about optimization problems um and you know in the little time I have today I won't really be able to do this topic Justice I want we talk about convex optimization and do that topic Justice and so at this week's discussion section um the TA will um have more time we'll teach a discussion section focus on convex optimization okay so it's a very beautiful and useful Theory um so you want to learn more about that listen to this Friday's um discussion section just to recap what we did in the previous lecture um right as we were beginning our development of support Vector machines I said that um our hypothesis will be represented as H subscript WB as G of w trans plus X+ b um where G will be um plus one or minus one depending on whether Z is is greater than equal to zero and I said that in our development of support Vector machines we'll use we change the convention of letting y be plus one minus one den note the qu labels um so last time right we also talked about the functional margin which was this thing cam hat I right and so we had the intuition that um if the functional margin is a large positive number then that means that we have classifying you know training example correctly and very confidently right so if Yi is + one we would like W transpose x i plus b to be very large and if x i if excuse me if y i is minus one then we like w transpose x i plus b to be a large negative number so we sort of like function margins we launch um but also said function margin is a strange property that you can increase functional margin just by say taking your parameters W and B and multiplying them by two um and then we also Define the geometric margin which was that okay which is essentially the functional margin um divided by the norm of w and so the geometric margin had the um interpretation as being know you have a few train examples the geometric margin of the example is sort of has interpretation as the distance between a training example and a hyper plane right and uh it'll actually be a sign distance so this distance will be positive if you're classifying the example correctly and if you're misclassifying example this distance will be so the the minus of the distance between the point between training example and your separating hyper plane where the separating hyper plane is defined by the equation you w trans X plus b equals z right um okay so oh well and I guys also defined you know these things as as the the functional margin and geometric margins with respect to trading set I defined as the worst case or the minimum functional geometric margin so um in our development of the of the optimal margin classifier our learning algorithm would choose parameters W and B so as to maximize the geometric margin so our go is to find the separating hyper plane that separates the positive and negative examples with as large a distance as possible between the hyper plane and the positive and negative examples um and if you go to choose your parameters W and B to maximize this TR points one property of the geometric Orin is that um you can actually scale W and B arbitrarily right so if you look at this definition for the geometric margin um I can choose to multiply my parameters W and B by two or by 10 or by any other constant and it doesn't change my geometric margin right and one way of interpreting that is um you look at just separating hyper plane you look at this line you know separating by positive and negative train examples if I scale W and B um that doesn't change the position of this plane right because the equation WX + B = 0 is the same as equation 2 W transpose x + 2 Bal 0 so it gives the same straight line what that means is that I can actually choose whatever scaling um for w and B is convenient for me and in particular we'll use this in a minute um I can impose a constraint like that the normal W must be equal one because um this means that you can find a solution to W and B and then by rescaling the parameters you can easily meet this condition just rescale W to Norm one and so I can add a condition like this and and essentially not change the problem um or I can add other conditions I can actually add the condition that um excuse me the absolute value of W1 equal to one I can add only one of these conditions right not both of them we can add a condition that the absolute value of the first component of w must Beal to one and again you you can find the opposite solution and just rescale W andb to meet this condition um and I can add other more esoteric conditions like that right because again this is a condition that you can s solve for the opal margin and then just by scaling you know W up and down you can you can ensure you meet this condition as well okay so again I can choose one of these conditions right not all of them um and so our ability to choose any scaling condition on W that's convenient to us will will useful again in a second all right so let's go ahead and write down an optimization problem um and again my goal is to choose parameters W and B so to maximize the geometric margin um here's my first attempt at writing down an optimization problem I actually wrote this one down um right at the end of the previous lecture getting to solve a parameters gamma W and B um such that that HS you know for all I for M training examples let's say I choose to add this normalization condition okay um so the norm condition that W the norm of w is equal to one um this makes the geometric and the functional margin the same and so I'm saying I want to find a value I want to find the value for gamma does as big as possible so that all of my training examples have functional margin greater than or equal to gamma and um with the con with the constraint that normal of wals 1 functional margin and geometric margin are the same right so this is saying you find the value for gamma so that all the values all the all the geometric margins are great equal to gamma okay um so if you solve this optimization problem then you know you you have derived the um op the optimal Marin classifier um but this is not a very nice optimization problem right because this is a nasty non-convex constraint in particular I was asking that you solve for parameters W that um lie on you lie on the surface of a unit sphere L Li on has know ones it lies on a unit circle a unit sphere right um and so if we can come of a convex optimization problem then we be guaranteed that algs like Grand descends or other local search algs will not have local Optima it turns out this is an example of a non-convex constraint this is a nasty constraint um that that I would like to get rid of so let me change the optimization problem one more time um now let me POS a slightly different optimization problem let me maximize the functional margin divided by the norm of w subject to y i w trans X okay so in other words I want to find a number gamma hat so that every one of my training examples has um functional margin gr gamma hat and my optimization objective is I want to maximize gamma hat divided by the norm of w okay so I want to maximize the function margin divided by the norm of w and you know we saw we saw previously right the functional margin divided by the normal W is just a geometric margin so this is a different way of posing the same optimization problem I'm seeing some slightly confused looks are there different questions about this the second statement that has to be greater than the functional margin right while dividing it by why have the GE why is there some trick there that's important or oh what say again the second stat we saying greater than the functional margin dividing by the norm of w oh I see yes is that important um let's see so this is the functional margin right this is not the geometric margin so um oh I I want to I I want to divide by the normal W my optimization objective I'm just wondering how come you're not dividing also in the second stage greater than the functional why are you dividing there by by the nor W let's see um nor should I get the question let me try to explain this again so here here's my go my I want I want an optimiz so let's see the parameters of this optimization problem with gamma hat W andb right so you I'm going to have like a convex optimization software solve this problem for some set of parameters gamma hat W and B and I'm imposing the constraint that um whatever values it comes out with you know y i * w x i plus b must be greater than equal gamma hat and so this means that um the functional margin of every example had better be greater than equal to gamma hat okay so this a constraint that the functional margin is greater thanal gamah but what I care about is not really maximizing the functional margin what I really care about in in other words in my optimization objective is maximizing gamma hat divided by the normal W which is the geometric margin okay so in other words my my my optimization problem is I want to maximize the functional margin divided by the norm of w um you know subject to that every example must have you function margin that LE G had does that make sense now when you say that you maximize Gamma or gamma with respect um to gamma W and or with respect to gamma hat W and does mean that in this GMA and G hat are no longer function um yeah right okay so this is the right so it turns out um so this is how I write down an optim this is how I write down an optimization problem in order to solve for the geometric margin um what is it so it turns out that um the question is is gamma had the function of w andb and it turns out that um in my previous mathematical definition it was but the way I'm going to pose this as a as an optimization problem is I'm going to ask a convex optimization Sol say let's have software let's say have you know software um for solving convex optimization problems right then I'm going to pretend that these are independent variables and ask my convex optimization software to find me values for gamma W and B to make this value as POS as big as possible and subject to this constraint and it'll turn out that um when it does that it will choose well it'll obviously choose gamma to be as big as possible right because because the optimization objective is this is you're trying to maximize gamma hat so for a fixed value of w andb my software would choose to make gamah hat as big as possible well but how big can I make gamma hat is is limited by these constraints right it says that every training example must have functional margin you know greater than equal to gamma hat and so my the the the biggest we can make gamma hat is will be the value of the smallest functional margin and so when you solve this optimization problem the value of gamma hat you get out will be indeed um you know the minimum of the function margins over your training set okay so Justin yeah just wonder I guess I'm a little confused because it's like okay you have like two clouds of data and and you would say okay please draw me a line such that you know you maximize the the distance between the the smallest distance from between that line and one of the you know data points um and it seems like that's kind of what we're doing but there's it seems like this is this is this is more complicated than that um and I guess I'm wondering what is the difference I see so I mean this is um the question was of right two clouds of data trying to find a sering hyperplane um uh and this seems more complicated than trying to find the line surface the class data so so I'm just repeating questions in case since I'm not sure how well the audio captures it um so the answer is um does this actually exactly that problem this is exactly the problem of you know given two clouds of data positive and negative examples this is exactly the formalization of the problem where our goes Define a line that separates the two the positive and negative examples maximizing the worst case distance between any point and and this line okay yeah so why you care about the worst case distance let me yeah for now why do we care about case distance right for now um let's just let's just say let's just care about the worst case distance for now we'll come back and we'll fix that later that's a caring about the worst case distance is just um is a nice way to formulate this optimization problem I'll come back and actually change that later can you raise your hand if this makes sense if this if this formulation makes sense yeah okay cool great so um let's see so this is a different way of posing the same optimization problem and on the one hand I've got gotten rid of this nasty non-convex constraint um but on the other hand I've now you know added a nasty non-convex objective in particular this is not a convex function in parenthesis W and so you can't you don't have the usual guarantees like if you apply beend this will converge the global minimum okay at least that that that that um does not follow immediately from this because this is noncon um so what I'm going to do is um earlier I said that I can impose any of you know a number of even fairly bizarre scaling constraints on W right so I can choose any scaling constraint like this and things are still fine and so here's the scaling I'm going to choose to add um again I'm going to assume again for the purpose of today's lecture I'm going to assume that the examples are linearly separ able that you can actually separate the positive negative conses and and we we'll come back and fix this later as well um but here's the scaling constraint I want to impose on W I want to impose a constraint that the functional margin is equal to one okay um and another way of writing that is that I want to impose the constraint that Min / i y i that you know okay the worst case function is equal one and clearly this is a scaling constraint right because if if you solve for w and B um you know and you find that your your worst case function modin is actually 10 whatever then by dividing through W andb by a factor of 10 I can get my functional margin to be equal to one okay so this is scale constraint going to imply and and and this just more compactly written as follows this is imposing a constraint that the functional margin be equal to one um and so if we just take you know what I wrote down as number two our previous optimization problem and add the scaling constraint we then get the following optimization problem Min over w b okay um I guess previously we had a maximization over you know gamma hats divided by the normal of w so those maximize one over the norm of w but so that's the same thing as minimizing the normal of w squ right Max one over normal W is the same as no W squ and then these are our constraints since I've added the constraint to the function ma one okay and this is actually um and this is actually you know my final well final formulation of the optim classifier problem at least for now okay um so the picture to uh keep in mind for this I guess is that um you know our optimization objective is want to minimize the norm of w and so our optimization objective is just a big quadratic function right and just be somewhere in this picture to we can draw it so you know your parameters W1 and W2 and you want to minimize a quadratic function like this so quadratic function just has Contours that look like this um and moreover you have a number of linear constraints in your parameters so you may have a linear constraint that eliminates that h space the linear constraint eliminates that half space eliminates that half space and so on and so the picture is you have a quadratic function and you're ruling out various you know half spaces um with each of these linear constraints and I hope if you can picture this in 3D I guess it kind more in 3D um help you convince yourself that this is a convex problem that has no local Optima right um that you run your grade in descent within this set of points that that hasn't been ruled out you converge to the global Optimum um and so that's a convex optimization problem think that this have nice and yeah relatively easy to solve okay um questions about [Applause] this okay actually just raise your hand if this Mak sense okay cool um so this gives you the optimal margin classifier algorithm um it turns out that you know this is a convex optimization problem so you can actually take this formulation of the problem and throw it at um offthe shelf software what's called a QP or quadratic program solware this sort of optimization problem is called a quadratic program with a quadratic convex objective function and set of linear constraints so you can actually download software to solve these optimization problems for you um usually I I actually wouldn't use l use Grand descent because because you have constraints like these although you could actually modify G desent to work for this too um so we could just declare success and say that we're done with this formulation of the problem but what I'm going to do now is take a digression to talk about Primal and dual optimization problems and in particular I'm going to later I'm going to come back and derive yet another very different form of this optimization problem um and the reason we'll do that is because it turns out this optimization problem has certain properties that make it amendable to very efficient algorithms um and moreover or be deriving What's called the Dual formulation of this that allows us to um apply the a Marin classier even in of very high dimensional feature spaces even in sometimes infinite even in sometimes infinite dimensional feature spaces so we'll come back to that later but um let me now spend some time talking about um convex optimization so I'm actually curious how how many of you you know from like I don't know calculus uh remember the method of the granch multipliers for solving an optimization problem like minimum or minimization maximization problem subject to some constraint how many of you remember like the method of the Gran multipliers for that oh okay cool some of you yeah so if you don't remember don't worry I I'll I'll describe that briefly here as well but what I'm really going to do is uh talk about a generalization of this method ofr multipliers that you may or may not have seen in in like you know some calculus closes um but if you haven't seen both don't worry about it um so you know the method the gr multipliers was um well Suppose there some function you want to minimize you want minimize F of w um but subject to some set of constraints that h i of w must be equal to zero for IAL 1 through L okay and um given this constraint I'll actually usually write it in a vectorial form in which I write h of w as you know this Vector value function going say that is equal to zero we zero with the arrow on top I use that to denote the vector of all zeros um so you want to solve this optimization problem right um some of you have seen method of gr multipliers where you construct let's call the lren right which is your original optimization objective plus sum of R lrange multipliers times constraints and these parameters beta R we called the lrange multipliers and so um the way you actually solve your optimization problem is you take the partial derivative of this with respect to your original parameters and set that to zero take the partial derivative respect to your LR multiply as beta and set that to zero um and then you know this famous theorem proved by I guess Joseph Lis L Grange was that um for w for some value W star to be a solution um it is necessary that um there EX exist beta Star right it's the back with e is there exist so there exists beta star such that oops excuse me okay such that those POs derivatives um are equal to zero right so so the the the um the method the granch multiplies to solve this problem um you construct the lren take the derivatives respect to the original parameters bit the original parameters W and with respect to the Gran M multiplies beta set the partial Dera is equal to zero and solve you know for all solutions and then you check each of the solutions to see if it is indeed a minimum right um so right so what I'm going to do is actually um write down the generalization of this okay and again if you haven't seen that before don't worry about it this is sorry right so what I'm going to do is actually write down a generalization of this to solve a slightly more difficult type of constraint optimization problem um which is suppose you want to minimize F of w subject to the constraint that GI of w excuse me is less than equal to zero um and that h i of w is equal to zero okay and again using my Vector notation I'll write this as G of w isal z and H of w is equal to zero so unlike the previous case we now have inequality constraints as well as equality constraints um I then have a lren or is actually some sometimes call this a generalized lren which is now a function of my original optimization parameters W as well as two sets of the gr multipliers Alpha and beta and and so this will be F of w um okay now here's a cool part um I'm going to Define Theta subscrip P of w to be equal to Max over Alpha Beta subject to the constraints that the alpha is uh greater than equal zero of the L okay um and so I want I want you to consider you the optimization problem Min / W of Max over Alpha Beta such that the alpha is a great zero of the lren um and that's just equal to Min over W Theta P of um W okay um and just to give this a name the the uh the locate the subscrip P here stands a prime problem um and that refers to you know this entire thing um this optimization problem that written down is is called the Primal problem this means the original optimization problem which been solving and later on I'll derive derive in another version of this but that's what P sense was and so this is a primal problem um now I want you to look at consider Theta of peak and and and in particular I want to consider the the problem of what happens if you minimize W uh minimize as a function of w this quantity Theta of P right so so let's look at what Theta P of w is right um notice that if GI of w is greater than zero so so let's pick some value of w and let's ask what is the the P of w okay so if W violates one of your Primal problems constraints then say the P of w would be Infinity right um why is that coming back to this previous board for a second um suppose I pick a value of w that violates one of these constraints so GI of w is is positive um then well Theta p right is maximize this function of Alpha and beta the lrin so if one of these GI of W's is positive then by setting you know the corresponding Alpha R to plus infinity I can make this opportunely large and so if W violates one of my Primal problems constraints one of the gis then Max over Alpha of this the grent will be plus infinity okay um and similarly and well and and in the same way because in a similar way if hi I of w is not equal to zero then Theta P of w will also be Infinity for for very similar reason because um if h of w is notal zero for some value of I then in my Len I had a beta times hi term right and so by setting you know beta I to be plus infinity or minus infinity depending on the sign of hi I can make this plus infinity as well um and otherwise um c p Theta P of w is just equal to F of w right um turns out if I have a value of w that satisfies all of the GI and the hi constraints then when you maximize in terms of Alpha and beta you know all of the the grun multiply terms will actually be the maximum will actually be obtained by setting all of the grun multiply terms to be zero and so Theta P just left with f of w um thus th P of w is equal to F of w if constraints are s satisfy this if the GI and hi constraints and is equal to plus infinity otherwise okay so the problem I wrote down if minimiz is a function of w Theta P of w this is just your original problem right that's just exactly same problem as my original Primal problem because if you choose a value of w that validates the constraints you get you know infinity and if you satisfy the constraints you get F of w so this is really just the same as well saying satisfy the constraints and minimize F of w right that's that's really what minimizing Theta P of w is okay um raise your hand if this makes sense yeah Okay cool so all right I hope no one's getting mad at me for like doing so much work come back to exactly the same thing we started with um so here's the cool part um Let Me Now define the Dual problem um Define Theta D and D sensor Duo and um this is now not a function of Alpha and beta it's not a function of the lrange multipli it's not of w defin this to be minimize over W of my generalized lren okay and um my dual problem is um is this in other words this is Max over that okay and so um so this is my dual optimization problem to maximize over Alpha and beta theta D of Alpha and beta right so this this this this optimization problem I guess this my dual problem um I want you to compare this to our previous Primal optimization problem right the only difference is that um you know I took the Max and Min and I switched the order around to the Max and Min that that that's the difference in the prime one the DU um and it turns out that is a is of a is a fact is is you know true generally that um D star is less than the P star in other words I think I defined P star previously P star was a value of the Prime optimation problem um and and in other words that um is it's just you know generally true that the max of the Min of something is less than equal to the Min of the max of something okay this is a general fact um and just as a concrete example the max of Y in the set 01 times excuse me of the Min of the set in 01 of indicator xal y right this is less equal to Min okay um so this this this equality does inequality actually holds true for any function you might plug in here and this is one specific example where the Min of X Y excuse me Min of X of indicate xal y this is always equal to zero right because whatever Y is I I can choose X be something different so this is this is always zero whereas um if I exchange the order of the Min and Max then this thing here is always equal to one so Z is less equ one um and more generally this min max less excuse me this Max Min less than min max holds true for any function you might put in there um but it turns out that sometimes under certain conditions um these two optimization problems have the same value sometimes under certain conditions the Primal and the Dual problems um have the same value and so um you might be able to solve the Dual problem rather than the Primal problem okay and the reason to do that is that sometimes which we'll see in the optimal Marin classifi problem of support Vector machine problem the duo problem turns out to be much easier to solve and has it has many many useful properties that will make use of um uh compared to the prime wall so um see for the sake of so um what I'm going to do now is write down formally the set of conditions under which that's true where the prime and the Dual problems are equivalent um and so our strategy for working out for deriv and support v machine algorithm will be that we'll write down the Primal optimization problem which we did previously in the mass Marin classifier and then we'll derive um the Dual optimization problem for that and then we'll solve the Dual problem and by modifying that a little bit that's how we'll derive a support X machine but let me actually for now let me just first for the sake of completeness I'll just write down the conditions under which the Primal and the Dual optimization problems give you the same solution um so let F be convex um uh if you're not show a con means for the purposes of this class you can take it to mean that the hessan H is positive semi-definite right so it just means it's a you know flow shape function like that right and uh once to learn more about convex optimization um again please come to you know this week's discussion section t for TAS um let's suppose hi the hi constraints apine and what that means is that um hi of w equals alha I transpose W + bi okay um this essentially means the same thing is linear um without without the term B here we say hi is linear when we have a constant intercept term as well this is technically called aine rather than linear and let's suppose the gis um are strictly feasible and um what that means is that there exists a value of the W such that um for all I GI of w is less than zero okay don't worry too much about I'm writing these things now for for for the sake of completeness but don't worry too much about the technical details but strictly feasible just means that there's a value of w such that all of these constraints are satis with strict inequality rather than with less than equal to um under these conditions there will exist W star Alpha star beta star such that W star solves the Primal problem and Alpha star and beta star the lrange multipliers solve the Dual problem um and the value of the Primal problem will be equal to the value of the Dual problem will be equal to the value of your the drage multiplier excuse me will be equal to the value of the generalized the grent the value to that W alpha s beta in other words um you can solve either the Primal the Dual problem you get get the same solution um further you know your parameters will satisfy these conditions um partial derivative respect to parameters would be zero um and actually you keep this equation in mind we actually use this in a second when we take the lren and we of of our support Vector machine problem and take this derivative with respect to W to solve to solve our to derive our dual problem we actually perform the step Els in a second um po respect to the grun multipli beta is equal to zero um turns out this will help true too um this is called the uh well this is called the kkt complimentarity condition um kkt stands for Kish kunaka which were the authors of this theorem um well and by tradition usually list five kkt conditions but the other two are okay just just that the alpha rise a gr of zero which you had previously and that your uh uh constraints are actually satisfied okay so um [Music] let's see half and all all right so let's take this and apply this to our um optim margin um optimization problem that we had previously um I want to say one word about this um which is oh say one word about this kkt uh complimentarity condition that condition that you of at your solution you must have that Alpha star I * GI of w is equal to zero um so let's see so if a product of two numbers is equal to zero that means that um at least one of these things must be equal to zero right for product of two things equal to zero well that just the say either Alpha I or GI is equal to D is equal to zero right so what that implies is that the the um just a Kish see I most people just say kkt but I'm going to show you the right spelling of of um of of of the names so KT complimentarity condition implies that if Alpha R is non zero that necessarily implies that um GI of w star um is equal to zero right um and usually um it turns out so all the K condition guarantees is that um you know at least one of them is zero um it may actually be the case that both Alpha and GI are both equal to zero but in practice when you when you solve this optimization problem you find that you know to a large part Alpha St is not equal to zero if and only if GI of w St is equal to zero this is not strictly true um because because it's possible that both of these may be zero um but in practice when when just when you solve problems like these you know for the most part usually exactly one of these things will be non zero um and also when this HS true when GI of w star is equal to zero we say that um GI you know GI of w I guess is an active constraint of because recall our con our constraint was that GI of w must be less than equal to zero and so if it's equal to zero then then we say that that's a constraint that that this is an active constraint all right um once we talk about svms I actually come back and say a couple and just extend this idea a little bit more um H this on a different bullet I might return to this board in a second but um so let's go back and work out what are the Primal and the Dual optimization problems for our optimal margin classifier for the optimization problem that we worked out just now um as a point of notation and what I've been writing down so far in deriving the kkt conditions um my lrange multipliers were um Alpha R and beta R it turns out when I apply this you to work out the svm um turns out I only have one set of LR multiplies Alpha R and also as I was working out the kkt conditions you know I use you to denote the parameters of my um Primal optimization problem I want to minimize F of delta in my very first optimization problem you had an optimization problem is parameters W um in my SDM problem I'm actually going to two sets of parameters W and B okay so this is just a keep that a slight notation change in mind so problem we worked up previously was we want to minimize the norm of w s and just add a half there by convention because it makes another the work math work out nicer and subject to the constraint that Yi * W transfers x i plus b must be greater one okay um and so um let me just take this constraint and I'll rewrite it as a constraint GI of w comma B again right previously I had GI of w but now I have parameters W and B I have gi of w comma B defined as this one okay so um let's look at the implication of this in terms of the kkt Dual complimentarity condition again um so we have that Alpha R is greater than equal to zero right that necessarily implies GI of w comma B you know is is equal to zero in other words this is an active constraint right um and what does this mean it means that it actually turns out GI of WB equal Z that um is that means exactly that the training example x i Yi has functional Marin equals to one right because remember this this this constraint uh this constraint was that um you know the functional Marin of every example has to be greater than equal one right and so if this is an active constraint if this inequality holds of equality that means that my training example I must have function margin equal to exactly one right and so um actually yeah right now I'll do this a a different board I guess so in pictures um what that means is that you have some training sets right now some separating hyperlane um and so the example with functional margin equal to one will be exactly those which are sort of closest to my separating hyper so that's my that's my equation of w transpose equals equal Zer and so in this in in this in this you know cartoon example I've drawn it'll be exactly um these three examples that have functional margin equal to one and all of the other examples being you know further away than these three um will have functional margin that are strictly greater than one okay um and the examples with functional margin equal to one will usually correspond to um the ones where the corresponding LR multiply is also not equal to zero and again this may not hold true it may be the case that GI and Alpha I equal to zero but usually when GI is not is is zero Alp will be non zero um and so the examples of functional one will be the ones where Alp is not equal to zero okay um one useful property of this is that as suggested by this picture and and so true in general as well it turns out that when you find a solution to this to the svm optimization problem you find that relatively few training examples of functional margin equal to one um in this picture I've drawn you know there are three examples with functional margin equal to one there there are just three examples with this minimum possible distance to your separating hyper plane um and these are the three these examples of functional margin equal to one um they are what we're going to call the support vectors okay um and this leads to the name support Vector machine there will be these three points with functional margin one that we call the support vectors and um the fact that they're relatively few support vectors also means that usually most of the alpha R are equal to zero right so this Alpha will equal to zero for examples they're not support vectors okay let's go ahead and work out the actual optimization problem um so we have the MX margin of optimization problem um so let me go and write down the lren and um because we only have any quality constraints right we only have so GI sty constraints no hii sty constraints we only have Anye equality constraints no no equality constraints um I I only have near the gr multipliers of Tye Alpha no betas in my in my generalized theen um but my lren will be 12 w^ 2 minus oops okay that's my lrin um and so let's work out what the Dual problem is right and to do that I need to figure out what Theta D of alha again there no beta is here so what Theta D of alpha is min with respect to W and B of L B Alpha okay so the Dual problem is to maximize Theta D is a function of alpha so let's work out what Theta D is and then and then that'll give us our dual problem um so the workout what this is well what do we need to do right we need to take L grent and minimize it as a function of w and B so I mean what is this how do you minimize l so so in order to minimize lran as a function of w and B you know we do the usual thing right we take the derivatives of De of the lent with respect to W and B and we set that to zero that's how we minimize the lren with respect to W and B so take the derivative with respect to W of the lren and um I won't I'll just write down the answer you know how to do calculus like this right so I want to minimize this this function of w so I take the derivative and set to zero and I get that and so this implies that W must be must be that okay um and so w therefore is actually a linear combination of your input feature vectors XI right this is a sum of you know various weights given by the alpha then times the excise which your examples in your training set um and this will be useful later um the other equation we have is yeah partial derivative of Lent respect to B is equal to minus sum of I = 1 to m i and so I also set that to equal to zero okay and so these are my two constraints um and so [Applause] um oh the next one so what I'm going to do is I'm actually going to you know take these two constraints and um well I'm going to take what I worked out to be the value for w and um I'll draw over this I'm going to take what I workt to be the value for w and um I'll plug it back in there right to figure out what the lrange what the lrange really is when I minimize this with respected W um let's and I'll deal with b in a second so um let's see so my lren is 12 w transpose W minus right um so this first term W trans spells W you know this becomes Su from I = 1 to m alha i y i x i transpose right this is just plugging in the value for w that I worked out previously um since this is W transpose W and so when I expand out this quadratic function and when I plug in W over there as well I find that um I have e that um where I'm using these angle brackets to den note inner product so um you know this thing here it just means the inner product you know x i transs XJ right um and the first and second terms are actually the same except the minus one2 and so this simplifies to be equal to that okay um so then let me go ahead and call this W of alpha um my dual problem is therefore the following I want to maximize W of alpha which is that you know big expression I wrote out oh and um uh the the the I realize notation is someone unfortunate I'm using capital W of alpha to denote that formula I wrote down earlier and then we also had you know a lowercase w the the the original paramet is the Primal problem right lowercase w transal XI so uppercase and lower case W are totally different things so unfortunately the notation is standard as well as far as I know so um so dual problem is that seate to the alpha RIS was to zero and um we also have that um some of y y i Alpha is equal to zero okay um that last constraint was um was the constraint I got from from this okay that sum ofi uh right sum ofi y i Alp I was equal zero that's where that final constraint came from okay um yeah let me just I I think in previous years I taught this this where does this constraint comes from is is is slightly confusing so let me let me just take two minutes to say what the real interpretation of that is and if you don't understand it is not a big deal I guess um so in when when we took the partial derivative of the lrange with respect to B we ended up with this constraint that some y i Alp must be to zero um the interpretation of that it turns out is that um if Su y i al is not equal to zero then um um Theta D of w b is is um actually excuse me then say the D of alpha is equal to is uh is equal to um minus infinity with minimizing right yeah right okay so in other words um it it turns out my Lum is actually a linear function of my parameters B and so the the interpretation that constraint Works previously was that if Su y Alp is not equal to zero then Theta D of alpha is equal to minus infinity um and so if your goal is to maximize as a function of alpha thet D of alpha then you better choose values of alpha for which some of Y Alpha is equal to zero okay and then when some y Alp is equal to zero then um Theta D of alpha is equal to W of alpha okay and so that's why we end up deciding to maximize W of alpha subject to that somei y Alpha is equal to zero right um yeah the the UN fortunately the the fact we at to doal with B of adds just a little bit of extra notation in our derivation of the duo but by the way and and and so you know right all the action in the optimization problem is is with W because B is just one parameter right so um okay let's check other questions about this okay cool so um we've derived the Dual optimization problem and and really don't worry about this if you're not quite sure where this was just think of this as we worked out that this constraint we worked out when we took the partial derivative respect to B that this constraint has the host and so I just copy that over here okay um but so we've now worked out the Dual of the optimization problem so approach to finding to deriving you optim margin classier Vector machine will be that will solve um this dual optimization problem for the parameters Alpha star and then um if you want you can then you know this is the equation that we worked out on on the previous Bo right we said that W just WR this is Alpha W must be equal to that and so um once you solve for Alpha you can then go back and quickly derive um you know W peram is z Primal problem we worked this out earlier right um and moreover um once you solve Alpha and W you can then you know plug this back into your once you solve for Alpha and W it's really easy to solve for b um sort of be just this interpretation of you have a training set right you found the direction for w so you know where separating hyper planes direction is you know it's got to be you know one of these planes right you know you know the orientation of separate hyper plane you just have to decide where to place this hyper plane that's what solving B is so once you solve Alpha and W is really easy to solve B um you can plug Alpha and W back into the Primal Primal optimization problem and um and Sol for B and um I'll just write it down for the sake of completeness but um okay and the intuition behind this formula is just that you find the find the um find the worst positive example and the worst negative example like let's say you know this one and this one say and then you split the difference between them and that tells you where you should set the fres for for where to place a searing hyperplane um and then that's the this is the op Marin classifier or this is also called the support Vector machine if you do not use one more idea which is called kernals and I'll say a few words about that um but I hope the process is clear right so it's a dual problem we're going to solve the Dual problem for the alpha r that gives us W and that gives us B so there's just one more thing I want to point out as a lead into to um the next lecture which is that I just write this out again I guess which is that it turns out um we can take the entire algorithm and we can express the entire algorithm in terms of inner products and here here's what I mean by that um so I said that paramis W is a sum of your input examples and so when you need to make a prediction right you know someone gives you a new value of x you want to evaluate the hypothesis on the value of x that's given by G of w transpose X plus b where G was this your threshhold function that outputs minus one or plus one and so you need to compute W transpose X plus b and that is equal to Alpha Yi okay and that can be expressed as a sum of these inner products between your training examples and this new value of x you're trying to Value the algon right um and this will lead into the next lecture which is the idea of kernels and it turns out that in the sorts of feature spaces we use the support Vector machines it turns out that sometimes the training examples may be very high dimensional um it may even be the case that the features you want to want use are you know infinite dimensional feature vectors but despite this it'll turn out that there'll be an implicit representation um that you can use that will allow you to comp comput inner products like these efficiently and this holds true only for certain feature spaces it doesn't hold true for arit sets of features but when we talk about the idea of kernels um in in the next lecture we'll see examples where even though you have extremely high dimensional feature vectors you can compute um you you may never want to represent XR explicitly right if XR is infinite dimensional feature Vector you can't even store it in computer memory but you will nonetheless be able to compute in products between different input feature vectors very efficiently and so you can for example you can make predictions by making use of these inner products I this is just XI you know transpose right um You Will compute these in products very efficiently and therefore make predictions and this's point out also the other reason we derived the duo was because um on this Bo right when we worked out what w of alpha is W of alpha actually is the same property that W of alpha is again written in terms of these inner products right and so if you if you actually you know look at the Dual optimization problem and step for all the steps of the algorithm you find that you can actually do everything you want learn the parameters Alpha supposed to do optimization problem learn the pr is Alpha you can do everything you want without ever needing to represent x i dire directly and and all you need to do is represent is compute in products between feature effectiv IDs okay um well one last property of this algorithm that that that's kind of nice is that um I said previously that um the alpha R are zero only for the are nonzero only for the support vectors right only for the vectors the the function margin one and in practice that we usually fairly few of them and so what this means is that if you're representing W this way then um you know W will represented as a fairly small fraction of your training examples because most of the alpha R will be zero and so when you're summing up the sum you need to compute in products only of the support vectors which is usually a reasonbly small fraction of your training set so that's another nice property because most the alpar is was Zero okay um and well much of this will make much more sense when we talk about kernels um this quick question is fire close yeah I mean it seems that for anything we've done to work the point cloud has to be really well behaved in that if any of the points are on the kind of wrong side yeah no oh yeah you're right so for again for today's lecture I assume that the data is linearly separable that that you can actually get perfect training eror I'll fix this next I'll fix this in the next lecture as well relax that assumption yes so here we resume that we have two point plugs so if you have multiple oh yeah so let's I it turns out there are ways to generalize this in multiple classes but I probably won't but yeah that's generalization leave later okay let's close to today then we'll talk about Colonel's the next lecture

Transcript for:Overview of Machine Learning Concepts in CS229

Transcript for:
Overview of Machine Learning Concepts in CS229