Transcript for:
Understanding Optimization in Machine Learning

All right. So, welcome to day four, today  we're doing Optimization. My name is Ioannis   Mitliagkas and along with Jose Gallego-Posada  we've put together the material for this day   for you, okay. Now let's start by me introducing  myself, okay. I'm an Assistant Professor at the   University of Montreal and I'm a core member of  MILA. I'm an amateur musician, and I say that   I'm a professional researcher in Machine Learning  and optimization. I will always miss Greek summers   which is my plan for this summer as well and  my life's goal to spend every summer there.   So, a few thank yous and credits  and acknowledgements for the day:   I want to thank the academy organizers  for the invitation, and for all the   great organization and support, of course Jose  Gallego-Posada who is a PhD candidate at MILA   and the University of Montreal who expertly  put together some of the practical content   you will see, actually all of the  practical content you will see in the labs.   And I would like to thank Lyle and Konrad  because I've borrowed some of their material.   Especially some of the very nice stories in the  introduction and the conclusion of today, okay.   So, introduction to optimization. What is it and  why is it important. That's what we start with.   So, what is optimization about. There are two  very important topics that we will be focusing on.   The first one is what do we optimize.  Before we even begin making things better,   it's important that we decide what does it mean  to make things better, right. What is good. So,   we need a scale, we need a quantity, a scalar  quantity typically that we want to make   bigger if it's a measure of good or smaller, if  it's a cost of some sort. We will mostly focus on   minimization. This is a very difficult question,  okay. We will see some examples even when we   start with good intentions it's often likely  that we will end up with adverse consequences,   okay. The second part is how do we do it. Once  we decide that this is the quantity that we want   to minimize, how do we go about doing it and that  involves some numerical methodology and some math.   We will see the most important bits and pieces  that we need when we're doing optimization   for Machine Learning today. Okay so, just to  give you a first idea on the first question,   like what do we optimize, right. We  have to choose what we optimize. So,   for machine learning the very first step that you  have already seen is the choice of a loss function   okay and you have seen a few loss functions before  mean squared error the cross entropy loss etc.   So, some of those questions are technical  but some of them are more high level as   we will see. For example, when we're doing a  classification and we have multiple classes,   it is possible to do classification with the Mean  Square Error loss which I believe you saw a couple   of days ago, okay. However, as you saw yesterday  the cross entropy loss is particularly good   for the multi-class classification problem,  okay. It's possible to use one or the other   but they don't both have the same  effect, okay. And there's a reason   why for multi-class classification,  we often tend to prefer cross-entropy. At another level when we're trying to evaluate  the final performance of our model we could either   measure accuracy, meaning the percentage  of examples that we classify correctly.   Or we could measure what we call the area under  the curve, okay. Without getting into too much   detail I would like to just bring this as an  example. When we have class imbalance, okay   so, imagine that you're working on a medical  data set and the overwhelming majority of   your patients are cancer-free, okay. Only a tiny  fraction of your patients have cancer, let's say   0.1 percent which is which is good to keep this  percentage small. Now, if we train a classifier   trying to maximize our accuracy, okay,  we will do a beautiful job, 99.9%,   if we predict that everyone is healthy,  everyone is cancer free. Of course that's   not true because we expect a small percentage of  our patients to unfortunately have this condition,   right. So, there are alternatives like the  area under the curve which is much better for   evaluating our performance when we have class  imbalance. So when we're choosing our training   losses and our evaluation metrics and losses  there are a lot of technical issues like that,   that we take into account. And then of  course there are some societal questions,   okay. For example, some of the advanced and more  modern topics that we see in the field are about   fairness. Are our algorithms fair, and the  predictions we make fair. We will see at the very   end section nine today some ethical concerns  about the use of Machine learning in various   sensitive let's say applications in society  okay. So what does it mean for an algorithm to   perform well, or for an algorithm to be fair.  And are we optimizing for one class of people   versus another class of people. Are the systems  we create good for predicting and helping   one particular demographic versus another.  So these questions become very important   once we adopt this methodology in  some of those societal applications.   And then of course there's always unintended  consequences, once we put incentives in order   because when we choose an objective we implicitly  choose an incentive for finding the right solution   that will guide us to the right solution. We  might often be surprised, okay. So here we will   see a couple of examples that I borrowed from from  Lyle's lectures, and the first one is a very well   known example of perverse incentives, as it's  called. And it's a story about cobras in India,   and it goes as follows. Years ago, the rulers  of India decided that, actually realized that   there was a serious problem of cobras in some  regions of India. An overpopulation of cobras,   they were dangerous, they caused problems, so they  announced the bounty. Anyone who would present   the head of a cobra they just killed to the  right authorities, they would be compensated   by some amount per dead cobra, right. An  incentive to get people to get rid of the cobras.   Eventually this incentive worked the  opposite way, and the story goes as follows.   Some farmers decided that it was profitable to  actually farm those cobras to breed the cobras and   kill them on mass, as we say and then present them  to the right authorities for the compensation.   Of course once the authorities realized  that this was what was going on,   they stopped the program. Eventually the farmers,  once they stopped getting compensated for the   cobras they were breeding, they released all the  cobras in the wild. At the end, effect was that   they ended up with more cobras at the end of  that program compared to how many cobras they   had at the beginning of the program. So, this  is an example of an unintended consequence.   There's another example here which  is taken from a an open AI blog   and what we're going to see here is a short  video of the Reinforcement Learning trained   agent for this little boating game called Coast  Runners. Technically, the job of the player here   is to go around the track and finish the track,  complete the track, and along the way gather as   many points as possible. And there are some  little bonuses here that you see the in green   that give you extra points every time you  run over a little bonus you get an extra   extra set of points and they respawn, okay. So,  what this Reinforcement learning agent figured out   here through this training process is that I don't  need to finish the game, I can get a lot of points   by going round and round at the right period  so I can grab all of those green bonuses   at the at the frequency that they respawn,  right. So this is another example of   an let's say an adverse side effect of the  loss function in this particular case it's   called the reward function that the designers  chose here. So we should always be mindful,   we should always keep our eyes open that  our choices could have adverse side effects,   okay. Now once we've alerted you about the  importance of the first choice what we optimize,   we will spend a lot of time discussing how we do  it because when it comes to Deep Learning how we   optimize is quite complex, okay. There  are a lot of tips and tricks, let's say   little engineering pieces of methodology  that we have to keep in mind, because they   have been proven to be very very helpful.  So that's what we're going to do today,   and just to motivate things a little better,  there are in industry in particular there are some   big models of billions of parameters in this  case this particular quote is about a big 11   billion parameter model trained within Google. And  apparently it costs more than one million dollars   to do one run to train the model from beginning  to end. As you might know or as we will see   you one run is typically not enough right,  you will need to do multiple of those.   So, to tune that, to validate etc, to tweak  things a little bit. So, the estimate was that a   project to develop this model has to cost  at least 10 million dollars. So this is a   particularly difficult limitation for anyone who's  not Google that wants to follow along this kind of   technology, implement it, adopt it etc. So at the  heart of this whole heavy process is optimization.   Any little improvement we make on the on the  side of optimization bears the promise of a   very significant cost reduction, okay. So this  motivates a lot of the things that we will see.   So this is the summary for today, we have nine  little micro lectures of content okay each one   focused on a very specific element we're just done  with the first one why optimization is important.   Next, we're going to introduce our case  study, a simple MLP classification on which   we're going to play step-by-step in the  labs with Gradient descent with momentum.   We're going to see what non-convexity  looks like and what we can do about it.   We're going to see the value of  mini batches, adaptive methods.   And in the end we're going to put it all together  in one final lab before concluding and discussing   some of our ethical considerations. So  thank you for joining us and let's do this