Understanding Optimization in Machine Learning

All right. So, welcome to day four, today we're doing Optimization. My name is Ioannis Mitliagkas and along with Jose Gallego-Posada we've put together the material for this day for you, okay. Now let's start by me introducing myself, okay. I'm an Assistant Professor at the University of Montreal and I'm a core member of MILA. I'm an amateur musician, and I say that I'm a professional researcher in Machine Learning and optimization. I will always miss Greek summers which is my plan for this summer as well and my life's goal to spend every summer there. So, a few thank yous and credits and acknowledgements for the day: I want to thank the academy organizers for the invitation, and for all the great organization and support, of course Jose Gallego-Posada who is a PhD candidate at MILA and the University of Montreal who expertly put together some of the practical content you will see, actually all of the practical content you will see in the labs. And I would like to thank Lyle and Konrad because I've borrowed some of their material. Especially some of the very nice stories in the introduction and the conclusion of today, okay. So, introduction to optimization. What is it and why is it important. That's what we start with. So, what is optimization about. There are two very important topics that we will be focusing on. The first one is what do we optimize. Before we even begin making things better, it's important that we decide what does it mean to make things better, right. What is good. So, we need a scale, we need a quantity, a scalar quantity typically that we want to make bigger if it's a measure of good or smaller, if it's a cost of some sort. We will mostly focus on minimization. This is a very difficult question, okay. We will see some examples even when we start with good intentions it's often likely that we will end up with adverse consequences, okay. The second part is how do we do it. Once we decide that this is the quantity that we want to minimize, how do we go about doing it and that involves some numerical methodology and some math. We will see the most important bits and pieces that we need when we're doing optimization for Machine Learning today. Okay so, just to give you a first idea on the first question, like what do we optimize, right. We have to choose what we optimize. So, for machine learning the very first step that you have already seen is the choice of a loss function okay and you have seen a few loss functions before mean squared error the cross entropy loss etc. So, some of those questions are technical but some of them are more high level as we will see. For example, when we're doing a classification and we have multiple classes, it is possible to do classification with the Mean Square Error loss which I believe you saw a couple of days ago, okay. However, as you saw yesterday the cross entropy loss is particularly good for the multi-class classification problem, okay. It's possible to use one or the other but they don't both have the same effect, okay. And there's a reason why for multi-class classification, we often tend to prefer cross-entropy. At another level when we're trying to evaluate the final performance of our model we could either measure accuracy, meaning the percentage of examples that we classify correctly. Or we could measure what we call the area under the curve, okay. Without getting into too much detail I would like to just bring this as an example. When we have class imbalance, okay so, imagine that you're working on a medical data set and the overwhelming majority of your patients are cancer-free, okay. Only a tiny fraction of your patients have cancer, let's say 0.1 percent which is which is good to keep this percentage small. Now, if we train a classifier trying to maximize our accuracy, okay, we will do a beautiful job, 99.9%, if we predict that everyone is healthy, everyone is cancer free. Of course that's not true because we expect a small percentage of our patients to unfortunately have this condition, right. So, there are alternatives like the area under the curve which is much better for evaluating our performance when we have class imbalance. So when we're choosing our training losses and our evaluation metrics and losses there are a lot of technical issues like that, that we take into account. And then of course there are some societal questions, okay. For example, some of the advanced and more modern topics that we see in the field are about fairness. Are our algorithms fair, and the predictions we make fair. We will see at the very end section nine today some ethical concerns about the use of Machine learning in various sensitive let's say applications in society okay. So what does it mean for an algorithm to perform well, or for an algorithm to be fair. And are we optimizing for one class of people versus another class of people. Are the systems we create good for predicting and helping one particular demographic versus another. So these questions become very important once we adopt this methodology in some of those societal applications. And then of course there's always unintended consequences, once we put incentives in order because when we choose an objective we implicitly choose an incentive for finding the right solution that will guide us to the right solution. We might often be surprised, okay. So here we will see a couple of examples that I borrowed from from Lyle's lectures, and the first one is a very well known example of perverse incentives, as it's called. And it's a story about cobras in India, and it goes as follows. Years ago, the rulers of India decided that, actually realized that there was a serious problem of cobras in some regions of India. An overpopulation of cobras, they were dangerous, they caused problems, so they announced the bounty. Anyone who would present the head of a cobra they just killed to the right authorities, they would be compensated by some amount per dead cobra, right. An incentive to get people to get rid of the cobras. Eventually this incentive worked the opposite way, and the story goes as follows. Some farmers decided that it was profitable to actually farm those cobras to breed the cobras and kill them on mass, as we say and then present them to the right authorities for the compensation. Of course once the authorities realized that this was what was going on, they stopped the program. Eventually the farmers, once they stopped getting compensated for the cobras they were breeding, they released all the cobras in the wild. At the end, effect was that they ended up with more cobras at the end of that program compared to how many cobras they had at the beginning of the program. So, this is an example of an unintended consequence. There's another example here which is taken from a an open AI blog and what we're going to see here is a short video of the Reinforcement Learning trained agent for this little boating game called Coast Runners. Technically, the job of the player here is to go around the track and finish the track, complete the track, and along the way gather as many points as possible. And there are some little bonuses here that you see the in green that give you extra points every time you run over a little bonus you get an extra extra set of points and they respawn, okay. So, what this Reinforcement learning agent figured out here through this training process is that I don't need to finish the game, I can get a lot of points by going round and round at the right period so I can grab all of those green bonuses at the at the frequency that they respawn, right. So this is another example of an let's say an adverse side effect of the loss function in this particular case it's called the reward function that the designers chose here. So we should always be mindful, we should always keep our eyes open that our choices could have adverse side effects, okay. Now once we've alerted you about the importance of the first choice what we optimize, we will spend a lot of time discussing how we do it because when it comes to Deep Learning how we optimize is quite complex, okay. There are a lot of tips and tricks, let's say little engineering pieces of methodology that we have to keep in mind, because they have been proven to be very very helpful. So that's what we're going to do today, and just to motivate things a little better, there are in industry in particular there are some big models of billions of parameters in this case this particular quote is about a big 11 billion parameter model trained within Google. And apparently it costs more than one million dollars to do one run to train the model from beginning to end. As you might know or as we will see you one run is typically not enough right, you will need to do multiple of those. So, to tune that, to validate etc, to tweak things a little bit. So, the estimate was that a project to develop this model has to cost at least 10 million dollars. So this is a particularly difficult limitation for anyone who's not Google that wants to follow along this kind of technology, implement it, adopt it etc. So at the heart of this whole heavy process is optimization. Any little improvement we make on the on the side of optimization bears the promise of a very significant cost reduction, okay. So this motivates a lot of the things that we will see. So this is the summary for today, we have nine little micro lectures of content okay each one focused on a very specific element we're just done with the first one why optimization is important. Next, we're going to introduce our case study, a simple MLP classification on which we're going to play step-by-step in the labs with Gradient descent with momentum. We're going to see what non-convexity looks like and what we can do about it. We're going to see the value of mini batches, adaptive methods. And in the end we're going to put it all together in one final lab before concluding and discussing some of our ethical considerations. So thank you for joining us and let's do this

Transcript for:Understanding Optimization in Machine Learning

Transcript for:
Understanding Optimization in Machine Learning