Introduction to Deep Learning - Lecture Notes

Good afternoon everyone! Thank you all for joining today. My name is Alexander Amini and I'll be one of your course organizers this year along with Ava -- and together we're super excited to introduce you all to Introduction to Deep Learning. Now MIT Intro to Deep Learning is a really really fun exciting and fast-paced program here at MIT and let me start by just first of all giving you a little bit of background into what we do and what you're going to learn about this year. So this week of Intro to Deep Learning we're going to cover a ton of material in just one week. You'll learn the foundations of this really really fascinating and exciting field of deep learning and artificial intelligence and more importantly you're going to get hands-on experience actually reinforcing what you learn in the lectures as part of hands-oOn software labs. Now over the past decade AI and deep learning have really had a huge resurgence and many incredible successes and a lot of problems that even just a decade ago we thought were not really even solvable in the near future now we're solving with deep learning with Incredible ease. Now this past year in particular of 2022 has been an incredible year for a deep learning progress and I like to say that actually this past year in particular has been the year of generative deep learning using deep learning to generate brand new types of data that I've never been seen before and never existed in reality in fact I want to start this class by actually showing you how we started this class several years ago which was by playing this video that I'll play in a second now this video actually was an introductory video for the class it kind of exemplifies this idea that I'm talking about. So let me just stop there and play this video first of all Hi everybody and welcome to MIT 6.S191 -- the official introductory course on deep learning taught here at MIT. Deep Learning is revolutionizing so many fields: from robotics to medicine and everything in between. You'll learn the fundamentals of this field and how you can build some of these incredible algorithms. In fact, this entire speech and video are not real and were created using deep learning and artificial intelligence. And in this class you'll learn how. It has been an honor to speak with you today and I hope you enjoy the course. so in case you couldn't tell this video and its entire audio was actually not real it was synthetically generated by a deep learning algorithm and when we introduced this class A few years ago this video was created several years ago right but even several years ago when we introduced this and put it on YouTube it went somewhat viral right people really loved this video they were intrigued by how real the video and audio felt and looked uh entirely generated by an algorithm by a computer and people were shocked with the power and the realism of these types of approaches and this was a few years ago now fast forward to today and the state of deep learning today we have have seen deep learning accelerating at a rate faster than we've ever seen before in fact we can use deep learning now to generate not just images of faces but generate full synthetic environments where we can train autonomous vehicles entirely in simulation and deploy them on full-scale vehicles in the real world seamlessly the videos here you see are actually from a data driven simulator from neural networks generated called Vista that we actually built here at MIT and have open sourced to the public so all of you can actually train and build the future of autonomy and self-driving cars and of course it goes far beyond this as well deep learning can be used to generate content directly from how we speak and the language that we convey to it from prompts that we say deep learning can reason about the prompts in natural language and English for example and then guide and control what is generated according to what we specify we've seen examples of where we can generate for example things that again have never existed in reality we can ask a neural network to generate a photo of a astronaut riding a horse and it actually can imagine hallucinate what this might look like even though of course this photo not only this photo has never occurred before but I don't think any photo of an astronaut riding a horse has ever occurred before so there's not really even training data that you could go off in this case and my personal favorite is actually how we can not only build software that can generate images and videos but build software that can generate software as well we can also have algorithms that can take language prompts for example a prompt like this write code and tensorflow to generate or to train a neural network and not only will it write the code and create that neural network but it will have the ability to reason about the code that it's generated and walk you through step by step explaining the process and procedure all the way from the ground up to you so that you can actually learn how to do this process as well now I think some of these examples really just highlight how far deep learning and these methods have come in the past six years since we started this course and you saw that example just a few years ago from that introductory video but now we're seeing such incredible advances and the most amazing part of this course in my opinion is actually that within this one week we're going to take you through from the ground up starting from today all of the foundational building blocks that will allow you to understand and make all of this amazing Advance as possible so with that hopefully now you're all super excited about what this class will teach and I want to basically now just start by taking a step back and introducing some of these terminologies that I've kind of been throwing around so far the Deep learning artificial intelligence what do these things actually mean so first of all I want to maybe just take a second to speak a little bit about intelligence and what intelligence means at its core so to me intelligence is simply the ability to process information such that we can use it to inform some future decision or action that we take now the field of artificial intelligence is simply the ability for us to build algorithms artificial algorithms that can do exactly this process information to inform some future decision now machine learning is simply a subset of AI which focuses specifically on how we can build a machine to or teach a machine how to do this from some experiences or data for example now deep learning goes One Step Beyond this and is a subset of machine learning which focuses explicitly on what are called neural networks and how we can build neural networks that can extract features in the data these are basically what you can think of as patterns that occur within the data so that it can learn to complete these tasks as well now that's exactly what this class is really all about at its core we're going to try and teach you and give you the foundational understanding and how we can build and teach computers to learn tasks many different type of tasks directly from raw data and that's really what this class spoils down to at it's it's most simple form and we'll provide a very solid foundation for you both on the technical side through the lectures which will happen in two parts throughout the class the first lecture and the second lecture each one about one hour long followed by a software lab which will immediately follow the lectures which will try to reinforce a lot of what we cover in the in the in the technical part of the class and you know give you hands-on experience implementing those ideas so this program is split between these two pieces the technical lectures and the software Labs we have several new updates this year in specific especially in many of the later lectures the first lecture will cover the foundations of deep learning which is going to be right now and finally we'll conclude the course with some very exciting guest lectures from both Academia and Industry who are really leading and driving forward the state of AI and deep learning and of course we have many awesome prizes that go with all of the software labs and the project competition at the end of the course so maybe quickly to go through these each day like I said we'll have dedicated software Labs that couple with the lectures starting today with lab one you'll actually build a neural network keeping with this theme of generative AI you'll build a neural network that can learn listen to a lot of music and actually learn how to generate brand new songs in that genre of music at the end at the next level of the class on Friday we'll host a project pitch competition where either you individually or as part of a group can participate and present an idea a novel deep learning idea to all of us it'll be roughly three minutes in length and we will focus not as much because this is a one week program we're not going to focus so much on the results of your pitch but rather The Innovation and the idea and the novelty of what you're trying to propose the prices here are quite significant already where first price is going to get an Nvidia GPU which is really a key piece of Hardware that is instrumental if you want to actually build a deep learning project and train these neural networks which can be very large and require a lot of compute these prices will give you the compute to do so and finally this year we'll be awarding a grand prize for labs two and three combined which will occur on Tuesday and Wednesday focused on what I believe is actually solving some of the most exciting problems in this field of deep learning and how specifically how we can build models that can be robust not only accurate but robust and trustworthy and safe when they're deployed as well and you'll actually get experience developing those types of solutions that can actually Advance the state of the art and AI now all of these Labs that I mentioned and competitions here are going to be due on Thursday night at 11 PM right before the last day of class and we'll be helping you all along the way this this Prize or this competition in particular has very significant prizes so I encourage all of you to really enter this prize and try to try to get a chance to win the prize and of course like I said we're going to be helping you all along the way who are many available resources throughout this class to help you achieve this please post to Piazza if you have any questions and of course this program has an incredible team that you can reach out to at any point in case you have any issues or questions on the materials myself and Ava will be your two main lectures for the first part of the class we'll also be hearing like I said in the later part of the class from some guest lectures who will share some really cutting edge state-of-the-art developments in deep learning and of course I want to give a huge shout out and thanks to all of our sponsors who without their support this program wouldn't have been possible at first yet again another year so thank you all okay so now with that let's really dive into the really fun stuff of today's lecture which is you know the the technical part and I think I want to start this part by asking all of you and having yourselves ask yourself you know having you ask yourselves this question of you know why are all of you here first of all why do you care about this topic in the first place now I think to answer this question we have to take a step back and think about you know the history of machine learning and what machine learning is and what deep learning brings to the table on top of machine learning now traditional machine learning algorithms typically Define what are called these set of features in the data you can think of these as certain patterns in the data and then usually these features are hand engineered so probably a human will come into the data set and with a lot of domain knowledge and experience can try to uncover what these features might be now the key idea of deep learning and this is really Central to this class is that instead of having a human Define these features what if we could have a machine look at all of this data and actually try to extract and uncover what are the core patterns in the data so that it can use those when it sees new data to make some decisions so for example if we wanted to detect faces in an image a deep neural network algorithm might actually learn that in order to detect a face it first has to detect things like edges in the image lines and edges and when you combine those lines and edges you can actually create compositions of features like corners and curves which when you create those when you combine those you can create more high level features for example eyes and noses and ears and then those are the features that allow you to ultimately detect what you care about detecting which is the face but all of these come from what are called kind of a hierarchical learning of features and you can actually see some examples of these these are real features learned by a neural network and how they're combined defines this progression of information but in fact what I just described this underlying and fundamental building block of neural networks and deep learning have actually existed for decades now why are we studying all of this now and today in this class with all of this great enthusiasm to learn this right well for one there have been several key advances that have occurred in the past decade number one is that data is so much more pervasive than it has ever been before in our lifetimes these models are hungry for more data and we're living in the age of Big Data more data is available to these models than ever before and they Thrive off of that secondly these algorithms are massively parallelizable they require a lot of compute and we're also at a unique time in history where we have the ability to train these extremely large-scale algorithms and techniques that have existed for a very long time but we can now train them due to the hardware advances that have been made and finally due to open source toolbox access and software platforms like tensorflow for example which all of you will get a lot of experience on in this class training and building the code for these neural networks has never been easier so that from the software point of view as well there have been incredible advances to open source you know the the underlying fundamentals of what you're going to learn so let me start now with just building up from the ground up the fundamental building block of every single neural network that you're going to learn in this class and that's going to be just a single neuron right and in neural network language a single neuron is called a perceptron so what is the perceptron a perceptron is like I said a single neuron and it's actually I'm going to say it's very very simple idea so I want to make sure that everyone in the audience understands exactly what a perceptron is and how it works so let's start by first defining a perceptron as taking it as input a set of inputs right so on the left hand side you can see this perceptron takes M different inputs 1 to M right these are the blue circles we're denoting these inputs as X's each of these numbers each of these inputs is then multiplied by a corresponding weight which we can call W right so X1 will be multiplied by W1 and we'll add the result of all of these multiplications together now we take that single number after the addition and we pass it through this non-linear what we call a non-linear activation function and that produces our final output of the perceptron which we can call Y now this is actually not entirely accurate of the picture of a perceptron there's one step that I forgot to mention here so in addition to multiplying all of these inputs with their corresponding weights we're also now going to add what's called a bias term here denoted as this w0 which is just a scalar weight and you can think of it coming with a input of just one so that's going to allow the network to basically shift its nonlinear activation function uh you know non-linearly right as it sees its inputs now on the right hand side you can see this diagram mathematically formulated right as a single equation we can now rewrite this linear this this equation with linear algebra terms of vectors and Dot products right so for example we can Define our entire inputs X1 to XM as a large Vector X right that large Vector X can be multiplied by or taking a DOT excuse me Matrix multiplied with our weights W this again another Vector of our weights W1 to WN taking their dot product not only multiplies them but it also adds the resulting terms together adding a bias like we said before and applying this non-linearity now you might be wondering what is this non-linear function I've mentioned it a few times already well I said it is a function right that's passed that we pass the outputs of the neural network through before we return it you know to the next neuron in the in the pipeline right so one common example of a nonlinear function that's very popular in deep neural networks is called the sigmoid function you can think of this as kind of a continuous version of a threshold function right it goes from zero to one and it's having it can take us input any real number on the real number line and you can see an example of it Illustrated on the bottom right hand now in fact there are many types of nonlinear activation functions that are popular in deep neural networks and here are some common ones and throughout this presentation you'll actually see some examples of these code snippets on the bottom of the slides where we'll try and actually tie in some of what you're learning in the lectures to actual software and how you can Implement these pieces which will help you a lot for your software Labs explicitly so the sigmoid activation on the left is very popular since it's a function that outputs you know between zero and one so especially when you want to deal with probability distributions for example this is very important because probabilities live between 0 and 1. in modern deep neural networks though the relu function which you can see on the far right hand is a very popular activation function because it's piecewise linear it's extremely efficient to compute especially when Computing its derivatives right its derivatives are constants except for one non-linear idiot zero now I hope actually all of you are probably asking this question to yourself of why do we even need this nonlinear activation function it seems like it kind of just complicates this whole picture when we didn't really need it in the first place and I want to just spend a moment on answering this because the point of a nonlinear activation function is of course number one is to introduce non-linearities to our data right if we think about our data almost all data that we care about all real world data is highly non-linear now this is important because if we want to be able to deal with those types of data sets we need models that are also nonlinear so they can capture those same types of patterns so imagine I told you to separate for example I gave you this data set red points from greenpoints and I ask you to try and separate those two types of data points now you might think that this is easy but what if I could only if I told you that you could only use a single line to do so well now it becomes a very complicated problem in fact you can't really Solve IT effectively with a single line and in fact if you introduce nonlinear activation functions to your Solution that's exactly what allows you to you know deal with these types of problems nonlinear activation functions allow you to deal with non-linear types of data now and that's what exactly makes neural networks so powerful at their core so let's understand this maybe with a very simple example walking through this diagram of a perceptron one more time imagine I give you this trained neural network with weights now not W1 W2 I'm going to actually give you numbers at these locations right so the trained weights w0 will be 1 and W will be a vector of 3 and negative 2. so this neural network has two inputs like we said before it has input X1 it has input X2 if we want to get the output of it this is also the main thing I want all of you to take away from this lecture today is that to get the output of a perceptron there are three steps we need to take right from this stage we first compute the multiplication of our inputs with our weights sorry yeah multiply them together add their result and compute a non-linearity it's these three steps that Define the forward propagation of information through a perceptron so let's take a look at how that exactly works right so if we plug in these numbers to the to those equations we can see that everything inside of our non-linearity here the nonlinearity is G right that function G which could be a sigmoid we saw a previous slide that component inside of our nonlinearity is in fact just a two-dimensional line it has two inputs and if we consider the space of all of the possible inputs that this neural network could see we can actually plot this on a decision boundary right we can plot this two-dimensional line as as a a decision boundary as a plane separating these two components of our space in fact not only is it a single plane there's a directionality component depending on which side of the plane that we live on if we see an input for example here negative one two we actually know that it lives on one side of the plane and it will have a certain type of output in this case that output is going to be positive right because in this case when we plug those components into our equation we'll get a positive number that passes through the nonlinear component and that gets propagated through as well of course if you're on the other side of the space you're going to have the opposite result right and that thresholding function is going to essentially live at this decision boundary so depending on which side of the space you live on that thresholding function that sigmoid function is going to then control how you move to one side or the other now in this particular example this is very convenient right because we can actually visualize and I can draw this exact full space for you on this slide it's only a two-dimensional space so it's very easy for us to visualize but of course for almost all problems that we care about our data points are not going to be two-dimensional right if you think about an image the dimensionality of an image is going to be the number of pixels that you have in the image right so these are going to be thousands of Dimensions millions of Dimensions or even more and then drawing these types of plots like you see here is simply not feasible right so we can't always do this but hopefully this gives you some intuition to understand kind of as we build up into more complex models so now that we have an idea of the perceptron let's see how we can actually take this single neuron and start to build it up into something more complicated a full neural network and build a model from that so let's revisit again this previous diagram of the perceptron if again just to reiterate one more time this core piece of information that I want all of you to take away from this class is how a perceptron works and how it propagates information to its decision there are three steps first is the dot product second is the bias and third is the non-linearity and you keep repeating this process for every single perceptron in your neural network let's simplify the diagram a little bit I'll get rid of the weights and you can assume that every line here now basically has an Associated weight scaler that's associated with it every line also has it corresponds to the input that's coming in it has a weight that's coming in also at the on the line itself and I've also removed the bias just for a sake of Simplicity but it's still there so now the result is that Z which let's call that the result of our DOT product plus the bias is going and that's what we pass into our non-linear function that piece is going to be applied to that activation function now the final output here is simply going to be G which is our activation function of Z right Z is going to be basically what you can think of the state of this neuron it's the result of that dot product plus bias now if we want to Define and build up a multi-layered output neural network if we want two outputs to this function for example it's a very simple procedure we just have now two neurons two perceptrons each perceptron will control the output for its Associated piece right so now we have two outputs each one is a normal perceptron it takes all of the inputs so they both take the same inputs but amazingly now with this mathematical understanding we can start to build our first neural network entirely from scratch so what does that look like so we can start by firstly initializing these two components the first component that we saw was the weight Matrix excuse me the weight Vector it's a vector of Weights in this case and the second component is the the bias Vector that we're going to multiply with the dot product of all of our inputs by our weights right so the only remaining step now after we've defined these parameters of our layer is to now Define you know how does forward propagation of information works and that's exactly those three main components that I've been stressing to so we can create this call function to do exactly that to Define this forward propagation of information and the story here is exactly the same as we've been seeing it right Matrix multiply our inputs with our weights Right add a bias and then apply a non-linearity and return the result and that literally this code will run this will Define a full net a full neural network layer that you can then take like this and of course actually luckily for all of you all of that code which wasn't much code that's been abstracted away by these libraries like tensorflow you can simply call functions like this which will actually you know replicate exactly that piece of code so you don't need to necessarily copy all of that code down you just you can just call it and with that understanding you know we just saw how you could build a single layer but of course now you can actually start to think about how you can stack these layers as well so since we now have this transformation essentially from our inputs to a hidden output you can think of this as basically how we can Define some way of transforming those inputs right into some new dimensional space right perhaps closer to the value that we want to predict and that transformation is going to be eventually learned to know how to transform those inputs into our desired outputs and we'll get to that later but for now the piece that I want to really focus on is if we have these more complex neural networks I want to really distill down that this is nothing more complex than what we've already seen if we focus on just one neuron in this diagram take is here for example Z2 right Z2 is this neuron that's highlighted in the middle layer it's just the same perceptron that we've been seeing so far in this class it was a its output is obtained by taking a DOT product adding a bias and then applying that non-linearity between all of its inputs if we look at a different node for example Z3 which is the one right below it it's the exact same story again it sees all of the same inputs but it has a different set of weight Matrix that it's going to apply to those inputs so we'll have a different output but the mathematical equations are exactly the same so from now on I'm just going to kind of simplify all of these lines and diagrams just to show these icons in the middle just to demonstrate that this means everything is going to fully connect it to everything and defined by those mathematical equations that we've been covering but there's no extra complexity in these models from what you've already seen now if you want to Stack these types of Solutions on top of each other these layers on top of each other you can not only Define one layer very easily but you can actually create what are called sequential models these sequential models you can Define one layer after another and they define basically the forward propagation of information not just from the neuron level but now from the layer level every layer will be fully connected to the next layer and the inputs of the secondary layer will be all of the outputs of the prior layer now of course if you want to create a very deep neural network all the Deep neural network is is we just keep stacking these layers on top of each other there's nothing else to this story that's really as simple as it is once so these layers are basically all they are is just layers where the final output is computed right by going deeper and deeper into this progression of different layers right and you just keep stacking them until you get to the last layer which is your output layer it's your final prediction that you want to Output right we can create a deep neural network to do all of this by stacking these layers and creating these more hierarchical models like we saw very early in the beginning of today's lecture one where the final output is really computed by you know just going deeper and deeper into this system okay so that's awesome so we've now seen how we can go from a single neuron to a layer to all the way to a deep neural network right building off of these foundational principles let's take a look at how exactly we can use these uh you know principles that we've just discussed to solve a very real problem that I think all of you are probably very concerned about uh this morning when you when you woke up so that problem is how we can build a neural network to answer this question which is will I how will I pass this class and if I will or will I not so to answer this question let's see if we can train a neural network to solve this problem okay so to do this let's start with a very simple neural network right we'll train this model with two inputs just two inputs one input is going to be the number of lectures that you attend over the course of this one week and the second input is going to be how many hours that you spend on your final project or your competition okay so what we're going to do is firstly go out and collect a lot of data from all of the past years that we've taught this course and we can plot all of this data because it's only two input space we can plot this data on a two-dimensional feature space right we can actually look at all of the students before you that have passed the class and failed the class and see where they lived in this space for the amount of hours that they've spent the number of lectures that they've attended and so on greenpoints are the people who have passed red or those who have failed now and here's you right you're right here four or five is your coordinate space you fall right there and you've attended four lectures you've spent five hours on your final project we want to build a neural network to answer the question of will you pass the class although you failed the class so let's do it we have two inputs one is four one is five these are two numbers we can feed them through a neural network that we've just seen how we can build that and we feed that into a single layered neural network three hidden units in this example but we could make it larger if we wanted to be more expressive and more powerful and we see here that the probability of you passing this class is 0.1 it's pretty visible so why would this be the case right what did we do wrong because I don't think it's correct right when we looked at the space it looked like actually you were a good candidate to pass the class but why is the neural network saying that there's only a 10 likelihood that you should pass does anyone have any ideas exactly exactly so this neural network is just uh like it was just born right it has no information about the the world or this class it doesn't know what four and five mean or what the notion of passing or failing means right so exactly right this neural network has not been trained you can think of it kind of as a baby it hasn't learned anything yet so our job firstly is to train it and part of that understanding is we first need to tell the neural network when it makes mistakes right so mathematically we should now think about how we can answer this question which is does did my neural network make a mistake and if it made a mistake how can I tell it how big of a mistake it was so that the next time it sees this data point can it do better minimize that mistake so in neural network language those mistakes are called losses right and specifically you want to Define what's called a loss function which is going to take as input your prediction and the true prediction right and how far away your prediction is from the true prediction tells you how big of a loss there is right so for example let's say we want to build a neural network to do classification of or sorry actually even before that I want to maybe give you some terminology so there are multiple different ways of saying the same thing in neural networks and deep learning so what I just described as a loss function is also commonly referred to as an objective function empirical risk a cost function these are all exactly the same thing they're all a way for us to train the neural network to teach the neural network when it makes mistakes and what we really ultimately want to do is over the course of an entire data set not just one data point of mistakes we won't say over the entire data set we want to minimize all of the mistakes on average that this neural network makes so if we look at the problem like I said of binary classification will I pass this class or will I not there's a yes or no answer that means binary classification now we can use what's called a loss function of the softmax Cross entropy loss and for those of you who aren't familiar this notion of cross entropy is actually developed here at MIT by Sean Sean Excuse me yes Claude Shannon who is a Visionary he did his Masters here over 50 years ago he introduced this notion of cross-entropy and that was you know pivotal in in the ability for us to train these types of neural networks even now into the future so let's start by instead of predicting a binary cross-entropy output what if we wanted to predict a final grade of your class score for example that's no longer a binary output yes or no it's actually a continuous variable right it's the grade let's say out of 100 points what is the value of your score in the class project right for this type of loss we can use what's called a mean squared error loss you can think of this literally as just subtracting your predicted grade from the true grade and minimizing that distance apart foreign so I think now we're ready to really put all of this information together and Tackle this problem of training a neural network right to not just identify how erroneous it is how large its loss is but more importantly minimize that loss as a function of seeing all of this training data that it observes so we know that we want to find this neural network like we mentioned before that minimizes this empirical risk or this empirical loss averaged across our entire data set now this means that we want to find mathematically these W's right that minimize J of w JFW is our loss function average over our entire data set and W is our weight so we want to find the set of Weights that on average is going to give us the minimum the smallest loss as possible now remember that W here is just a list basically it's just a group of all of the weights in our neural network you may have hundreds of weights and a very very small neural network or in today's neural networks you may have billions or trillions of weights and you want to find what is the value of every single one of these weights that's going to result in the smallest loss as possible now how can you do this remember that our loss function J of w is just a function of our weights right so for any instantiation of our weights we can compute a scalar value of you know how how erroneous would our neural network be for this instantiation of our weights so let's try and visualize for example in a very simple example of a two-dimensional space where we have only two weights extremely simple neural network here very small two weight neural network and we want to find what are the optimal weights that would train this neural network we can plot basically the loss how erroneous the neural network is for every single instantiation of these two weights right this is a huge space it's an infinite space but still we can try to we can have a function that evaluates at every point in this space now what we ultimately want to do is again we want to find which set of W's will give us the smallest loss possible that means basically the lowest point on this landscape that you can see here where is the W's that bring us to that lowest point the way that we do this is actually just by firstly starting at a random place we have no idea where to start so pick a random place to start in this space and let's start there at this location let's evaluate our neural network we can compute the loss at this specific location and on top of that we can actually compute how the loss is changing we can compute the gradient of the loss because our loss function is a continuous function right so we can actually compute derivatives of our function across the space of our weights and the gradient tells us the direction of the highest point right so from where we stand the gradient tells us where we should go to increase our loss now of course we don't want to increase our loss we want to decrease our loss so we negate our gradient and we take a step in the opposite direction of the gradient that brings us one step closer to the bottom of the landscape and we just keep repeating this process right over and over again we evaluate the neural network at this new location compute its gradient and step in that new direction we keep traversing this landscape until we converge to the minimum we can really summarize this algorithm which is known formally as gradient descent right so gradient descent simply can be written like this we initialize all of our weights right this can be two weights like you saw in the previous example it can be billions of Weights like in real neural networks we compute this gradient of the partial derivative with of our loss with respect to the weights and then we can update our weights in the opposite direction of this gradient so essentially we just take this small amount small step you can think of it which here is denoted as Ada and we refer to this small step right this is commonly referred to as what's known as the learning rate it's like how much we want to trust that gradient and step in the direction of that gradient we'll talk more about this later but just to give you some sense of code this this algorithm is very well translatable to real code as well for every line on the pseudocode you can see on the left you can see corresponding real code on the right that is runnable and directly implementable by all of you in your labs but now let's take a look specifically at this term here this is the gradient we touched very briefly on this in the visual example this explains like I said how the loss is changing as a function of the weights right so as the weights move around will my loss increase or decrease and that will tell the neural network if it needs to move the weights in a certain direction or not but I never actually told you how to compute this right and I think that's an extremely important part because if you don't know that then you can't uh well you can't train your neural network right this is a critical part of training neural networks and that process of computing this line This gradient line is known as back propagation so let's do a very quick intro to back propagation and how it works so again let's start with the simplest neural network in existence this neural network has one input one output and only one neuron right this is as simple as it gets we want to compute the gradient of our loss with respect to our weight in this case let's compute it with respect to W2 the second weight so this derivative is going to tell us how much a small change in this weight will affect our loss if if a small change if we change our weight a little bit in One Direction we'll increase our loss or decrease our loss so to compute that we can write out this derivative we can start with applying the chain rule backwards from the loss function through the output specifically what we can do is we can actually just decompose this derivative into two components the first component is the derivative of our loss with respect to our output multiplied by the derivative of our output with respect to W2 right this is just a standard um uh instantiation of the chain rule with this original derivative that we had on the left hand side let's suppose we wanted to compute the gradients of the weight before that which in this case are not W1 but W excuse me not W2 but W1 well all we do is replace W2 with W1 and that chain Rule still holds right that same equation holds but now you can see on the red component that last component of the chain rule we have to once again recursively apply one more chain rule because that's again another derivative that we can't directly evaluate we can expand that once more with another instantiation of the chain Rule and now all of these components we can directly propagate these gradients through the hidden units right in our neural network all the way back to the weight that we're interested in in this example right so we first computed the derivative with respect to W2 then we can back propagate that and use that information also with W1 that's why we really call it back propagation because this process occurs from the output all the way back to the input now we repeat this process essentially many many times over the course of training by propagating these gradients over and over again through the network all the way from the output to the inputs to determine for every single weight answering this question which is how much does a small change in these weights affect our loss function if it increases it or decreases and how we can use that to improve the loss ultimately because that's our final goal in this class foreign so that's the back propagation algorithm that's that's the core of training neural networks in theory it's very simple it's it's really just an instantiation of the chain rule but let's touch on some insights that make training neural networks actually extremely complicated in practice even though the algorithm of back propagation is simple and you know many decades old in practice though optimization of neural networks looks something like this it looks nothing like that picture that I showed you before there are ways that we can visualize very large deep neural networks and you can think of the landscape of these models looking like something like this this is an illustration from a paper that came out several years ago where they tried to actually visualize the landscape a very very deep neural networks and that's what this landscape actually looks like that's what you're trying to deal with and find the minimum in this space and you can imagine the challenges that come with that so to cover the challenges let's first think of and recall that update equation defined in gradient descent right so I didn't talk too much about this parameter Ada but now let's spend a bit of time thinking about this this is called The Learning rate like we saw before it determines basically how big of a step we need to take in the direction of our gradient on every single iteration of back propagation in practice even setting the learning rate can be very challenging you as you as the designer of the neural network have to set this value this learning rate and how do you pick this value right so that can actually be quite difficult it has really uh large consequences when building a neural network so for example if we set the learning rate too low then we learn very slowly so let's assume we start on the right hand side here at that initial guess if our learning rate is not large enough not only do we converge slowly we actually don't even converge to the global minimum right because we kind of get stuck in a local minimum now what if we set our learning rate too high right what can actually happen is we overshoot and we can actually start to diverge from the solution the gradients can actually explode very bad things happen and then the neural network doesn't trade so that's also not good in reality there's a very happy medium between setting it too small setting it too large where you set it just large enough to kind of overshoot some of these local Minima put you into a reasonable part of the search space where then you can actually Converge on the solutions that you care most about but actually how do you set these learning rates in practice right how do you pick what is the ideal learning rate one option and this is actually a very common option in practice is to simply try out a bunch of learning rates and see what works the best right so try out let's say a whole grid of different learning rates and you know train all of these neural networks see which one works the best but I think we can do something a lot smarter right so what are some more intelligent ways that we could do this instead of exhaustively trying out a whole bunch of different learning rates can we design a learning rate algorithm that actually adapts to our neural network and adapts to its landscape so that it's a bit more intelligent than that previous idea so this really ultimately means that the learning rate the speed at which the algorithm is trusting the gradients that it sees is going to depend on how large the gradient is in that location and how fast we're learning how many other options uh and sorry and many other options that we might have as part of training in neural networks right so it's not only how quickly we're learning you may judge it on many different factors in the learning landscape in fact we've all been these different algorithms that I'm talking about these adaptive learning rate algorithms have been very widely studied in practice there is a very thriving community in the Deep learning research community that focuses on developing and designing new algorithms for learning rate adaptation and faster optimization of large neural networks like these and during your Labs you'll actually get the opportunity to not only try out a lot of these different adaptive algorithms which you can see here but also try to uncover what are kind of the patterns and benefits of One Versus the other and that's going to be something that I think you'll you'll find very insightful as part of your labs so another key component of your Labs that you'll see is how you can actually put all of this information that we've covered today into a single picture that looks roughly something like this which defines your model at the first at the top here that's where you define your model we talked about this in the beginning part of the lecture for every piece in your model you're now going to need to Define this Optimizer which we've just talked about this Optimizer is defined together with a learning rate right how quickly you want to optimize your lost landscape and over many Loops you're going to pass over all of the examples in your data set and observe essentially how to improve your network that's the gradient and then actually improve the network in those directions and keep doing that over and over and over again until eventually your neural network converges to some sort of solution so I want to very quickly briefly in the remaining time that we have continue to talk about tips for training these neural networks in practice and focus on this very powerful idea of batching your data into well what are called mini batches of smaller pieces of data to do this let's revisit that gradient descent algorithm right so here this gradient that we talked about before is actually extraordinarily computationally expensive to compute because it's computed as a summation across all of the pieces in your data set right and in most real life or real world problems you know it's simply not feasible to compute a gradient over your entire data set data sets are just too large these days so in you know there are some Alternatives right what are the Alternatives instead of computing the derivative or the gradients across your entire data set what if you instead computed the gradient over just a single example in your data set just one example well of course this this estimate of your gradient is going to be exactly that it's an estimate it's going to be very noisy it may roughly reflect the trends of your entire data set but because it's a very it's only one example in fact of your entire data set it may be very noisy right well the advantage of this though is that it's much faster to compute obviously the gradient over a single example because it's one example so computationally this has huge advantages but the downside is that it's extremely stochastic right that's the reason why this algorithm is not called gradient descent it's called stochastic gradient descent now now what's the middle ground right instead of computing it with respect to one example in your data set what if we computed what's called a mini batch of examples a small batch of examples that we can compute the gradients over and when we take these gradients they're still computationally efficient to compute because it's a mini batch it's not too large maybe we're talking on the order of tens or hundreds of examples in our data set but more importantly because we've expanded from a single example to maybe 100 examples the stochasticity is significantly reduced and the accuracy of our gradient is much improved so normally we're thinking of batch sizes many batch sizes roughly on the order of 100 data points tens or hundreds of data points this is much faster obviously to compute than gradient descent and much more accurate to compute compared to stochastic gradient descent which is that single single point example so this increase in gradient accuracy allows us to essentially converge to our solution much quicker than it could have been possible in practice due to gradient descent limitations it also means that we can increase our learning rate because we can trust each of those gradients much more efficiently right we're now averaging over a batch it's going to be much more accurate than the stochastic version so we can increase that learning rate and actually learn faster as well this allows us to also massively parallelize this entire algorithm in computation right we can split up batches onto separate workers and Achieve even more significant speed UPS of this entire problem using gpus the last topic that I very very briefly want to cover in today's lecture is this topic of overfitting right when we're optimizing a neural network with stochastic gradient descent we have this challenge of what's called overfitting overfitting I looks like this roughly right so on the left hand side we want to build a neural network or let's say in general we want to build a machine learning model that can accurately describe some patterns in our data but remember we're ultimately we don't want to describe the patterns in our training data ideally we want to define the patterns in our test data of course we don't observe test data we only observe training data so we have this challenge of extracting patterns from training data and hoping that they generalize to our test data so set in one different way we want to build models that can learn representations from our training data that can still generalize even when we show them brand new unseen pieces of test data so assume that you want to build a line that can describe or find the patterns in these points that you can see on the slide right if you have a very simple neural network which is just a single line straight line you can describe this data sub-optimally right because the data here is non-linear you're not going to accurately capture all of the nuances and subtleties in this data set that's on the left hand side if you move to the right hand side you can see a much more complicated model but here you're actually over expressive you're too expressive and you're capturing kind of the nuances the spurious nuances in your training data that are actually not representative of your test data ideally you want to end up with the model in the middle which is basically the middle ground right it's not too complex and it's not too simple it still gives you what you want to perform well and even when you give it brand new data so to address this problem let's briefly talk about what's called regularization regularization is a technique that you can introduce to your training pipeline to discourage complex models from being learned now as we've seen before this is really critical because neural networks are extremely large models they are extremely prone to overfitting right so regularization and having techniques for regularization has extreme implications towards the success of neural networks and having them generalize Beyond training data far into our testing domain the most popular technique for regularization in deep learning is called Dropout and the idea of Dropout is is actually very simple it's let's revisit it by drawing this picture of deep neural networks that we saw earlier in today's lecture in Dropout during training we essentially randomly select some subset of the neurons in this neural network and we try to prune them out with some random probabilities so for example we can select this subset of neural of neurons we can randomly select them with a probability of 50 percent and with that probability we randomly turn them off or on on different iterations of our training so this is essentially forcing the neural network to learn you can think of an ensemble of different models on every iteration it's going to be exposed to kind of a different model internally than the one it had on the last iteration so it has to learn how to build internal Pathways to process the same information and it can't rely on information that it learned on previous iterations right so it forces it to kind of capture some deeper meaning within the pathways of the neural network and this can be extremely powerful because number one it lowers the capacity of the neural network significantly right you're lowering it by roughly 50 percent in this example but also because it makes them easier to train because the number of Weights that have gradients in this case is also reduced so it's actually much faster to train them as well now like I mentioned on every iteration we randomly drop out a different set of neurons right and that helps the data generalize better and the second regularization techniques which is actually a very broad regularization technique far beyond neural networks is simply called early stopping now we know the the definition of overfitting is simply when our model starts to represent basically the training data more than the testing data that's really what overfitting comes down to at its core if we set aside some of the training data to use separately that we don't train on it we can use it as kind of a testing data set synthetic testing data set in some ways we can monitor how our network is learning on this unseen portion of data so for example we can over the course of training we can basically plot the performance of our Network on both the training set as well as our held out test set and as the network is trained we're going to see that first of all these both decrease but there's going to be a point where the loss plateaus and starts to increase the training loss will actually start to increase this is exactly the point where you start to overfit right because now you're starting to have sorry that was the test loss the test loss actually starts to increase because now you're starting to overfit on your training data this pattern basically continues for the rest of training and this is the point that I want you to focus on right this Middle Point is where we need to stop training because after this point assuming that this test set is a valid representation of the true test set this is the place where the accuracy of the model will only get worse right so this is where we would want to early stop our model and regularize the performance and we can see that stopping anytime before this point is also not good we're going to produce an underfit model where we could have had a better model on the test data but it's this trade-off right you can't stop too late and you can't stop too early as well so I'll conclude this lecture by just summarizing these three key points that we've covered in today's lecture so far so we've first covered these fundamental building blocks of all neural networks which is the single neuron the perceptron we've built these up into larger neural layers and then from their neural networks and deep neural networks we've learned how we can train these apply them to data sets back propagate through them and we've seen some trips tips and tricks for optimizing these systems end to end in the next lecture we'll hear from Ava on deep sequence modeling using rnns and specifically this very exciting new type of model called the Transformer architecture and attention mechanisms so maybe let's resume the class in about five minutes after we have a chance to swap speakers and thank you so much for all of your attention thank you

Good afternoon everyone!
Thank you all for joining today.  My name is Alexander Amini and I&#39;ll be one of your 
course organizers this year along with Ava -- and   together we&#39;re super excited to introduce you 
all to Introduction to Deep Learning. Now MIT  Intro to Deep Learning is a really really fun 
exciting and fast-paced program here at MIT   and let me start by just first of all giving 
you a little bit of background into what we   do and what you&#39;re going to learn about this year.
So this week of Intro to Deep Learning we&#39;re going   to cover a ton of material in just one week.
You&#39;ll learn the foundations of this really   really fascinating and exciting field of 
deep learning and artificial intelligence   and more importantly you&#39;re going to get hands-on 
experience actually reinforcing what you learn in   the lectures as part of hands-oOn software labs.
Now over the past decade AI and deep learning   have really had a huge resurgence and many 
incredible successes and a lot of problems   that even just a decade ago we thought were not 
really even solvable in the near future now we&#39;re   solving with deep learning with Incredible ease.
Now this past year in particular of 2022 has been   an incredible year for a deep learning progress 
and I like to say that actually this past year   in particular has been the year of generative 
deep learning using deep learning to generate   brand new types of data that I&#39;ve never been 
seen before and never existed in reality in   fact I want to start this class by actually 
showing you how we started this class several   years ago which was by playing this video that 
I&#39;ll play in a second now this video actually   was an introductory video for the class it kind 
of exemplifies this idea that I&#39;m talking about.  So let me just stop there and 
play this video first of all Hi everybody and welcome to MIT 6.S191 
-- the official introductory course on   deep learning taught here at MIT.
Deep Learning is revolutionizing   so many fields: from robotics to 
medicine and everything in between.  You&#39;ll learn the fundamentals of 
this field and how you can build   some of these incredible algorithms.
In fact, this entire speech and video   are not real and were created using deep 
learning and artificial intelligence.  And in this class you&#39;ll learn how.   It has been an honor to speak with you 
today and I hope you enjoy the course. so in case you couldn&#39;t tell this video and 
its entire audio was actually not real it was   synthetically generated by a deep learning 
algorithm and when we introduced this class   A few years ago this video was created several 
years ago right but even several years ago when   we introduced this and put it on YouTube it went 
somewhat viral right people really loved this   video they were intrigued by how real the video 
and audio felt and looked uh entirely generated   by an algorithm by a computer and people were 
shocked with the power and the realism of these   types of approaches and this was a few years ago 
now fast forward to today and the state of deep   learning today we have have seen deep learning 
accelerating at a rate faster than we&#39;ve ever   seen before in fact we can use deep learning 
now to generate not just images of faces but   generate full synthetic environments where we can 
train autonomous vehicles entirely in simulation   and deploy them on full-scale vehicles in the 
real world seamlessly the videos here you see   are actually from a data driven simulator from 
neural networks generated called Vista that we   actually built here at MIT and have open sourced 
to the public so all of you can actually train and   build the future of autonomy and self-driving cars 
and of course it goes far beyond this as well deep   learning can be used to generate content directly 
from how we speak and the language that we convey   to it from prompts that we say deep learning can 
reason about the prompts in natural language and   English for example and then guide and control 
what is generated according to what we specify   we&#39;ve seen examples of where we can generate for 
example things that again have never existed in   reality we can ask a neural network to generate 
a photo of a astronaut riding a horse and it   actually can imagine hallucinate what this might 
look like even though of course this photo not   only this photo has never occurred before but 
I don&#39;t think any photo of an astronaut riding   a horse has ever occurred before so there&#39;s 
not really even training data that you could   go off in this case and my personal favorite 
is actually how we can not only build software   that can generate images and videos but build 
software that can generate software as well we   can also have algorithms that can take language 
prompts for example a prompt like this write   code and tensorflow to generate or to train 
a neural network and not only will it write   the code and create that neural network but it 
will have the ability to reason about the code   that it&#39;s generated and walk you through step by 
step explaining the process and procedure all the   way from the ground up to you so that you can 
actually learn how to do this process as well   now I think some of these examples really just 
highlight how far deep learning and these methods   have come in the past six years since we started 
this course and you saw that example just a few   years ago from that introductory video but now 
we&#39;re seeing such incredible advances and the   most amazing part of this course in my opinion is 
actually that within this one week we&#39;re going to   take you through from the ground up starting 
from today all of the foundational building   blocks that will allow you to understand and 
make all of this amazing Advance as possible   so with that hopefully now you&#39;re all super 
excited about what this class will teach and I   want to basically now just start by taking a step 
back and introducing some of these terminologies   that I&#39;ve kind of been throwing around so far 
the Deep learning artificial intelligence what   do these things actually mean so first of 
all I want to maybe just take a second to   speak a little bit about intelligence and 
what intelligence means at its core so to   me intelligence is simply the ability to process 
information such that we can use it to inform some   future decision or action that we take now the 
field of artificial intelligence is simply the   ability for us to build algorithms artificial 
algorithms that can do exactly this process   information to inform some future decision 
now machine learning is simply a subset of AI   which focuses specifically on how we can build 
a machine to or teach a machine how to do this   from some experiences or data for example now deep 
learning goes One Step Beyond this and is a subset   of machine learning which focuses explicitly on 
what are called neural networks and how we can   build neural networks that can extract features in 
the data these are basically what you can think of   as patterns that occur within the data so that 
it can learn to complete these tasks as well   now that&#39;s exactly what this class is really all 
about at its core we&#39;re going to try and teach   you and give you the foundational understanding 
and how we can build and teach computers to learn   tasks many different type of tasks directly from 
raw data and that&#39;s really what this class spoils   down to at it&#39;s it&#39;s most simple form and we&#39;ll 
provide a very solid foundation for you both on   the technical side through the lectures which will 
happen in two parts throughout the class the first   lecture and the second lecture each one about one 
hour long followed by a software lab which will   immediately follow the lectures which will try to 
reinforce a lot of what we cover in the in the in   the technical part of the class and you know give 
you hands-on experience implementing those ideas   so this program is split between these two pieces 
the technical lectures and the software Labs we   have several new updates this year in specific 
especially in many of the later lectures the   first lecture will cover the foundations of 
deep learning which is going to be right now   and finally we&#39;ll conclude the course with 
some very exciting guest lectures from both   Academia and Industry who are really leading 
and driving forward the state of AI and deep   learning and of course we have many awesome 
prizes that go with all of the software labs   and the project competition at the end of the 
course so maybe quickly to go through these   each day like I said we&#39;ll have dedicated 
software Labs that couple with the lectures   starting today with lab one you&#39;ll actually 
build a neural network keeping with this   theme of generative AI you&#39;ll build a neural 
network that can learn listen to a lot of   music and actually learn how to generate 
brand new songs in that genre of music   at the end at the next level of the class on 
Friday we&#39;ll host a project pitch competition   where either you individually or as part of a 
group can participate and present an idea a novel   deep learning idea to all of us it&#39;ll be roughly 
three minutes in length and we will focus not as   much because this is a one week program we&#39;re 
not going to focus so much on the results of   your pitch but rather The Innovation and the idea 
and the novelty of what you&#39;re trying to propose   the prices here are quite significant already 
where first price is going to get an Nvidia   GPU which is really a key piece of Hardware that 
is instrumental if you want to actually build a   deep learning project and train these neural 
networks which can be very large and require   a lot of compute these prices will give you 
the compute to do so and finally this year   we&#39;ll be awarding a grand prize for labs two and 
three combined which will occur on Tuesday and   Wednesday focused on what I believe is actually 
solving some of the most exciting problems in this   field of deep learning and how specifically how 
we can build models that can be robust not only   accurate but robust and trustworthy and safe when 
they&#39;re deployed as well and you&#39;ll actually get   experience developing those types of solutions 
that can actually Advance the state of the art   and AI now all of these Labs that I mentioned and 
competitions here are going to be due on Thursday   night at 11 PM right before the last day of 
class and we&#39;ll be helping you all along the   way this this Prize or this competition in 
particular has very significant prizes so I   encourage all of you to really enter this prize 
and try to try to get a chance to win the prize   and of course like I said we&#39;re going to 
be helping you all along the way who are   many available resources throughout this class to 
help you achieve this please post to Piazza if you   have any questions and of course this program 
has an incredible team that you can reach out   to at any point in case you have any issues or 
questions on the materials myself and Ava will   be your two main lectures for the first part 
of the class we&#39;ll also be hearing like I said   in the later part of the class from some guest 
lectures who will share some really cutting edge   state-of-the-art developments in deep learning 
and of course I want to give a huge shout out and   thanks to all of our sponsors who without their 
support this program wouldn&#39;t have been possible   at first yet again another year so thank you all   okay so now with that let&#39;s really dive into 
the really fun stuff of today&#39;s lecture which   is you know the the technical part and I think I 
want to start this part by asking all of you and   having yourselves ask yourself you know having 
you ask yourselves this question of you know why   are all of you here first of all why do you 
care about this topic in the first place now   I think to answer this question we have to take a 
step back and think about you know the history of   machine learning and what machine learning is and 
what deep learning brings to the table on top of   machine learning now traditional machine learning 
algorithms typically Define what are called these   set of features in the data you can think of these 
as certain patterns in the data and then usually   these features are hand engineered so probably 
a human will come into the data set and with a   lot of domain knowledge and experience can try to 
uncover what these features might be now the key   idea of deep learning and this is really Central 
to this class is that instead of having a human   Define these features what if we could have a 
machine look at all of this data and actually   try to extract and uncover what are the core 
patterns in the data so that it can use those   when it sees new data to make some decisions 
so for example if we wanted to detect faces   in an image a deep neural network algorithm might 
actually learn that in order to detect a face it   first has to detect things like edges in the image 
lines and edges and when you combine those lines   and edges you can actually create compositions 
of features like corners and curves which when   you create those when you combine those you can 
create more high level features for example eyes   and noses and ears and then those are the features 
that allow you to ultimately detect what you care   about detecting which is the face but all of these 
come from what are called kind of a hierarchical   learning of features and you can actually see some 
examples of these these are real features learned   by a neural network and how they&#39;re combined 
defines this progression of information but   in fact what I just described this underlying and 
fundamental building block of neural networks and   deep learning have actually existed for decades 
now why are we studying all of this now and today   in this class with all of this great enthusiasm 
to learn this right well for one there have been   several key advances that have occurred in the 
past decade number one is that data is so much   more pervasive than it has ever been before in our 
lifetimes these models are hungry for more data   and we&#39;re living in the age of Big Data more data 
is available to these models than ever before and   they Thrive off of that secondly these algorithms 
are massively parallelizable they require a lot of   compute and we&#39;re also at a unique time in history 
where we have the ability to train these extremely   large-scale algorithms and techniques that have 
existed for a very long time but we can now   train them due to the hardware advances that have 
been made and finally due to open source toolbox   access and software platforms like tensorflow 
for example which all of you will get a lot of   experience on in this class training and building 
the code for these neural networks has never been   easier so that from the software point of view 
as well there have been incredible advances   to open source you know the the underlying 
fundamentals of what you&#39;re going to learn   so let me start now with just building up from 
the ground up the fundamental building block of   every single neural network that you&#39;re going 
to learn in this class and that&#39;s going to be   just a single neuron right and in neural network 
language a single neuron is called a perceptron   so what is the perceptron a perceptron 
is like I said a single neuron and it&#39;s   actually I&#39;m going to say it&#39;s very 
very simple idea so I want to make   sure that everyone in the audience understands 
exactly what a perceptron is and how it works   so let&#39;s start by first defining a perceptron 
as taking it as input a set of inputs right so   on the left hand side you can see this perceptron 
takes M different inputs 1 to M right these are   the blue circles we&#39;re denoting these inputs as 
X&#39;s each of these numbers each of these inputs   is then multiplied by a corresponding weight which 
we can call W right so X1 will be multiplied by W1   and we&#39;ll add the result of all of these 
multiplications together now we take that   single number after the addition and we pass it 
through this non-linear what we call a non-linear   activation function and that produces our final 
output of the perceptron which we can call Y   now this is actually not entirely accurate of 
the picture of a perceptron there&#39;s one step   that I forgot to mention here so in addition 
to multiplying all of these inputs with their   corresponding weights we&#39;re also now going to add 
what&#39;s called a bias term here denoted as this w0   which is just a scalar weight and you can think 
of it coming with a input of just one so that&#39;s   going to allow the network to basically shift 
its nonlinear activation function uh you know   non-linearly right as it sees its inputs now 
on the right hand side you can see this diagram   mathematically formulated right as a single 
equation we can now rewrite this linear this this   equation with linear algebra terms of vectors and 
Dot products right so for example we can Define   our entire inputs X1 to XM as a large Vector 
X right that large Vector X can be multiplied   by or taking a DOT excuse me Matrix multiplied 
with our weights W this again another Vector of   our weights W1 to WN taking their dot product 
not only multiplies them but it also adds the   resulting terms together adding a bias like 
we said before and applying this non-linearity   now you might be wondering what is this non-linear 
function I&#39;ve mentioned it a few times already   well I said it is a function right that&#39;s passed 
that we pass the outputs of the neural network   through before we return it you know to the next 
neuron in the in the pipeline right so one common   example of a nonlinear function that&#39;s very 
popular in deep neural networks is called the   sigmoid function you can think of this as kind of 
a continuous version of a threshold function right   it goes from zero to one and it&#39;s having it can 
take us input any real number on the real number   line and you can see an example of it Illustrated 
on the bottom right hand now in fact there are   many types of nonlinear activation functions that 
are popular in deep neural networks and here are   some common ones and throughout this presentation 
you&#39;ll actually see some examples of these code   snippets on the bottom of the slides where we&#39;ll 
try and actually tie in some of what you&#39;re   learning in the lectures to actual software and 
how you can Implement these pieces which will help   you a lot for your software Labs explicitly so 
the sigmoid activation on the left is very popular   since it&#39;s a function that outputs you know 
between zero and one so especially when you want   to deal with probability distributions for example 
this is very important because probabilities live   between 0 and 1. in modern deep neural networks 
though the relu function which you can see on the   far right hand is a very popular activation 
function because it&#39;s piecewise linear it&#39;s   extremely efficient to compute especially when 
Computing its derivatives right its derivatives   are constants except for one non-linear idiot 
zero now I hope actually all of you are probably   asking this question to yourself of why do we 
even need this nonlinear activation function   it seems like it kind of just complicates this 
whole picture when we didn&#39;t really need it in   the first place and I want to just spend a moment 
on answering this because the point of a nonlinear   activation function is of course number one is to 
introduce non-linearities to our data right if we   think about our data almost all data that we care 
about all real world data is highly non-linear   now this is important because if we want to be 
able to deal with those types of data sets we   need models that are also nonlinear so they can 
capture those same types of patterns so imagine   I told you to separate for example I gave you this 
data set red points from greenpoints and I ask you   to try and separate those two types of data points 
now you might think that this is easy but what if   I could only if I told you that you could only 
use a single line to do so well now it becomes   a very complicated problem in fact you can&#39;t 
really Solve IT effectively with a single line   and in fact if you introduce nonlinear activation 
functions to your Solution that&#39;s exactly what   allows you to you know deal with these types of 
problems nonlinear activation functions allow   you to deal with non-linear types of data now 
and that&#39;s what exactly makes neural networks   so powerful at their core so let&#39;s understand 
this maybe with a very simple example walking   through this diagram of a perceptron one 
more time imagine I give you this trained   neural network with weights now not W1 W2 I&#39;m 
going to actually give you numbers at these   locations right so the trained weights w0 will 
be 1 and W will be a vector of 3 and negative 2.   so this neural network has two inputs like we 
said before it has input X1 it has input X2 if   we want to get the output of it this is also 
the main thing I want all of you to take away   from this lecture today is that to get the output 
of a perceptron there are three steps we need to   take right from this stage we first compute the 
multiplication of our inputs with our weights   sorry yeah multiply them together add 
their result and compute a non-linearity   it&#39;s these three steps that Define the forward 
propagation of information through a perceptron   so let&#39;s take a look at how that exactly 
works right so if we plug in these numbers   to the to those equations we can see that 
everything inside of our non-linearity   here the nonlinearity is G right that function G 
which could be a sigmoid we saw a previous slide   that component inside of our nonlinearity is 
in fact just a two-dimensional line it has two   inputs and if we consider the space of all of 
the possible inputs that this neural network   could see we can actually plot this on a decision 
boundary right we can plot this two-dimensional   line as as a a decision boundary as a plane 
separating these two components of our space   in fact not only is it a single plane there&#39;s a 
directionality component depending on which side   of the plane that we live on if we see an input 
for example here negative one two we actually   know that it lives on one side of the plane and 
it will have a certain type of output in this case   that output is going to be positive right because 
in this case when we plug those components into   our equation we&#39;ll get a positive number that 
passes through the nonlinear component and that   gets propagated through as well of course if 
you&#39;re on the other side of the space you&#39;re   going to have the opposite result right and that 
thresholding function is going to essentially live   at this decision boundary so depending on which 
side of the space you live on that thresholding   function that sigmoid function is going to then 
control how you move to one side or the other   now in this particular example this is very 
convenient right because we can actually   visualize and I can draw this exact full space 
for you on this slide it&#39;s only a two-dimensional   space so it&#39;s very easy for us to visualize 
but of course for almost all problems that we   care about our data points are not going to 
be two-dimensional right if you think about   an image the dimensionality of an image is going 
to be the number of pixels that you have in the   image right so these are going to be thousands 
of Dimensions millions of Dimensions or even   more and then drawing these types of plots like 
you see here is simply not feasible right so we   can&#39;t always do this but hopefully this gives 
you some intuition to understand kind of as we   build up into more complex models so now that we 
have an idea of the perceptron let&#39;s see how we   can actually take this single neuron and start 
to build it up into something more complicated a   full neural network and build a model from that 
so let&#39;s revisit again this previous diagram of   the perceptron if again just to reiterate one more 
time this core piece of information that I want   all of you to take away from this class is how a 
perceptron works and how it propagates information   to its decision there are three steps first is the 
dot product second is the bias and third is the   non-linearity and you keep repeating this process 
for every single perceptron in your neural network   let&#39;s simplify the diagram a little bit I&#39;ll get 
rid of the weights and you can assume that every   line here now basically has an Associated weight 
scaler that&#39;s associated with it every line also   has it corresponds to the input that&#39;s coming 
in it has a weight that&#39;s coming in also at the   on the line itself and I&#39;ve also removed the bias 
just for a sake of Simplicity but it&#39;s still there   so now the result is that Z which let&#39;s call 
that the result of our DOT product plus the   bias is going and that&#39;s what we pass into 
our non-linear function that piece is going   to be applied to that activation function 
now the final output here is simply going   to be G which is our activation function of 
Z right Z is going to be basically what you   can think of the state of this neuron it&#39;s 
the result of that dot product plus bias   now if we want to Define and build up a 
multi-layered output neural network if we   want two outputs to this function for example 
it&#39;s a very simple procedure we just have now   two neurons two perceptrons each perceptron will 
control the output for its Associated piece right   so now we have two outputs each one is a normal 
perceptron it takes all of the inputs so they   both take the same inputs but amazingly now 
with this mathematical understanding we can   start to build our first neural network entirely 
from scratch so what does that look like so we   can start by firstly initializing these two 
components the first component that we saw   was the weight Matrix excuse me the weight 
Vector it&#39;s a vector of Weights in this case   and the second component is the the bias Vector 
that we&#39;re going to multiply with the dot product   of all of our inputs by our weights right so the 
only remaining step now after we&#39;ve defined these   parameters of our layer is to now Define you know 
how does forward propagation of information works   and that&#39;s exactly those three main components 
that I&#39;ve been stressing to so we can create this   call function to do exactly that to Define this 
forward propagation of information and the story   here is exactly the same as we&#39;ve been seeing it 
right Matrix multiply our inputs with our weights   Right add a bias and then apply a non-linearity 
and return the result and that literally this code   will run this will Define a full net a full neural 
network layer that you can then take like this   and of course actually luckily for all 
of you all of that code which wasn&#39;t much   code that&#39;s been abstracted away by these 
libraries like tensorflow you can simply   call functions like this which will actually 
you know replicate exactly that piece of code   so you don&#39;t need to necessarily copy all of 
that code down you just you can just call it   and with that understanding you know we just saw 
how you could build a single layer but of course   now you can actually start to think about how 
you can stack these layers as well so since we   now have this transformation essentially from 
our inputs to a hidden output you can think   of this as basically how we can Define some 
way of transforming those inputs right into   some new dimensional space right perhaps closer 
to the value that we want to predict and that   transformation is going to be eventually learned 
to know how to transform those inputs into our   desired outputs and we&#39;ll get to that later but 
for now the piece that I want to really focus on   is if we have these more complex neural networks 
I want to really distill down that this is nothing   more complex than what we&#39;ve already seen if we 
focus on just one neuron in this diagram take is   here for example Z2 right Z2 is this neuron that&#39;s 
highlighted in the middle layer it&#39;s just the same   perceptron that we&#39;ve been seeing so far in this 
class it was a its output is obtained by taking   a DOT product adding a bias and then applying 
that non-linearity between all of its inputs   if we look at a different node for example Z3 
which is the one right below it it&#39;s the exact   same story again it sees all of the same inputs 
but it has a different set of weight Matrix that   it&#39;s going to apply to those inputs so we&#39;ll have 
a different output but the mathematical equations   are exactly the same so from now on I&#39;m just 
going to kind of simplify all of these lines and   diagrams just to show these icons in the middle 
just to demonstrate that this means everything   is going to fully connect it to everything and 
defined by those mathematical equations that we&#39;ve   been covering but there&#39;s no extra complexity in 
these models from what you&#39;ve already seen now if   you want to Stack these types of Solutions on top 
of each other these layers on top of each other   you can not only Define one layer very easily but 
you can actually create what are called sequential   models these sequential models you can Define one 
layer after another and they define basically the   forward propagation of information not just 
from the neuron level but now from the layer   level every layer will be fully connected to the 
next layer and the inputs of the secondary layer   will be all of the outputs of the prior layer 
now of course if you want to create a very deep   neural network all the Deep neural network is is 
we just keep stacking these layers on top of each   other there&#39;s nothing else to this story that&#39;s 
really as simple as it is once so these layers are   basically all they are is just layers where the 
final output is computed right by going deeper and   deeper into this progression of different layers 
right and you just keep stacking them until you   get to the last layer which is your output layer 
it&#39;s your final prediction that you want to Output   right we can create a deep neural network to do 
all of this by stacking these layers and creating   these more hierarchical models like we saw very 
early in the beginning of today&#39;s lecture one   where the final output is really computed by you 
know just going deeper and deeper into this system   okay so that&#39;s awesome so we&#39;ve now seen how 
we can go from a single neuron to a layer to   all the way to a deep neural network right 
building off of these foundational principles   let&#39;s take a look at how exactly we can use these 
uh you know principles that we&#39;ve just discussed   to solve a very real problem that I think all 
of you are probably very concerned about uh   this morning when you when you woke up so that 
problem is how we can build a neural network to   answer this question which is will I how will 
I pass this class and if I will or will I not   so to answer this question let&#39;s see if we can 
train a neural network to solve this problem okay   so to do this let&#39;s start with a very simple 
neural network right we&#39;ll train this model with   two inputs just two inputs one input is going to 
be the number of lectures that you attend over the   course of this one week and the second input is 
going to be how many hours that you spend on your   final project or your competition okay so what 
we&#39;re going to do is firstly go out and collect   a lot of data from all of the past years that 
we&#39;ve taught this course and we can plot all of   this data because it&#39;s only two input space we can 
plot this data on a two-dimensional feature space   right we can actually look at all of the students 
before you that have passed the class and failed   the class and see where they lived in this space 
for the amount of hours that they&#39;ve spent the   number of lectures that they&#39;ve attended and so 
on greenpoints are the people who have passed red   or those who have failed now and here&#39;s you right 
you&#39;re right here four or five is your coordinate   space you fall right there and you&#39;ve attended 
four lectures you&#39;ve spent five hours on your   final project we want to build a neural network 
to answer the question of will you pass the class   although you failed the class so let&#39;s do it we 
have two inputs one is four one is five these   are two numbers we can feed them through a neural 
network that we&#39;ve just seen how we can build that   and we feed that into a single layered neural 
network three hidden units in this example but   we could make it larger if we wanted to be more 
expressive and more powerful and we see here   that the probability of you passing this class 
is 0.1 it&#39;s pretty visible so why would this   be the case right what did we do wrong because I 
don&#39;t think it&#39;s correct right when we looked at   the space it looked like actually you were a good 
candidate to pass the class but why is the neural   network saying that there&#39;s only a 10 likelihood 
that you should pass does anyone have any ideas exactly exactly so this neural network is just uh 
like it was just born right it has no information   about the the world or this class it doesn&#39;t 
know what four and five mean or what the notion   of passing or failing means right so exactly right 
this neural network has not been trained you can   think of it kind of as a baby it hasn&#39;t learned 
anything yet so our job firstly is to train it   and part of that understanding is we first need 
to tell the neural network when it makes mistakes   right so mathematically we should now think 
about how we can answer this question which is   does did my neural network make a mistake and if 
it made a mistake how can I tell it how big of a   mistake it was so that the next time it sees this 
data point can it do better minimize that mistake   so in neural network language those mistakes 
are called losses right and specifically you   want to Define what&#39;s called a loss function 
which is going to take as input your prediction   and the true prediction right and how 
far away your prediction is from the   true prediction tells you how big of 
a loss there is right so for example   let&#39;s say we want to build a neural 
network to do classification of   or sorry actually even before that I want to 
maybe give you some terminology so there are   multiple different ways of saying the same thing 
in neural networks and deep learning so what I   just described as a loss function is also commonly 
referred to as an objective function empirical   risk a cost function these are all exactly the 
same thing they&#39;re all a way for us to train the   neural network to teach the neural network when it 
makes mistakes and what we really ultimately want   to do is over the course of an entire data set not 
just one data point of mistakes we won&#39;t say over   the entire data set we want to minimize all of the 
mistakes on average that this neural network makes   so if we look at the problem like I said of 
binary classification will I pass this class   or will I not there&#39;s a yes or no answer that 
means binary classification now we can use what&#39;s   called a loss function of the softmax Cross 
entropy loss and for those of you who aren&#39;t   familiar this notion of cross entropy is actually 
developed here at MIT by Sean Sean Excuse me yes   Claude Shannon who is a Visionary he did his 
Masters here over 50 years ago he introduced   this notion of cross-entropy and that was you 
know pivotal in in the ability for us to train   these types of neural networks even now into the 
future so let&#39;s start by instead of predicting   a binary cross-entropy output what if we wanted 
to predict a final grade of your class score for   example that&#39;s no longer a binary output yes or 
no it&#39;s actually a continuous variable right it&#39;s   the grade let&#39;s say out of 100 points what is the 
value of your score in the class project right for   this type of loss we can use what&#39;s called a mean 
squared error loss you can think of this literally   as just subtracting your predicted grade from 
the true grade and minimizing that distance apart   foreign so I think now we&#39;re ready to really put 
all of this information together and Tackle this   problem of training a neural network right to not 
just identify how erroneous it is how large its   loss is but more importantly minimize that loss 
as a function of seeing all of this training data   that it observes so we know that we want to find 
this neural network like we mentioned before that   minimizes this empirical risk or this empirical 
loss averaged across our entire data set now   this means that we want to find mathematically 
these W&#39;s right that minimize J of w JFW is our   loss function average over our entire data set 
and W is our weight so we want to find the set   of Weights that on average is going to give 
us the minimum the smallest loss as possible   now remember that W here is just a list basically 
it&#39;s just a group of all of the weights in our   neural network you may have hundreds of weights 
and a very very small neural network or in today&#39;s   neural networks you may have billions or trillions 
of weights and you want to find what is the value   of every single one of these weights that&#39;s 
going to result in the smallest loss as possible   now how can you do this remember that our loss 
function J of w is just a function of our weights   right so for any instantiation of our weights 
we can compute a scalar value of you know how   how erroneous would our neural network be for 
this instantiation of our weights so let&#39;s try   and visualize for example in a very simple example 
of a two-dimensional space where we have only two   weights extremely simple neural network here very 
small two weight neural network and we want to   find what are the optimal weights that would train 
this neural network we can plot basically the loss   how erroneous the neural network is for every 
single instantiation of these two weights right   this is a huge space it&#39;s an infinite space but 
still we can try to we can have a function that   evaluates at every point in this space now what 
we ultimately want to do is again we want to find   which set of W&#39;s will give us the smallest loss 
possible that means basically the lowest point   on this landscape that you can see here where 
is the W&#39;s that bring us to that lowest point   the way that we do this is actually just by 
firstly starting at a random place we have no idea   where to start so pick a random place to start in 
this space and let&#39;s start there at this location   let&#39;s evaluate our neural network we can compute 
the loss at this specific location and on top   of that we can actually compute how the loss is 
changing we can compute the gradient of the loss   because our loss function is a continuous function 
right so we can actually compute derivatives of   our function across the space of our weights and 
the gradient tells us the direction of the highest   point right so from where we stand the gradient 
tells us where we should go to increase our loss   now of course we don&#39;t want to increase our loss 
we want to decrease our loss so we negate our   gradient and we take a step in the opposite 
direction of the gradient that brings us one   step closer to the bottom of the landscape and 
we just keep repeating this process right over   and over again we evaluate the neural network 
at this new location compute its gradient and   step in that new direction we keep traversing 
this landscape until we converge to the minimum   we can really summarize this algorithm which 
is known formally as gradient descent right so   gradient descent simply can be written like this 
we initialize all of our weights right this can   be two weights like you saw in the previous 
example it can be billions of Weights like   in real neural networks we compute this gradient 
of the partial derivative with of our loss with   respect to the weights and then we can update our 
weights in the opposite direction of this gradient   so essentially we just take this small 
amount small step you can think of it   which here is denoted as Ada and we refer 
to this small step right this is commonly   referred to as what&#39;s known as the learning 
rate it&#39;s like how much we want to trust that   gradient and step in the direction of that 
gradient we&#39;ll talk more about this later   but just to give you some sense of code this this 
algorithm is very well translatable to real code   as well for every line on the pseudocode you can 
see on the left you can see corresponding real   code on the right that is runnable and directly 
implementable by all of you in your labs but now   let&#39;s take a look specifically at this term here 
this is the gradient we touched very briefly on   this in the visual example this explains like I 
said how the loss is changing as a function of the   weights right so as the weights move around will 
my loss increase or decrease and that will tell   the neural network if it needs to move the weights 
in a certain direction or not but I never actually   told you how to compute this right and I think 
that&#39;s an extremely important part because if you   don&#39;t know that then you can&#39;t uh well you can&#39;t 
train your neural network right this is a critical   part of training neural networks and that process 
of computing this line This gradient line is known   as back propagation so let&#39;s do a very quick 
intro to back propagation and how it works so   again let&#39;s start with the simplest neural network 
in existence this neural network has one input one   output and only one neuron right this is as simple 
as it gets we want to compute the gradient of our   loss with respect to our weight in this case let&#39;s 
compute it with respect to W2 the second weight   so this derivative is going to tell us how much a 
small change in this weight will affect our loss   if if a small change if we change our weight a 
little bit in One Direction we&#39;ll increase our   loss or decrease our loss so to compute that we 
can write out this derivative we can start with   applying the chain rule backwards from the loss 
function through the output specifically what   we can do is we can actually just decompose this 
derivative into two components the first component   is the derivative of our loss with respect to 
our output multiplied by the derivative of our   output with respect to W2 right this is just a 
standard um uh instantiation of the chain rule   with this original derivative that we had on the 
left hand side let&#39;s suppose we wanted to compute   the gradients of the weight before that which in 
this case are not W1 but W excuse me not W2 but W1   well all we do is replace W2 with W1 and that 
chain Rule still holds right that same equation   holds but now you can see on the red component 
that last component of the chain rule we have to   once again recursively apply one more chain rule 
because that&#39;s again another derivative that we   can&#39;t directly evaluate we can expand that 
once more with another instantiation of the   chain Rule and now all of these components we 
can directly propagate these gradients through   the hidden units right in our neural network all 
the way back to the weight that we&#39;re interested   in in this example right so we first computed 
the derivative with respect to W2 then we can   back propagate that and use that information 
also with W1 that&#39;s why we really call it   back propagation because this process occurs 
from the output all the way back to the input   now we repeat this process essentially many many 
times over the course of training by propagating   these gradients over and over again through 
the network all the way from the output to   the inputs to determine for every single weight 
answering this question which is how much does   a small change in these weights affect our loss 
function if it increases it or decreases and how   we can use that to improve the loss ultimately 
because that&#39;s our final goal in this class   foreign so that&#39;s the back propagation algorithm 
that&#39;s that&#39;s the core of training neural networks   in theory it&#39;s very simple it&#39;s it&#39;s really 
just an instantiation of the chain rule   but let&#39;s touch on some insights that make 
training neural networks actually extremely   complicated in practice even though the algorithm 
of back propagation is simple and you know many   decades old in practice though optimization of 
neural networks looks something like this it   looks nothing like that picture that I showed you 
before there are ways that we can visualize very   large deep neural networks and you can think 
of the landscape of these models looking like   something like this this is an illustration from 
a paper that came out several years ago where   they tried to actually visualize the landscape 
a very very deep neural networks and that&#39;s what   this landscape actually looks like that&#39;s what 
you&#39;re trying to deal with and find the minimum   in this space and you can imagine the challenges 
that come with that so to cover the challenges   let&#39;s first think of and recall that update 
equation defined in gradient descent right so   I didn&#39;t talk too much about this parameter Ada 
but now let&#39;s spend a bit of time thinking about   this this is called The Learning rate like we saw 
before it determines basically how big of a step   we need to take in the direction of our gradient 
on every single iteration of back propagation   in practice even setting the learning rate 
can be very challenging you as you as the   designer of the neural network have to set this 
value this learning rate and how do you pick   this value right so that can actually be quite 
difficult it has really uh large consequences   when building a neural network so for example 
if we set the learning rate too low then we   learn very slowly so let&#39;s assume we start on 
the right hand side here at that initial guess   if our learning rate is not large enough 
not only do we converge slowly we actually   don&#39;t even converge to the global minimum right 
because we kind of get stuck in a local minimum   now what if we set our learning rate too high 
right what can actually happen is we overshoot and   we can actually start to diverge from the solution 
the gradients can actually explode very bad things   happen and then the neural network doesn&#39;t trade 
so that&#39;s also not good in reality there&#39;s a very   happy medium between setting it too small setting 
it too large where you set it just large enough to   kind of overshoot some of these local Minima 
put you into a reasonable part of the search   space where then you can actually Converge on the 
solutions that you care most about but actually   how do you set these learning rates in practice 
right how do you pick what is the ideal learning   rate one option and this is actually a very common 
option in practice is to simply try out a bunch of   learning rates and see what works the best right 
so try out let&#39;s say a whole grid of different   learning rates and you know train all of these 
neural networks see which one works the best   but I think we can do something a lot smarter 
right so what are some more intelligent ways   that we could do this instead of exhaustively 
trying out a whole bunch of different learning   rates can we design a learning rate algorithm 
that actually adapts to our neural network and   adapts to its landscape so that it&#39;s a bit 
more intelligent than that previous idea   so this really ultimately means that the learning 
rate the speed at which the algorithm is trusting   the gradients that it sees is going to depend 
on how large the gradient is in that location   and how fast we&#39;re learning how many other 
options uh and sorry and many other options   that we might have as part of training in 
neural networks right so it&#39;s not only how   quickly we&#39;re learning you may judge it on many 
different factors in the learning landscape   in fact we&#39;ve all been these different algorithms 
that I&#39;m talking about these adaptive learning   rate algorithms have been very widely studied in 
practice there is a very thriving community in   the Deep learning research community that 
focuses on developing and designing new   algorithms for learning rate adaptation and faster 
optimization of large neural networks like these   and during your Labs you&#39;ll actually get the 
opportunity to not only try out a lot of these   different adaptive algorithms which you can see 
here but also try to uncover what are kind of the   patterns and benefits of One Versus the other 
and that&#39;s going to be something that I think   you&#39;ll you&#39;ll find very insightful as part of your 
labs so another key component of your Labs that   you&#39;ll see is how you can actually put all of this 
information that we&#39;ve covered today into a single   picture that looks roughly something like this 
which defines your model at the first at the top   here that&#39;s where you define your model we talked 
about this in the beginning part of the lecture   for every piece in your model you&#39;re now going 
to need to Define this Optimizer which we&#39;ve just   talked about this Optimizer is defined together 
with a learning rate right how quickly you want   to optimize your lost landscape and over many 
Loops you&#39;re going to pass over all of the   examples in your data set and observe essentially 
how to improve your network that&#39;s the gradient   and then actually improve the network in those 
directions and keep doing that over and over   and over again until eventually your neural 
network converges to some sort of solution so I want to very quickly briefly in the 
remaining time that we have continue to talk   about tips for training these neural networks 
in practice and focus on this very powerful   idea of batching your data into well what are 
called mini batches of smaller pieces of data   to do this let&#39;s revisit that gradient descent 
algorithm right so here this gradient that we   talked about before is actually extraordinarily 
computationally expensive to compute because it&#39;s   computed as a summation across all of the pieces 
in your data set right and in most real life or   real world problems you know it&#39;s simply not 
feasible to compute a gradient over your entire   data set data sets are just too large these days 
so in you know there are some Alternatives right   what are the Alternatives instead of computing 
the derivative or the gradients across your entire   data set what if you instead computed the gradient 
over just a single example in your data set just   one example well of course this this estimate of 
your gradient is going to be exactly that it&#39;s   an estimate it&#39;s going to be very noisy it may 
roughly reflect the trends of your entire data set   but because it&#39;s a very it&#39;s only one example in 
fact of your entire data set it may be very noisy   right well the advantage of this though is 
that it&#39;s much faster to compute obviously   the gradient over a single example because 
it&#39;s one example so computationally this   has huge advantages but the downside is that it&#39;s 
extremely stochastic right that&#39;s the reason why   this algorithm is not called gradient descent 
it&#39;s called stochastic gradient descent now   now what&#39;s the middle ground right instead of 
computing it with respect to one example in   your data set what if we computed what&#39;s called a 
mini batch of examples a small batch of examples   that we can compute the gradients over and when we 
take these gradients they&#39;re still computationally   efficient to compute because it&#39;s a mini batch 
it&#39;s not too large maybe we&#39;re talking on the   order of tens or hundreds of examples in our data 
set but more importantly because we&#39;ve expanded   from a single example to maybe 100 examples 
the stochasticity is significantly reduced and   the accuracy of our gradient is much improved so 
normally we&#39;re thinking of batch sizes many batch   sizes roughly on the order of 100 data points 
tens or hundreds of data points this is much   faster obviously to compute than gradient descent 
and much more accurate to compute compared to   stochastic gradient descent which is that single 
single point example so this increase in gradient   accuracy allows us to essentially converge to 
our solution much quicker than it could have   been possible in practice due to gradient descent 
limitations it also means that we can increase our   learning rate because we can trust each of those 
gradients much more efficiently right we&#39;re now   averaging over a batch it&#39;s going to be much 
more accurate than the stochastic version so we   can increase that learning rate and actually 
learn faster as well this allows us to also   massively parallelize this entire algorithm in 
computation right we can split up batches onto   separate workers and Achieve even more significant 
speed UPS of this entire problem using gpus the   last topic that I very very briefly want to cover 
in today&#39;s lecture is this topic of overfitting   right when we&#39;re optimizing a neural network with 
stochastic gradient descent we have this challenge   of what&#39;s called overfitting overfitting I looks 
like this roughly right so on the left hand side   we want to build a neural network or let&#39;s say 
in general we want to build a machine learning   model that can accurately describe some patterns 
in our data but remember we&#39;re ultimately we don&#39;t   want to describe the patterns in our training data 
ideally we want to define the patterns in our test   data of course we don&#39;t observe test data we only 
observe training data so we have this challenge of   extracting patterns from training data and hoping 
that they generalize to our test data so set in   one different way we want to build models that can 
learn representations from our training data that   can still generalize even when we show them brand 
new unseen pieces of test data so assume that you   want to build a line that can describe or find 
the patterns in these points that you can see on   the slide right if you have a very simple neural 
network which is just a single line straight line   you can describe this data sub-optimally right 
because the data here is non-linear you&#39;re not   going to accurately capture all of the nuances 
and subtleties in this data set that&#39;s on the   left hand side if you move to the right hand 
side you can see a much more complicated model   but here you&#39;re actually over expressive you&#39;re 
too expressive and you&#39;re capturing kind of the   nuances the spurious nuances in your training 
data that are actually not representative of   your test data ideally you want to end up with the 
model in the middle which is basically the middle   ground right it&#39;s not too complex and it&#39;s not too 
simple it still gives you what you want to perform   well and even when you give it brand new data so 
to address this problem let&#39;s briefly talk about   what&#39;s called regularization regularization 
is a technique that you can introduce to your   training pipeline to discourage complex models 
from being learned now as we&#39;ve seen before   this is really critical because neural networks 
are extremely large models they are extremely   prone to overfitting right so regularization 
and having techniques for regularization has   extreme implications towards the success of 
neural networks and having them generalize   Beyond training data far into our testing domain 
the most popular technique for regularization in   deep learning is called Dropout and the idea of 
Dropout is is actually very simple it&#39;s let&#39;s   revisit it by drawing this picture of deep neural 
networks that we saw earlier in today&#39;s lecture in   Dropout during training we essentially randomly 
select some subset of the neurons in this neural   network and we try to prune them out with some 
random probabilities so for example we can select   this subset of neural of neurons we can randomly 
select them with a probability of 50 percent and   with that probability we randomly turn them off 
or on on different iterations of our training   so this is essentially forcing the neural network 
to learn you can think of an ensemble of different   models on every iteration it&#39;s going to be exposed 
to kind of a different model internally than the   one it had on the last iteration so it has 
to learn how to build internal Pathways to   process the same information and it can&#39;t rely on 
information that it learned on previous iterations   right so it forces it to kind of capture some 
deeper meaning within the pathways of the neural   network and this can be extremely powerful 
because number one it lowers the capacity   of the neural network significantly right you&#39;re 
lowering it by roughly 50 percent in this example   but also because it makes them easier to 
train because the number of Weights that   have gradients in this case is also reduced so 
it&#39;s actually much faster to train them as well   now like I mentioned on every iteration we 
randomly drop out a different set of neurons right   and that helps the data generalize better and the 
second regularization techniques which is actually   a very broad regularization technique far beyond 
neural networks is simply called early stopping   now we know the the definition of overfitting 
is simply when our model starts to represent   basically the training data more than the 
testing data that&#39;s really what overfitting   comes down to at its core if we set aside some of 
the training data to use separately that we don&#39;t   train on it we can use it as kind of a testing 
data set synthetic testing data set in some ways   we can monitor how our network is learning on 
this unseen portion of data so for example we   can over the course of training we can basically 
plot the performance of our Network on both the   training set as well as our held out test set and 
as the network is trained we&#39;re going to see that   first of all these both decrease but there&#39;s 
going to be a point where the loss plateaus and   starts to increase the training loss will actually 
start to increase this is exactly the point where   you start to overfit right because now you&#39;re 
starting to have sorry that was the test loss the   test loss actually starts to increase because now 
you&#39;re starting to overfit on your training data   this pattern basically continues for the rest 
of training and this is the point that I want   you to focus on right this Middle Point 
is where we need to stop training because   after this point assuming that this test set 
is a valid representation of the true test   set this is the place where the accuracy 
of the model will only get worse right so   this is where we would want to early stop 
our model and regularize the performance   and we can see that stopping anytime before 
this point is also not good we&#39;re going to   produce an underfit model where we could 
have had a better model on the test data   but it&#39;s this trade-off right you can&#39;t stop 
too late and you can&#39;t stop too early as well   so I&#39;ll conclude this lecture by just summarizing 
these three key points that we&#39;ve covered in   today&#39;s lecture so far so we&#39;ve first covered 
these fundamental building blocks of all neural   networks which is the single neuron the perceptron 
we&#39;ve built these up into larger neural layers and   then from their neural networks and deep neural 
networks we&#39;ve learned how we can train these   apply them to data sets back propagate through 
them and we&#39;ve seen some trips tips and tricks   for optimizing these systems end to end in 
the next lecture we&#39;ll hear from Ava on deep   sequence modeling using rnns and specifically 
this very exciting new type of model called the   Transformer architecture and attention mechanisms 
so maybe let&#39;s resume the class in about five   minutes after we have a chance to swap speakers 
and thank you so much for all of your attention thank you

Transcript for:Introduction to Deep Learning - Lecture Notes

Transcript for:
Introduction to Deep Learning - Lecture Notes