Good afternoon everyone!
Thank you all for joining today. My name is Alexander Amini and I'll be one of your
course organizers this year along with Ava -- and together we're super excited to introduce you
all to Introduction to Deep Learning. Now MIT Intro to Deep Learning is a really really fun
exciting and fast-paced program here at MIT and let me start by just first of all giving
you a little bit of background into what we do and what you're going to learn about this year.
So this week of Intro to Deep Learning we're going to cover a ton of material in just one week.
You'll learn the foundations of this really really fascinating and exciting field of
deep learning and artificial intelligence and more importantly you're going to get hands-on
experience actually reinforcing what you learn in the lectures as part of hands-oOn software labs.
Now over the past decade AI and deep learning have really had a huge resurgence and many
incredible successes and a lot of problems that even just a decade ago we thought were not
really even solvable in the near future now we're solving with deep learning with Incredible ease.
Now this past year in particular of 2022 has been an incredible year for a deep learning progress
and I like to say that actually this past year in particular has been the year of generative
deep learning using deep learning to generate brand new types of data that I've never been
seen before and never existed in reality in fact I want to start this class by actually
showing you how we started this class several years ago which was by playing this video that
I'll play in a second now this video actually was an introductory video for the class it kind
of exemplifies this idea that I'm talking about. So let me just stop there and
play this video first of all Hi everybody and welcome to MIT 6.S191
-- the official introductory course on deep learning taught here at MIT.
Deep Learning is revolutionizing so many fields: from robotics to
medicine and everything in between. You'll learn the fundamentals of
this field and how you can build some of these incredible algorithms.
In fact, this entire speech and video are not real and were created using deep
learning and artificial intelligence. And in this class you'll learn how. It has been an honor to speak with you
today and I hope you enjoy the course. so in case you couldn't tell this video and
its entire audio was actually not real it was synthetically generated by a deep learning
algorithm and when we introduced this class A few years ago this video was created several
years ago right but even several years ago when we introduced this and put it on YouTube it went
somewhat viral right people really loved this video they were intrigued by how real the video
and audio felt and looked uh entirely generated by an algorithm by a computer and people were
shocked with the power and the realism of these types of approaches and this was a few years ago
now fast forward to today and the state of deep learning today we have have seen deep learning
accelerating at a rate faster than we've ever seen before in fact we can use deep learning
now to generate not just images of faces but generate full synthetic environments where we can
train autonomous vehicles entirely in simulation and deploy them on full-scale vehicles in the
real world seamlessly the videos here you see are actually from a data driven simulator from
neural networks generated called Vista that we actually built here at MIT and have open sourced
to the public so all of you can actually train and build the future of autonomy and self-driving cars
and of course it goes far beyond this as well deep learning can be used to generate content directly
from how we speak and the language that we convey to it from prompts that we say deep learning can
reason about the prompts in natural language and English for example and then guide and control
what is generated according to what we specify we've seen examples of where we can generate for
example things that again have never existed in reality we can ask a neural network to generate
a photo of a astronaut riding a horse and it actually can imagine hallucinate what this might
look like even though of course this photo not only this photo has never occurred before but
I don't think any photo of an astronaut riding a horse has ever occurred before so there's
not really even training data that you could go off in this case and my personal favorite
is actually how we can not only build software that can generate images and videos but build
software that can generate software as well we can also have algorithms that can take language
prompts for example a prompt like this write code and tensorflow to generate or to train
a neural network and not only will it write the code and create that neural network but it
will have the ability to reason about the code that it's generated and walk you through step by
step explaining the process and procedure all the way from the ground up to you so that you can
actually learn how to do this process as well now I think some of these examples really just
highlight how far deep learning and these methods have come in the past six years since we started
this course and you saw that example just a few years ago from that introductory video but now
we're seeing such incredible advances and the most amazing part of this course in my opinion is
actually that within this one week we're going to take you through from the ground up starting
from today all of the foundational building blocks that will allow you to understand and
make all of this amazing Advance as possible so with that hopefully now you're all super
excited about what this class will teach and I want to basically now just start by taking a step
back and introducing some of these terminologies that I've kind of been throwing around so far
the Deep learning artificial intelligence what do these things actually mean so first of
all I want to maybe just take a second to speak a little bit about intelligence and
what intelligence means at its core so to me intelligence is simply the ability to process
information such that we can use it to inform some future decision or action that we take now the
field of artificial intelligence is simply the ability for us to build algorithms artificial
algorithms that can do exactly this process information to inform some future decision
now machine learning is simply a subset of AI which focuses specifically on how we can build
a machine to or teach a machine how to do this from some experiences or data for example now deep
learning goes One Step Beyond this and is a subset of machine learning which focuses explicitly on
what are called neural networks and how we can build neural networks that can extract features in
the data these are basically what you can think of as patterns that occur within the data so that
it can learn to complete these tasks as well now that's exactly what this class is really all
about at its core we're going to try and teach you and give you the foundational understanding
and how we can build and teach computers to learn tasks many different type of tasks directly from
raw data and that's really what this class spoils down to at it's it's most simple form and we'll
provide a very solid foundation for you both on the technical side through the lectures which will
happen in two parts throughout the class the first lecture and the second lecture each one about one
hour long followed by a software lab which will immediately follow the lectures which will try to
reinforce a lot of what we cover in the in the in the technical part of the class and you know give
you hands-on experience implementing those ideas so this program is split between these two pieces
the technical lectures and the software Labs we have several new updates this year in specific
especially in many of the later lectures the first lecture will cover the foundations of
deep learning which is going to be right now and finally we'll conclude the course with
some very exciting guest lectures from both Academia and Industry who are really leading
and driving forward the state of AI and deep learning and of course we have many awesome
prizes that go with all of the software labs and the project competition at the end of the
course so maybe quickly to go through these each day like I said we'll have dedicated
software Labs that couple with the lectures starting today with lab one you'll actually
build a neural network keeping with this theme of generative AI you'll build a neural
network that can learn listen to a lot of music and actually learn how to generate
brand new songs in that genre of music at the end at the next level of the class on
Friday we'll host a project pitch competition where either you individually or as part of a
group can participate and present an idea a novel deep learning idea to all of us it'll be roughly
three minutes in length and we will focus not as much because this is a one week program we're
not going to focus so much on the results of your pitch but rather The Innovation and the idea
and the novelty of what you're trying to propose the prices here are quite significant already
where first price is going to get an Nvidia GPU which is really a key piece of Hardware that
is instrumental if you want to actually build a deep learning project and train these neural
networks which can be very large and require a lot of compute these prices will give you
the compute to do so and finally this year we'll be awarding a grand prize for labs two and
three combined which will occur on Tuesday and Wednesday focused on what I believe is actually
solving some of the most exciting problems in this field of deep learning and how specifically how
we can build models that can be robust not only accurate but robust and trustworthy and safe when
they're deployed as well and you'll actually get experience developing those types of solutions
that can actually Advance the state of the art and AI now all of these Labs that I mentioned and
competitions here are going to be due on Thursday night at 11 PM right before the last day of
class and we'll be helping you all along the way this this Prize or this competition in
particular has very significant prizes so I encourage all of you to really enter this prize
and try to try to get a chance to win the prize and of course like I said we're going to
be helping you all along the way who are many available resources throughout this class to
help you achieve this please post to Piazza if you have any questions and of course this program
has an incredible team that you can reach out to at any point in case you have any issues or
questions on the materials myself and Ava will be your two main lectures for the first part
of the class we'll also be hearing like I said in the later part of the class from some guest
lectures who will share some really cutting edge state-of-the-art developments in deep learning
and of course I want to give a huge shout out and thanks to all of our sponsors who without their
support this program wouldn't have been possible at first yet again another year so thank you all okay so now with that let's really dive into
the really fun stuff of today's lecture which is you know the the technical part and I think I
want to start this part by asking all of you and having yourselves ask yourself you know having
you ask yourselves this question of you know why are all of you here first of all why do you
care about this topic in the first place now I think to answer this question we have to take a
step back and think about you know the history of machine learning and what machine learning is and
what deep learning brings to the table on top of machine learning now traditional machine learning
algorithms typically Define what are called these set of features in the data you can think of these
as certain patterns in the data and then usually these features are hand engineered so probably
a human will come into the data set and with a lot of domain knowledge and experience can try to
uncover what these features might be now the key idea of deep learning and this is really Central
to this class is that instead of having a human Define these features what if we could have a
machine look at all of this data and actually try to extract and uncover what are the core
patterns in the data so that it can use those when it sees new data to make some decisions
so for example if we wanted to detect faces in an image a deep neural network algorithm might
actually learn that in order to detect a face it first has to detect things like edges in the image
lines and edges and when you combine those lines and edges you can actually create compositions
of features like corners and curves which when you create those when you combine those you can
create more high level features for example eyes and noses and ears and then those are the features
that allow you to ultimately detect what you care about detecting which is the face but all of these
come from what are called kind of a hierarchical learning of features and you can actually see some
examples of these these are real features learned by a neural network and how they're combined
defines this progression of information but in fact what I just described this underlying and
fundamental building block of neural networks and deep learning have actually existed for decades
now why are we studying all of this now and today in this class with all of this great enthusiasm
to learn this right well for one there have been several key advances that have occurred in the
past decade number one is that data is so much more pervasive than it has ever been before in our
lifetimes these models are hungry for more data and we're living in the age of Big Data more data
is available to these models than ever before and they Thrive off of that secondly these algorithms
are massively parallelizable they require a lot of compute and we're also at a unique time in history
where we have the ability to train these extremely large-scale algorithms and techniques that have
existed for a very long time but we can now train them due to the hardware advances that have
been made and finally due to open source toolbox access and software platforms like tensorflow
for example which all of you will get a lot of experience on in this class training and building
the code for these neural networks has never been easier so that from the software point of view
as well there have been incredible advances to open source you know the the underlying
fundamentals of what you're going to learn so let me start now with just building up from
the ground up the fundamental building block of every single neural network that you're going
to learn in this class and that's going to be just a single neuron right and in neural network
language a single neuron is called a perceptron so what is the perceptron a perceptron
is like I said a single neuron and it's actually I'm going to say it's very
very simple idea so I want to make sure that everyone in the audience understands
exactly what a perceptron is and how it works so let's start by first defining a perceptron
as taking it as input a set of inputs right so on the left hand side you can see this perceptron
takes M different inputs 1 to M right these are the blue circles we're denoting these inputs as
X's each of these numbers each of these inputs is then multiplied by a corresponding weight which
we can call W right so X1 will be multiplied by W1 and we'll add the result of all of these
multiplications together now we take that single number after the addition and we pass it
through this non-linear what we call a non-linear activation function and that produces our final
output of the perceptron which we can call Y now this is actually not entirely accurate of
the picture of a perceptron there's one step that I forgot to mention here so in addition
to multiplying all of these inputs with their corresponding weights we're also now going to add
what's called a bias term here denoted as this w0 which is just a scalar weight and you can think
of it coming with a input of just one so that's going to allow the network to basically shift
its nonlinear activation function uh you know non-linearly right as it sees its inputs now
on the right hand side you can see this diagram mathematically formulated right as a single
equation we can now rewrite this linear this this equation with linear algebra terms of vectors and
Dot products right so for example we can Define our entire inputs X1 to XM as a large Vector
X right that large Vector X can be multiplied by or taking a DOT excuse me Matrix multiplied
with our weights W this again another Vector of our weights W1 to WN taking their dot product
not only multiplies them but it also adds the resulting terms together adding a bias like
we said before and applying this non-linearity now you might be wondering what is this non-linear
function I've mentioned it a few times already well I said it is a function right that's passed
that we pass the outputs of the neural network through before we return it you know to the next
neuron in the in the pipeline right so one common example of a nonlinear function that's very
popular in deep neural networks is called the sigmoid function you can think of this as kind of
a continuous version of a threshold function right it goes from zero to one and it's having it can
take us input any real number on the real number line and you can see an example of it Illustrated
on the bottom right hand now in fact there are many types of nonlinear activation functions that
are popular in deep neural networks and here are some common ones and throughout this presentation
you'll actually see some examples of these code snippets on the bottom of the slides where we'll
try and actually tie in some of what you're learning in the lectures to actual software and
how you can Implement these pieces which will help you a lot for your software Labs explicitly so
the sigmoid activation on the left is very popular since it's a function that outputs you know
between zero and one so especially when you want to deal with probability distributions for example
this is very important because probabilities live between 0 and 1. in modern deep neural networks
though the relu function which you can see on the far right hand is a very popular activation
function because it's piecewise linear it's extremely efficient to compute especially when
Computing its derivatives right its derivatives are constants except for one non-linear idiot
zero now I hope actually all of you are probably asking this question to yourself of why do we
even need this nonlinear activation function it seems like it kind of just complicates this
whole picture when we didn't really need it in the first place and I want to just spend a moment
on answering this because the point of a nonlinear activation function is of course number one is to
introduce non-linearities to our data right if we think about our data almost all data that we care
about all real world data is highly non-linear now this is important because if we want to be
able to deal with those types of data sets we need models that are also nonlinear so they can
capture those same types of patterns so imagine I told you to separate for example I gave you this
data set red points from greenpoints and I ask you to try and separate those two types of data points
now you might think that this is easy but what if I could only if I told you that you could only
use a single line to do so well now it becomes a very complicated problem in fact you can't
really Solve IT effectively with a single line and in fact if you introduce nonlinear activation
functions to your Solution that's exactly what allows you to you know deal with these types of
problems nonlinear activation functions allow you to deal with non-linear types of data now
and that's what exactly makes neural networks so powerful at their core so let's understand
this maybe with a very simple example walking through this diagram of a perceptron one
more time imagine I give you this trained neural network with weights now not W1 W2 I'm
going to actually give you numbers at these locations right so the trained weights w0 will
be 1 and W will be a vector of 3 and negative 2. so this neural network has two inputs like we
said before it has input X1 it has input X2 if we want to get the output of it this is also
the main thing I want all of you to take away from this lecture today is that to get the output
of a perceptron there are three steps we need to take right from this stage we first compute the
multiplication of our inputs with our weights sorry yeah multiply them together add
their result and compute a non-linearity it's these three steps that Define the forward
propagation of information through a perceptron so let's take a look at how that exactly
works right so if we plug in these numbers to the to those equations we can see that
everything inside of our non-linearity here the nonlinearity is G right that function G
which could be a sigmoid we saw a previous slide that component inside of our nonlinearity is
in fact just a two-dimensional line it has two inputs and if we consider the space of all of
the possible inputs that this neural network could see we can actually plot this on a decision
boundary right we can plot this two-dimensional line as as a a decision boundary as a plane
separating these two components of our space in fact not only is it a single plane there's a
directionality component depending on which side of the plane that we live on if we see an input
for example here negative one two we actually know that it lives on one side of the plane and
it will have a certain type of output in this case that output is going to be positive right because
in this case when we plug those components into our equation we'll get a positive number that
passes through the nonlinear component and that gets propagated through as well of course if
you're on the other side of the space you're going to have the opposite result right and that
thresholding function is going to essentially live at this decision boundary so depending on which
side of the space you live on that thresholding function that sigmoid function is going to then
control how you move to one side or the other now in this particular example this is very
convenient right because we can actually visualize and I can draw this exact full space
for you on this slide it's only a two-dimensional space so it's very easy for us to visualize
but of course for almost all problems that we care about our data points are not going to
be two-dimensional right if you think about an image the dimensionality of an image is going
to be the number of pixels that you have in the image right so these are going to be thousands
of Dimensions millions of Dimensions or even more and then drawing these types of plots like
you see here is simply not feasible right so we can't always do this but hopefully this gives
you some intuition to understand kind of as we build up into more complex models so now that we
have an idea of the perceptron let's see how we can actually take this single neuron and start
to build it up into something more complicated a full neural network and build a model from that
so let's revisit again this previous diagram of the perceptron if again just to reiterate one more
time this core piece of information that I want all of you to take away from this class is how a
perceptron works and how it propagates information to its decision there are three steps first is the
dot product second is the bias and third is the non-linearity and you keep repeating this process
for every single perceptron in your neural network let's simplify the diagram a little bit I'll get
rid of the weights and you can assume that every line here now basically has an Associated weight
scaler that's associated with it every line also has it corresponds to the input that's coming
in it has a weight that's coming in also at the on the line itself and I've also removed the bias
just for a sake of Simplicity but it's still there so now the result is that Z which let's call
that the result of our DOT product plus the bias is going and that's what we pass into
our non-linear function that piece is going to be applied to that activation function
now the final output here is simply going to be G which is our activation function of
Z right Z is going to be basically what you can think of the state of this neuron it's
the result of that dot product plus bias now if we want to Define and build up a
multi-layered output neural network if we want two outputs to this function for example
it's a very simple procedure we just have now two neurons two perceptrons each perceptron will
control the output for its Associated piece right so now we have two outputs each one is a normal
perceptron it takes all of the inputs so they both take the same inputs but amazingly now
with this mathematical understanding we can start to build our first neural network entirely
from scratch so what does that look like so we can start by firstly initializing these two
components the first component that we saw was the weight Matrix excuse me the weight
Vector it's a vector of Weights in this case and the second component is the the bias Vector
that we're going to multiply with the dot product of all of our inputs by our weights right so the
only remaining step now after we've defined these parameters of our layer is to now Define you know
how does forward propagation of information works and that's exactly those three main components
that I've been stressing to so we can create this call function to do exactly that to Define this
forward propagation of information and the story here is exactly the same as we've been seeing it
right Matrix multiply our inputs with our weights Right add a bias and then apply a non-linearity
and return the result and that literally this code will run this will Define a full net a full neural
network layer that you can then take like this and of course actually luckily for all
of you all of that code which wasn't much code that's been abstracted away by these
libraries like tensorflow you can simply call functions like this which will actually
you know replicate exactly that piece of code so you don't need to necessarily copy all of
that code down you just you can just call it and with that understanding you know we just saw
how you could build a single layer but of course now you can actually start to think about how
you can stack these layers as well so since we now have this transformation essentially from
our inputs to a hidden output you can think of this as basically how we can Define some
way of transforming those inputs right into some new dimensional space right perhaps closer
to the value that we want to predict and that transformation is going to be eventually learned
to know how to transform those inputs into our desired outputs and we'll get to that later but
for now the piece that I want to really focus on is if we have these more complex neural networks
I want to really distill down that this is nothing more complex than what we've already seen if we
focus on just one neuron in this diagram take is here for example Z2 right Z2 is this neuron that's
highlighted in the middle layer it's just the same perceptron that we've been seeing so far in this
class it was a its output is obtained by taking a DOT product adding a bias and then applying
that non-linearity between all of its inputs if we look at a different node for example Z3
which is the one right below it it's the exact same story again it sees all of the same inputs
but it has a different set of weight Matrix that it's going to apply to those inputs so we'll have
a different output but the mathematical equations are exactly the same so from now on I'm just
going to kind of simplify all of these lines and diagrams just to show these icons in the middle
just to demonstrate that this means everything is going to fully connect it to everything and
defined by those mathematical equations that we've been covering but there's no extra complexity in
these models from what you've already seen now if you want to Stack these types of Solutions on top
of each other these layers on top of each other you can not only Define one layer very easily but
you can actually create what are called sequential models these sequential models you can Define one
layer after another and they define basically the forward propagation of information not just
from the neuron level but now from the layer level every layer will be fully connected to the
next layer and the inputs of the secondary layer will be all of the outputs of the prior layer
now of course if you want to create a very deep neural network all the Deep neural network is is
we just keep stacking these layers on top of each other there's nothing else to this story that's
really as simple as it is once so these layers are basically all they are is just layers where the
final output is computed right by going deeper and deeper into this progression of different layers
right and you just keep stacking them until you get to the last layer which is your output layer
it's your final prediction that you want to Output right we can create a deep neural network to do
all of this by stacking these layers and creating these more hierarchical models like we saw very
early in the beginning of today's lecture one where the final output is really computed by you
know just going deeper and deeper into this system okay so that's awesome so we've now seen how
we can go from a single neuron to a layer to all the way to a deep neural network right
building off of these foundational principles let's take a look at how exactly we can use these
uh you know principles that we've just discussed to solve a very real problem that I think all
of you are probably very concerned about uh this morning when you when you woke up so that
problem is how we can build a neural network to answer this question which is will I how will
I pass this class and if I will or will I not so to answer this question let's see if we can
train a neural network to solve this problem okay so to do this let's start with a very simple
neural network right we'll train this model with two inputs just two inputs one input is going to
be the number of lectures that you attend over the course of this one week and the second input is
going to be how many hours that you spend on your final project or your competition okay so what
we're going to do is firstly go out and collect a lot of data from all of the past years that
we've taught this course and we can plot all of this data because it's only two input space we can
plot this data on a two-dimensional feature space right we can actually look at all of the students
before you that have passed the class and failed the class and see where they lived in this space
for the amount of hours that they've spent the number of lectures that they've attended and so
on greenpoints are the people who have passed red or those who have failed now and here's you right
you're right here four or five is your coordinate space you fall right there and you've attended
four lectures you've spent five hours on your final project we want to build a neural network
to answer the question of will you pass the class although you failed the class so let's do it we
have two inputs one is four one is five these are two numbers we can feed them through a neural
network that we've just seen how we can build that and we feed that into a single layered neural
network three hidden units in this example but we could make it larger if we wanted to be more
expressive and more powerful and we see here that the probability of you passing this class
is 0.1 it's pretty visible so why would this be the case right what did we do wrong because I
don't think it's correct right when we looked at the space it looked like actually you were a good
candidate to pass the class but why is the neural network saying that there's only a 10 likelihood
that you should pass does anyone have any ideas exactly exactly so this neural network is just uh
like it was just born right it has no information about the the world or this class it doesn't
know what four and five mean or what the notion of passing or failing means right so exactly right
this neural network has not been trained you can think of it kind of as a baby it hasn't learned
anything yet so our job firstly is to train it and part of that understanding is we first need
to tell the neural network when it makes mistakes right so mathematically we should now think
about how we can answer this question which is does did my neural network make a mistake and if
it made a mistake how can I tell it how big of a mistake it was so that the next time it sees this
data point can it do better minimize that mistake so in neural network language those mistakes
are called losses right and specifically you want to Define what's called a loss function
which is going to take as input your prediction and the true prediction right and how
far away your prediction is from the true prediction tells you how big of
a loss there is right so for example let's say we want to build a neural
network to do classification of or sorry actually even before that I want to
maybe give you some terminology so there are multiple different ways of saying the same thing
in neural networks and deep learning so what I just described as a loss function is also commonly
referred to as an objective function empirical risk a cost function these are all exactly the
same thing they're all a way for us to train the neural network to teach the neural network when it
makes mistakes and what we really ultimately want to do is over the course of an entire data set not
just one data point of mistakes we won't say over the entire data set we want to minimize all of the
mistakes on average that this neural network makes so if we look at the problem like I said of
binary classification will I pass this class or will I not there's a yes or no answer that
means binary classification now we can use what's called a loss function of the softmax Cross
entropy loss and for those of you who aren't familiar this notion of cross entropy is actually
developed here at MIT by Sean Sean Excuse me yes Claude Shannon who is a Visionary he did his
Masters here over 50 years ago he introduced this notion of cross-entropy and that was you
know pivotal in in the ability for us to train these types of neural networks even now into the
future so let's start by instead of predicting a binary cross-entropy output what if we wanted
to predict a final grade of your class score for example that's no longer a binary output yes or
no it's actually a continuous variable right it's the grade let's say out of 100 points what is the
value of your score in the class project right for this type of loss we can use what's called a mean
squared error loss you can think of this literally as just subtracting your predicted grade from
the true grade and minimizing that distance apart foreign so I think now we're ready to really put
all of this information together and Tackle this problem of training a neural network right to not
just identify how erroneous it is how large its loss is but more importantly minimize that loss
as a function of seeing all of this training data that it observes so we know that we want to find
this neural network like we mentioned before that minimizes this empirical risk or this empirical
loss averaged across our entire data set now this means that we want to find mathematically
these W's right that minimize J of w JFW is our loss function average over our entire data set
and W is our weight so we want to find the set of Weights that on average is going to give
us the minimum the smallest loss as possible now remember that W here is just a list basically
it's just a group of all of the weights in our neural network you may have hundreds of weights
and a very very small neural network or in today's neural networks you may have billions or trillions
of weights and you want to find what is the value of every single one of these weights that's
going to result in the smallest loss as possible now how can you do this remember that our loss
function J of w is just a function of our weights right so for any instantiation of our weights
we can compute a scalar value of you know how how erroneous would our neural network be for
this instantiation of our weights so let's try and visualize for example in a very simple example
of a two-dimensional space where we have only two weights extremely simple neural network here very
small two weight neural network and we want to find what are the optimal weights that would train
this neural network we can plot basically the loss how erroneous the neural network is for every
single instantiation of these two weights right this is a huge space it's an infinite space but
still we can try to we can have a function that evaluates at every point in this space now what
we ultimately want to do is again we want to find which set of W's will give us the smallest loss
possible that means basically the lowest point on this landscape that you can see here where
is the W's that bring us to that lowest point the way that we do this is actually just by
firstly starting at a random place we have no idea where to start so pick a random place to start in
this space and let's start there at this location let's evaluate our neural network we can compute
the loss at this specific location and on top of that we can actually compute how the loss is
changing we can compute the gradient of the loss because our loss function is a continuous function
right so we can actually compute derivatives of our function across the space of our weights and
the gradient tells us the direction of the highest point right so from where we stand the gradient
tells us where we should go to increase our loss now of course we don't want to increase our loss
we want to decrease our loss so we negate our gradient and we take a step in the opposite
direction of the gradient that brings us one step closer to the bottom of the landscape and
we just keep repeating this process right over and over again we evaluate the neural network
at this new location compute its gradient and step in that new direction we keep traversing
this landscape until we converge to the minimum we can really summarize this algorithm which
is known formally as gradient descent right so gradient descent simply can be written like this
we initialize all of our weights right this can be two weights like you saw in the previous
example it can be billions of Weights like in real neural networks we compute this gradient
of the partial derivative with of our loss with respect to the weights and then we can update our
weights in the opposite direction of this gradient so essentially we just take this small
amount small step you can think of it which here is denoted as Ada and we refer
to this small step right this is commonly referred to as what's known as the learning
rate it's like how much we want to trust that gradient and step in the direction of that
gradient we'll talk more about this later but just to give you some sense of code this this
algorithm is very well translatable to real code as well for every line on the pseudocode you can
see on the left you can see corresponding real code on the right that is runnable and directly
implementable by all of you in your labs but now let's take a look specifically at this term here
this is the gradient we touched very briefly on this in the visual example this explains like I
said how the loss is changing as a function of the weights right so as the weights move around will
my loss increase or decrease and that will tell the neural network if it needs to move the weights
in a certain direction or not but I never actually told you how to compute this right and I think
that's an extremely important part because if you don't know that then you can't uh well you can't
train your neural network right this is a critical part of training neural networks and that process
of computing this line This gradient line is known as back propagation so let's do a very quick
intro to back propagation and how it works so again let's start with the simplest neural network
in existence this neural network has one input one output and only one neuron right this is as simple
as it gets we want to compute the gradient of our loss with respect to our weight in this case let's
compute it with respect to W2 the second weight so this derivative is going to tell us how much a
small change in this weight will affect our loss if if a small change if we change our weight a
little bit in One Direction we'll increase our loss or decrease our loss so to compute that we
can write out this derivative we can start with applying the chain rule backwards from the loss
function through the output specifically what we can do is we can actually just decompose this
derivative into two components the first component is the derivative of our loss with respect to
our output multiplied by the derivative of our output with respect to W2 right this is just a
standard um uh instantiation of the chain rule with this original derivative that we had on the
left hand side let's suppose we wanted to compute the gradients of the weight before that which in
this case are not W1 but W excuse me not W2 but W1 well all we do is replace W2 with W1 and that
chain Rule still holds right that same equation holds but now you can see on the red component
that last component of the chain rule we have to once again recursively apply one more chain rule
because that's again another derivative that we can't directly evaluate we can expand that
once more with another instantiation of the chain Rule and now all of these components we
can directly propagate these gradients through the hidden units right in our neural network all
the way back to the weight that we're interested in in this example right so we first computed
the derivative with respect to W2 then we can back propagate that and use that information
also with W1 that's why we really call it back propagation because this process occurs
from the output all the way back to the input now we repeat this process essentially many many
times over the course of training by propagating these gradients over and over again through
the network all the way from the output to the inputs to determine for every single weight
answering this question which is how much does a small change in these weights affect our loss
function if it increases it or decreases and how we can use that to improve the loss ultimately
because that's our final goal in this class foreign so that's the back propagation algorithm
that's that's the core of training neural networks in theory it's very simple it's it's really
just an instantiation of the chain rule but let's touch on some insights that make
training neural networks actually extremely complicated in practice even though the algorithm
of back propagation is simple and you know many decades old in practice though optimization of
neural networks looks something like this it looks nothing like that picture that I showed you
before there are ways that we can visualize very large deep neural networks and you can think
of the landscape of these models looking like something like this this is an illustration from
a paper that came out several years ago where they tried to actually visualize the landscape
a very very deep neural networks and that's what this landscape actually looks like that's what
you're trying to deal with and find the minimum in this space and you can imagine the challenges
that come with that so to cover the challenges let's first think of and recall that update
equation defined in gradient descent right so I didn't talk too much about this parameter Ada
but now let's spend a bit of time thinking about this this is called The Learning rate like we saw
before it determines basically how big of a step we need to take in the direction of our gradient
on every single iteration of back propagation in practice even setting the learning rate
can be very challenging you as you as the designer of the neural network have to set this
value this learning rate and how do you pick this value right so that can actually be quite
difficult it has really uh large consequences when building a neural network so for example
if we set the learning rate too low then we learn very slowly so let's assume we start on
the right hand side here at that initial guess if our learning rate is not large enough
not only do we converge slowly we actually don't even converge to the global minimum right
because we kind of get stuck in a local minimum now what if we set our learning rate too high
right what can actually happen is we overshoot and we can actually start to diverge from the solution
the gradients can actually explode very bad things happen and then the neural network doesn't trade
so that's also not good in reality there's a very happy medium between setting it too small setting
it too large where you set it just large enough to kind of overshoot some of these local Minima
put you into a reasonable part of the search space where then you can actually Converge on the
solutions that you care most about but actually how do you set these learning rates in practice
right how do you pick what is the ideal learning rate one option and this is actually a very common
option in practice is to simply try out a bunch of learning rates and see what works the best right
so try out let's say a whole grid of different learning rates and you know train all of these
neural networks see which one works the best but I think we can do something a lot smarter
right so what are some more intelligent ways that we could do this instead of exhaustively
trying out a whole bunch of different learning rates can we design a learning rate algorithm
that actually adapts to our neural network and adapts to its landscape so that it's a bit
more intelligent than that previous idea so this really ultimately means that the learning
rate the speed at which the algorithm is trusting the gradients that it sees is going to depend
on how large the gradient is in that location and how fast we're learning how many other
options uh and sorry and many other options that we might have as part of training in
neural networks right so it's not only how quickly we're learning you may judge it on many
different factors in the learning landscape in fact we've all been these different algorithms
that I'm talking about these adaptive learning rate algorithms have been very widely studied in
practice there is a very thriving community in the Deep learning research community that
focuses on developing and designing new algorithms for learning rate adaptation and faster
optimization of large neural networks like these and during your Labs you'll actually get the
opportunity to not only try out a lot of these different adaptive algorithms which you can see
here but also try to uncover what are kind of the patterns and benefits of One Versus the other
and that's going to be something that I think you'll you'll find very insightful as part of your
labs so another key component of your Labs that you'll see is how you can actually put all of this
information that we've covered today into a single picture that looks roughly something like this
which defines your model at the first at the top here that's where you define your model we talked
about this in the beginning part of the lecture for every piece in your model you're now going
to need to Define this Optimizer which we've just talked about this Optimizer is defined together
with a learning rate right how quickly you want to optimize your lost landscape and over many
Loops you're going to pass over all of the examples in your data set and observe essentially
how to improve your network that's the gradient and then actually improve the network in those
directions and keep doing that over and over and over again until eventually your neural
network converges to some sort of solution so I want to very quickly briefly in the
remaining time that we have continue to talk about tips for training these neural networks
in practice and focus on this very powerful idea of batching your data into well what are
called mini batches of smaller pieces of data to do this let's revisit that gradient descent
algorithm right so here this gradient that we talked about before is actually extraordinarily
computationally expensive to compute because it's computed as a summation across all of the pieces
in your data set right and in most real life or real world problems you know it's simply not
feasible to compute a gradient over your entire data set data sets are just too large these days
so in you know there are some Alternatives right what are the Alternatives instead of computing
the derivative or the gradients across your entire data set what if you instead computed the gradient
over just a single example in your data set just one example well of course this this estimate of
your gradient is going to be exactly that it's an estimate it's going to be very noisy it may
roughly reflect the trends of your entire data set but because it's a very it's only one example in
fact of your entire data set it may be very noisy right well the advantage of this though is
that it's much faster to compute obviously the gradient over a single example because
it's one example so computationally this has huge advantages but the downside is that it's
extremely stochastic right that's the reason why this algorithm is not called gradient descent
it's called stochastic gradient descent now now what's the middle ground right instead of
computing it with respect to one example in your data set what if we computed what's called a
mini batch of examples a small batch of examples that we can compute the gradients over and when we
take these gradients they're still computationally efficient to compute because it's a mini batch
it's not too large maybe we're talking on the order of tens or hundreds of examples in our data
set but more importantly because we've expanded from a single example to maybe 100 examples
the stochasticity is significantly reduced and the accuracy of our gradient is much improved so
normally we're thinking of batch sizes many batch sizes roughly on the order of 100 data points
tens or hundreds of data points this is much faster obviously to compute than gradient descent
and much more accurate to compute compared to stochastic gradient descent which is that single
single point example so this increase in gradient accuracy allows us to essentially converge to
our solution much quicker than it could have been possible in practice due to gradient descent
limitations it also means that we can increase our learning rate because we can trust each of those
gradients much more efficiently right we're now averaging over a batch it's going to be much
more accurate than the stochastic version so we can increase that learning rate and actually
learn faster as well this allows us to also massively parallelize this entire algorithm in
computation right we can split up batches onto separate workers and Achieve even more significant
speed UPS of this entire problem using gpus the last topic that I very very briefly want to cover
in today's lecture is this topic of overfitting right when we're optimizing a neural network with
stochastic gradient descent we have this challenge of what's called overfitting overfitting I looks
like this roughly right so on the left hand side we want to build a neural network or let's say
in general we want to build a machine learning model that can accurately describe some patterns
in our data but remember we're ultimately we don't want to describe the patterns in our training data
ideally we want to define the patterns in our test data of course we don't observe test data we only
observe training data so we have this challenge of extracting patterns from training data and hoping
that they generalize to our test data so set in one different way we want to build models that can
learn representations from our training data that can still generalize even when we show them brand
new unseen pieces of test data so assume that you want to build a line that can describe or find
the patterns in these points that you can see on the slide right if you have a very simple neural
network which is just a single line straight line you can describe this data sub-optimally right
because the data here is non-linear you're not going to accurately capture all of the nuances
and subtleties in this data set that's on the left hand side if you move to the right hand
side you can see a much more complicated model but here you're actually over expressive you're
too expressive and you're capturing kind of the nuances the spurious nuances in your training
data that are actually not representative of your test data ideally you want to end up with the
model in the middle which is basically the middle ground right it's not too complex and it's not too
simple it still gives you what you want to perform well and even when you give it brand new data so
to address this problem let's briefly talk about what's called regularization regularization
is a technique that you can introduce to your training pipeline to discourage complex models
from being learned now as we've seen before this is really critical because neural networks
are extremely large models they are extremely prone to overfitting right so regularization
and having techniques for regularization has extreme implications towards the success of
neural networks and having them generalize Beyond training data far into our testing domain
the most popular technique for regularization in deep learning is called Dropout and the idea of
Dropout is is actually very simple it's let's revisit it by drawing this picture of deep neural
networks that we saw earlier in today's lecture in Dropout during training we essentially randomly
select some subset of the neurons in this neural network and we try to prune them out with some
random probabilities so for example we can select this subset of neural of neurons we can randomly
select them with a probability of 50 percent and with that probability we randomly turn them off
or on on different iterations of our training so this is essentially forcing the neural network
to learn you can think of an ensemble of different models on every iteration it's going to be exposed
to kind of a different model internally than the one it had on the last iteration so it has
to learn how to build internal Pathways to process the same information and it can't rely on
information that it learned on previous iterations right so it forces it to kind of capture some
deeper meaning within the pathways of the neural network and this can be extremely powerful
because number one it lowers the capacity of the neural network significantly right you're
lowering it by roughly 50 percent in this example but also because it makes them easier to
train because the number of Weights that have gradients in this case is also reduced so
it's actually much faster to train them as well now like I mentioned on every iteration we
randomly drop out a different set of neurons right and that helps the data generalize better and the
second regularization techniques which is actually a very broad regularization technique far beyond
neural networks is simply called early stopping now we know the the definition of overfitting
is simply when our model starts to represent basically the training data more than the
testing data that's really what overfitting comes down to at its core if we set aside some of
the training data to use separately that we don't train on it we can use it as kind of a testing
data set synthetic testing data set in some ways we can monitor how our network is learning on
this unseen portion of data so for example we can over the course of training we can basically
plot the performance of our Network on both the training set as well as our held out test set and
as the network is trained we're going to see that first of all these both decrease but there's
going to be a point where the loss plateaus and starts to increase the training loss will actually
start to increase this is exactly the point where you start to overfit right because now you're
starting to have sorry that was the test loss the test loss actually starts to increase because now
you're starting to overfit on your training data this pattern basically continues for the rest
of training and this is the point that I want you to focus on right this Middle Point
is where we need to stop training because after this point assuming that this test set
is a valid representation of the true test set this is the place where the accuracy
of the model will only get worse right so this is where we would want to early stop
our model and regularize the performance and we can see that stopping anytime before
this point is also not good we're going to produce an underfit model where we could
have had a better model on the test data but it's this trade-off right you can't stop
too late and you can't stop too early as well so I'll conclude this lecture by just summarizing
these three key points that we've covered in today's lecture so far so we've first covered
these fundamental building blocks of all neural networks which is the single neuron the perceptron
we've built these up into larger neural layers and then from their neural networks and deep neural
networks we've learned how we can train these apply them to data sets back propagate through
them and we've seen some trips tips and tricks for optimizing these systems end to end in
the next lecture we'll hear from Ava on deep sequence modeling using rnns and specifically
this very exciting new type of model called the Transformer architecture and attention mechanisms
so maybe let's resume the class in about five minutes after we have a chance to swap speakers
and thank you so much for all of your attention thank you