Introduction to Backpropagation in Machine Learning

what do nearly all machine Learning Systems have in common from GPT and M journey to Alpha fold and various models of the brain despite being designed to solve different problems having completely different architectures and being trained on different data there is something that unites all of them a single algorithm that runs under the hood of the training procedures in all of those cases this algorithm called back propagation is the foundation of the entire field of machine learning although its details are often overlooked surprisingly what enables artificial networks to learn is also what makes them fundamentally different from the brain and incompatible with Biology this video is the first in a two-part Series today we will explore the concept of back propagation in arcial systems and develop an intuitive understanding of what it is why it works and how you could have developed it from scratch yourself in the next video we will focus on synaptic plasticity enabling learning in biological brains and discuss whether back propagation is biologically relevant and if not what kind of algorithms the brain may be using instead if you're interested stay tuned despite its transformative imp impact it's hard to say who invented back propagation in the first place as certain Concepts can be traced back to liins in 17th century however it is believed that the first modern formulation of the algorithm still in use today was published by sepo linar in his master's thesis in 1970 although he did not reference any neural networks explicitly another significant Milestone occurred in 1986 when David rumelhart Joffrey Hinton and Ronald Williams published a paper titled learning representations by back propagating errors they applied the back propagation algorithm to multi-layer perceptrons a type of a neural network and demonstrated for the first time that training with back propagation enables the network to successfully solve problems and develop meaningful representations at the hidden neuron level capturing important regularities in the task as the field progressed researchers scaled up these models significantly and introduced various architectures but the fundamental principles of training remained largely unchanged to gain a comprehensive understanding of what exactly it means to train a network let's try to build the concept of back propagation from the ground up consider the following problem suppose you have collected a set of points XY on the plane and you want to describe their relation ship to achieve this you need to fit a curve y of X that best represents the data since there are infinitely many possible functions we need to make some assumptions for instance let's assume we want to find a smooth approximation of the data using a polom of degree 5 that means that the resulting curve we're looking for will be a combination of a constant term a polinomial of degree Z a straight line a parabola and so on up to a power of five each weighted by specific coefficients in other words the equation for the curve is as follows where each K is some arbitrary real number our job then becomes finding the configuration of k0 through K5 which leads to the best fitting curve to make the problem totally unambiguous we need to agree on what the best curve even means while you you can just visually inspect the data points and estimate whether a given curve captures the pattern or not this approach is highly subjective and impractical when dealing with large data sets instead we need an objective measurement a numerical value that quantifies the quality of a fit one popular method is to measure the square distance between data points and the fitted curve a high value suggests that the data points are significantly far from the curve indicating a poor approximation conversely low values indicate a better fit as the curve closely aligns with the data points this measurement is commonly referred to as a loss and the objective is to minimize it now notice that for a fixed data this distance the value of the loss depends only on the defining characteristics of the curve in our case the Coe ients from k0 through K5 this means that it is effectively a function of parameters so people usually refer to it as a loss function it's important not to confuse two different functions we are implicitly dealing with here the first one is the function y of X which has one input number and one output number and defines the curve itself it has this polinomial form given by K's there are infinitely many such functions and we would like to find the best one to achieve this we introduce a loss function which instead has six inputs numbers k0 through K5 and for each configuration it constructs the corresponding curve y calculates the distance between observed data points and the curve and outputs a single number the particular value of the loss our job then becomes finding the configuration of KS that yields a minimum loss or minimizing the loss function with respect to the coefficients then plugging these optimal cases into the general equation for the Curve will give us the best curve described in the data all right great but how do we find this magic configuration of case that minimizes the loss well we might need some help let's build a machine called Curve fitter 6000 designed to simplify manual calculations it is equipped with six adjustable knobs for k0 through K5 which we can freely turn to begin we initialize the machine with our data points and then for each setting of The Knobs it will evaluate the curve y ofx compute the distance from it to the data points and print out the value of the loss function now we can begin twisting the knobs in order to find the minimum loss for example let's start with some initial setting and slightly noge Noob number one to the right the resultant curve changed as well and we can see that the value of the loss function slightly decreased great it means we are on the right track let's turn knob number one in the same direction once again uh-oh this time the fit gets worse and the loss function increases apparently that last noge was a bit too much so let's revert the knob to the previous position and try knob two and we can keep doing this iteratively many many times nudging each individual knob one at a time to see whether the resulting curve is a better fit this is a so-called random perturbation method since we are essentially wandering in the dark not knowing in advance how each adjustment will affect the loss function this would certainly work but it's not very efficient is there a way we can be more intelligent about the knob adjustments in the most General case when the machine is a complete Black Box nothing better than a random perturbation is guaranteed to exist however a great deal of computations including what's carried out under the hood of our curve fitter have a special property to them something called differentiability that allows us to compute the optimal knob setting much more efficiently we will dive deeper into what differentiability means in just a minute but for now let's quickly see the big picture overview of where we are going our goal would be to upgrade the machine so that it would have a tiny screen next to each knob and for any configuration those screens should say which direction you need to nudge each knob in order to decrease the loss function and by how much think about it for a second we are essentially asking the machine to predict the future and estimate the effect of the noob adjustment on the loss function without actually performing that adjustment calculating the loss and then reverting the knob back like we did previously wouldn't this glance into the future violate some sort of principle after all we are jumping to the result of the computation without performing in it sounds like cheating right well it turns out that this idea lies on a very simple mathematical foundation so let's spend the next few minutes building it up from scratch all right let's consider a simpler case first where we freeze five out of six knobs for example suppose someone tells you that the rest of them are already in the optimal position so all you need to do is to find the best value for one remaining kn essentially the machine now has only one variable parameter K1 that we can tweak and so the loss function is also a simpler function which accepts one number the knob setting and outputs another number the loss value as a function of one variable it can be conveniently visualized as a graph in a two-dimensional plane which captures the relationship between the input and the output for example it may have this shape right here and our goal goal is to find this value of K1 which corresponds to the minimum of the loss function but we don't have access to the true underlying shape all we can do is to set the knob at a chosen position and kind of query the machine for the value of the loss in other words we can only sample individual points along the function we're trying to minimize and we are essentially blind to how the function behaves in between the known points before we sample them but suppose we would like to know something more about the function not just each value at each point for example whether at this point the function is going up or down this information will ultimately guide our adjustments because if you know that the function is going down as you increase the input turning the knob to the right is a safe bad since you are guaranteed to decrease the loss with this manipulation let's put this notion of going up or down around a point on a stronger mathematical ground suppose we have just sampled the point x KN y KN on this graph what we can do is increase the input by a small amount Delta X this new adjusted input will result in a new value of y which will differ from the old value by some Delta y this Delta depends on the magnitude of our adjustment for example if we take a step Delta X which is 10 times smaller Delta y will also be approximately 10 times as small this is why it makes sense to take the ratio Delta y over Delta X the amount of change in the output per unit change in the input graphically this ratio corresponds to a slope of a straight line going through the points X not Y and X Plus Delta X Y KN plus Delta Y no notice that as we take smaller and smaller steps this straight line will more and more accurately align with the graph in the neighborhood of the point x y KN let's take a limit of this ratio as Delta X goes to infinitely small values then this limiting case value which this ratio converges to for infinitesimally small Delta X's is what is called the derivative of a function and it is dened Ed by dy/ DX visually the derivative of a function at some point is the slope of the line that is tangent to the graph and thus corresponds to the instantaneous rate of change or steepness of that function around that point but different points along the graph might have different stiffness values so the derivative of the entire function is not a single number in fact the derivative Dy by DX X is itself a function of X that takes an arbitrary value of x and outputs the local steepness of Y ofx at that point this definition assigns to every function its derivative Alter Ego another function operating on the same input domain which carries information about the steepness of the original function there is a bit of a subtlety strictly speaking the derivative may not exist if the function doesn't have a steepness around some point for example if it has sharp corners or discontinuities however for the remainder of the video we are going to assume that all functions we are dealing with are smooth so that the derivative always exists this is a reasonable claim because we can control what sort of functions go into our models when we build them and people usually restrict everything to smooth or differential functions to make all the math work out nicely all right great now along with the underlying loss as a function of K1 which is hidden from us we can also reason about its derivative another function of K1 which we also don't know that is equal to the steepness of the loss function at that point let's suppose that similarly to how we can query the loss function by running our machine and obtaining individual samples there is a mechanism for us to sample the derivative function as well so for every input value of K1 the machine will output the value of the loss and the local steepness of the loss function around that point notice that this derivative information is exactly the sort of look into the future we were looking for to make smarter knob adjustments for example let's use it to efficiently find the the optimal value of K1 what we can do is the following first start at some random position ask the machine for a value of the loss and the derivative of the loss function at that position take a tiny step in the direction opposite of the derivative if the derivative is negative it means that the function is going down and so if we want to arrive at the minimum we need to move in the direction of increas in value of K1 repeat this procedure until you reach the point where the derivative is zero which essentially corresponds to the minimum where the tangent line is flat essentially each adjustment in such a guided fashion Works kind of like a ball rolling down the hill along the graph until it reaches a valley although in the beginning we froze five out of six knobs for Simplicity this process is easily carried out to higher dimensions for example suppose now we are free to tweak two different knobs K1 and K2 the loss would become a function of two variables which can be visualized as a surface but what about the derivative recall that by definition the derivative at each point tells us how the output changes per unit change of the input but now we have two different inputs should we nudge only K1 K2 or both essentially our function will have two different derivatives that are usually called partial derivatives because of this ambiguity which input to notch namely when we have two knobs the derivative of the loss function with respect to parameter K1 is written like this it is how much the output changes per unit change in K1 if you hold K2 constant and conversely this expression tells you the rate of change of the output if you hold K1 constant and slightly noge K2 geometrically you can imagine slicing the surface with planes parallel to the axis intersecting at the point of Interest K1 K2 so that each of the two cross-sections is like a one-dimensional graph of the loss as a function of one variable while the other one is kept constant then the slope of a tangent line at each cross-section will give you a corresponding partial derivative of the loss at that point while thinking about partial derivatives as two separate surfaces one for each variable is a perfectly valid way people usually plug the two different values into a vector called a gradient Vector essentially this is a mapping from two input values to another two numbers where the first signifies how much the output changes per tiny change in the first input and similarly for the second input geometrically this Vector points in the direction of steepest Ascent so if you want to minimize a function like in the case for our loss we need to take steps in in the direction opposite to this gradient this iterative procedure of noding the parameters in the direction opposite of the gradient Vector is called gradient descent which you have probably heard of this is analogous to a ball rolling down the hill for the two-dimensional case and the partial derivatives essentially tell you which direction is downhill going Beyond two Dimensions is impossible to visualize directly but the math stays exactly the same for instance if we are now free to tweak all the six knobs the loss function is a hyper surface in Six Dimensions and the gradient Vector now has six numbers packed into it but it still points in the direction of steepest Ascent so if we iteratively take small steps in the direction opposite to it we are going to roll the ball down the hill in Six Dimensions and eventually reach the minimum of the loss function great let's back up a bit remember how we were looking for ways to add screens next to each knob that would give us the direction of optimal adjustment well it is essentially nothing more but the components of the gradient Vector if at a particular configuration the partial derivative of the loss with respect to K1 is positive it means that increasing K1 will lead to increased loss so we need to decrease the value of the knob by turning it to the left and similarly for all other parameters this is how the derivatives serve as these windows into the future by providing us with information about local behavior of the function and once we have a way of accessing the derivative we can perform gradient descent and efficiently find the minimum of the loss function thus solving the optimization problem however there is an elephant in a room so far we have implicitly assumed the derivative information is given to us or that we can sample the derivative at a given point similarly to how we sample the loss function Itself by running the calculation of the machine but how do you actually find the derivative as we will see further this is the main purpose of the back propagation algorithm essentially the way we find derivatives of arbitrarily complex functions is the following first there are a handful of building blocks to begin with simple functions derivatives of which are known from calculus these are the kind of derivative formulas you often memorize in college for example if the function is linear it's pretty clear that its derivative will be a constant equal to the slope of that line everywhere which coincides with its own tangent line a parabola x² becomes more steep as you increase X and its derivative is actually 2x in fact there is a more general formula for the derivative of x to the^ of n similarly derivatives of the exponent and logarithm can be written down explicitly but these are just individual examples of simple well-known functions in order to compute arbitrary derivatives we need a way to combine such Atomic building blocks together there are a few rules how to do it for instance the derivative of a sum of two functions is the sum of the derivatives there is also a formula for the derivative of a product of two functions this gives you a way to compute things like the derivative of 3x^2 - e ^ of x but to complete the picture and to be able to find derivatives of almost everything we need one other rule called The Chain rule which Powers the entire field of machine learning it tells you how to compute the derivative of a combination of two functions when one of them is an input to another here is a way to reason about this suppose you take one of those simpler machines which receives a single input x that you can vary with a knob and spits out an output J of X now you take a second machine of this kind which performs a different function f of x what would happen if you connect them in sequence so that the output of the first machine is fed into the second one as an input notice that such a construction can be thought of as a single function into the second function is J of X and so the local rate of change of the second machine is thus the derivative of f evaluated at the point J of X now imagine you nudge the knob X by a tiny amount Delta that input nudge when it comes out of the first machine will be multiplied by the derivative of J since the derivative is the rate of change in the output per unit change of the input so after the first function the output will increase by Delta multiplied by the derivative of J this expression is essentially a tiny nudge in the input to the second machine whose derivative at that point is given by this expression this means that for each Delta increase in the input we bump the output by this much hence the derivative when you divide that by Delta will look like this you can think about it as a set of three interconnected Cog Wheels where the first one represents the input knob X and the other two wheels are functions J of X and F of J of X respectively when you Nodge the first wheel it induces a nudge in the middle wheel and the amplitude of that change is given by the derivative of J which in turn causes the Third Wheel to rotate and the amplitude of that resulting nudge is given by changing the derivatives together all right great now we have a straightforward way of obtaining a derivative of any arbitrarily complex function as long as it can be decomposed into building blocks simple functions with explicit derivative formulas such as summations multiplications exponents logarithms Etc but how can it be used to find the best curve using our curve fitter the big picture we are aiming for is the following for each of our parameter knobs we will write down its effect on the loss in terms of simple easily differentiable operations once we have that sequence of building blocks no matter how long we should be able to sequentially apply the chain rule to each of them in order to find the value of the derivative of the loss function with respect to each of the input knobs and perform iterative gradient descent to minimize the loss let's see an example of this first we are going to create a knob for each number the loss function can possibly depend on this obviously includes the parameters but there is also the data itself coordinates of points to which we are fit in the curve in the first place now during optimization the data points are set in Stone so changing them in order to obtain a lower loss would make no sense however for conceptional purposes we can think about these values as fixed knobs set in one position so that we cannot n them once we have all the existing numbers being fed into the machine we can start to break down the loss calculation Remember by definition it is the sum of squar vertical distances from each point to the curve parameterized by case so for instance let's take the first data point X1 y1 multiply the x coordinate by K1 add that to the squared value of X1 multiplied by K2 and so on for other KS including the constant term k0 this sum of weight and powers of X1 is the value of y predicted by the current curve F of X1 let's call it y1 hat next we need to take the squared difference between the actual value and the predicted value this is how much the first data point contributes to the resulting value of the loss function repeating the same procedure for all remaining data points and summing up the resulting squared distances gives us the overall total loss that we are trying to minimize the computation we just performed finding the value of the loss for a given configuration of parameter and data knobs is known as the forward step the entire sequence of calculations can be visualized as this kind of computational graph where each node is some simple operation like addition or multiplication forward step then corresponds to computations flowing from left to right but to perform optimization we also need information about gradients how each knob influences the loss now we are going to do what's known as the backward step and unroll the sequence of calculations in reverse order to find derivatives what makes the backward possible is the fact that every note in our compute graph is an easily differentiable operation think of individual nodes as these tiny machines which simply add multiply or take Powers we know their derivatives and because their outputs are connected sequentially we can apply the chain rule this means that for each node we can find its gradient the partial derivative of of the output loss with respect to that node let's see how it can be done consider a region of the compute graph where two number nodes A and B are being fed into a machine that performs addition and its result a plus b is further processed by the system to compute the overall output L suppose we already computed the gradient of a plus b earlier so that we know how nding this sum will affect the output the question is what are individual gradients of A and B well intuitively if you nudge a by some amount a + b will be nudged by the same amount so the gradient or the partial derivative of the loss with respect to a is the same as the gradient of the sum and similarly for B this can be seen more form by writing down the chain Rule and noticing that the derivative of A+ B with respect to a is just one in other words when you encounter this situation in the compute graph then the gradient of the sum just simply propagates into the gradients of the nodes that plug into the sum machine another possible scenario is When A and B are multiplied just like before suppose we know the gradient of the their product because it was computed before in this case individual noge to a will be scaled by a factor of B so the product will be nudged B times as much which propagates into the output so whatever the derivative of the output with respect to the product of ab is the output derivative with respect to a will get Scaled by a factor of B and vice versa for the gradient of B once again it can be seen more formally by examining the chain rule in other words the multiplication node in the compute graph distributes the downstream gradient across incoming nodes by multiplying it cross Ways by their values similar rules can be easily formulated for other building block calculations such as raising a number to a power or taking the logarithm finally when a single node takes part in multiple branches of the compute graph gradients from the corresponding branches are simply added together indeed suppose you have the following structure in the graph where the same node a plugs into two different operations that contribute to the overall loss then if you nudge a by Delta the output will be simultaneously noodged by this derivative from the first branch and this derivative from the second second Branch so the overall effect of ning a will be the sum of the two gradients all right great now that we have constructed a computational graph and established how to process individual chunks of it we can just sequentially apply those rules starting from the output and working our way backwards for instance the rightmost node in the graph is the resulting value of the loss function how does the incremental change in that node affect the output well it is the output so its gradient is by definition equal to one next the loss function is the sum of many Delta y's squared we know what to do with the summation node it just copies whatever the gradient value is to the right of it into all incoming nodes consequently the gradients of all Delta y squared will also be equal to one each of those nodes is this squared value of the corresponding Delta Y and we know how to differentiate this squaring operation the derivative of the loss function with respect to Delta y1 will be 2 * the Delta y1 which is just the number we found during the forward calculation and we can keep doing this propagation of sequential derivative calculation backwards along our compute graph until we reach the leftmost nodes which are the data and parameter knobs the Der derivatives of the loss with respect to the input data don't really matter but the derivatives with respect to the parameters is exactly what we want once these parameter gradients are found we can perform one iteration of gradient descent namely we're going to slightly tweak the knobs in the directions opposite to the gradient the exact magnitude of each adjustment being the negative product of the gradient and some small number called The Learning rate for example 0.01 note that after the adjustment is performed the configuration of the machine and the resulting loss are different and so the old gradient values we found no longer hold so we need to run the forward and backward calculations once again to obtain updated gradients and the new decreased loss performing this Loop of forward pass backward pass nudge repeat is the essence of training every modern machine Learning System and exactly the same algorithm is used today in even the most complicated models as long as the problem you're trying to solve with a given model architecture can be decomposed into individual operations that are differentiable you can sequentially apply the chain rule many times to arrive at the optimal setting of The parameters for instance a feed forward neural network is essentially a bunch of multiplications and summations with a few nonlinear activation functions sprinkled between the layers each of those atomic computations is differentiable so you can construct the compute graph and run the backward path on it to find how each parameter like connection weights between neurons influence the loss function and because neural networks given enough neurons can in theory approximate any function imaginable we can create a large enough sequence of these building block mathematical machines to solve problems such as classifying images and even generating new text this seems like a very elegant and efficient solution after all if you want to solve the optimization problem derivatives tell you exactly which adjustments are necessary but how similar is this to what the brain actually does when we learn to walk speak and read is the brain also minimizing some sort of loss function does it calculate derivatives or could it be doing something totally different in the next video we are going to dive into the world of synaptic plasticity and talk about how biological neural networks learn in keeping with the topic of biological learning I'd like to take a moment to give a shout out to shortform a longtime partner of this channel short form is a platform which stay tuned for more interesting topics coming up goodbye and thank you for the interest in the [Music] brain for

Transcript for:Introduction to Backpropagation in Machine Learning

Transcript for:
Introduction to Backpropagation in Machine Learning