Transcript for:
Regularization in Neural Networks

why does regularization help with overfitting why does it help with reducing variance problems let's go through a couple examples to gain some intuition about how it works so recall that our high bias high variance and right just right pictures from earlier video I look something like this now let's be a fitting a large and deep neural network I know I haven't drawn this one too large or too deep but let's do things on your network and is currently overfitting so you have some cost function right J of W B equals some of the losses like so all right and so what we do for a regularization was add this extra term that penalizes the weight matrices from being to launch we said that was a for being a small so why is it that shrinking the l2 norm or the Frobenius norm of the parameters might cause less overfitting one piece of intuition is that if you you know crank the regularization lambda to be really really base they'll be really incentivized to set the weight matrices W to be reasonably close to zero so one piece of inversion is maybe set the ways to be so close to zero for a lot of hidden units there's basically fevering out along the impact of these hidden units an adapter case then you know this much simplified new network becomes a much smaller neural network in fact it's almost like the logistic regression Union you know bin stack multiple layers B and so that will take you from this overfitting case much closer to the left towards a high bias case but hopefully there'll be an intermediate value of lambda the results in the result closer to this just right case in the middle but the intuition is that by cranking up lambda to be really big it will set W close to zero which in practice this isn't actually what happens the can think of it as zeroing out or at least reducing the impacted law the hidden units so you end up with what might feel like a simpler network like this closer and closer to as if you were just using logistic progression the intuition of completely zeroing out a bunch of hidden units isn't quite right it turns out that what actually happens and it will still use all the hidden units but each of them will just have a much smaller effect but you do end up with a simple network and as if you have a smaller network that is therefore less prone to overfitting so I'm not sure this intuition helps but when you implement regularization in the primary exercise you actually see some of these variance reduction results yourself here's another attempt at additional intuition for why regularization helps prevent overfitting and for this I'm going to assume that we're using the 10h activation function which looks like this right so there's a G of Z equals 10 H of Z so if that's the case notice that so long as Z is quite small so the Z takes on only a smallish range of parameters maybe around here then you're just using the linear regime of the Technische function there's only a Z is allowed to wonder up you know to larger values or smaller values like so that the activation function starts to become less linear so the intuition you might take away from this is that it launder the regularization parameter is launched then you have that your parameters will be relatively small because they are penalized to be large in the cost function and so the weights W are small then because Z is equal to w right and then technically plus B or but if W tells you very small then Z will also be low to be small and in particular is Z ends up taking relatively small values just invicible range then G of Z will be roughly linear so it's as if every layer will be roughly linear as it is just linear regression and we saw on course one that if every layer is linear then your whole network is just a linear network and so even a very deep network but a deep network where the linear activation function is at the end they only able to compute the linear function so it's not able to you know fit those very very complicated decisions very nonlinear decision boundaries that allow it to you know really a over fit right the datasets like we saw on the overfitting high variance case on the previous slide so just to summarize um if the regularization term is very large the parameters W very small so Z will be relatively small kind of ignoring the effect would be for now but so Z is relatively so Z be relatively small or really should say it takes on a small range of values and so the activation function this chain HCA will be relatively linear and so your whole neural network will be computing something not too far from a big linear function which is therefore a pretty simple function about in a very complex highly nonlinear function and so it's also much less able to open it and again when you implement regularization for yourself in the current exercise you'll be able to see some of these effects yourself before wrapping up our discussion on regularization I just want to give you one implementational tip which is that when influencing regularization we took our definition of the cost function J and we actually modified it by adding this extra term that penalizes the waste being too large and so if you implement gradient descent one of the steps to debug gradient descent is to plot the cost function J as a function of the number of iterations of gradient descent and you want to see that the cost function J your decreases monotonically after every iteration of gradient and if you're implementing regularization then please remember that J now has this new definition if you plot the old definition of J just this first term then you might not see a decrease monotonically so to divide gradients and make sure you're plotting you know this new definition of J that includes this second term as well otherwise you might not see J decrease monotonically on every single iteration so that's it for l2 regularization which is actually a regularization technique that I use the most in training people learning models in deep learning does another sometimes use regularization techniques called drop out regularization let's take a look at that in the next video