[Music] hey everyone welcome back so today we're going to be talking about a cool concept called bayesian linear regression so in the initial base video that i made there was a lot of requests to do a statistical model using bayesian statistics and so what better place to start than with our old friend linear regression and so what we'll do in this video first i'll just give you a quick reminder of this linear regression framework including some of the assumptions we're going to need and then we'll talk about how to attack this linear regression problem from a bayesian framework and the most important thing that we'll get out of this video is that we'll find that the solution we get for the betas with the bayesian approach is not some new thing but actually it's exactly and very intimately related to the solution of the betas we get using l1 or l2 regularization also known as lasso or ridge on our ols problem so we're going to get there in just a moment so let's start with the assumptions of linear regression so just kind of taking you back to stats 101 maybe you haven't seen it in a while but this is the linear model and the mechanics of the linear model are that we have some known data x so x here some kind of n by p matrix of data and we're assuming that the response variable y which we know is generated according to some linear combination of all the features specifically the weights on these linear combinations are given by this parameter vector beta and it's exactly that beta that it is our goal to solve for in this process so in a nutshell in terms of equation we're saying that x times beta plus some error epsilon is going to be equal to the response variable that we see that is our assumption for linear regression a couple of other assumptions are stated here it might have been a while since you saw these but this is also part of the assumptions we're assuming that these errors epsilon sub i each of them is normally distributed with mean 0 and variance sigma squared we're also assuming that each of the y i are normally distributed with mean beta transpose x i and variance also sigma squared so i state these because they're going to be important for us as we talk about the bayesian framework so one of the issues we know if we're able to just solve for the beta ols we know how to solve for that in matrix form but the issue with that solution is that it has high variance this gets into the bias variance trade-off which is a video i've linked below but in a gist of it we're saying that small changes in our data x so this n by p matrix has many observations of data in it we're saying that small changes in those observations could potentially lead to large changes in this beta that we've solved for this is typically not something that's favorable in machine learning we don't want small changes to our training data to change the model that much and so one of the ways we solve for this and the way that's probably most familiar to you right now is using this idea of regularization now it comes in these two main flavors which is lasso and ridge also known as using the l1 norm versus the l2 norm they vary slightly but both of them have the same goal which is to solve a slightly different optimization problem now in each of these optimization problems you see two distinct terms and so this relates to the two distinct goals that we're trying to do in this minimization problem the first goal is the same as the ols solution which is we want to minimize the errors we get from fitting this model using some setting of the betas and so that's the first term called error you're seeing in both cases and this is saying that i want the predictions that i get which are xb to be a good fit to the actual things that i see which are y so that's the first part here now this is called regularization because we're also trying to achieve a secondary goal namely we are trying to keep the absolute value of the betas or in this case the l2 norm of the betas small in some sense and the reason for that is that if we didn't do this because of the ols solution having high variance our betas could potentially be very large in absolute value and so we include these regularization terms to keep and encourage the betas to be small and so it seems very natural when we kind of think about it that way maybe the first time we learned it or the second time we learned it we kind of started getting a grasp about why this works and what it's trying to achieve but at least for me there was always this kind of logical disconnect at least a slight one where i understood what's trying to be done but still these tacking on of these regularization terms this lambda beta one normal lambda beta l2 norm seemed somewhat arbitrary where did these come from whoever created this why did they choose to tack on these terms and not something else and so that actually got answered for me years later when i looked at linear regression again but now attacked from a bayesian statistical point of view we'll see that by starting to look at the problem in a bayesian setting we're going to come back to these optimization problems and in that way it's going to give us much more of a motivation about where these terms actually come from and what it means to do regularization on them looked at from a different point of view so let's start looking at that now so we know that we're using bayesian stats here so we're probably going to be helpful to look at a posterior probability first and try to maximize that namely the posterior probability we're going to be looking at is probability of beta given y let me pause here and make sure we understand what this means beta is some unknown parameter vector that we're trying to solve for y is some known response variable which we have we have that data and so the posterior distribution is asked in the question about given that known parameter vector y what is the probability of observing some setting of the betas and the natural thing to do is maximize this posterior because if we're able to find the betas which maximize this posterior we have exactly answered the very natural question which is this setting of betas is the most likely given the data that we actually observe that makes a lot of sense and we give that setting of betas the special name beta hat m ap map stands for maximum apostolia which is latin phrase but in a nutshell it means that that's the betas which maximize the posterior probability now how do we actually solve this the first step as usual is going to be writing this out in terms of bayes theorem so note that this posterior can be written out in terms of these three probabilities using bayes theorem notice that the denominator here has nothing to do with beta and since this is an optimization problem on beta we can just ignore the denominator altogether and this is a nice thing because if we think about actually computing the denominator this is the probability of observing these y's it's a little bit unclear or at least seems difficult to get that so it's a good thing that we don't have to explicitly get that to solve the problem so we're looking at arg max over all betas of probability of y given beta times probability of beta unconditional on anything and these have special names which we've already seen the first one is called the likelihood and it's the exact reverse conditional probability of the posterior so whereas the posterior asked the question what's the probability of observing this beta given this data that we see the likelihood is answering the question about what's the probability of observing this data given some setting of the betas so related but fundamentally different question okay and the other term is going to be the most important one in this video and that's called the prior so let's make sure we understand what it's trying to get at probability of beta this is called the prior and it's asking the very sort of philosophical question about before i even observe the data notice there's no why here before i even observe the data at all what is the probability of observing this setting of the betas unconditional on anything and so in a sense what this is asking is that what is my prior understanding about the world what is my prior understanding about these betas so that i can bring that knowledge into this problem and use that to get to my maximum a positive solution so we'll come back to this in just a moment but first let's work on this first term the likelihood the first thing we'll actually do is take the log of the inside so these two terms we know that maximizing or minimizing something is the same as maximizing or minimizing the log and logs are friendlier to our computers if we were to actually code this so taking the log of these two becomes the sum of the logs and this first part probability of y given beta we can actually compute this believe it or not and the reason is that we have this ols assumption that the y i's are normally distributed with mean beta transpose x i and variance sigma squared that is one of the assumptions we made in ordinary least squares and so using that to get the probability of these y's so keep in mind y here is a vector of n observations so the probability of seeing these y's given that we have some beta is just going to be the multiplication of these n probability density functions each of which is a normal probability density function so although this thing looks kind of complicated what you're looking at is a product from 1 to n of these n normal probability density functions and then when we take the log of this as we need to do for this right here this product becomes a sum we take the logs of the insides and this first part actually doesn't matter because notice it doesn't have any betas in it and we're just doing a problem over betas here so we just care about this term here so that's fine we have a good way to compute this now now let's come back to the most important part of this whole video which is the prior we said that the prior is our prior understanding about the world for the betas before even looking at the data but what should that be this is kind of where some of the criticism and some of the nuance of bayesian statistics comes in is how do we pick this prior if we're allowed to pick any kind of prior understanding of the world which one should we pick and who's correct if we have two different opinions and that true is one of the downsides or at least things to think about with bayesian stats is that you need to really think about what prior you're setting and why you're picking that prior but let's just pick a prior for now and roll with it and see what we get so let's assume that each of the beta j's so quick note how many betas are there remember i said x was a n by p matrix therefore there's p betas so each of these beta j's where j is equal to one to p we're going to assume that the prior distribution on them is normal with mean zero and some variance tau squared where tau is up to us and so in graphical form each of the beta js has a pdf that looks like this where there's a lot of mass around zero and around things that are close to zero so you see that most of the mass here is concentrated in that area there's not so much mass around more extreme values of beta and i talk about this because this is where the regularization is going to come in in the bayesian framework so remember when we talked about regularization with lasso and ridge we were trying to achieve the goal of keeping the absolute values of the betas close to zero small and now we see that bayesian statistics using this prior here is trying to do the exact same thing we are insisting that before we even see the data small values of beta of these parameters are more likely and so if we see some data now which would suggest that we should choose more extreme values of beta that is going to be counteracted that's going to be kept in check by this prior understanding of the world put another way it would take some extreme evidence with the data that we see in order for us to accept very high values of beta because of this prior distribution and so we can write this optimization problem in this form so i do understand that going from here to here it looks a little bit different and so let me explain the high level steps but the algebra is just two or three lines confident you can do it so this arg max becomes the arg min problem the reason for that is that you'll see some negative signs come out for example this one and when you take the log of the prior you'll see another negative sign come out front so i just took the negative of those negative signs to get positive stuff which became a arg min problem in the end okay so this first term y minus x times beta l 2 norm squared is exactly this term here so notice that the sum from i equals 1 to n of this written in a vector form becomes exactly this and the second part comes from the prior distribution so although i didn't show the algebra there if you were to take the log of the pdf of that normal distribution and then you simplify you're going to get exactly sigma squared divided by tau squared times the l2 norm of the beta squared so we get something that looks like this and then we're going to rename this ratio sigma squared divided by tau squared as lambda and so we get this problem here now you probably saw this already but if you look at this problem here and you look at this problem here they're the exact same and when i first saw this it kind of blew my mind but it also helped shed light on where this actually comes from why it's actually motivated so we see that using this bayesian statistical approach to linear regression which didn't inherently have anything to do with lasso or ridge we're able to get exactly the same solution as the ridge problem in this case and if you're wondering about the lasso problem there's actually a nice analog there too it's just that we need to pick a different prior so here we picked a gaussian prior for the betas if we pick a laplacian prior for the betas a laplacian distribution looks like this which looks kind of like a gaussian distribution but it has that peak that point at the top instead it has a pdf that looks like that if we use that as our prior distribution instead then we get beta map which is the maximum ah posteriori solution which comes from the bayesian approach is exactly the same solution as the l1 approach the lasso approach and so again this is kind of crazy we saw that these sort of unmotivated terms here which makes sense but it's difficult to see why they are what they are we understand exactly why they are what they are when we approach this problem from a bayesian framework and we can also draw analogs between the parameters the choices that we have for example let's do a sanity check we know that lambda in this setting controls the amount of regularization so the bigger lambda is the more these betas will be driven towards zero does that check out well if we have big lambdas here that's going to mean we have small values for tau because tau is the denominator there small values for tau means that our prior distribution becomes more and more and more narrow so for example if we have very small value of tau our prior distribution will look more like this more kind of peaked and what it means for it to be more peaked is that we have a more and more and more prior belief that the betas are close to zero there's a lot more probability density around that region on the other hand if tau were to be very large then we get a very kind of large prior kind of fat prior which means that we don't have any real preference for these betas being near zero and if tau is very large that means lambda is very small which again we know from this setting means that we don't regularize too much so all this knowledge is indeed consistent with each other and so approaching the end of this video what i'll say is that i've made it seem like bayesian linear regression is this great thing that solves all of our y problems about regularization and linear regression but um you know you can also think about it as just shifting the problem somewhere else i don't want to make it seem better than it is for example here our main issue was that where do these terms come from it seems like they're arbitrary using bayesian linear regression we found that they're not arbitrary we get them using this maximum apostuary problem and picking some priors and that's where they come from but then you can just say that okay but why'd you pick these priors how do you pick the correct value for tau or b in this case how can we then pick some other prior and these are all totally valid questions and things that we do need to think about in bayesian statistics so i think this is better seen as a different approach to get to the same path as we saw before rather than some better way of thinking about linear regression it's just a different way of thinking about linear regression so if you learned something this video please like and subscribe for more just like this i'll see you next time