this is level two of the CFA program the topic on quantitative methods and the learning module on basics of multiple regression and underlying assumptions you might be tempted to skim over this learning module because of the word basics in there but I caution you not to although this module is relatively short and and relatively basic there are a handful of items that we're going to need as we extend through the next six learning modules I'm guessing you figured out that the CFA Institute has moved to this learning module presentation and I think it's an awesome evolution by The Institute it's good news for you candidates out there because it tends to condense material into shorter kinds of chapters they're calling them modules and it allows for a greater linkage between and among the different modules inside of a topic now you guys who have watched me over the years hopefully it hadn't been that long but uh you'll note that I try to steer away from identifying losses that are more important than others I try to give you some hints in there because some of these losses it's almost impossible to ask an exam question but I'm going to go ahead and break my kind of a rule here and so I want you to I want you to look at these three losses I mean the first one is pretty basic one we learned back in level one for just a simple linear regression model and in fact I got out my my old materials and when I went through the program you know 100 years ago um simple linear regression wasn't even a part of level one so this is an extension a logical extension of what we do in level one from simple linear regression to multiple linear regression so all of the material that you learned within the last year applies but this third loss I'm going to go ahead and say that this is the most important of the three for the sole reason that the end of module vignette has five questions in it and every question asks you to interpret the residual plots so pay close attention as we go through the slide deck on the graphs what we're going to do is take Los 1 and LOS 2 and try to figure out what they mean in terms of looking at a graphical representation of the data all right so what I want you to do is I want you to take a look at this multiple regression model you see a plus sign in there and just visually eliminate the second plus sign and everything that's over on the right this is what shows up in level one that's a simple linear regression in which we're saying something like you know we have two variables one we think does a really good job of explaining the other one so we talked about this in level one there's a dependent variable and then there are there is an independent variable over on the right hand side of the equal sign but now all we're doing is we're just adding extra extra independent variables over on the right hand side so we still have the dependent variable that's noted as uh Y and then there's an intercept term lowercase b and then there are all these slope coefficients and those are and this is important in level one we just probably called them slope coefficients but now they're partial slope coefficients because what they do is they tell us the relationship between that independent variable and the dependent variable holding all of the other independent variables constant that's important and remember to interpret that those as partial slope coefficients and so there's X1 and X2 and X3 and X10 and x50 and x 100 you know you could have as many as you want over there and we'll talk about this as we go through the next uh the next six learning modules about you know is there an optimal number of independent variables and the answer is probably probably no but you know based on all the material that we cover in the CFA program you could probably reasonably determine how many is is how about nearly optimal I'm pretty sure that's not a calculus term or a finance term nearly optimal but you get you get my point and then all the way at the end there's something called an error term or a disturbance term and that's going to be really important especially when we do some graphing here so what are we saying we're just saying there are a handful of variables that can be used to explain some other variable out there all right so what are we doing here look in the horizontal axis there's there's an X variable and on the vertical axis there's a y variable so there's The Intercept term right there and then there's the slope coefficient just like it was back in uh back in level one we're going to do what did you learn back in grade school or High School rise over the Run for the slope coefficient that just means changes in one intercept as it relates to changes in the I'm sorry did I say intercept changes in one axis over uh changes in the other axis and notice that we have drawn a blue line so one of the things that we're assuming in here under multiple and simple linear regression is that these relation these variables have linear relationship I'm not sure that second arrow point is really important but boy The Institute could slip that in there for one of these exams uh intercept term defined as the value of the dependent variable when when all of the independent variables are zero you know so you think about a model suppose you're trying to predict something like oh what's critical in 2022 how about how about inflation so we're trying to predict inflation and so we have a whole bunch of variables on the right hand side like oh I don't know fed policy uh Federal fund rate maybe yield to maturity on uh AAA rated bonds or or or you know what's the likelihood of all of those things being zero and probably probably not very high so uh my uh my professors in college I don't know that we ever even talked about The Intercept term as being important all right why are we using this multiple regression well I said this in my introduction here test existing theories like oh these variables here like currency values and interest rates they're probably super important in explaining changes in inflation I like that second one identify relationships between variables because on the economic surface we may find that currency values and inflation are related but there might be some hidden relationships maybe a latent relationship that we don't obviously see and regression analysis can help uncover those relationships and then of course we can use multiple regression to forecast inflation in the future look at that second embedded bullet point due to their complexity statistical tests and fundamental justification are necessary to explain financial and economic relations fully boy that sounds like a mouthful there but I'm going to go ahead and rewrite that sentence in saying something like look The Institute is teaching us all sorts of stuff about the relationships between and among variables and so let's just go ahead and use all of this material to make a reasonable justification for performing a statistical test I mean clearly The Institute is not going to have a question on the exam that says something like okay Jim holds the CFA designation and he's trying to predict the Returns on Twitter after you know the uh after the the attempted attempt to take over and then the and then the poison pill and then all the politics involved and then all the stuff that's going on with Twitter and what Jim's going to do is he's going to use everything that he learned in the CFA program and he's going to come up with a model that says hey the returns to Twitter are a function of the number of clouds out in the sky on a daily basis and the number of foul shots that LeBron James makes during the course of that day I mean that makes absolutely no sense at all so economic justification that's super important all right so what the reading does is give us some kind of examples and that's what we do here in the purple and in the red so we can we can do profitability we can do growth we can do Revenue we can do dividends we can do all sorts of stuff and uh and we can come up with some kind of a model that predicts either financial difficulties or financial success what we can also do is we can say something like let's look at stock price let's look at trading volume and let's see if there's correlations between and among other kinds of variables but maybe some analysts is trying to determine how do they change daily you know why is it that Twitter has this much volume before the whole Elon Musk stuff and now it has this much trading volume yeah those are those are really great questions all right let's go ahead and look at an example here we go impact of inflation and real rates on the price of the US dollar so look on the left hand side of the equal sign there's our dependent variable the price of a U.S dollar Index whatever that is and it's probably a basket of exchange rates between the US dollar and you know maybe five or eight or ten or Twenty different currencies out there so you have this basket index and so what do we think as a good uh financial analyst we think that inflation is probably important we all so think that the real interest rate is important so there's our multiple regression model so what do we have to do well we need whoops I didn't mean to go ahead we we need to go out and collect data on the price of the US dollar Index we need to collect data on inflation and collect data on the real rate of interest and so in in an Ideal World what we would do is we would have this data readily available to us and let's just say on a daily basis so we could go back over the last let's just pick a number of days 100 trading days and get these two independent variables and the independent I'm sorry and the dependent variable we could get that data for the last 100 days you know the problem in reality is that while we could get the price of the US dollar every day it'd be difficult to come up with inflation because the government doesn't announce rates of inflation on a daily basis it does it on a monthly basis now the real rate of interest we might be able to back into that using things like you know maybe treasury securities and maybe inflation rates that are predicted by some kind of a model but then you get into the problem of okay what are the problems with the model you said you see my point here is that we need to make sure that when we're collecting all this data that we have data points that uh referred to same time Trend so what we'll probably do here in this model is since we can get actual inflation from the government on a monthly basis maybe we'll just get monthly prices and monthly real rates of Interest so let's suppose we do all that and we perform a regression analysis so there are the coefficients going down the left hand side 81 minus 276 and a positive 902 so what does that tell us that there's a negative relationship between inflation and the price of the dollar there's a positive relationship between real rates and the price of a dollar and then the next column is some measure of standard error you should remember this back from level one standard error and in almost any kind of phrase or term that has the word standard in it like a standard deviation and a standard error and a standard error of the estimate I mean they they have individual different kinds of meanings but they essentially tell us you know what's our degree of error you know how how wrong are we you know if if we have a data set that looks like this and here's the mean somewhere in there you know and the data set goes way way way way way out to here well then we'll probably have a high some kind of a measure of standardization huh but then that's important on that third column there's the test statistic and there's the p-value so you can use either one of those to determine significance so what did I say earlier in that second column that there is a negative relationship and a positive relationship inflation and real interest rates but we need to ask ourselves the question do we even do we care you know so maybe maybe they go up and down together but is it material how about that now of course we can't use that term here in our quantitative methods terminology we need to use the term are they statistically material we call that statistically significant and so you should remember from level one that it's you know it's probably an okay general rule of thumb to think of a good uh a t statistic that's significant if it's greater than two you know my professors in college they they allowed us to get away from get away with that and just kind of general terms when we were reporting our data to them whether we were doing papers or writing our dissertations uh but then they said you know what to do the real test you need to make sure you get out the tables and and do the t-test or the z-test and so uh that's how p-value comes into play so what do we say the the test statistic for inflation is a minus 1.18 and its p-value is 27 so that is not statistically significant so all we can do is say that that partial coefficient of a minus 276 I mean we can say with confidence that there's a negative relationship but that it is not uh statistically significant but then look at real interest rates we have a p-value of 0 1 and a test statistic of 322 so wow that sounds an awful lot like this is statistically significant that partial coefficient is statistically significant and so you know what's the standard the standard is a p-value of less than less than five percent and that's if we want 95 confidence I mean we could have we could say we want 97.5 confidence or 99 confidence or we could say something crazy like we want we want 50 confidence uh not sure if that would be uh anything reasonable but we could say that so there's that last sentence down there the only real interest rate variable is significant at the five percent level there we go so here's some more summery data on what I was uh when I was talking about there in that previous slide all right so let's move on to the next one here possible drivers of a company's percentage return on Capital so we take profit margin we take sales and we take the debt ratio as possible drivers all right so uh uh what are we doing five five percent level of significance sample size of 25. Ah that's important so that's less than less than 30. so we're going to use a t value back here what did we know uh we were not told how many we collected did we we could probably figure it out but uh so let's quickly go through that notice all these are positive right they're the standard errors there's the t-stat The Intercept term that is statistically significant but that only just tells you that the uh that the line doesn't go through zero and this is probably why my professors said let's not worry about the intercept term although there was one uh one example that we did about trying to determine if the risk-free rate of interest which remember shows up on the vertical axis in the Capital Market line is that significantly different from zero that was probably that was probably an okay one but anyway what do we have 1.7 and 1.3 so that's nine percent P I'm sorry 0.09 p-value 0.17 P value so sales and debt ratio uh are clearly not uh statistically significant but profit margin is and so that should make perfect sense notice the p-value over there of zero and there's that far right column uh comparing the p-value to five percent all right how about the loss regarding the assumptions and these are the exact same assumptions that show up in uh in a simple linear regression but they they have a little bit more meaning because now we have a couple of more variables uh to worry about so first one relationship is linear right independent variables are not random no definite linear relationship between two or more independent variables if you have that this is called multicollinearity here's the example that I give my students whenever we uh whenever we run into this let's suppose that you were let's suppose that you were trying to determine whether or not your newly installed air conditioner in your house is working and it's working efficiently and so you say all right I'm going to collect data on the number of times that my air conditioner machine outside clicks on during the course of a day and to my my independent variables are going to be I'm going to put a I'm going to put a thermometer on the roof that faces East and I'm going to put a thermometer on my roof outside that faces West and I'm going to put a thermometer out in the middle of my backyard and so those are my three those are my three independent variables so clearly clearly they're not independent because the Sun is going to shine here and then it's going to shine here and it's going to shine there now there might be some shading involved in there so there'll be variability they're not going to be the exact same temperatures but clearly when the sun is shining and it's 100 degrees outside those three thermometers they're going to have a high degree of correlation expected value of the error term is equal to zero this simply means that we know that we're going to make mistakes in our model you know let me go back here to this let's just go back here to this model we know we're going to make mistakes in sales and debt ratio and profit margin it's not going to those three independent variables are not going to perfectly predict the percentage return on Capital all right so whoops we're going to make some mistakes so the expected value of the error term simply means that the mistakes that we make on this side offset the mistakes that we make on this side expected value of the error term is equal to zero now we also need to concern ourselves about the variance of the error term so it has a mean of zero but it's not going to have a standard deviation or a variance of zero because we're going to make mistakes right we're making mistakes this way and we're making mistakes this way that's why we Square those differences to get rid of the negative so that the mistakes on this side don't exactly offset the mistakes on that side so what we want is that across all those observations we want that variance of the error term to be equal this is called error terms that are homoscedastic violations of that mean heteroskedastic error term is uncorrelated across all observations which means that let's go back to my silly uh air conditioner example I mean clearly clearly we're going to make mistakes but how are we going to make the mistakes the mistakes are going to be made and those mistakes are all going to be related because when the thermometer on the front roof makes a mistake the thermometer on the back roof and the thermometer out in the backyard they're probably going to make this the mistakes they probably won't be identical mistakes but they're going to be uh they're going to be related and then finally we want that error term to be normally distributed and that's kind of an a summary assumption based on the three above Diamond points now this is what I was saying earlier in my introduction that if you look at those that vignette at the end of this learning module there are there are five questions and every question asks you to interpret a graph and so that's what we're going to do over the next handful of slides so after we're done this after we're done this slide deck I want you to immediately open up your book or get it on your phone digitally and I want you to look at that look at that final question that question set there and hopefully after what we're doing here today you'll be able to answer those questions correctly all right so let's go through a couple of these assumptions so we need to ask ourselves the question is is the relationship linear because it might not be so the way we do that is that we plot the residuals going up the vertical axis versus the fitted values going along the horizontal axis and what did we say just a moment ago those residuals that error term they should have an expected value of zero so notice there's a zero going up the vertical axis and so notice that what we're doing is that we kind of draw a line there's the red line it's that's not really a line but you can draw something that with the textbook I'm sorry what the what the module does is in the in the graphs in there it draws a perfect straight line from zero all the way out through the end of the of the horizontal axis here we kind of just try to draw a red line that's it's kind of close to zero so in order to determine linearity what you can do is you can look at this graph and you can ask yourself the question okay suppose that you were walking along zero so here can you can I do that I can't even do that with my fingers all right we're walking along zero right and if we look to the left we're going to see a bunch of data points and if we look to the right we're going to see a bunch of data points do we see the same number of data points and are they about the same distance so this is really just you know just a regular old let's go ahead and Eyeball it and this one here we could probably say yeah this is probably a linear relationship because well look up at the top we say there's no pattern evident so the Institute is very likely to throw this a graph at you and say okay is this linear or not then you just eyeball it and say you know what it it probably is what I suppose they give you something like this oh boy there's no way that this thing is linear because you know there's a little what does that look like a mountain boy dare I even say it looks almost like the uh part of a normal distribution and so this is this is non-linear so I'm guessing the Institute when asking this question would you know would give you something that's relatively obvious I mean here this is linear and this is not you know if it gives you a graph that looks kind of like in between these you know it's it doesn't have a total hump it has just a little Hill I would probably err on the side of non-linearity all right how about the second assumption here Auto correlation and so what we're what we're concerned about here is that the error terms are related to each other now this particular learning module mentions this just in passing but we put together this slide because there's learning module I can't remember if it's learning module four or five in quantitative methods where we're going to take a look at all of these issues that all these violations of the assumptions and then how to correct for them and what their consequences are so this is probably not too terribly important slide other than the definition for this particular learning outcome statement but it will become super important when we get to that future module so remember this autocorrelation we check for it using the Durbin Watson statistics and without getting into too many details once again we'll do this here in another recording shortly it's a standard statistic so all of the values are going to be between 0 and 4 and sometimes there's no autocorrelation that's you want the mean between zero and two so that's uh that's uh Durbin Watson statistic of two that's a good thing if it's less than two then you have positive correlation if it's uh if it's greater than 2 then you have negative correlation and there are some other tests that we'll talk about here how about this one here this is really good seasonal or correlated patterns go away go way back up to the very top autocorrelation is the most off most often appears in Time series models and that's going to be a super fun discussion that we have but that's not today this is what I was talking about with my uh thermometers on my roof and in my in my backyard roofs do I have roofs or roofs I'm not sure what the plural is of that so multi-collinearity so here we go we have an X1 variable and an X2 variable and so clearly if that's thermometer one and that's thermometer two well there's going to have there's going to be a high level of correlation between those two variables um now notice on the left hand side we have this v i f uh Factor and what that's going to do is that's going to allow us to determine the degree of multicollinearity and although these relationships are not described in this particular learning module they are once again in a future learning module and so just think about this if this uh if this factor is less than four there's probably not too concern about multicollinearity if it's greater than 10 and you got serious issues I'm certain that my temperature example would have a factor greater than 10 but then this future learning module says something like hey if you're between 5 and 10 then you need to further investigate what's the point here the point is to be able to look at a graph here like this and say oh yes when X2 increases X1 increases and it does it in a linear fashion that's clearly a violation of this multi-collineary assumption now heteroskedasticity is the opposite of homoscedasticity in which we said that we want what did I say earlier I think I went like this we you know we want a constant variance so if we take a look at this same plot residuals on vertical fitted values on the horizontal you know what do we want we would like to see those things go like this right just like this imagine all the blue goes just like this like a rectangle right but now what's happening here it goes way out into a funnel shape so this is a clear picture of heteroskedasticity where we start out we start out with you know very low errors very low variability but then as we move out through those fitted values boy our mistakes become bigger and bigger and bigger and so it's a non-constant variance look at the bottom there there are a couple of hand there are a couple or a handful of tests that we can do to further test for heteroscedasticity and then correct for heteroskedasticity but that will come up in in a new learning module all right so once again what we're going to do is exactly what we did back in that other in that previous example right look at the red line is that a line kind of a curve I don't know what it is it's not quite a line it's not quite a curve but it's it's almost linear How about if Isaiah nearly linear so that pretty much looks like what we want you know again pretend that you're walking along that red path and you want to look to your left and you want to look to your right and if you see the same number and the same distance then that's probably evidence that the model is correctly specified at least as the homoscedastic Assumption applies but if we walk along that line and notice early on we see a bunch of those things they're pretty close to us and then as we move further and further they get further and further distance away from us and and we're upward sloping as well this is a really cool test this normal QQ test what we're doing is we're testing quantiles versus quantiles and notice that there's look at the there's a line right there's a line up there so what you do is you go ahead and plot this and those are circles right there so those are quantiles and the quantiles can be almost anything I mean they could be 10 they could be 30 percent whatever they are and so what you want in a perfectly normal distribution you'd have all those circles going right along that line from beginning all the way to end that would be perfectly normal distribution here in this example boy we're we're perfectly linear right all the way until we get to some point and so they kind of squiggly away from away from the black line but this is this is a good example of a normal distribution and so what you'll look like what you'll look for on this picture on the exam and I think there's a really good example here at the end of The Learning module excuse me you'll want to look for something like a smile where it speakings above the black line and it goes on the black line maybe a little bit below and then back above so it looks like a smile right or a frown you know if you see a smile or a frown then you're going to see evidence of skewness and evidence of kurtosis oh wow so normal distribution remember there's no skew there's no kurtosis and so this normal QQ plot goes a long way in determining that normal distribution and remember this is the normal distribution of the error terms that's important there so here's an example of is that a smile I don't know it's kind of a smile but you know think about it that that those circles that path Can it can go down and below and up you know if you get one that looks like this I don't even know what that would look like but clearly that's uh a distribution that is not normally all right so here's a list of the main assumptions homoscedasticity Independence of the variables and of the error terms normality and linearity and that takes us through these learning outcome statements so what have we done we've really just built on what we learned in simple linear regression we've added a couple of important things but then I think the important thing is those interpret the residual plots you can only interpret their residual plots if you know loss one and los2 so that should be your focus hey thanks for watching have a great day and good luck studying [Music]