recording awesome okay now we're ready so we will start from the template sdt 501 awesome okay so now we are ready so welcome to uh our first session for St sd501 and uh I will say I think this is a tremendously fun course so my My Hope Is that uh everyone will agree with me on that at least by the end of the course if not at the beginning so it it should feel very familiar in the sense that you just had 500 um but it has a GI a different goal in mind so 500 was where we really did a survey of a lot of different topics we covered them very quickly and the topics that we covered very quickly uh we never went particularly in depth with them the goal of this course is for us to leave with an with much more of an in-depth understanding of a few topics and to that aim uh we're going to remind ourselves to day of regression analysis and go through in detail each of the components of what's involved in doing regression analysis we then also then look at an at an example most likely uh and come up with you know how do we model things in doing regression analysis which regression analysis is also referred to as best fit uh or fit analysis so there's a lot of different terms but the the base term is regression we'll also review the Anova table which is something that comes up when we're doing regression analysis and this will give us all the tools that we need to be able to feel very competent and successful at doing the first project which will be coming up in another I think two weeks it's due or something like that um something to that effect so at any point during the course if you suspect something like a due date or something like that is off and or you have a hard time with completing the due date let out a hard um so hopefully by now you know I'm pretty chill so I don't I don't believe in going through life with anxiety so just doesn't make sense okay so let's go over all the topics one at a time so for starters why do we call it regression modeling so the name itself uh is actually very Telltale about the usage of the topic so if we imagine that we have a set of axes wow okay cool there we go if we have a set of axes and we then have bunch of data values it it doesn't matter exactly matching with the dots that I draw but the idea is is is that you can see that the dots that I've drawn do have a linear relationship that seems to be occurring well if I then draw a line of best fit which this this will be the best fit line for the state of values that I drew well the term regression or regressing means moving toward something and so when we refer to something as a regression model the idea is is that the model which is this red line is what happens when we move towards the average value so it's sometimes referred to as the a regression equation is regressing towards the mean which really just means moving towards the mean so it's another way of saying it's you're moving towards the mean so that really tells us a lot about how to interpret what any regression equation is and what it does it is simply an average of the available data values now when we look at the example dots that we have okay well the idea is is that this line is the average based upon the data values that are there but only if we assume that everything is a line because these dots that I drew they could represent a different relationship in fact if I wanted I could see there as being a bit of a swirl in this meaning that I could could have something that looks like this that seems to represent the data and that could be something that we would regress towards now which one is better well that's a different question the fact that I can do both is a different question than which one is best and so that means that when we are talking about regression analysis we need to have a specific equation in mind so so we would then sometimes refer to then uh the different options on what the model could look like as having different fits but then there's then uh analyses that go behind determining which one has the better fit or is a better fit for the sample and that's really where it's the sample that's going to to determine everything about this now when we're talking about particularly large data sets and we do regression analysis in very large data sets we do not use all of the data in the construction of our model so we have a rule we pick a sample as the training set for our data n which is the sample size equals 10% of population or at most 1,00 whichever one's bigger now why is this a hard Rule and the answer is no this is not a hard rule this is a guideline so you know when we when we think of in terms of what is a speed limit at least in the United States a speed limit is actually something that's a bit of a gray area in the sense of where you can technically drive faster than the speed limit I'm not advocating for that I want to be clear since I'm currently recording myself I'm not advocating for that I'm not saying like you know speed limits don't don't exist no but the point is is like your car is capable of traveling faster than the speed limit however is it a good idea and the answer is well it depends upon how much over the speed limit and how many uh Law Enforcement Officers might be in the area you know if I uh am five minutes from home and I need to use the restroom I tend to go at a different speed than when I have another three hours to my drink so interpret out your own will uh so that's where the ability to do something doesn't necessarily mean it becomes the intelligent choice so why are we capping this at a thousand the reason why we're capping it at a thousand is because it becomes very computationally taxing at that point so we wish to avoid that so there are some situations of where we could use more we could use a couple thousand values as what we would use for a training set it doesn't mean we should so when we have then a regression model we get an anova table as the standard analysis now what is the Anova table an NOA is equal to analysis of variation and the way that this abbreviation works is that it's the first two letters of that then it's the first letter of that and then it's the first two letters of that so that's where that's coming from so wasn't me uh I didn't invent it I will say it's not as clever as one would hope but you know eh it works so analysis of variation well variation as a reminder means difference from the mean so when we then go back to our picture which is right here and we see that we have a bunch of dots and then we say okay well we're we're regressing towards the average line well not every dot is on that line and therefore they are different from the line so an NOA analysis an NOA means that we're analyzing how those sample values are different from the line now in simple terms just to State this a regression model predicts a dependent variable which is an outcome based upon independent variables which are what we refer to as the predict predictors so independent means that the one V you know that the one occurrence does not influence the next now we have hypothesis hypotheses are a key component of any analysis so in particular for the Anova tests the null hypothesis that the means of the groups are all equal versus the the alternative which is that at least one group mean is different in regression it's testing whether the regression model explains a significant portion of the variance in the dependent variable so as we said those dots were not all on the line well okay well can we explain how they're not so if we can explain the reason why they're not based upon some criteria well then everything in the model is explained so we then have the line which is the average value and then we have explained why dots are not on the line well that would be everything we would want to explain so we're going to now formally State what is the null and alternative hypothesis here I'll wait just a second so you can make sure that you catch up with writing things so typically our days are going to be very practical uh and I would actually say that this is practical because this is explaining exactly what we need to know to be able to understand how to do the first project because the first project is actually asking us to do regression modeling okay so that means that we formally have an h0 which is that the model explains none of the variability of the response data around it's mean so in other words there's no explanation of why the dots are not on whatever equation we're using and then our alternative hypothesis is that the model explains sum of the variability now you'll notice that having the alternative hypothesis be what we're left with does not mean everything's great it just means we're not horrid so if the null hypothesis is true that is a horrible circumstance from the perspective of predicting but having some of the variability explained does not mean things are great so that means really that if we're left with the alternative we still have additional analysis we have to perform because that would be like saying imagine that you get some of your monthly paychecks would you be happy with getting some of your monthly paychecks I would not I want all of my monthly paychecks so yeah so sum is not a good term so what is the what are the things that we use for testing this so one of the things that we use is What's called the f statistic this is the test statistic in a Nova it's the ratio of the mean regression sum of squares to the mean error sum of squares okay so what are the sum of squares well there's a couple sums of squares there's the TSS so somebody back in the day thought that this was somehow extremely helpful to come up with all these abbreviations so I will tell you there's it's like a lexicon over overdose there's there's so many terminologies and it's it's terrible naming convention you'll see in just a second but this is the total variance in data okay what does that mean well that means that's the total amount that that all of the data values differ from the mean that makes sense that that's that's actually named very well then there's RSS and this is where I would say the naming convention starts getting bad because it's like anytime you're coming up with multiple terms of where all you're doing is changing the first letter it's like come on bro like so this is how much of the variance in the dependent variable our model does not explain or pardon me pardon me pardon me does explain wow bad error I'm sorry does explain and then our final one is ESS and now you know why I said this is such a bad naming convention because it's like if you're a letter off you're on to a completely different term so this is how much of the variance in the dependent variable the model does not explain okay now this is also right here referred to this just particular one as the residual sum of squares the ESS is now I will tell you something yeah well yeah yeah well because there was already an RSS so I will tell you then that when I was in college which was a long time ago by the way so when I was in college this one of them was RSS and the other was SSR and like that was tons of fun that was tons of fun and trying to remember which one was which when honestly the letters involved are exactly the same was super cool so it's it's at least slightly better to refer to it as ESS which stands for error sum of squares so they they don't a lot of times they don't bother telling you what it stands for so RSS uh stands for regression sum of squares and then ESS which is also known as residual sum of squares so there's the residual sum of squares and regression sum of squares there's what a massacre of naming like somebody should have been more creative you know but not my fault wasn't alive at the time when it was developed so now you can notice that when I look at these three things these two are in a sense opposites of each other right and if I add up these two I have to get this because the total variation in the model has to be the sum of these two things because these two together make up all possibilities of what could be described you're either variance that the dependent variable explains or your variance that the dependent variable doesn't explain that's it those are the only two possibilities so then when we go to the regression so when we go to the above which is the F statistic so the mean regression regression sum of squares okay well which one is the regression sum of squares that's this one right here which is how much of the how much of the variance in the dependent uh uh variance in the model the dependent variable explains and then the mean error sum of squares that's this one what this F statistic is looking at is on average what's the ratio of the explained to the Unexplained cool that's pretty cool now those components of the Innova table are actually pretty useful the following is not I would say necessarily useful on its own but it is an important component of what happens so we have what it referred to as the degrees of freedom these are the number of values in the final calculation of a statistic that are allowed to which is also referred to as free to vary so what we typically have is that we have two of them there is a fly so we typically have a regression degree of freedom and a residual degree of Freedom now I will tell you in the modern world of computers these are not really that important because they're automatically calculated and used and you almost don't have to reference them in any way shape or form at all but knowing that the knowing that there's a reason why they're given is kind of nice but not vital now why are they given the reason why they're given is because back in the day you needed them to be able to look up values for the F statistic so and it's because back in the day things like the es statistic you'd have to go to a table of values in order to read off its interpretation and two of the parameters you would need for being able to know where in the table you were going to read off the value for the es statistic was that you needed to know these two degrees of freedom in the modern Computing age the value for the F statistic is automatically given to you which means this is not something you need to know in order to look up values so at this point it's like Legacy facts it's still true and it's still important it's just there's nothing you'll do with it how we doing so far any questions okay so the next things that we have are the mean Square values these are the sums of squares divided by the degrees of freedom involved okay so let's for instance go back to this picture and let's Okay so let's see if I can actually count how many data values one two let's see one two three four five six 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 okay so it looks like there are 21 dated values so let's pretend that's the number in case I was off by one so I am declaring that there are 21 data values which actually is a remarkably convenient number I really love 21 um okay so right here there's a total amount that those dots differ from the line of best fit now what do I mean by differ well right here see how there's a vertical height from that blue dot to the line that vertical height that's right there is what we would refer to as a residual it's it's like the leftover so that's what residual means it means like leftover so it's the residual or it it can also be referred to as the error but if I look at this whole example there's a total amount of residual or error involved in the calculation well what is the average one so in other words if I divide by 21 the total I should get how much an average value is off that's what these mean Square values are it's how much an individual one is off now that means that there are in fact two main mean Square values there's the mean regression Square which is the RSS divided by the degrees of freedom excuse me thank you and then there's the mean error Square which is the ESS divided by the degrees of freedom now is that something in the modern Computing age is really important for us to find and the answer is nope not really not really is it actually still important yeah it is important but it's important because it's what the computer is using to give us values so again it's given as a legacy result in the Inova it's not something that we have a tendency of using a whole lot separate from other information P value okay well now here's where we come to something of where this is important so this is important this is not a legacy it's included just so that we know what it is this is straight up important this is important to the analysis so what is a P value well after calculating the F statistic you need to find the probability that the statistic that fly is really annoying that the statistic takes at least as extreme of a value as the one observed comma assuming h0 is true a low P value what counts as low typically less than or equal to 0.5 now quick comment okay normally when we would write a decimal we would prefer to write decimals as 0.05 right that's like the standard unless of course we're using the European system in which case it'll be comma 05 but this is United States so sorry yeah 05 um uh you know that's our way of doing things why who knows who knows I mean yeah nobody else in the world does it our way um I mean I think it has spread to a lot of North America of where at least people are familiar with it um um but it truly is yeah a little bit with the dot or a little bit with the comma usually do with the dot okay yeah but when we give things in terms of P values there's a tradition of you cut the Z the initial zero off so because we will always be using something that's less than one we cut off the zero now is this a big deal no it's not a big deal but it's something that shows that you're educated in statistics so continuing the tradition of cutting the initial zero off communicates to the reader ah this person knows what they're doing because it's unnecessary does that is does that make sense for that to be such an important thing honestly not at all not at all but it's one of those things of where you know it's it's it's can't amount to grammar in the context of Statistics so it's like you you didn't capitalize the beginning of a sentence you know if you start if you in your writing you don't capitalize the beginning of a sentence then the person reading it goes oh this person's uneducated and and it doesn't mean it's true because like imagine you took uh a play of Shakespeare and you just uncapitalized all the first letters of sentences that does not change the meaning of it at all it's still Shakespeare it's still great but the reader does not interpret it correctly because they go oh this person doesn't know what they're doing and then in the off chance then that you have somebody who doesn't know statistics and they go hey you're missing the decimal then you get to then go well allow me to tell you in statistics we actually don't include the zero which I mean come on that's that's worth it at least so this a low P value indicates that you can reject the all hypothesis now that also means that if you have a high P value then that means that the F statistic is such that you fail to reject the null hypothesis and that's where we would then say we retain the null hypothesis so the opposite of rejecting is retaining now if you'll remember back from last semester which was oh so long ago uh I.E you know like what three weeks or something um one of the things that we're supposed to imagine is that we're supposed to imagine ourselves as statistical lawyers and so as a result we presume the truth of the null hypothesis unless we come to terms with its rejection so the null hypothesis is the assumed truth value and then we would just then reject it if given enough information okay so let's take a break here and when we come back then I'll cover multiple R and R squ and then we'll we'll take a look then maybe at doing an example and run through what are the components because there's some additional Parts but this gets us started because this defines all the terms that I need to be able to then have us go through an anova analysis so we're going to take a little break here