Understanding Regression Analysis Techniques

go to this computer there we go all right okay so regression analysis uh one of the things that we're going to work on today is that we're going to go through an example of how we can check out uh whether we have things like log linear linear log et cetera et cetera models and investigate how do we determine whether something is such a model so that's going to be our goal for today so for starters is just a reminder when we have these models what we have is is we have the idea of that this red line let's suppose that this red line represents uh a particular uh linear model and then suppose that the blue data values represent actual values that are collected so these are sample values and then the green line what it represents is it represents the instance of having a difference between your actual data value and your predicted value your predicted value is what you're predicting on your line so this model makes a key assumption and it supposes that things behave linearly and then as we learned that's not always what is the case so for instance if we were to have something that would be a logarithmic growth so for instance uh it could be reasonable to say that uh you know as some particular item let's suppose it was food were consumed by a species that then the uh population would increase but that the total food consumed goes down per individual uh and it's higher for per individual when the population is lower so we said for instance if we had an entire grass field and then we placed a bunch of rabbits in the said grass field we could then imagine that then this is no longer a linear equation in other words you know the more food doesn't necessarily equal the more rabbits uh because we would then have what we would want to imagine as a rate of increase of uh B over X so that as X gets larger and larger X is the amount of land for instance uh then the overall number of rabbits should go down or so we would hypothesize uh or pardon me reverse that as the number of rabbits goes up uh the amount of food consumed goes down there we go so the other thing then that we would have is that then this is an example of what we would refer to as a linear log model so a linear log model is where we take the logarithm of the independent variable which is the X variable but we also learned that we could have things like cubic or quadratic models and that's where then the data values look like the blue dots that we see right here and then our model then it's if it's quadratic it looks like a parabola if it's cubic it looks like a x cubed equation but another example of something that we have is that we could have a log linear model and this is something that we use when we have a skewed distribution and it's in general the case of where we have things that cannot be negative it's at least worth a shot to attempt to see whether something is a log linear model and a log linear model is where the dependent variable which is the Y variable has a logarithm of it and then finally we saw that we have then a log log model and this is oftentimes used for things like Supply demand equations and this is where you take the logarithm of both the independent and dependent variables so an example of this would be where we would have birds and then we'd have grasshoppers who has the number of birds uh as the number of birds goes up the number of grasshoppers goes down as the number of birds get smaller the number of grasshoppers gets bigger because the more birds you have the less grasshoppers you have Etc et cetera so this is an example of Supply demand Okay so that now takes us to the point of where we would want to begin to investigate just a little bit with using R so let's open up our favorite uh favorite way of getting to R so R Studio that's my personal favorite and what we're going to do then is that we're going to load the data set which is the example data set that I gave it doesn't really particularly matter which data set we use as an example but that is going to be the one that I'm going to use today so I happen to have it in my downloads okay now as I'm loading it into R I'd like to rename the the file so I'm going down here to rename because I don't want to have to call it by that extremely long name so I'm just going to call it g uh and G will be short for G I Can't Wait for Thanksgiving because I'm really tired and need a break so that sounds like a pretty good name okay so I now have this data file that is called G and you can see that it has a couple different columns so it has a column X1 X2 a b and c so one of the first things that we'd like to do is we'd like to investigate could we come up with any kind of relationships between the very the various uh columns which really means can we come up with any kind of relationships between the variables that are involved in this problem so one of the things we might want to do then is that we might want to come up with a plot of some of those dots so in other words a scatter plot so let's suppose that we're going to do G dollar sign X1 comma G dollar sign X 2. in other words I'd like to plot the X's as being X1 and the Y's as being X2 let's see what that plot looks like wow that looks like a straight line doesn't it that looks rather remarkably like a straight line okay so at that point we should then maybe say well this is kind of likely for potentially being something that we should test with a linear model does that sound about right I would think so because I don't I don't know about you but that looks incredibly straight to me so let's give this a name let's call it l x one X2 that's the name I'm going to give it and that little uh left arrow means I'm about to assign that an assignment and I want that to be a linear model and I want to predict my G x 2 variable and then that little squiggly which is a tilde means I'd like to use the G x one variable okay and then something happened something mysterious oh but nothing showed up okay so this is where then I can say well I'd like to get a summary of this linear X1 X2 model now you will notice that it's in fact telling us that our bestest expectations could not possibly have been achieved better because it's actually telling us a warning oh this is essentially a perfect fit I would agree because as far as any kind of analysis is concerned if you're looking at the r squared value it's one which means what percentage of the data is explained well that one stands for what 100 100 of the data is explained yikes that's a pretty great model that's a really great model okay so this turns out to be a perfect example for us to then investigate some one of the other things that we can do okay so given that this is like a perfect example let's take this perfect example and let's now then ask for a series of plots so I'm going to type in here plot and that linear X1 X2 model I wish greatly I could make that larger for you and yet sadly I cannot so if you cannot currently see uh I would indeed highly recommend that you I don't know get closer to the board okay and now I'm going to click enter and then it tells me okay well it's about to get ready to give me a couple different plots but I have to press enter slash the return button in order for it to give me my first one okay so this is my first one now this is a plot of the residuals versus the fitted values what's that mean well the residuals here are the errors so do you remember from our picture let's get our picture good old trusty picture that we had up here of where we were saying that you have your data values and then you have these little green lines and they stand for the amount of errors so the blue dots are the actual things that we have as our data values and then the red line is then what we're using to predict well this right here is a plot of all of those green lines for this particular example and so right here you can notice that it looks like wow it looks like you know it looks like there's a lot of Errors until we actually turn our heads sideways and see what the scaling is on this graph how big are these errors how large are these errors here small right how small are they just a little small well this is 5 times 10 to the negative 15. is that a is that just a small number I wouldn't call that just a small number I'd call that a crazy small number that's a crazy small number what does it mean to say 10 what does it mean to say 5 times 10 to the negative 1. what is 5 times 10 to the negative 1. what's that mean it means 5 times 1 over 10. what is 5 times 1 over 10. I agree with you that's 5 times 0.1 thanks for the help on that and 5 times 0.1 is 0.5 thank you again coming in for the win everybody's very ready for Thanksgiving the silence tells me that okay what does it mean to say 5 times 10 to the negative 3. it's 5 times 1 over a thousand which means what is that as a decimal zero point zero zero five so how many zeros were there before the five after the decimal two zeros okay so if I have five times ten to the negative 15 which is what this means how many zeros in a row do I before do I have after the decimal before I actually get a five which is also after the decimal 14 zeros that is a extremely small number so that means that this value right here is crazy small but also this value right here which looks like it's so far away crazy small all of these values are extremely small which tells us yeah this is pretty much a straight line so when these all residuals are very close to that red line that means that you have something that has a strong indication of being a quote unquote good model so this would be not just a good model this would be a great model next graph this is the what we would refer to as the normal QQ model this is testing whether our errors come from a normal distribution which is an assumption we have about our model when we do line of best fit we're making an assumption that our errors are supposed to follow a normal distribution what this is giving is this is giving us a way of where we want everything to fall largely on that line there do things kind of look like a line in this picture and the answer is heck yeah it looks a lot like it falls in a line and so as a result then we would say that this seems to be indicating as if things and by things I mean our residuals are normally distributed and that means our errors are normally distributed which means that the only errors that we have are just by chance okay next graph this is what's referred to sometimes as a spread location plot and this shows that if the residuals are spread equally across the ranges of predictors and so this is how we can check the Assumption of equal variance which is what we would refer to as homoscadastity so I think I gave that term did I give that term homoscadastity and then there's another one homeosted so homoscudacity means same error and it's good if you see a horizontal line with equally randomly spread points now do we have a horizontal line here let me go not quite is it a kind of horizontal line it's a kind of horizontal line kind of so does this model obey homoscedastity good enough is that a very precise kind of way of describing things no it's not precise at all but we're interpreting Things based upon a picture so we're going to see an example a bad example this is a good example so this is actually like Eh this is not so bad and also all the dots appear relatively kind of randomly spread-ish kind of almost sort of okay the last thing that we get as a plot is this right here now are sort of best idea and this is a plot of the residuals versus The Leverage neither of which I've talked about so I've not talked about the leverage at all but the point of this graph is to look at whether we have what we refer to as influential data values and influential data values means do we have values that are so unlike the other ones that we should consider them outliers now we don't happen to actually have any particular uh outliers on this plot so I wouldn't say that there's not anything with a particularly high amount of Leverage in this picture therefore there's no particular relevance for this plot for this example great okay so now we're done with that particular data set let's now create another data set of where we're going to look at it so let's do the following plot let's plot G x 2 with g a so this is where now we'd like our X values to be our X2 values and we'd like our a values to be our y values now looking at this plot we now have a pretty reasonable question and our very reasonable question is does this look like it should be linear hmm does this data set look like it should be linear now I would say no right I would say very much no but at the moment we don't have a way of checking it so knowing that it's not going to work out well we're going to move ahead with it so we're going to say let's have this be l of X2 a V1 standing for version one and now we're going to do a linear model and what we had before was uh so what you want to do is you want to put your what are going to be your y values first which means G dollar sign a and then I put a tilde and then G dollar sign X2 and this is saying I'd like to predict all of the a values using X2 okay done I now created it but now I don't have any actual sort of information yet so I'd like to call that up so I'm going to create a summary of l x two a V1 and here we go and we see that this multiple r squared value is ew okay so approximately what percentage of the data is explained in this particular model by having a line approximately three percent right that's not very good okay and now let's do a plot of this linear X2 a version one so this is where we're going to get those four pictures and so now we know that this is a bad example right so picture number one residuals versus fitted does this look like a straight line no so is it fair to think that this is in fact actually a linear model and the answer is nope it's not how about is it fair to say that we have in fact normally distributed variables and the answer is actually yeah so our residuals are normally distributed and what's the thing that helps us to say that well in general they happen to go close to the line so they don't significantly deviate from that straight line next thing the next thing is this is our spread location plot and this is for homeostedastity so does it look like we're following homeostedastity and the answer is and they don't really seem like they're spread evenly so this is where I would say not a particularly homely skedastic model and then the last do we have any outliers well none are really particular outliers for this model because none are really off from the predicted line okay so that was version one let's do a brand new l x two a let's call it version two and we're going to create a linear model however we're going to use G dollar sign a but now we're going to predict it using G dollar sign X2 but I'd like to square it so this is a way of putting it in of where now I'm saying I don't want to predict all of the A's using x2s I want to predict all of them using x 2 squared and now I'd like to do a summary of that data l x two a version 2. and then we look at this and we say okay well this sure doesn't look like we got any better here and in fact if we go up above you'll notice absolutely nothing happened of any kind of significance let me go well that is kind of a bummer and the reason why this is a bummer is because it did not successfully recognize what it is that we were trying to do so let's give it a new attempt at trying to input this result so if I'd like it to use a second degree polynomial wow that is not always polynomial wow that is not how we spell polynomial polynomial okay hold up following all the there we go it's Paulie if I could just remember the right code the first time Paulie and then I have to put the degree in uh I'm doing ploy Ollie there we go silly me two meaning I want it to be a degree two polynomial there we go summary l x two av2 and now we get that exact shocking message which tells us you're doing a perfect fit that exactly fits and now let's do plot of l x two v a and this is where we're going to get our four pictures again so is this a mostly straight line hey kind of a mostly straight line and now this is once again where we have to pay attention to actually what's the size here what's the size of the errors along this side well the size of the errors along this side are extremely small RR residuals normally distributed and the answer is yeah for the most part they're all basically along that line so yes the residuals are mostly normally distributed meaning they mostly behave like a bow curve again what's the check do they line up with being along a line on this plot then we have the scale location plot do things appear to be for the most part randomly assigned on the page does it look like there's a particular pattern to them not really so I would say they're mostly random and when they're mostly random that's a good indicator uh when it's mostly evenly spread out it's a good indicator of homoscedastity now this is not by any means a good picture this is a at best Fair picture and now we then check for then leverage and we actually we have a fairly decent chance that there being some leverage from this problem but that doesn't matter because we haven't learned really much about that awesome so now we have computed a second example and figured out how to find out if it's quadratic so that's what we just did so now let's do a test what about if we wanted to figure out for some kind of an example whether we would have logarithmic Behavior okay so let's test that so let's pick a particular example plot and let's plot G dollar sign C with G dollar sign a let's see what it looks like okay now here's a an observation we should make observation we should make is you see how there's a bunch of data values that are stacked here Can You observe that there's a bunch that line up over the exact same line if we observe what values are actually in a there's a couple different threes here aren't there but there's also a couple different 13s those are the values that are stacking of where we're getting the same value occurring so is it likely that data like this is going to match with the line and the answer is hmm honestly not too much when the data looks like this it's not very linear in appearance is it how about quadratic does it look very quadratic at this point the answer is um maybe but not really and how about then if we're supposed to do something uh logarithmic well let's give it a try so what is it that we would be trying to do well this is where we could either do log linear or linear log so let's try it of where we're going to attempt to do a model of where we'll say l of and what we'll try is log C a and we'll say we're going to do a linear model and we're going to try to predict G dollar sign a using the logarithm and so that's how we write it is log the logarithm of G dollar sign C so we're trying to predict the a values using the logarithm of the C values all right so now then we want to then say let's run a summary of L log C and then we observe that our multiple r squared value is 0.418 which means approximately what percentage of the data is explained like 42-ish percent so is that a good model that's not so good how do you determine whether something is good or bad based upon that so in other words like what decimal values are good and what decimal values are bad well okay so I think I gave this example the other day what if it was that every time you came to uh every time time you came to a stop sign you could make either a left or a right and you were just going to randomly choose whether you go to the right or to the left based upon a probability of that kind of consistency a 42 chance so let's suppose you have a 42 chance that you're going to then get an R for right and what if then you're going to use this as a way of getting to some destination is that you're just going to keep sort of spinning the dial and figuring out which answer you got to randomly assign you where you were going would you want to take those kinds of chances at using that to figure out how to get home and the answer is no you wouldn't use that one would you start maybe considering that option well probably once it started getting up into 0.85-ish territory then it might be something you'd be willing to take a chance on 0.7 and it'd be a fun day you know if it's like a you know point point seven percent chance that you're gonna use the right turn every time you come to a stop that's an exciting day that's a really exciting day I don't know if you've ever had that fun of where as you're driving you're just going to randomly decide whether where you go based upon just a whim but that's a great way of finding out finding different locations uh I that's I quite enjoy doing that so let's now check what's the plot of this okay so is this a mostly horizontal line and the answer is nope not so much and what's the distribution of the residuals does it look like it's normally distributed and the answer is no there's actually a pattern to it so this doesn't look actually like it's normally distributed how about like so do we think that we have homoscadastity and the answer is no this does not look like homoscadastity there seems to be an actual very specific pattern to the re to the uh fitted values here and how about any kind of Leverage well there's some rather significant leverage of are there values that very clearly are following a pattern so would we call this a good model and the answer is nope we would not call this a good model okay so we're actually going to end here for today and the next time then we're going to pick this up because we want to become good at actually evaluating this and I would like to then be able to then review the concepts that we just did and then use them to then kind of gain additional information about this procedure so any questions before we completely completely go so have a good one

Transcript for:Understanding Regression Analysis Techniques

Transcript for:
Understanding Regression Analysis Techniques