Introduction to Statistics by Justin Zeltzer

Hey team, Justin Zeltzer here from zstatistics.com, where today I'm responding to a challenge that was issued to me. Someone asked me if I could explain statistics to them in under half an hour. While initially I thought that was a bit of an ambitious ask, I thought no that's actually a really good challenge, and one that I might do for everybody. So this is it! An introduction to statistics, with no maths, and done in under half an hour. Now you can probably see that the the timing of this video is a bit longer than that, but it is because I bunged on a little extra section at the end- which is a bit of an optional extra, but I think I get most of it done in under half an hour. But the idea is for you to develop your intuition around statistics, so it's great for those people who are just enrolling in a statistics course and are a bit apprehensive, or for others who aren't studying statistics, but kind of want to know what it's all about. And to keep it light and interesting, I've themed all of the examples in this video on my latest obsession, which is the NBA. Despite proudly following Australian sports my brother's getting me hopelessly addicted to American basketball So here we go, the first thing we're going to delve into is what types of data we're going to encounter when we're dealing with statistics. Now roughly we can divide data into two distinct classes: categorical data and numerical data. Now they sound somewhat self-explanatory because numerical means numbers and categorical means categories, so I'll give you some examples of those in a second, but categorical data can be further split into nominal categorical data and ordinal categorical data. Nominal meaning there is no order to the various categories of a particular variable and ordinal means that there is some kind of order to the categories, and we'll see some examples in a second. Numerical data can be further split into Discrete numerical data or Continuous numerical data, and again we'll have a look at some examples right now. So if I was to ask you what team does Steph Curry play for, you can see that clearly the answer to that question is not going to be numerical, so it's a categorical piece of data. And here in these brackets I've put what's called the sample space for this particular question. Now Steph Curry could either play for the Atlanta Hawks, the Boston Celtics etc etc. It turns out he plays for the Golden State Warriors, but this but all these potential values for what team Steph Curry plays for when we combine them we call that the sample space and you can see that there's no order to the teams. It doesn't really matter which order you put them in, so that's why we would say this is a nominal piece of data. Now the question: what position does Steph play? Well that might provide us with ordinal categorical data. Now Steph could either play Guard, Forward or Center, or you could split that up into Point Guard, Shooting Guard etc etc, but there is some loose order to these positions. The Guards generally play in the back court and then the forwards played closer to the ring and the center plays underneath it. And also there's a general kind of height difference between smaller players that play guard to taller players playing forward and the tallest will play at center so while this is still categorical data, there's some kind of order to it. Now an example of discrete numerical data might be how many free throws has Steph missed tonight. Clearly he could miss 0 1 2 3 etc etc. This is numerical data, but importantly he can't miss 1.5 or 2.3 free throws, right? So there are only discrete possible values that this piece of data can take. And finally continuous data might be the question: What's Steph's height? Now google lists Steph's height as 191 centimeters but of course Steph's actual height might be something like 191.3 or .2 .1. 7. You can keep subdividing these centimeters into as many decimal places as you as you like, so height in this instance is an example of a continuous numerical piece of data. Generally we kind of make height a discrete piece of data because we only really are interested in whole centimeters or in the case of the Imperial measure: whole inches. We don't usually care if someone's you know six foot three and a half or six foot three and two thirds. But in a pure sense, you could say that height is continuous. Now here's an interesting question: if I asked you what is Steph's three-point percentage this season, what kind of data do you think that is? Is it categorical? Is it numerical? And what kind of which of these subcategories would this relate to maybe you want to pause the video and have a think, but spoiler alert I'm gonna tell you right now when we have a quick look at what proportions in fact are now. Just appreciate the percentages and proportions are pretty much the same thing one's just expressed in terms of being out of a hundred and the other one is just a number between zero and one but it's the same thing all right now if I was to ask you what is Steph's three-point percentage this season what type of data do you think this is. Is it categorical or is it numerical and which subcategory do you think it fits into well feel free to pause the video and have a think about it but spoiler alert I'm about to ruin it for you when we have a look at a special kind of data called proportions. Now even though I've asked for Steph's three-point percentage, hopefully you could appreciate that a percentage is in fact just a proportion that's being expressed out of a hundred but we'll note that each three-point attempt that Steph makes actually provides us with nominal data so it's either a three-point that's made or a three-point that's missed so what a proportion does is that it aggregates this information to provide a numerical summary figure so in some senses a proportion is numerical because obviously it provides us with a number but it's built up off nominal data now Steph Curry in this season so far in 2018 19 has made 128 three-point shots and this is his percentage point four seven six six so each of these 128 shots are actually piece nominal pieces of information this proportion is some kind of summary of that now here's an interesting question for you is a proportionate and discrete or continuous numerical data and that's not necessarily so obvious a question and I might leave that to you to answer in the comments of this video so feel free to start a little discussion on that it's a it's an interesting one I think anyway that's your data types now distributions if I was to ask how other heights of NBA players distributed now the smallest player currently playing in the 2018 19 season is Isaiah Thomas at 5 foot 9 and the largest player is bobbin Marjanovic and seven foot three now all the other players will fit somewhere between the smallest and largest here we go there's a pickie of both of those two players anyway what we can present is something called a probability density function which essentially describes the distribution of all the players in between the smallest and largest player here so you can think about it in two ways it's either the distribution of the whole population of basketball players that we have in the NBA or alternatively it's the probability of selecting someone at random from that population at every given height so quite clearly as the bulk of the players are going to be somewhere in the middle say six foot six or six foot five or something if I was to select someone at random I would have the highest probability of selecting someone around that height as opposed to selecting someone at five foot nine or seven foot three there's just less players at those Heights right now this curve I'm presenting here is a very common curve in statistics some people call it a bell curve other people call a normal distribution but it's a very commonly occurring distribution in statistics and it basically just means that the bulk of the distribution happens towards the middle and it gets rarer as you go towards the extremes now I've created a whole video on the normal distribution which I'll put a little flash hyperlink up now for if you're keen on learning a bit more about it but there's a symmetry about this distribution with the bulk of the players being around that height now I've just assumed that this would be the distribution of basketballers heights but what other possible distributions might there be a distribution like this would indicate that it's the same probability that if I was to select someone at random from the NBA there'd be the same probability of being six foot six as they would of being five foot nine or indeed 743 this is called a uniform distribution which probably doesn't match up with the reality of the NBA a distribution like this we can call this a bi modal distribution it's got two modes where the mode is just the highest peak of the graph or something like this which is a skewed distribution let's just say that there's a larger predominance of players up towards the seven-foot mark and it gets I think it's a lot more scarce down towards these smaller players and this particular type of skew is actually called left skew because the tail points in the left direction you can guess what right askew might look like now before we move on to have a look at sampling distributions I just want to reiterate that this distribution we've been looking at describes the probability distribution of heights if I was to select one single player but what if I had a whole sample of players say ten players and I wanted to know what is the distribution what is the probability distribution of their average height or for that we'll be looking at sampling distributions so the question is if I select ten players at random what is the probability distribution of their average height okay well here's the underlying distribution again now there's five foot nine and there's seven for three and if I was to let's select someone at random the probability density function would be a bit like that but if I select ten people and take a look at their average what will that distribution look like and it turns out that it'll have the same mean but it'll be a lot skinnier now why is that well think of it this way if I select someone at random it's possible that I select Isiah Thomas he's five foot nine and while there might only be a few of him in a league that are that small it's still possible that I select that player at random but if I'm selecting ten players the probability of them having an average height of five foot nine is very very very small indeed eventually after selecting Isiah Thomas maybe I'll have to select other players and it's likely that they're going to be somewhere else in the distribution so that their average gets shifted up right when you take a sample the larger your sample size is the more unlikely you are to get extreme sample means so that's why this distribution is going to be a lot skinnier than the distribution on the left here and this is important in statistics because every study that ever gets conducted starts with a sample you want to test some kind of effect so you take a sample and then you make an inference using that sample so it's important for us to get a handle on what happens when we take a sample the distribution becomes a lot skinnier or in other words the variance of our statistic is reduced and that indeed takes us to sampling and estimation my question is how good is Steph Curry currently at three-pointers so in the current season 2018-19 ice he's shot to 128 threes and has nailed 61 of them so that's a point for 766 what I'm gonna try to get across here is that this is actually a sample statistic he has a sample of size 128 he has 128 three-point attempts and 61 of them have been successes so here we have that proportion point 4 7 6 6 which is our sample statistic now when I ask you how good is Steph current Steph Curry at three-pointers appreciate at this point 4 7 6 6 is actually an estimate for this thing we're going to call theta now theta is a Greek letter and it represents exactly how good Steph Curry is it's something we can never know maybe he's a 50% shooter but this season he's just a little bit off or maybe he only shoots at 42% and this season he's doing much better but either way what statistics does is it creates this unknowable almost godlike god-like value of theta which we can then try to estimate by taking a sample so given our sample where Steph Curry has currently got point 4 7 6 6 maybe the best estimate for theta might be point 4 7 6 6 as in if you were trying to guess what you think Steph Curry's long-run 3-point percentage would be 0.47 6 6 is probably best bet but appreciate that there's some kind of variance around this estimate some kind of uncertainty if Steph Curry shoots some more three-pointers this proportion could either go up or down right and all of a sudden that would be our new best estimate for theta the whole idea behind statistics is trying to get a hold of the uncertainty you have behind your estimates now I'm not going to enter into calculations of those of these particular intervals in this video but I've made plenty of videos that delve into this precise question but what statisticians like to do is create these things called 95% confidence intervals where we can say look we don't know what theta is but given our sample we have a 95% confidence that fitas between these two particular values and generally our sample estimate is bang in the middle of those two limits so that's what you're going to be doing when you study statistics you're going to be developing means of calculating of quantifying this uncertainty you have over Steph Curry's long term 3-point percentage or other things maybe more meaningful now here's an interesting point we all know that Steph Curry is probably the best 3-point shooter in the league if not in basketball history but at this point in the 2018 2019 season were only about sort of 12 or 13 games in at this point there's a player called Meyers Leonard who scored 9 out of 15 three-pointers and has a three-point percentage of 0.6 now who do you think out of these two players is the better three-point shooter if you were just looking at the sample statistics here you'd say well Meyers Leonard is right because he's got a 60% or 0.6 proportion for three-pointers raised Steph Curry's only shooting 0.476 so what is it about Meyers Leonard that might tweak your intuition that something's not quite right here well let's investigate in a statistical way this point 6 that we've got for Meyers Leonard is an estimate for he is Theta and I've got this in green now this is a different theta as the one we saw before which was for Steph Curry but this is Maya's Leonard's long term three-point percentage and the best estimate for that again is our sample estimate which is 0.6 ooo but in this case we might find that the confidence interval we create is a lot larger for Meyers Leonard than it is for Steph Curry why is it a lot larger for Meyers Leonard well because he's only had 15 three-point attempts in the season so far so we're going to be less sure about where this value of theta is going to be for Meyers Leonard but again we can construct here's 95% confidence interval which is going to be a lot wider than Steph Curry's because we actually had more information for Steph Curry so if you put both of these two side-by-side the red being Steph Curry and the green being Meyers Leonard it's true that if we didn't know anything about these two players we'd still have the best estimate for Meyers Leonard being higher than for Steph Curry but you can see we'd have a much larger confidence interval for Meyers Leonard in other words would be less confident about where his long-term 3-point percentage is going to be and it could be down here below Steph Curry's and knowing what we do about the two players it's probably likely to be less than Steph Curry's so again this is preparing you for what statistics can do which is deal and quantify uncertainty now we've met theta just a second ago which is this long term three-point percentage but when I described it as a Greek letter I was referring to it essentially as what we call a parameter now let's have a look at some common parameters that we're gonna see in the study of statistics you might have heard of some of these mu is often used for the mean of a numerical variable so for example the main height of players might be given mu Sigma is the standard deviation of a numerical variable now I haven't dealt with standard deviation in this video but all standard deviation is is a measure of the variation a measure of the uncertainty of a particular estimate or the variation of a particular distribution another parameter PI these are all Greek letters by the way if you haven't noticed pi is for the proportion of a categorical variable so I could have used PI in that example I just gave I ended up using theta and as I say down the bottom here theta is generally used for all parameters in some texts and I like using theta because it sort of is a bit more general but pi is sometimes used for the proportion Rho is used when you're dealing with the correlation between two variables and beta is used for the gradient between two variables and that's often used in regression which is a very important topic in statistics and one for which I've put together a whole series of videos so you can investigate the videos I've done on regression if you like now again all these represent parameters all those unknowable fixed values that we try to estimate now they themselves do not have any uncertainty about them technically they are these godly figures that we just try to merely estimate as mere statisticians and the way we estimate them is by taking a sample and those sample statistics are given other symbols for a numerical variable say height if we're taking the average height off a sample that gets given the simple x-bar a standard deviation is given the symbol s P is generally used for proportion R for correlation and B for the gradient so be prepared to see all of these particular lowercase Roman numerals to represent the sample values that estimate these parameters provided in Greek but I will say be prepared also for your statistics textbook to break all of those rules because this despite them being conventions sometimes you'll find they don't stick to them annoyingly all right so with that under our belt let's go and have a look at a very common topic in statistics called hypothesis testing now I'm gonna start you off with an example rather than give you some kind of hypothetical definition here but using the data we've just seen is there enough evidence to suggest that Maya's Leonard is shooting above 50% so let's review his stats again he's got nine three-pointers made out of fifth Dean and that's 0.6 so yeah sure his sample is greater than 50% point 5 but is that suggesting to us that his long-term 3-point performance is going to be above point 5 well that is a question worthy of a hypothesis test so as we saw in the previous section there's going to be some variation or variance around this estimate point 600 it's not as if that's going to definitely be his long-term 3-point proportion so what statisticians like to do is they like to set this thing called a null hypothesis and it's given the expression H naught and here we're gonna set the null hypothesis to be that Myers Leonard's long term three-point percentage is less than or equal to 50% less than or equal to 0.5 now why might we do that well as a statistician we're always very conservative we assume that the reverse is true and then see if there's enough evidence to really budge from that assumption it's kind of like when someone's on trial the null hypothesis might be that they're innocent and you really need a lot of evidence to budge from that null hypothesis it's not good enough that there's just a little bit of evidence you really need evidence beyond reasonable doubt right and that's the same with hypothesis tests so this here on the right-hand side is called the alternate hypothesis and in general whenever we're doing a hypothesis test in statistics whatever we're seeking evidence for goes in the alternate hypothesis just for the reason that we're very conservative as statisticians we're always going to have a null hypothesis that the reverse is in fact true and we're gonna see if our sample is extreme enough is far enough away from that null hypothesis to suggest that the alternative hypothesis might be true this is the way you're going to be framing your thinking when you're dealing with statistics now one thing that different texts different textbooks will do will have different ways of describing and null hypothesis they both mean the same thing but some will say theta is less than or equal to 0.5 and others might say something like theta is equal to 0.5 and it doesn't much matter because the important thing is that theta being greater than 0.5 is in our alternate hypothesis so let's see how this pans out using what we understand now from a probability distribution so essentially what we're going to do is we're gonna start with this null hypothesis that theta is equal to 0.5 and if it indeed is equal to 0.5 how many three-pointers out of 15 would Meyers Leonard sink well here's the probability distribution if he truly is a 50% 3-point shooter and exactly 50% 3-point shooter if he shoots 15 three-pointers on average he's going to get 7.5 of those in right but of course you can't sink exactly 7.5 so 7 & 8 they'll be approximately the same height so they'll have the same probability of occurring he's less likely to get 6 and 9 less likely again to get 5 and 10 etc etc etc so this is the probability distribution of Meyers Leonard's 15 three-point attempts where the number of successes are on this axis assuming the null hypothesis is true now for those advanced players this is actually a binomial distribution if you're keen on learning more I'll put a link up here now what did he get in this sample well he actually got 9 so what this tells us is that if indeed he has a 50% 3-point percentage it's still quite likely for him to get nine three-pointers out of 15 it's not beyond the realm of possibility that he could be a 50% 3-point shooter and just happened to do a little bit better in his first 15 shots than expected now if I was to ask you how much doubt this sample is casting our null hypothesis in this case you'd say well not very much what if I then told you that say someone else scored 12 out of 15 three-pointers well let's just say Myers scored 12 out of 15 instead at that point you're starting to think you know what that's quite unlikely now it's still possible that he's truly a 50% 3-point shooter and managed to just do better in his first 15 than we expected but it's starting to now cast some doubt on our null hypothesis and this is what a hypothesis test does it takes the sample and says how extreme is that sample is it too extreme given our null hypothesis for us to realistically hold on to that null hypothesis so in reality what's going to happen is we're going to construct what's called a rejection region we'll find a point on this x-axis here beyond which we're going to consider it too extreme to realistically hold on to the null hypothesis being true now this yellow area can effectively be customized to determine how strict you wanna be with rejecting this null hypothesis but often it's chosen as 5% of the entire distribution and we call this the level of significance so we might say that the level of significance here is 5% because if our sample statistic is in this upper 5% we will consider it too extreme for the null hypothesis and therefore reject the null hypothesis so just to repeat in this case because Maya's Leonard got 9 out of 15 on point 6 he was in this point here he was at 9 therefore not extreme enough to reject the null hypothesis so even though in the sample he was shooting above 50% it wasn't extreme enough to allow us to infer that he's shooting 50% in the long-term we need more evidence as conservative statisticians anyway so I just got a little extra look section here for hypothesis testing just for you to be aware of two important notes the first thing is that we never ever prove anything in a hypothesis test so here again is that set up with Meyers Leonard and our conclusion which was to not reject the null hypothesis as in there's not enough evidence to suggest my as Leonard is shooting above 50% never say the word prove in your conclusion which is the frustrating thing about statistics I guess you can never prove anything at all or you can do is infer so we were unable to infer that Meyers Leonard is shooting above 50% the other thing we never say is the word accept so notice I've written it and do not reject the null hypothesis so this was our null but in the event that we do not reject it you should never say the word we then accept the null hypothesis because don't forget he scored 60% out of the first 15 three-pointers so it's not as if we have evidence that he's less than a 50% 3-point shooter it's just that we don't have evidence that he's more than a 50% 3-point shooter that's a really important distinction in fact a whole judicial system relies on that distinction when you find someone not guilty that doesn't necessarily mean that they're innocent right it just means that there's not been enough evidence to convince you of their guilt and because of the presumption of innocence they walk free okay so let's have a look at p-values now they're the much-maligned p-values in statistics now to introduce them i've said considering a null hypothesis so whatever null hypothesis were testing hypothesis tests assess if our sample is extreme enough to reject the null hypothesis that's exactly what we did in the last section what the p-value does is that then measures how extreme the sample is so the hypothesis tests sort of set up the goal posts and we assess whether we've scored the goal or not but the p-value measures out exactly how far we kicked the ball to continue with a fairly loose analogy there so here's the example again we're using the same setup as before with Meyers Leonard's 50% three-point percentage so our test statistic was 9 so he got 9 out of 15 three-pointers right and this is the distribution under the null hypothesis so how extreme was his test statistic that we got well we found out it wasn't extreme enough right so the hypothesis test said reject the null hypothesis if the test statistic is in the top 5% of the distribution and indeed we found that he was not in the top 5% of the distribution what the p-value does is it takes our test statistic and actually calculates that region so it says our test statistic is in the top 30 point 4 percent of the distribution 0.304 so it's actually measuring how much of the distribution is at or above our test statistic so in other words it's measuring how extreme our sample is so if our p-value is very small the more extreme our sample must have been and therefore the more likely we are to reject the null hypothesis and if the p-value is large we're less likely to reject the null hypothesis so if it's closer to one it's this one is this was 0.3 so we had quite a large pink area here we become less likely to reject the null hypothesis and that's exactly what happened in our case we did not reject our null hypothesis now the final point I might make and it's something that you probably have figured out already maybe but if this p-value drops below 0.05 it implies that our test statistic must be in the rejection region let me repeat that if the p-value is less point O five it means that our test statistic wherever we are must be in the rejection region so that rejection region was constructed that yellow bit let's go back that yellow bit was constructed so there's five percent that's been highlighted five percent of the whole distribution that's been highlighted so if our p-value is less than five percent or less than 0.05 if the pink bit was less than 0.05 we know that we must be somewhere in that rejection region our test statistic must be in the rejection region so what that implies is if the p-value is less than the level of significance for your hypothesis test you're going to reject the null hypothesis so it's a really quick way of assessing whether we're going to be rejecting our null hypothesis right so all up whenever you conduct a hypothesis test let's sort of recap whatever you're seeking evidence for goes in your alternate hypothesis and then if you conduct the test and your p-value is very very small that provides evidence for that alternate hypothesis it provides evidence enough for us to reject the null hypothesis all right so that's pretty much it for the theoretical component of this video stop the clock did I get under 30 minutes I don't think so I think it was a few minutes over but I'm not even going to quiet stop the video here because I figured I might give you an extra little section to do with p-values because they've been in the news over the last few years I would say and not necessarily in a good way people have been throwing a lot of shade at scientific research over the last little while and it's somewhat justified due to this thing called P hacking so if you've had enough with the theoretical component of the stats today well I'll tell you that's it we're done but let's have a look at P hacking to see how a misuse of p-values can invalidate scientific research so let's talk about what P hackings all about and I might start with that same boring old probability density function that we saw before now as we've seen in hypothesis testing we start with the null hypothesis that there's no effect and then we take a sample and we want to see a sample that's extreme enough for us to reject that null hypothesis so we'll construct this rejection region which is this yellow shaded region up here my choice of colors might not have been the best but hopefully you can see that's shaded yellow and so if our sample lies up in this region here we're able to reject the null hypothesis and in doing so we would say that there's a significant effect and all of a sudden that's great we'll be able to publish our paper to show that X affects Y and we'll get all the plaudits from the research community but here's the thing remember how I said that statistics doesn't prove anything well this is exactly the case if we have a sample which is in our rejection region in other words a sample which is extreme enough for us to reject the null hypothesis it doesn't mean the null hypothesis is false it's still possible that we just happen to get a freak sample right the whole purpose of a p-value is to say well how likely is it for us to get this sample statistic if the null hypothesis is true and if the p-value is low enough we go oh that's starting to become too low but at the same time as long as that p-values nonzero there is an outside chance but you just happen to get a freak sample where there was in fact no effect to put it in the basketball terms just say my as Leonard was a 50% 3-point shooter it's still possible for him to score 14 or 15 out of 15 three-pointers right very unlikely if his true 3-point percentage was 50% but it's possible so how does this relate to good and bad research well in good research what you do is you theorize some kind of effect and maybe that might be that red wine causes cancer let's just say that as an example right we then collect our data and we test only that effect red wine causing cancer and if we find the p-value of this test less than 0.05 we can conclude some strong evidence for the effect of red wine on cancer and that's all well and good and that's good research that process of theorizing some effect then collecting your data and testing that exact effect is how one conducts good research now bad research gets conducted like this and unfortunately I'm gonna suggest this gets done all the time if you collect your data first with just the general idea of let's see what causes cancer so let's collect a whole bunch of data from people that have cancer a lot of lifestyle kinds of pieces of data as well whether they smoke whether they drink wine all this kind of stuff we're gonna test all these different effects we're gonna test red wine we're gonna test smoking we're gonna test exercise we're gonna test exposure two main roads all this kind of stuff and then we're gonna look through all those effects and find the ones where P is less than 0.05 but let's just say it happened to be four where we're testing red wine on cancer and then we're gonna publish our results and say yep red wine causes cancer because the p-value is less than 0.5 this is called pee hacking and is potentially rife in research and it's quite problematic now it's not necessarily obvious why this is so much worse than our good research over here on the left but as I said before when we conclude strong evidence for some effect we're essentially saying there's a very very low probability that this came about by chance now what happens when you test 10 different things if you test 10 different things it becomes more likely that one of them by chance will be quite extreme in their sampling well let's push it even further if you test 20 different things we're actually expecting 1 out of those 20 things to have a p-value less than 0.05 that's actually what the p-value means if there's a 5% chance that the effect we've seen was just due to the randomness of the sampling process than if we test 20 things 5% of 20 is 1 one of those 20 things is likely to show that strength of effect so that's where he hacking comes into it we test all these different things and we just find the one that happens to look significant and we can then sort of pretend that that was the thing that we were looking for the whole time and it's actually a big big problem anyway that's hopefully brought brought it into practice some of the stuff that you can learn in statistics and of course I've dealt with things in a very superficial way but that was the whole point of this video but look if you like this I've got more in-depth discussions one where I go into the actual formula and the mathematics of it all you can check it all out on Zed statistics com but hey if you dig it you can like and subscribe and do all those things that you're meant to do but yeah hope you enjoyed

Transcript for:Introduction to Statistics by Justin Zeltzer

Transcript for:
Introduction to Statistics by Justin Zeltzer