Transcript for:
Introduction to Statistics by Justin Zeltzer

Hey team, Justin Zeltzer here from  zstatistics.com, where today I'm   responding to a challenge that was issued to me.  Someone asked me if I could explain statistics   to them in under half an hour. While initially  I thought that was a bit of an ambitious ask,   I thought no that's actually a really good  challenge, and one that I might do for everybody.   So this is it! An introduction to statistics,  with no maths, and done in under half an hour.   Now you can probably see that the the timing of  this video is a bit longer than that, but it is   because I bunged on a little extra section at  the end- which is a bit of an optional extra,   but I think I get most of it done in under half  an hour. But the idea is for you to develop your   intuition around statistics, so it's great  for those people who are just enrolling in a   statistics course and are a bit apprehensive,  or for others who aren't studying statistics,   but kind of want to know what it's all about. And  to keep it light and interesting, I've themed all   of the examples in this video on my latest  obsession, which is the NBA. Despite proudly   following Australian sports my brother's getting  me hopelessly addicted to American basketball So here we go, the first thing we're going to  delve into is what types of data we're going   to encounter when we're dealing with statistics.  Now roughly we can divide data into two distinct   classes: categorical data and numerical data.  Now they sound somewhat self-explanatory because   numerical means numbers and categorical means  categories, so I'll give you some examples of   those in a second, but categorical data can be  further split into nominal categorical data and   ordinal categorical data. Nominal meaning  there is no order to the various categories   of a particular variable and ordinal means that  there is some kind of order to the categories,   and we'll see some examples in a second. Numerical  data can be further split into Discrete numerical   data or Continuous numerical data, and again  we'll have a look at some examples right now.   So if I was to ask you what team does Steph Curry  play for, you can see that clearly the answer to   that question is not going to be numerical, so  it's a categorical piece of data. And here in   these brackets I've put what's called the sample  space for this particular question. Now Steph   Curry could either play for the Atlanta Hawks, the  Boston Celtics etc etc. It turns out he plays for   the Golden State Warriors, but this but all these  potential values for what team Steph Curry plays   for when we combine them we call that the sample  space and you can see that there's no order to the   teams. It doesn't really matter which order you  put them in, so that's why we would say this is   a nominal piece of data. Now the question: what  position does Steph play? Well that might provide   us with ordinal categorical data. Now Steph could  either play Guard, Forward or Center, or you   could split that up into Point Guard, Shooting  Guard etc etc, but there is some loose order to   these positions. The Guards generally play in the  back court and then the forwards played closer to   the ring and the center plays underneath it. And  also there's a general kind of height difference   between smaller players that play guard to taller  players playing forward and the tallest will play   at center so while this is still categorical data,  there's some kind of order to it. Now an example   of discrete numerical data might be how many free  throws has Steph missed tonight. Clearly he could   miss 0 1 2 3 etc etc. This is numerical data, but  importantly he can't miss 1.5 or 2.3 free throws,   right? So there are only discrete possible  values that this piece of data can take. And   finally continuous data might be the question:  What's Steph's height? Now google lists Steph's   height as 191 centimeters but of course Steph's  actual height might be something like 191.3 or .2   .1. 7. You can keep subdividing these centimeters  into as many decimal places as you as you like,   so height in this instance is an example of a  continuous numerical piece of data. Generally   we kind of make height a discrete piece of data  because we only really are interested in whole   centimeters or in the case of the Imperial  measure: whole inches. We don't usually care   if someone's you know six foot three and a half  or six foot three and two thirds. But in a pure   sense, you could say that height is continuous.  Now here's an interesting question: if I asked   you what is Steph's three-point percentage this  season, what kind of data do you think that is?   Is it categorical? Is it numerical? And what kind  of which of these subcategories would this relate   to maybe you want to pause the video and have a  think, but spoiler alert I'm gonna tell you right   now when we have a quick look at what proportions  in fact are now. Just appreciate the percentages   and proportions are pretty much the same thing  one's just expressed in terms of being out of   a hundred and the other one is just a number  between zero and one but it's the same thing   all right now if I was to ask you what is Steph's  three-point percentage this season what type of   data do you think this is. Is it categorical  or is it numerical and which subcategory do   you think it fits into well feel free to pause  the video and have a think about it but spoiler   alert I'm about to ruin it for you when we have a  look at a special kind of data called proportions. Now even though I've asked for Steph's three-point  percentage, hopefully you could appreciate that a   percentage is in fact just a proportion that's  being expressed out of a hundred but we'll note   that each three-point attempt that Steph makes  actually provides us with nominal data so it's   either a three-point that's made or a three-point  that's missed so what a proportion does is that   it aggregates this information to provide a  numerical summary figure so in some senses   a proportion is numerical because obviously it  provides us with a number but it's built up off   nominal data now Steph Curry in this season so  far in 2018 19 has made 128 three-point shots   and this is his percentage point four seven six  six so each of these 128 shots are actually piece   nominal pieces of information this proportion  is some kind of summary of that now here's an   interesting question for you is a proportionate  and discrete or continuous numerical data and   that's not necessarily so obvious a question and I  might leave that to you to answer in the comments   of this video so feel free to start a little  discussion on that it's a it's an interesting   one I think anyway that's your data types now  distributions if I was to ask how other heights   of NBA players distributed now the smallest player  currently playing in the 2018 19 season is Isaiah   Thomas at 5 foot 9 and the largest player is  bobbin Marjanovic and seven foot three now all   the other players will fit somewhere between  the smallest and largest here we go there's a   pickie of both of those two players anyway what  we can present is something called a probability   density function which essentially describes the  distribution of all the players in between the   smallest and largest player here so you can think  about it in two ways it's either the distribution   of the whole population of basketball players  that we have in the NBA or alternatively it's the   probability of selecting someone at random from  that population at every given height so quite   clearly as the bulk of the players are going to  be somewhere in the middle say six foot six or six   foot five or something if I was to select someone  at random I would have the highest probability of   selecting someone around that height as opposed  to selecting someone at five foot nine or seven   foot three there's just less players at those  Heights right now this curve I'm presenting here   is a very common curve in statistics some people  call it a bell curve other people call a normal   distribution but it's a very commonly occurring  distribution in statistics and it basically just   means that the bulk of the distribution happens  towards the middle and it gets rarer as you go   towards the extremes now I've created a whole  video on the normal distribution which I'll   put a little flash hyperlink up now for if you're  keen on learning a bit more about it but there's   a symmetry about this distribution with the bulk  of the players being around that height now I've   just assumed that this would be the distribution  of basketballers heights but what other possible   distributions might there be a distribution like  this would indicate that it's the same probability   that if I was to select someone at random from the  NBA there'd be the same probability of being six   foot six as they would of being five foot nine or  indeed 743 this is called a uniform distribution   which probably doesn't match up with the reality  of the NBA a distribution like this we can call   this a bi modal distribution it's got two modes  where the mode is just the highest peak of the   graph or something like this which is a skewed  distribution let's just say that there's a larger   predominance of players up towards the seven-foot  mark and it gets I think it's a lot more scarce   down towards these smaller players and this  particular type of skew is actually called left   skew because the tail points in the left direction  you can guess what right askew might look like now   before we move on to have a look at sampling  distributions I just want to reiterate that   this distribution we've been looking at describes  the probability distribution of heights if I was   to select one single player but what if I had  a whole sample of players say ten players and   I wanted to know what is the distribution what  is the probability distribution of their average   height or for that we'll be looking at sampling  distributions so the question is if I select ten   players at random what is the probability  distribution of their average height okay   well here's the underlying distribution again now  there's five foot nine and there's seven for three   and if I was to let's select someone at random  the probability density function would be a bit   like that but if I select ten people and take a  look at their average what will that distribution   look like and it turns out that it'll have the  same mean but it'll be a lot skinnier now why is   that well think of it this way if I select someone  at random it's possible that I select Isiah Thomas   he's five foot nine and while there might only be  a few of him in a league that are that small it's   still possible that I select that player at random  but if I'm selecting ten players the probability   of them having an average height of five foot  nine is very very very small indeed eventually   after selecting Isiah Thomas maybe I'll have to  select other players and it's likely that they're   going to be somewhere else in the distribution so  that their average gets shifted up right when you   take a sample the larger your sample size is the  more unlikely you are to get extreme sample means   so that's why this distribution is going to be  a lot skinnier than the distribution on the left   here and this is important in statistics because  every study that ever gets conducted starts with   a sample you want to test some kind of effect so  you take a sample and then you make an inference   using that sample so it's important for us to get  a handle on what happens when we take a sample the   distribution becomes a lot skinnier or in other  words the variance of our statistic is reduced and   that indeed takes us to sampling and estimation my  question is how good is Steph Curry currently at   three-pointers so in the current season 2018-19  ice he's shot to 128 threes and has nailed 61   of them so that's a point for 766 what I'm gonna  try to get across here is that this is actually   a sample statistic he has a sample of size 128 he  has 128 three-point attempts and 61 of them have   been successes so here we have that proportion  point 4 7 6 6 which is our sample statistic now   when I ask you how good is Steph current Steph  Curry at three-pointers appreciate at this point   4 7 6 6 is actually an estimate for this thing  we're going to call theta now theta is a Greek   letter and it represents exactly how good Steph  Curry is it's something we can never know maybe   he's a 50% shooter but this season he's just a  little bit off or maybe he only shoots at 42% and   this season he's doing much better but either way  what statistics does is it creates this unknowable   almost godlike god-like value of theta which we  can then try to estimate by taking a sample so   given our sample where Steph Curry has currently  got point 4 7 6 6 maybe the best estimate for   theta might be point 4 7 6 6 as in if you were  trying to guess what you think Steph Curry's   long-run 3-point percentage would be 0.47 6 6 is  probably best bet but appreciate that there's some   kind of variance around this estimate some kind  of uncertainty if Steph Curry shoots some more   three-pointers this proportion could either go  up or down right and all of a sudden that would   be our new best estimate for theta the whole idea  behind statistics is trying to get a hold of the   uncertainty you have behind your estimates now  I'm not going to enter into calculations of those   of these particular intervals in this video but  I've made plenty of videos that delve into this   precise question but what statisticians like to  do is create these things called 95% confidence   intervals where we can say look we don't know  what theta is but given our sample we have a 95%   confidence that fitas between these two particular  values and generally our sample estimate is bang   in the middle of those two limits so that's  what you're going to be doing when you study   statistics you're going to be developing means  of calculating of quantifying this uncertainty   you have over Steph Curry's long term 3-point  percentage or other things maybe more meaningful   now here's an interesting point we all know that  Steph Curry is probably the best 3-point shooter   in the league if not in basketball history but at  this point in the 2018 2019 season were only about   sort of 12 or 13 games in at this point there's  a player called Meyers Leonard who scored 9 out   of 15 three-pointers and has a three-point  percentage of 0.6 now who do you think out   of these two players is the better three-point  shooter if you were just looking at the sample   statistics here you'd say well Meyers Leonard is  right because he's got a 60% or 0.6 proportion   for three-pointers raised Steph Curry's only  shooting 0.476 so what is it about Meyers Leonard   that might tweak your intuition that something's  not quite right here well let's investigate in   a statistical way this point 6 that we've got for  Meyers Leonard is an estimate for he is Theta and   I've got this in green now this is a different  theta as the one we saw before which was for   Steph Curry but this is Maya's Leonard's long term  three-point percentage and the best estimate for   that again is our sample estimate which is 0.6 ooo  but in this case we might find that the confidence   interval we create is a lot larger for Meyers  Leonard than it is for Steph Curry why is it a lot   larger for Meyers Leonard well because he's only  had 15 three-point attempts in the season so far   so we're going to be less sure about where this  value of theta is going to be for Meyers Leonard   but again we can construct here's 95% confidence  interval which is going to be a lot wider than   Steph Curry's because we actually had more  information for Steph Curry so if you put both   of these two side-by-side the red being Steph  Curry and the green being Meyers Leonard it's   true that if we didn't know anything about these  two players we'd still have the best estimate for   Meyers Leonard being higher than for Steph Curry  but you can see we'd have a much larger confidence   interval for Meyers Leonard in other words would  be less confident about where his long-term   3-point percentage is going to be and it could be  down here below Steph Curry's and knowing what we   do about the two players it's probably likely  to be less than Steph Curry's so again this is   preparing you for what statistics can do which  is deal and quantify uncertainty now we've met   theta just a second ago which is this long term  three-point percentage but when I described it as   a Greek letter I was referring to it essentially  as what we call a parameter now let's have a look   at some common parameters that we're gonna see  in the study of statistics you might have heard   of some of these mu is often used for the mean  of a numerical variable so for example the main   height of players might be given mu Sigma is the  standard deviation of a numerical variable now   I haven't dealt with standard deviation in this  video but all standard deviation is is a measure   of the variation a measure of the uncertainty  of a particular estimate or the variation of a   particular distribution another parameter PI these  are all Greek letters by the way if you haven't   noticed pi is for the proportion of a categorical  variable so I could have used PI in that example   I just gave I ended up using theta and as I say  down the bottom here theta is generally used for   all parameters in some texts and I like using  theta because it sort of is a bit more general   but pi is sometimes used for the proportion Rho  is used when you're dealing with the correlation   between two variables and beta is used for the  gradient between two variables and that's often   used in regression which is a very important topic  in statistics and one for which I've put together   a whole series of videos so you can investigate  the videos I've done on regression if you like   now again all these represent parameters all those  unknowable fixed values that we try to estimate   now they themselves do not have any uncertainty  about them technically they are these godly   figures that we just try to merely estimate as  mere statisticians and the way we estimate them is   by taking a sample and those sample statistics are  given other symbols for a numerical variable say   height if we're taking the average height off a  sample that gets given the simple x-bar a standard   deviation is given the symbol s P is generally  used for proportion R for correlation and B for   the gradient so be prepared to see all of these  particular lowercase Roman numerals to represent   the sample values that estimate these parameters  provided in Greek but I will say be prepared also   for your statistics textbook to break all of those  rules because this despite them being conventions   sometimes you'll find they don't stick to them  annoyingly all right so with that under our belt   let's go and have a look at a very common topic  in statistics called hypothesis testing now I'm   gonna start you off with an example rather than  give you some kind of hypothetical definition   here but using the data we've just seen is there  enough evidence to suggest that Maya's Leonard   is shooting above 50% so let's review his stats  again he's got nine three-pointers made out of   fifth Dean and that's 0.6 so yeah sure his sample  is greater than 50% point 5 but is that suggesting   to us that his long-term 3-point performance  is going to be above point 5 well that is a   question worthy of a hypothesis test so as we saw  in the previous section there's going to be some   variation or variance around this estimate point  600 it's not as if that's going to definitely   be his long-term 3-point proportion so what  statisticians like to do is they like to set this   thing called a null hypothesis and it's given the  expression H naught and here we're gonna set the   null hypothesis to be that Myers Leonard's long  term three-point percentage is less than or equal   to 50% less than or equal to 0.5 now why might we  do that well as a statistician we're always very   conservative we assume that the reverse is true  and then see if there's enough evidence to really   budge from that assumption it's kind of like when  someone's on trial the null hypothesis might be   that they're innocent and you really need a lot  of evidence to budge from that null hypothesis   it's not good enough that there's just a little  bit of evidence you really need evidence beyond   reasonable doubt right and that's the same with  hypothesis tests so this here on the right-hand   side is called the alternate hypothesis and in  general whenever we're doing a hypothesis test   in statistics whatever we're seeking evidence  for goes in the alternate hypothesis just   for the reason that we're very conservative as  statisticians we're always going to have a null   hypothesis that the reverse is in fact true and  we're gonna see if our sample is extreme enough   is far enough away from that null hypothesis to  suggest that the alternative hypothesis might be   true this is the way you're going to be framing  your thinking when you're dealing with statistics   now one thing that different texts different  textbooks will do will have different ways   of describing and null hypothesis they both  mean the same thing but some will say theta   is less than or equal to 0.5 and others might  say something like theta is equal to 0.5 and   it doesn't much matter because the important  thing is that theta being greater than 0.5   is in our alternate hypothesis so let's see how  this pans out using what we understand now from   a probability distribution so essentially what  we're going to do is we're gonna start with this   null hypothesis that theta is equal to 0.5 and if  it indeed is equal to 0.5 how many three-pointers   out of 15 would Meyers Leonard sink well here's  the probability distribution if he truly is a 50%   3-point shooter and exactly 50% 3-point shooter  if he shoots 15 three-pointers on average he's   going to get 7.5 of those in right but of course  you can't sink exactly 7.5 so 7 & 8 they'll be   approximately the same height so they'll have the  same probability of occurring he's less likely to   get 6 and 9 less likely again to get 5 and 10 etc  etc etc so this is the probability distribution   of Meyers Leonard's 15 three-point attempts where  the number of successes are on this axis assuming   the null hypothesis is true now for those advanced  players this is actually a binomial distribution   if you're keen on learning more I'll put a link  up here now what did he get in this sample well   he actually got 9 so what this tells us is that if  indeed he has a 50% 3-point percentage it's still   quite likely for him to get nine three-pointers  out of 15 it's not beyond the realm of possibility   that he could be a 50% 3-point shooter and just  happened to do a little bit better in his first   15 shots than expected now if I was to ask you  how much doubt this sample is casting our null   hypothesis in this case you'd say well not very  much what if I then told you that say someone else   scored 12 out of 15 three-pointers well let's  just say Myers scored 12 out of 15 instead at   that point you're starting to think you know what  that's quite unlikely now it's still possible that   he's truly a 50% 3-point shooter and managed to  just do better in his first 15 than we expected   but it's starting to now cast some doubt on our  null hypothesis and this is what a hypothesis   test does it takes the sample and says how extreme  is that sample is it too extreme given our null   hypothesis for us to realistically hold on to  that null hypothesis so in reality what's going   to happen is we're going to construct what's  called a rejection region we'll find a point   on this x-axis here beyond which we're going to  consider it too extreme to realistically hold on   to the null hypothesis being true now this yellow  area can effectively be customized to determine   how strict you wanna be with rejecting this null  hypothesis but often it's chosen as 5% of the   entire distribution and we call this the level  of significance so we might say that the level   of significance here is 5% because if our sample  statistic is in this upper 5% we will consider   it too extreme for the null hypothesis  and therefore reject the null hypothesis so just to repeat in this case because Maya's  Leonard got 9 out of 15 on point 6 he was in this   point here he was at 9 therefore not extreme  enough to reject the null hypothesis so even   though in the sample he was shooting above  50% it wasn't extreme enough to allow us to   infer that he's shooting 50% in the long-term we  need more evidence as conservative statisticians anyway so I just got a little extra look section  here for hypothesis testing just for you to be   aware of two important notes the first thing is  that we never ever prove anything in a hypothesis   test so here again is that set up with Meyers  Leonard and our conclusion which was to not reject   the null hypothesis as in there's not enough  evidence to suggest my as Leonard is shooting   above 50% never say the word prove in your  conclusion which is the frustrating thing about   statistics I guess you can never prove anything  at all or you can do is infer so we were unable   to infer that Meyers Leonard is shooting above 50%  the other thing we never say is the word accept so   notice I've written it and do not reject the null  hypothesis so this was our null but in the event   that we do not reject it you should never say the  word we then accept the null hypothesis because   don't forget he scored 60% out of the first  15 three-pointers so it's not as if we have   evidence that he's less than a 50% 3-point shooter  it's just that we don't have evidence that he's   more than a 50% 3-point shooter that's a really  important distinction in fact a whole judicial   system relies on that distinction when you find  someone not guilty that doesn't necessarily mean   that they're innocent right it just means that  there's not been enough evidence to convince you   of their guilt and because of the presumption  of innocence they walk free okay so let's have   a look at p-values now they're the much-maligned  p-values in statistics now to introduce them i've   said considering a null hypothesis so whatever  null hypothesis were testing hypothesis tests   assess if our sample is extreme enough to reject  the null hypothesis that's exactly what we did in   the last section what the p-value does is that  then measures how extreme the sample is so the   hypothesis tests sort of set up the goal posts and  we assess whether we've scored the goal or not but   the p-value measures out exactly how far we kicked  the ball to continue with a fairly loose analogy   there so here's the example again we're using the  same setup as before with Meyers Leonard's 50%   three-point percentage so our test statistic was  9 so he got 9 out of 15 three-pointers right and   this is the distribution under the null hypothesis  so how extreme was his test statistic that we got   well we found out it wasn't extreme enough right  so the hypothesis test said reject the null   hypothesis if the test statistic is in the top  5% of the distribution and indeed we found that   he was not in the top 5% of the distribution what  the p-value does is it takes our test statistic   and actually calculates that region so it says our  test statistic is in the top 30 point 4 percent of   the distribution 0.304 so it's actually measuring  how much of the distribution is at or above our   test statistic so in other words it's measuring  how extreme our sample is so if our p-value is   very small the more extreme our sample must have  been and therefore the more likely we are to   reject the null hypothesis and if the p-value  is large we're less likely to reject the null   hypothesis so if it's closer to one it's this one  is this was 0.3 so we had quite a large pink area   here we become less likely to reject the null  hypothesis and that's exactly what happened in   our case we did not reject our null hypothesis now  the final point I might make and it's something   that you probably have figured out already maybe  but if this p-value drops below 0.05 it implies   that our test statistic must be in the rejection  region let me repeat that if the p-value is less   point O five it means that our test statistic  wherever we are must be in the rejection region   so that rejection region was constructed that  yellow bit let's go back that yellow bit was   constructed so there's five percent that's been  highlighted five percent of the whole distribution   that's been highlighted so if our p-value is  less than five percent or less than 0.05 if   the pink bit was less than 0.05 we know that we  must be somewhere in that rejection region our   test statistic must be in the rejection region so  what that implies is if the p-value is less than   the level of significance for your hypothesis  test you're going to reject the null hypothesis   so it's a really quick way of assessing whether  we're going to be rejecting our null hypothesis   right so all up whenever you conduct a hypothesis  test let's sort of recap whatever you're seeking   evidence for goes in your alternate hypothesis and  then if you conduct the test and your p-value is   very very small that provides evidence for that  alternate hypothesis it provides evidence enough   for us to reject the null hypothesis all right  so that's pretty much it for the theoretical   component of this video stop the clock did I  get under 30 minutes I don't think so I think   it was a few minutes over but I'm not even going  to quiet stop the video here because I figured I   might give you an extra little section to do with  p-values because they've been in the news over the   last few years I would say and not necessarily  in a good way people have been throwing a lot   of shade at scientific research over the last  little while and it's somewhat justified due to   this thing called P hacking so if you've had  enough with the theoretical component of the   stats today well I'll tell you that's it we're  done but let's have a look at P hacking to see   how a misuse of p-values can invalidate scientific  research so let's talk about what P hackings all   about and I might start with that same boring old  probability density function that we saw before   now as we've seen in hypothesis testing we start  with the null hypothesis that there's no effect   and then we take a sample and we want to see a  sample that's extreme enough for us to reject that   null hypothesis so we'll construct this rejection  region which is this yellow shaded region up here   my choice of colors might not have been the best  but hopefully you can see that's shaded yellow   and so if our sample lies up in this region here  we're able to reject the null hypothesis and in   doing so we would say that there's a significant  effect and all of a sudden that's great we'll be   able to publish our paper to show that X affects  Y and we'll get all the plaudits from the research   community but here's the thing remember how I said  that statistics doesn't prove anything well this   is exactly the case if we have a sample which is  in our rejection region in other words a sample   which is extreme enough for us to reject the null  hypothesis it doesn't mean the null hypothesis is   false it's still possible that we just happen to  get a freak sample right the whole purpose of a   p-value is to say well how likely is it for us to  get this sample statistic if the null hypothesis   is true and if the p-value is low enough we go oh  that's starting to become too low but at the same   time as long as that p-values nonzero there is an  outside chance but you just happen to get a freak   sample where there was in fact no effect to put it  in the basketball terms just say my as Leonard was   a 50% 3-point shooter it's still possible for  him to score 14 or 15 out of 15 three-pointers   right very unlikely if his true 3-point percentage  was 50% but it's possible so how does this relate   to good and bad research well in good research  what you do is you theorize some kind of effect   and maybe that might be that red wine causes  cancer let's just say that as an example right   we then collect our data and we test only that  effect red wine causing cancer and if we find   the p-value of this test less than 0.05 we can  conclude some strong evidence for the effect   of red wine on cancer and that's all well and  good and that's good research that process of   theorizing some effect then collecting your data  and testing that exact effect is how one conducts   good research now bad research gets conducted  like this and unfortunately I'm gonna suggest   this gets done all the time if you collect your  data first with just the general idea of let's   see what causes cancer so let's collect a whole  bunch of data from people that have cancer a lot   of lifestyle kinds of pieces of data as well  whether they smoke whether they drink wine all   this kind of stuff we're gonna test all these  different effects we're gonna test red wine   we're gonna test smoking we're gonna test exercise  we're gonna test exposure two main roads all this   kind of stuff and then we're gonna look through  all those effects and find the ones where P is   less than 0.05 but let's just say it happened to  be four where we're testing red wine on cancer   and then we're gonna publish our results and say  yep red wine causes cancer because the p-value   is less than 0.5 this is called pee hacking and  is potentially rife in research and it's quite   problematic now it's not necessarily obvious why  this is so much worse than our good research over   here on the left but as I said before when  we conclude strong evidence for some effect   we're essentially saying there's a very very low  probability that this came about by chance now   what happens when you test 10 different things  if you test 10 different things it becomes more   likely that one of them by chance will be quite  extreme in their sampling well let's push it even   further if you test 20 different things we're  actually expecting 1 out of those 20 things to   have a p-value less than 0.05 that's actually  what the p-value means if there's a 5% chance   that the effect we've seen was just due to the  randomness of the sampling process than if we test   20 things 5% of 20 is 1 one of those 20 things is  likely to show that strength of effect so that's   where he hacking comes into it we test all these  different things and we just find the one that   happens to look significant and we can then sort  of pretend that that was the thing that we were   looking for the whole time and it's actually a  big big problem anyway that's hopefully brought   brought it into practice some of the stuff that  you can learn in statistics and of course I've   dealt with things in a very superficial way but  that was the whole point of this video but look if   you like this I've got more in-depth discussions  one where I go into the actual formula and the   mathematics of it all you can check it all out  on Zed statistics com but hey if you dig it you   can like and subscribe and do all those things  that you're meant to do but yeah hope you enjoyed