Hey team, Justin Zeltzer here from
zstatistics.com, where today I'm responding to a challenge that was issued to me.
Someone asked me if I could explain statistics to them in under half an hour. While initially
I thought that was a bit of an ambitious ask, I thought no that's actually a really good
challenge, and one that I might do for everybody. So this is it! An introduction to statistics,
with no maths, and done in under half an hour. Now you can probably see that the the timing of
this video is a bit longer than that, but it is because I bunged on a little extra section at
the end- which is a bit of an optional extra, but I think I get most of it done in under half
an hour. But the idea is for you to develop your intuition around statistics, so it's great
for those people who are just enrolling in a statistics course and are a bit apprehensive,
or for others who aren't studying statistics, but kind of want to know what it's all about. And
to keep it light and interesting, I've themed all of the examples in this video on my latest
obsession, which is the NBA. Despite proudly following Australian sports my brother's getting
me hopelessly addicted to American basketball So here we go, the first thing we're going to
delve into is what types of data we're going to encounter when we're dealing with statistics.
Now roughly we can divide data into two distinct classes: categorical data and numerical data.
Now they sound somewhat self-explanatory because numerical means numbers and categorical means
categories, so I'll give you some examples of those in a second, but categorical data can be
further split into nominal categorical data and ordinal categorical data. Nominal meaning
there is no order to the various categories of a particular variable and ordinal means that
there is some kind of order to the categories, and we'll see some examples in a second. Numerical
data can be further split into Discrete numerical data or Continuous numerical data, and again
we'll have a look at some examples right now. So if I was to ask you what team does Steph Curry
play for, you can see that clearly the answer to that question is not going to be numerical, so
it's a categorical piece of data. And here in these brackets I've put what's called the sample
space for this particular question. Now Steph Curry could either play for the Atlanta Hawks, the
Boston Celtics etc etc. It turns out he plays for the Golden State Warriors, but this but all these
potential values for what team Steph Curry plays for when we combine them we call that the sample
space and you can see that there's no order to the teams. It doesn't really matter which order you
put them in, so that's why we would say this is a nominal piece of data. Now the question: what
position does Steph play? Well that might provide us with ordinal categorical data. Now Steph could
either play Guard, Forward or Center, or you could split that up into Point Guard, Shooting
Guard etc etc, but there is some loose order to these positions. The Guards generally play in the
back court and then the forwards played closer to the ring and the center plays underneath it. And
also there's a general kind of height difference between smaller players that play guard to taller
players playing forward and the tallest will play at center so while this is still categorical data,
there's some kind of order to it. Now an example of discrete numerical data might be how many free
throws has Steph missed tonight. Clearly he could miss 0 1 2 3 etc etc. This is numerical data, but
importantly he can't miss 1.5 or 2.3 free throws, right? So there are only discrete possible
values that this piece of data can take. And finally continuous data might be the question:
What's Steph's height? Now google lists Steph's height as 191 centimeters but of course Steph's
actual height might be something like 191.3 or .2 .1. 7. You can keep subdividing these centimeters
into as many decimal places as you as you like, so height in this instance is an example of a
continuous numerical piece of data. Generally we kind of make height a discrete piece of data
because we only really are interested in whole centimeters or in the case of the Imperial
measure: whole inches. We don't usually care if someone's you know six foot three and a half
or six foot three and two thirds. But in a pure sense, you could say that height is continuous.
Now here's an interesting question: if I asked you what is Steph's three-point percentage this
season, what kind of data do you think that is? Is it categorical? Is it numerical? And what kind
of which of these subcategories would this relate to maybe you want to pause the video and have a
think, but spoiler alert I'm gonna tell you right now when we have a quick look at what proportions
in fact are now. Just appreciate the percentages and proportions are pretty much the same thing
one's just expressed in terms of being out of a hundred and the other one is just a number
between zero and one but it's the same thing all right now if I was to ask you what is Steph's
three-point percentage this season what type of data do you think this is. Is it categorical
or is it numerical and which subcategory do you think it fits into well feel free to pause
the video and have a think about it but spoiler alert I'm about to ruin it for you when we have a
look at a special kind of data called proportions. Now even though I've asked for Steph's three-point
percentage, hopefully you could appreciate that a percentage is in fact just a proportion that's
being expressed out of a hundred but we'll note that each three-point attempt that Steph makes
actually provides us with nominal data so it's either a three-point that's made or a three-point
that's missed so what a proportion does is that it aggregates this information to provide a
numerical summary figure so in some senses a proportion is numerical because obviously it
provides us with a number but it's built up off nominal data now Steph Curry in this season so
far in 2018 19 has made 128 three-point shots and this is his percentage point four seven six
six so each of these 128 shots are actually piece nominal pieces of information this proportion
is some kind of summary of that now here's an interesting question for you is a proportionate
and discrete or continuous numerical data and that's not necessarily so obvious a question and I
might leave that to you to answer in the comments of this video so feel free to start a little
discussion on that it's a it's an interesting one I think anyway that's your data types now
distributions if I was to ask how other heights of NBA players distributed now the smallest player
currently playing in the 2018 19 season is Isaiah Thomas at 5 foot 9 and the largest player is
bobbin Marjanovic and seven foot three now all the other players will fit somewhere between
the smallest and largest here we go there's a pickie of both of those two players anyway what
we can present is something called a probability density function which essentially describes the
distribution of all the players in between the smallest and largest player here so you can think
about it in two ways it's either the distribution of the whole population of basketball players
that we have in the NBA or alternatively it's the probability of selecting someone at random from
that population at every given height so quite clearly as the bulk of the players are going to
be somewhere in the middle say six foot six or six foot five or something if I was to select someone
at random I would have the highest probability of selecting someone around that height as opposed
to selecting someone at five foot nine or seven foot three there's just less players at those
Heights right now this curve I'm presenting here is a very common curve in statistics some people
call it a bell curve other people call a normal distribution but it's a very commonly occurring
distribution in statistics and it basically just means that the bulk of the distribution happens
towards the middle and it gets rarer as you go towards the extremes now I've created a whole
video on the normal distribution which I'll put a little flash hyperlink up now for if you're
keen on learning a bit more about it but there's a symmetry about this distribution with the bulk
of the players being around that height now I've just assumed that this would be the distribution
of basketballers heights but what other possible distributions might there be a distribution like
this would indicate that it's the same probability that if I was to select someone at random from the
NBA there'd be the same probability of being six foot six as they would of being five foot nine or
indeed 743 this is called a uniform distribution which probably doesn't match up with the reality
of the NBA a distribution like this we can call this a bi modal distribution it's got two modes
where the mode is just the highest peak of the graph or something like this which is a skewed
distribution let's just say that there's a larger predominance of players up towards the seven-foot
mark and it gets I think it's a lot more scarce down towards these smaller players and this
particular type of skew is actually called left skew because the tail points in the left direction
you can guess what right askew might look like now before we move on to have a look at sampling
distributions I just want to reiterate that this distribution we've been looking at describes
the probability distribution of heights if I was to select one single player but what if I had
a whole sample of players say ten players and I wanted to know what is the distribution what
is the probability distribution of their average height or for that we'll be looking at sampling
distributions so the question is if I select ten players at random what is the probability
distribution of their average height okay well here's the underlying distribution again now
there's five foot nine and there's seven for three and if I was to let's select someone at random
the probability density function would be a bit like that but if I select ten people and take a
look at their average what will that distribution look like and it turns out that it'll have the
same mean but it'll be a lot skinnier now why is that well think of it this way if I select someone
at random it's possible that I select Isiah Thomas he's five foot nine and while there might only be
a few of him in a league that are that small it's still possible that I select that player at random
but if I'm selecting ten players the probability of them having an average height of five foot
nine is very very very small indeed eventually after selecting Isiah Thomas maybe I'll have to
select other players and it's likely that they're going to be somewhere else in the distribution so
that their average gets shifted up right when you take a sample the larger your sample size is the
more unlikely you are to get extreme sample means so that's why this distribution is going to be
a lot skinnier than the distribution on the left here and this is important in statistics because
every study that ever gets conducted starts with a sample you want to test some kind of effect so
you take a sample and then you make an inference using that sample so it's important for us to get
a handle on what happens when we take a sample the distribution becomes a lot skinnier or in other
words the variance of our statistic is reduced and that indeed takes us to sampling and estimation my
question is how good is Steph Curry currently at three-pointers so in the current season 2018-19
ice he's shot to 128 threes and has nailed 61 of them so that's a point for 766 what I'm gonna
try to get across here is that this is actually a sample statistic he has a sample of size 128 he
has 128 three-point attempts and 61 of them have been successes so here we have that proportion
point 4 7 6 6 which is our sample statistic now when I ask you how good is Steph current Steph
Curry at three-pointers appreciate at this point 4 7 6 6 is actually an estimate for this thing
we're going to call theta now theta is a Greek letter and it represents exactly how good Steph
Curry is it's something we can never know maybe he's a 50% shooter but this season he's just a
little bit off or maybe he only shoots at 42% and this season he's doing much better but either way
what statistics does is it creates this unknowable almost godlike god-like value of theta which we
can then try to estimate by taking a sample so given our sample where Steph Curry has currently
got point 4 7 6 6 maybe the best estimate for theta might be point 4 7 6 6 as in if you were
trying to guess what you think Steph Curry's long-run 3-point percentage would be 0.47 6 6 is
probably best bet but appreciate that there's some kind of variance around this estimate some kind
of uncertainty if Steph Curry shoots some more three-pointers this proportion could either go
up or down right and all of a sudden that would be our new best estimate for theta the whole idea
behind statistics is trying to get a hold of the uncertainty you have behind your estimates now
I'm not going to enter into calculations of those of these particular intervals in this video but
I've made plenty of videos that delve into this precise question but what statisticians like to
do is create these things called 95% confidence intervals where we can say look we don't know
what theta is but given our sample we have a 95% confidence that fitas between these two particular
values and generally our sample estimate is bang in the middle of those two limits so that's
what you're going to be doing when you study statistics you're going to be developing means
of calculating of quantifying this uncertainty you have over Steph Curry's long term 3-point
percentage or other things maybe more meaningful now here's an interesting point we all know that
Steph Curry is probably the best 3-point shooter in the league if not in basketball history but at
this point in the 2018 2019 season were only about sort of 12 or 13 games in at this point there's
a player called Meyers Leonard who scored 9 out of 15 three-pointers and has a three-point
percentage of 0.6 now who do you think out of these two players is the better three-point
shooter if you were just looking at the sample statistics here you'd say well Meyers Leonard is
right because he's got a 60% or 0.6 proportion for three-pointers raised Steph Curry's only
shooting 0.476 so what is it about Meyers Leonard that might tweak your intuition that something's
not quite right here well let's investigate in a statistical way this point 6 that we've got for
Meyers Leonard is an estimate for he is Theta and I've got this in green now this is a different
theta as the one we saw before which was for Steph Curry but this is Maya's Leonard's long term
three-point percentage and the best estimate for that again is our sample estimate which is 0.6 ooo
but in this case we might find that the confidence interval we create is a lot larger for Meyers
Leonard than it is for Steph Curry why is it a lot larger for Meyers Leonard well because he's only
had 15 three-point attempts in the season so far so we're going to be less sure about where this
value of theta is going to be for Meyers Leonard but again we can construct here's 95% confidence
interval which is going to be a lot wider than Steph Curry's because we actually had more
information for Steph Curry so if you put both of these two side-by-side the red being Steph
Curry and the green being Meyers Leonard it's true that if we didn't know anything about these
two players we'd still have the best estimate for Meyers Leonard being higher than for Steph Curry
but you can see we'd have a much larger confidence interval for Meyers Leonard in other words would
be less confident about where his long-term 3-point percentage is going to be and it could be
down here below Steph Curry's and knowing what we do about the two players it's probably likely
to be less than Steph Curry's so again this is preparing you for what statistics can do which
is deal and quantify uncertainty now we've met theta just a second ago which is this long term
three-point percentage but when I described it as a Greek letter I was referring to it essentially
as what we call a parameter now let's have a look at some common parameters that we're gonna see
in the study of statistics you might have heard of some of these mu is often used for the mean
of a numerical variable so for example the main height of players might be given mu Sigma is the
standard deviation of a numerical variable now I haven't dealt with standard deviation in this
video but all standard deviation is is a measure of the variation a measure of the uncertainty
of a particular estimate or the variation of a particular distribution another parameter PI these
are all Greek letters by the way if you haven't noticed pi is for the proportion of a categorical
variable so I could have used PI in that example I just gave I ended up using theta and as I say
down the bottom here theta is generally used for all parameters in some texts and I like using
theta because it sort of is a bit more general but pi is sometimes used for the proportion Rho
is used when you're dealing with the correlation between two variables and beta is used for the
gradient between two variables and that's often used in regression which is a very important topic
in statistics and one for which I've put together a whole series of videos so you can investigate
the videos I've done on regression if you like now again all these represent parameters all those
unknowable fixed values that we try to estimate now they themselves do not have any uncertainty
about them technically they are these godly figures that we just try to merely estimate as
mere statisticians and the way we estimate them is by taking a sample and those sample statistics are
given other symbols for a numerical variable say height if we're taking the average height off a
sample that gets given the simple x-bar a standard deviation is given the symbol s P is generally
used for proportion R for correlation and B for the gradient so be prepared to see all of these
particular lowercase Roman numerals to represent the sample values that estimate these parameters
provided in Greek but I will say be prepared also for your statistics textbook to break all of those
rules because this despite them being conventions sometimes you'll find they don't stick to them
annoyingly all right so with that under our belt let's go and have a look at a very common topic
in statistics called hypothesis testing now I'm gonna start you off with an example rather than
give you some kind of hypothetical definition here but using the data we've just seen is there
enough evidence to suggest that Maya's Leonard is shooting above 50% so let's review his stats
again he's got nine three-pointers made out of fifth Dean and that's 0.6 so yeah sure his sample
is greater than 50% point 5 but is that suggesting to us that his long-term 3-point performance
is going to be above point 5 well that is a question worthy of a hypothesis test so as we saw
in the previous section there's going to be some variation or variance around this estimate point
600 it's not as if that's going to definitely be his long-term 3-point proportion so what
statisticians like to do is they like to set this thing called a null hypothesis and it's given the
expression H naught and here we're gonna set the null hypothesis to be that Myers Leonard's long
term three-point percentage is less than or equal to 50% less than or equal to 0.5 now why might we
do that well as a statistician we're always very conservative we assume that the reverse is true
and then see if there's enough evidence to really budge from that assumption it's kind of like when
someone's on trial the null hypothesis might be that they're innocent and you really need a lot
of evidence to budge from that null hypothesis it's not good enough that there's just a little
bit of evidence you really need evidence beyond reasonable doubt right and that's the same with
hypothesis tests so this here on the right-hand side is called the alternate hypothesis and in
general whenever we're doing a hypothesis test in statistics whatever we're seeking evidence
for goes in the alternate hypothesis just for the reason that we're very conservative as
statisticians we're always going to have a null hypothesis that the reverse is in fact true and
we're gonna see if our sample is extreme enough is far enough away from that null hypothesis to
suggest that the alternative hypothesis might be true this is the way you're going to be framing
your thinking when you're dealing with statistics now one thing that different texts different
textbooks will do will have different ways of describing and null hypothesis they both
mean the same thing but some will say theta is less than or equal to 0.5 and others might
say something like theta is equal to 0.5 and it doesn't much matter because the important
thing is that theta being greater than 0.5 is in our alternate hypothesis so let's see how
this pans out using what we understand now from a probability distribution so essentially what
we're going to do is we're gonna start with this null hypothesis that theta is equal to 0.5 and if
it indeed is equal to 0.5 how many three-pointers out of 15 would Meyers Leonard sink well here's
the probability distribution if he truly is a 50% 3-point shooter and exactly 50% 3-point shooter
if he shoots 15 three-pointers on average he's going to get 7.5 of those in right but of course
you can't sink exactly 7.5 so 7 & 8 they'll be approximately the same height so they'll have the
same probability of occurring he's less likely to get 6 and 9 less likely again to get 5 and 10 etc
etc etc so this is the probability distribution of Meyers Leonard's 15 three-point attempts where
the number of successes are on this axis assuming the null hypothesis is true now for those advanced
players this is actually a binomial distribution if you're keen on learning more I'll put a link
up here now what did he get in this sample well he actually got 9 so what this tells us is that if
indeed he has a 50% 3-point percentage it's still quite likely for him to get nine three-pointers
out of 15 it's not beyond the realm of possibility that he could be a 50% 3-point shooter and just
happened to do a little bit better in his first 15 shots than expected now if I was to ask you
how much doubt this sample is casting our null hypothesis in this case you'd say well not very
much what if I then told you that say someone else scored 12 out of 15 three-pointers well let's
just say Myers scored 12 out of 15 instead at that point you're starting to think you know what
that's quite unlikely now it's still possible that he's truly a 50% 3-point shooter and managed to
just do better in his first 15 than we expected but it's starting to now cast some doubt on our
null hypothesis and this is what a hypothesis test does it takes the sample and says how extreme
is that sample is it too extreme given our null hypothesis for us to realistically hold on to
that null hypothesis so in reality what's going to happen is we're going to construct what's
called a rejection region we'll find a point on this x-axis here beyond which we're going to
consider it too extreme to realistically hold on to the null hypothesis being true now this yellow
area can effectively be customized to determine how strict you wanna be with rejecting this null
hypothesis but often it's chosen as 5% of the entire distribution and we call this the level
of significance so we might say that the level of significance here is 5% because if our sample
statistic is in this upper 5% we will consider it too extreme for the null hypothesis
and therefore reject the null hypothesis so just to repeat in this case because Maya's
Leonard got 9 out of 15 on point 6 he was in this point here he was at 9 therefore not extreme
enough to reject the null hypothesis so even though in the sample he was shooting above
50% it wasn't extreme enough to allow us to infer that he's shooting 50% in the long-term we
need more evidence as conservative statisticians anyway so I just got a little extra look section
here for hypothesis testing just for you to be aware of two important notes the first thing is
that we never ever prove anything in a hypothesis test so here again is that set up with Meyers
Leonard and our conclusion which was to not reject the null hypothesis as in there's not enough
evidence to suggest my as Leonard is shooting above 50% never say the word prove in your
conclusion which is the frustrating thing about statistics I guess you can never prove anything
at all or you can do is infer so we were unable to infer that Meyers Leonard is shooting above 50%
the other thing we never say is the word accept so notice I've written it and do not reject the null
hypothesis so this was our null but in the event that we do not reject it you should never say the
word we then accept the null hypothesis because don't forget he scored 60% out of the first
15 three-pointers so it's not as if we have evidence that he's less than a 50% 3-point shooter
it's just that we don't have evidence that he's more than a 50% 3-point shooter that's a really
important distinction in fact a whole judicial system relies on that distinction when you find
someone not guilty that doesn't necessarily mean that they're innocent right it just means that
there's not been enough evidence to convince you of their guilt and because of the presumption
of innocence they walk free okay so let's have a look at p-values now they're the much-maligned
p-values in statistics now to introduce them i've said considering a null hypothesis so whatever
null hypothesis were testing hypothesis tests assess if our sample is extreme enough to reject
the null hypothesis that's exactly what we did in the last section what the p-value does is that
then measures how extreme the sample is so the hypothesis tests sort of set up the goal posts and
we assess whether we've scored the goal or not but the p-value measures out exactly how far we kicked
the ball to continue with a fairly loose analogy there so here's the example again we're using the
same setup as before with Meyers Leonard's 50% three-point percentage so our test statistic was
9 so he got 9 out of 15 three-pointers right and this is the distribution under the null hypothesis
so how extreme was his test statistic that we got well we found out it wasn't extreme enough right
so the hypothesis test said reject the null hypothesis if the test statistic is in the top
5% of the distribution and indeed we found that he was not in the top 5% of the distribution what
the p-value does is it takes our test statistic and actually calculates that region so it says our
test statistic is in the top 30 point 4 percent of the distribution 0.304 so it's actually measuring
how much of the distribution is at or above our test statistic so in other words it's measuring
how extreme our sample is so if our p-value is very small the more extreme our sample must have
been and therefore the more likely we are to reject the null hypothesis and if the p-value
is large we're less likely to reject the null hypothesis so if it's closer to one it's this one
is this was 0.3 so we had quite a large pink area here we become less likely to reject the null
hypothesis and that's exactly what happened in our case we did not reject our null hypothesis now
the final point I might make and it's something that you probably have figured out already maybe
but if this p-value drops below 0.05 it implies that our test statistic must be in the rejection
region let me repeat that if the p-value is less point O five it means that our test statistic
wherever we are must be in the rejection region so that rejection region was constructed that
yellow bit let's go back that yellow bit was constructed so there's five percent that's been
highlighted five percent of the whole distribution that's been highlighted so if our p-value is
less than five percent or less than 0.05 if the pink bit was less than 0.05 we know that we
must be somewhere in that rejection region our test statistic must be in the rejection region so
what that implies is if the p-value is less than the level of significance for your hypothesis
test you're going to reject the null hypothesis so it's a really quick way of assessing whether
we're going to be rejecting our null hypothesis right so all up whenever you conduct a hypothesis
test let's sort of recap whatever you're seeking evidence for goes in your alternate hypothesis and
then if you conduct the test and your p-value is very very small that provides evidence for that
alternate hypothesis it provides evidence enough for us to reject the null hypothesis all right
so that's pretty much it for the theoretical component of this video stop the clock did I
get under 30 minutes I don't think so I think it was a few minutes over but I'm not even going
to quiet stop the video here because I figured I might give you an extra little section to do with
p-values because they've been in the news over the last few years I would say and not necessarily
in a good way people have been throwing a lot of shade at scientific research over the last
little while and it's somewhat justified due to this thing called P hacking so if you've had
enough with the theoretical component of the stats today well I'll tell you that's it we're
done but let's have a look at P hacking to see how a misuse of p-values can invalidate scientific
research so let's talk about what P hackings all about and I might start with that same boring old
probability density function that we saw before now as we've seen in hypothesis testing we start
with the null hypothesis that there's no effect and then we take a sample and we want to see a
sample that's extreme enough for us to reject that null hypothesis so we'll construct this rejection
region which is this yellow shaded region up here my choice of colors might not have been the best
but hopefully you can see that's shaded yellow and so if our sample lies up in this region here
we're able to reject the null hypothesis and in doing so we would say that there's a significant
effect and all of a sudden that's great we'll be able to publish our paper to show that X affects
Y and we'll get all the plaudits from the research community but here's the thing remember how I said
that statistics doesn't prove anything well this is exactly the case if we have a sample which is
in our rejection region in other words a sample which is extreme enough for us to reject the null
hypothesis it doesn't mean the null hypothesis is false it's still possible that we just happen to
get a freak sample right the whole purpose of a p-value is to say well how likely is it for us to
get this sample statistic if the null hypothesis is true and if the p-value is low enough we go oh
that's starting to become too low but at the same time as long as that p-values nonzero there is an
outside chance but you just happen to get a freak sample where there was in fact no effect to put it
in the basketball terms just say my as Leonard was a 50% 3-point shooter it's still possible for
him to score 14 or 15 out of 15 three-pointers right very unlikely if his true 3-point percentage
was 50% but it's possible so how does this relate to good and bad research well in good research
what you do is you theorize some kind of effect and maybe that might be that red wine causes
cancer let's just say that as an example right we then collect our data and we test only that
effect red wine causing cancer and if we find the p-value of this test less than 0.05 we can
conclude some strong evidence for the effect of red wine on cancer and that's all well and
good and that's good research that process of theorizing some effect then collecting your data
and testing that exact effect is how one conducts good research now bad research gets conducted
like this and unfortunately I'm gonna suggest this gets done all the time if you collect your
data first with just the general idea of let's see what causes cancer so let's collect a whole
bunch of data from people that have cancer a lot of lifestyle kinds of pieces of data as well
whether they smoke whether they drink wine all this kind of stuff we're gonna test all these
different effects we're gonna test red wine we're gonna test smoking we're gonna test exercise
we're gonna test exposure two main roads all this kind of stuff and then we're gonna look through
all those effects and find the ones where P is less than 0.05 but let's just say it happened to
be four where we're testing red wine on cancer and then we're gonna publish our results and say
yep red wine causes cancer because the p-value is less than 0.5 this is called pee hacking and
is potentially rife in research and it's quite problematic now it's not necessarily obvious why
this is so much worse than our good research over here on the left but as I said before when
we conclude strong evidence for some effect we're essentially saying there's a very very low
probability that this came about by chance now what happens when you test 10 different things
if you test 10 different things it becomes more likely that one of them by chance will be quite
extreme in their sampling well let's push it even further if you test 20 different things we're
actually expecting 1 out of those 20 things to have a p-value less than 0.05 that's actually
what the p-value means if there's a 5% chance that the effect we've seen was just due to the
randomness of the sampling process than if we test 20 things 5% of 20 is 1 one of those 20 things is
likely to show that strength of effect so that's where he hacking comes into it we test all these
different things and we just find the one that happens to look significant and we can then sort
of pretend that that was the thing that we were looking for the whole time and it's actually a
big big problem anyway that's hopefully brought brought it into practice some of the stuff that
you can learn in statistics and of course I've dealt with things in a very superficial way but
that was the whole point of this video but look if you like this I've got more in-depth discussions
one where I go into the actual formula and the mathematics of it all you can check it all out
on Zed statistics com but hey if you dig it you can like and subscribe and do all those things
that you're meant to do but yeah hope you enjoyed