Transcript for:
Comprendere il Bootstrapping nella Statistica

[Music] quest hello i'm josh starmer and welcome to statquest today we're going to talk about bootstrapping part one main ideas now imagine we had a new drug to treat an illness and we gave that drug to 8 different people that had the illness for 5 of these people the drug appeared to help them feel better but for three people the drug appeared to make them feel worse if we calculate the mean of the response to the drug we get 0.5 0.5 is not a huge improvement but since most of the people five of eight improved maybe this drug is better than using no drug at all however maybe these five people all felt better because they were healthier to begin with and maybe these three people all felt worse because they had unhealthy lifestyles so it is possible that the reason we got a mean value equal to 0.5 instead of 0 is because of random things that we can't control is there anything we can do to decide if the drug works or not yes one expensive and time-consuming option would be to replicate the experiment a bunch of times if we repeat the experiment a bunch of times then we can keep track of each mean value and we will end up with a histogram of mean values just by looking at this distribution we can see that mean values close to zero which suggests that the drug does not do anything are relatively likely to occur and mean values far from zero indicating that the drug does something are relatively rare however as i said earlier repeating the experiment a bunch of times is both expensive and time-consuming is there something else we can do that is less expensive and time consuming yes instead of replicating the experiment a bunch of times we can use bootstrapping bam so let's use bootstrapping to get a better sense of which results are likely and which are rare first let's create a new number line now from the eight original measurements choose one at random and add that value to the new number line now go back to the original eight measurements and choose another value at random and add it to the new number line then we repeat that process randomly selecting one of the eight original values for the new number line a total of eight times note we can randomly select the same value more than once oh no it's the dreaded terminology alert randomly selecting data and allowing for duplicates is called sampling with replacement anyway so far we've only selected six measurements so we need two more bip boop note the reason we selected eight measurements for the new number line is because the original data set that we are sampling from contains eight measurements if we had started with 10 measurements then we would need to add 10 measurements to the new number line anyway this new data set that was created using sampling with replacements so that it had the same number of values as the original data set is called a bootstrapped dataset okay now that we have a new bootstrapped dataset we calculate the mean note because the bootstrap dataset is different from the original dataset we get a different main now let's add the mean of the bootstrap data set to what will soon be a histogram of means now we start over with a fresh number line and randomly select from the eight original values for the new number line repeating a total of eight times and allowing duplications then we calculate the mean and add that to our histogram note this process of creating a bootstrap data set then calculating something in this case we calculate the main then keeping track of those calculations is called bootstrapping in other words bootstrapping consists of four steps first make a bootstrapped data set second calculate something in this case we calculated the mean three keep track of that calculation and four repeat steps one through three a bunch of times note in step two we calculated the mean but we could have just as easily calculated the median or the standard deviation or any other statistic later on i'll say more about why this flexibility is awesome for now i'll just say bam okay now that we know what bootstrapping is we just do it a bunch of times usually we use a computer to bootstrap thousands of times and after creating thousands of bootstrap samples calculating their means and adding them to the histogram we end up with this because we sampled with replacement the histogram ended up with a wide variety of mean values because there are so many combinations bootstrapping usually only creates a subset like 10 000 to estimate the full distribution in this case the histogram tells us how the mean might change if we redid the experiment a bunch of times just by looking at the histogram we can get a sense of what might happen if we redid the experiment if we redid the experiment there's a high likelihood we will get something close to the original mean and getting something really far from the original mean should be relatively rare because the histogram tells us how the mean might change if we redid the experiment a bunch of times if we want to know the standard error of the mean value from the original data set we only need to calculate the standard deviation of this distribution and a 95 confidence interval is just an interval that covers 95 percent of the bootstrap means double bam in this case we see that the 95 confidence interval covers zero so we cannot reject the hypothesis that the drug is not doing anything note what we just did with the confidence interval was a type of hypothesis testing if you want to learn more about hypothesis testing check out the quest also note just so you know there are other fancier ways to use bootstrapping to calculate confidence intervals while these fancy methods can result in better confidence intervals we'll save them for another day since the purpose of this video is to explain the main ideas behind bootstrapping small bam now so far we have used bootstrapping to calculate the standard error and a confidence interval for the mean however both the standard error and the confidence interval can be calculated directly with a formula without having to create bootstrapped data sets so what is it that makes bootstrapping so awesome the awesome thing about bootstrapping is that we can apply it to any statistic to create a histogram of what might happen if we repeated the experiment a bunch of times and we can use that histogram to calculate stuff like standard errors and confidence intervals without having to worry about whether or not there is a nice formula for example if we started out by calculating the median of the original data then we can use bootstrapping to create a distribution and use that distribution to create the confidence interval so regardless of the statistic we calculate bootstrapping allows us to see it in the context of a distribution and we can use that distribution to help us interpret the initial results triple bam now it's time for some shameless self-promotion if you want to review statistics and machine learning offline check out the statquest study guides at statquest.org there's something for everyone hooray we've made it to the end of another exciting stat quest if you like this stat quest and want to see more please subscribe and if you want to support statquest consider contributing to my patreon campaign becoming a channel member buying one or two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below alright until next time quest on