Transcript for:
Inference for Categorical Data Overview

what's up my Z Stars Michael PR here ready to talk to you about the unit 6 summmer review video now this is part one of two different videos part one and part two that are going to summarize all of unit six which is over inference for categorical data with an emphasis on proportions but before we begin let me talk about two real quick things first this is just a summary review video it does not cover every single teeny tiny topic I mean I do try to cover it all to be honest but there are plenty of things that your teacher probably taught you in class that are super important as well this video is just trying to give you a recap of it all second stop right now whatever you're doing and get out that study guide from the ultimate review packet if you don't have it visit the ultimate review packet find the unit 6 study guide print it out open it up get ready to do it you could do it while you're watching this video or you could do it after watching the video but everything we talk about this in the video you could then go in practice in that study guide to make sure that you are ready for your unit 6 test in class and make sure you're ready for the AP exam exam in May this unit is all about inference for proportions now proportions mainly come from categorical data for example What proportion of high school students did their math homework last night or what is the difference between the proportion of males that did their homework last night and the proportion of females that did their homework last night now inference is a really important topic and understanding it right now is going to really help you out because all of units 6 7 8 n are all about statistical inference inferences using sample statistics to make judgments about a population parameter so maybe a sample of high school student showed that 78% did their math homework last night well what does that mean about the larger population of all students is it also 78% probably not could be a little bit more could be a little bit less but that's what this unit is all about trying to use that sample statistic to make a larger judgment or prediction about what is true for the population proportion now that inference comes in two different procedures confidence intervals and significance tests a confidence interval is using a sample statistic to try to make a prediction for what the population parameter could be a significance test is used to determine if a claim that has been made about a population parameter is true or false these two procedures could be used to analyze a single sample proportion tanken from a single population or they could be used to analyze the difference between two sample proportions taken from two different populations in this unit 6 part one video we're going to be going over confidence intervals for population proportions and in the part two video we're going to be talking about significance tests for population proportions let's start off with looking at a confidence interval for one sample proportion but first this entire process starts with analyzing a sample and when it comes to looking at samples there's a couple rules or conditions that we have to follow primarily that they're random and independent but there's a little bit more detail that gets involved the sample must be random to avoid bias the sample size must be less than 10% of the population in order for Independence to be assumed and the sample must be big enough now how do you know your sample's big enough when it comes to proportions well when you're working with proportions that means we need our sample to contain 10 or more successes and 10 or more failures now success is simply what we're looking for so if our sample involves looking at proportion of students that do their homework then we would need 10 or more people that do their homework and then failers would be people that do not do their homework so we need 10 or more of those as well so as long as these three conditions are met for our sample we are allowed to proceed through all the processes that we're going to talk about in this video and in the part two video but anytime you're analyzing a sample the sample's got to be random and independent and checking these conditions ensures that now recall inference is taking a sample statistic and trying to make a judge judement about a larger population parameter that that sample came from now there's no way that a sample proportion is going to match a population proportion perfectly for example let's say that a random sample of 780 teachers in the United States revealed that 82% currently have college loan debt that they are paying off this doesn't mean that 82% of all teachers in the US have college debt but it should be close right we call the 82% a point estimate that isn't the population proportion but it will point us to what it could be but what we learned from sampling distributions is that many sample proportions are really really close to the true population proportion recall a sampling distribution of a sample proportion contains all possible sample proportions of the same size taken from the same population and it looks like this we see that the center is the mean of all the sample proportions and it will be the true population prop proportion the spread is the standard deviation of all the possible sample proportions and we could use this formula to find that standard deviation and the shape of all the possible sample proportions is of course normal but the only way these three things are going to be true is if those three conditions I mentioned earlier are true as well which is why we always have to check them and for example we know that 95% of all possible sample proportions are within two standard deviations of the true proportion in this Center well actually let's be a little bit more accurate with that two standard deviation concept if 95% are in the center then 5% of sample proportions are left out 2 and 1 half% at the very very bottom and 2 and a half% at the very very top so using either invert Norm on your calculator or a z table we see that the zcore with 2.5% at the bottom is actually - 1.96 and the 2.5% the top starts at positive of 1.96 so the middle 95% of all simple proportions is within 1.96 standard deviations of the true proportion in the center 1.96 is a little bit more accurate as opposed to just saying two so the idea is simple our sample proportion of 82% isn't perfect but we are 95% confident it is pretty close to the true proportion and by pretty close we mean it's within 1.96 deviations of that True Value in the center so if we start with our sample proportion which could be anywhere within the sampling distribution and add 1.96 standard deviations and subtract 1.9 standard deviations we would capture the true population proportion within it the least we can say is that we are 95% confident that the true proportion is in our interval so we do not know the true population proportion that's in the center but we do have our sample proportion and for example it could fall a little bit to the left a little bit on the low side but if we add or subtract 1.96 standard deviations to our P hat we create an interval that should capture the truth in it as long as we are one of those 95% of samples that's in the middle our interval around our sample proportion should contain the truth or our sample proportion might be a teeny bit on the high side side but once again add or subtract 1.96 standard deviations to RP hat and we create an interval that does contain our true population proportion P now it is possible that we get a sample proportion that's really really high one of those very very rare sample proportions at that very high tail and if we add or subtract 1.96 standard deviations to this P hat unfortunately we will not capture the true proportion in the middle we'll miss it well that's why we're only 95% confident but that's really really confident so it's a very simple concept 95% of sample proportions are in the center near the true population proportion so as long as we take our sample proportion we create this interval around it we should be one of the samples in the center hence our interval around our sample proportion should contain the true population proportion within it so really that's it for building a confidence interval we take our sample proportion P hat and we add or subtract 1.96 standard deviations to it to create an interval for where we have 95% confidence the true population proportion will fall since 95% of All sample proportions are near the true proportion we could be 95% confident ours is two but wait there are a few more details oh that's [Music] funny wait first we actually can't call it standard deviation because check out the formula for standard deviation it requires P the true population proportion which we don't know remember that's the whole point of a conference interval is trying to use a sample proportion to find out what the true population proportion could be so that's why we can't actually use standard deviation so we have no choice but use our sample proportion P hat in place of the true population proportion P that we don't know so we can't call it standard deviation because it's not actually used in the proper formula but we can call it standard air think of Standard air as the twin brother for standard deviation it's the exact same formula but instead of the P's which we don't know we replace it with a p hat that's our sample proportion think of it as the best we could do in this situation second we can change how confident we want to be we don't always have to be 95% confident if we want to be a little bit more confident like 99% we're going to make our interval a little bit wider if we want to be a little bit less confident like 90% we could make our conf our interval a little bit more narrow now how many standard deviations we're willing to reach up and down is called our Zar it's our critical zcore for how confident we want to be we could be 90% confident 92% confident 95% % 98% 99% it's really up to you and what the question is asking now let's talk about exactly how to find our Zar based on our level of confidence the c% confidence level refers to the C per of sample proportions that are in the middle of the sampling distribution surrounding the true population proportion in the center now Alpha is what we call the outside proportion which is 1 minus C so again if we're 95% confident that is 1us 95% which is 5% on the outside that's Alpha the 0.05 but then what we have to do is we actually have to take half of alpha because symmetry half of alpha is at the bottom half of alpha is at the top again an example would be 95% in the middle 5% left out but that 5% gets split up 2 and a half% at the bottom 2 and a half% at the top then what we're going to do to find our Z Star again this is that critical zcore of what we want based on our level of confidence we're going to go to either a calculator or a z table to actually figure it out but we need to know that Alpha / two value because a calculator or Z table requires the area at the bottom or below your conference interval so for example if we want to be 90% confident that's 90% of sample proportions in the center which means 10% are left out and half of 10% is 5% so our Alpha ided 2 is 005 so if we have a ti4 calculator we could do invert Norm the area at the bottom is .005 hit enter and we get a zcore of- 1.645 and then the top would be positive 1.645 so our critical Z star is plus or minus 1.645 we could also use a z table to actually look up that bottom .05 and then we see the zcore would be - 1.64 or negative 1.645 or you can actually take the average of the middle there and get the negative 1.645 right there that we got another example could be 99% confident which means 99% of sample proportions are in the middle that leaves an alpha value of 1% that's left out split that in half half a percent at the bottom half a percent at the top and again use invert Norm or a z table to look up that half a percent 005 at the bottom and we get a zar of 2.57 six so the more confident you want to be the wider you're going to have to make your interval so here is the final formula to construct a one sample Z interval for a population proportion we start with our sample proportion P hat then we're going to add and subtract the multiplication of the critical Zar and the Standard air of the sample proportion now that back part the Z Star the critical value times The Standard air that all multiplied together is called the margin of air so our P hat is always dead center of our interval then we're going to go up and we're going to go down by a margin of error to create our interval now that we understand the process let's see a full example of a one sample Z interval for a population proportion What proportion of teachers in the United States have student College loan debt well a random sample of 780 teachers revealed that 82% yes that they have student loan debt and let's create a 98% confence interval for what the true population proportion of all teachers that have student loan debt could be now it is a four-step process follow those four steps with me here first step is naming the procedure in context this is a one sample Z interval looking for the proportion of all teachers in the US that have student loan debt step two is checking those conditions those conditions are to ensure that the sample is random and independent so the sample must be random to avoid bias 780 teachers must be assumed less than 10% of all teachers in the US which I think there's a lot of teach in the US so 780 being less than 10% is a safe assumption and we also need to make sure that our our sample has 10 or more successes and 10 or more failures so if we take the 82% multiply by 780 we get 640 teachers that said yes they have student debt and then we take the 18% times that by 70 teachers we get roughly 140 that did not have college debt so we're safe to move on because our conditions all check out now step three is actually the easiest part it's actually building our interval now we only need a couple things to build our interval first we need our P hat 0.82 we need our Zar for 98% confident 2326 now real quick how' I get that remember 98% in the middle 2% is left out half of 2% is 1% the bottom and 1% the top so use an invert normal on your calculator to look at that bottom 1% or look up the bottom 1% in your Z table and you get a zcore of 2.32 six then we have our Standard air which again the same formula for standard deviation but instead of using p we're using p hat so we have the square root of 82 * .18 ID 780 now we have to put it all together in our confidence interval formula which starts off with our P hat 082 we're going to add and subtract the Z Star times The Standard air multiply all that together and we get our margin of err of 032 add that to 082 subtract that from 082 and we get a conference interval of 788 to 0852 step four is interpreting that interval in context I am 98% confident that the true proportion of all teachers in the United States that have student loan debt is somewhere between 78.8% and 85.2% so just to recap those four steps generically step one is to name the process a one sample Z interval in context so make sure you add in exactly what the proportion is you're looking for which is going to be in the problem step two is checking those conditions step three is actually building the interval using our formula and step four is to interpret that interval in context don't forget to start off with how confident you are so overall the process of constructing a confidence interval is pretty simple it's not too too complicated but there's a couple really popular follow-up questions that could come with understanding confidence intervals and the first one is what does c% confidence mean so in our example we were 98% confident so a lot of times a follow question will be what the heck does 98% confident even mean let's talk about it a c% confidence level is referencing the C per of All sample proportions that are again really really close or in the center of that sampling distribution and that's the idea is that 98% confident is not a probability it's not a chance it's referencing samples and it's saying Hey listen there's lots and lots and lots of samples out there just like the one that we found 98% of all those samples are so close to the truth in this Center that when we build our interval around our sample we'll capture that true proportion P somewhere within our interval is it at the bottom is it at the top is it towards the middle no one knows but we just hope that it's somewhere in our interval so if we were specifically asked in our problem what does 98% confidence mean I'd say something like this 98% of all possible samples of 70 teachers will result in an interval that contains the population proportion of all teachers that have college loan debt again it's not about probability it's not about well there's a 98% chance that it's in the interval no it's absolutely nothing to do with that 98% is referencing all of the samples 98% of all possible samples of 78 teachers will created interval that contains the truth in it so 98% are going to contain it 2% of all samples will not well I'm not going to dilly dally with the 2% and be Negative Nancy I'm going to focus on the fact that 98% of samples create an interval that contains the truth so why wouldn't mine be one of them yes I know I could be one of the 2% that don't contain it but I'm not going to worry about that I'm going to worry about the fact I'm going to focus on the fact that I have a sample that is one of the 98% in the center that contains the true population proportion in it we could also use confidence andols to determine if there's evidence to justify a claim that's been made about the population proportion for example does our sample provide evidence that less than 84% of teachers Nationwide have college loan debt now recall our interval concluded that the proportion of teachers Nationwide that have college loan debt is somewhere between 78.8% and 85.2% since there is a portion of our interval that is not below 84% we cannot be certain the truth will be below 84% it could be but it could also be higher so we do not have evidence that less than 84% of teachers Nationwide have college loan debt but check out this problem does our sample provide evidence that the proportion of teachers Nationwide that have college loan debt is over 70% now once again our entire interval is over 70% our interval was from 78.8% to 85.2% so we do have evidence yes we can conclude that we have evidence that more than 70% of teachers Nationwide have college loan debt so the key is that when you create an interval any number in that interval is possible no one value is more likely than any other value so if a proportion is in our interval it's possible to be true another very popular question is asking for a specific sample size that meets a desired margin of air researchers would much prefer a small margin of error with high confidence so that their interval is more accurate I mean what good is an interval if you are 99% confident a true population proportion is somewhere between 10% and 80% that's a huge interval I'm not very I mean yeah I'm 99% confident but that is not a very accurate interval so to achieve a small margin of error with high confidence we need a bigger sample size bigger samples are more accurate because they vary less it should be common sense that a bigger sample will more closely match the population proportion hence given a more accurate answer so a question could be investigators want to find the proportion of all men that take a multivitamin by constructing a 95% conference interval with a margin of air of only plus or minus 2% what sample size will they need well first we have to start off with our margin of error formula remember margin of error is that back part what we're adding and subtracting to our pH hat so the margin of error alone is the Z star that critical zcore times The Standard air so all we have to do is fill in everything we know the margin of air that we want to achieve is 02 the Z Star for 95% confident again use invert Norm or use your Z table is 1.96 and then the square root of P hat * 1 minus P hat / n now n in the denominator is exactly what we are looking for now the only thing left that we do not know is p hat well we haven't even constructed the sample yet so how could we possibly know what P hat is if we don't even know the sample size we're working with yet so often times in a problem like this when we don't even have any idea what P hat could be because we haven't even looked at a sample we could just replace the P hat with 0.5 or 50% so that would look like this so we got the margin of error .02 equaling our Zar of 1.96 times the square < TK of 0.5 * 1us .5 which is also 0.5 all divided by n now we have to do some good oldfashioned algebra to solve for n the sample size that's going to achieve that 2% margin of error step one is to divide the 1.96 over pretty simple opposite of multiplication is division then we have to get rid of the square root by squaring both sides so on the left we have 0.02 / 1.96 all squared equal .5 * .5 / n then we're going to multiply the N to the left hand side and then now we have to solve for that n by dividing by the 02 divided by 1.96 squ now notice I did not do any math in the middle of my problem I'm a big proponent of doing all the math at the very very end and keeping everything exact in my work so our final answer is going to be taking 0.5 time 0.5 and dividing it all by 02 / 1.96 all squared type all that into your calculator and you'll get a sample size of 2,41 so if they want to estimate the proportion of men that use a multi Vin and they want their interval to have a margin of error of only 2% they need to interval at the very least 2,41 men to get that low margin of error of only 2% and still be 95% confident a 2% margin of err is pretty small it's only up to down two that's a window of only 4% and that's a pretty accurate window for What proportion of men could be using a multivitamin but they are going to require a pretty big sample size of 2,41 men another parameter we could use conference intervals to estimate is the difference between two population proportions for example what is the difference between the proportion of United States teachers that have college loan debt and nurses that have college loan debt I don't know what the difference is well we could start off with looking at a sample of teachers and a sample of nurses we already have the sample of 78 teachers that showed 82% have college debt now we need a second sample Le let's say we get a sample of 550 nurses that shows that 70% have college debt now we're going to try to estimate what the true difference could be the observed difference is 12% 82% minus 70% is a 12% difference but that's just say 12% difference between our samples that doesn't mean that the true difference in the populations is 12% as well so let's build a confidence interval to estimate what that true difference could be and it's going to follow our same four-step method step one is to name the procedure in context so I will construct a two sample Z interval for the difference in the proportion of teachers that have college debt and the proportion of nurses that have college loan debt then we have to check the conditions a little bit annoying here because we have to check the conditions for both samples but it's just about making sure that the samples are both random both samples are less than 10% of their respective populations and that both samples have 10 or more successes and 10 or more failures which all pass for both samples third we're going to actually build the interval now we're going to start off with our information the first sample of the teachers was 780 and it showed 82% the second sample was the sample of the nurses sample size is 550 and 70% showed that they have student loan debt now we start off with our observed difference which is again 12% now you could get positive or you could get negative all depending on the order you subtract them so make sure that you know the order and that actually does matter for the interpretation but I did teachers proportion minus the nurse's proportion to get a2. now here is the formula for creating an interval for the difference between two population proportions so on the left we're going to start off with our observed difference that's the difference between our two sample proportions then we're going to add and subtract that margin of air the entire back is the margin of air It's a combination of that once again that critical zcore and the Standard air of the difference now notice the standard ER of the difference is a little bit different but if you remember from sampling distributions it's actually the same formula for standard deviation of difference but because we don't know the true proportions for the nurses or the true proportion for the teachers we have to use our P hats here that's totally okay but that's why it's going to be called Standard air pretty ugly formula but not too too bad to use just make sure you keep your P hats and your sample sizes n all in the right places so here it is actually substituting all the data in so our observed difference that's the 82% minus the 70% is .12 or 12% we're going to go up we're going to go down by our margin of air for 95% that Z star that Z critical value is 1.96 then here comes that standard error formula so it's one giant square root for the for the teachers we got 78 time8 that's 1 - 78 or excuse me 1 -82 so I got 82 * .18 divided by the 780 teachers plus now comes the nurse's data 7 * .3 the3 was 1 minus the 7 all divided by the 550 nurses now multiply all of that together to get our margin of air of 0. 0468 so we're going to take that 0.12 we're going to add and we're going to subtract the 0 0468 to get a confidence interval of 0732 to1 1668 step four is is to interpret this interval in context now interpreting an interval for a difference could be a little bit tricky but keep in mind it's a difference right I did teachers minus nurses and I got 12% in favor of the teachers well I don't know if that's in favor because that means more teachers have debt but teachers being higher right because a positive number if I did teachers first positive number means the teachers was bigger so interpreting this interval has to be done in comparison so I'm 95% confident the proportion of teachers in the US that have college loan debt is anywhere from 7.32% higher to 16.68% higher than the proportion of nurses that have college loan debt again my entire interval was positive and positive in the order I did the subtracting teachers minus nurses positive means the teachers are more so my sample showed 12% more but my intervals showed that it could be as low as about 7.3% more to as high high is 16.68% more than the proportion of nurses that have college loan debt it's really important that you write that interpretation in context making comparisons between the two groups now let's play the wh if game for a second because sometimes you create an interval for a difference and you get one end of the interval to be negative and the other end to be positive what the heck would that mean so in our example the entire interval was positive which showed teachers have more student loan debt than nurses how much more I don't know anywhere from 7 something per more to 16% more but more but imagine if we had this interval so let's just pretend we did all the work and this is just what if and our interval came back negative 032 to positive 0548 how would I interpret this interval well I would interpret it like this I'm 95% confident that the proportion of teachers that have colge loan debt is anywhere from 3.32% lower that's the negative right lower to 5.48% higher than the proportion of nurses that have college loan debt again the idea is a negative would mean that the nurses were actually higher and the teachers were lower again remember the order I subtracted I did teachers minus nurses so if we get a negative difference it means the teachers were higher oh excuse me if we get a negative difference it means the nurses were higher hence the teachers were 3.2% lower or the positive would go back to the teachers being more so the teachers could be 5.48% higher than the proportion of nurses that have college loan debt so again if your entire interval is positive it shows that the teachers the proportion of teachers that have college loan debt is going to be more than the proportion of nurses but if you get like half negative half positive well then it could go either way the teachers could be 3.2% lower the nurses could be 5.48% higher it just all depends that's why you really have to understand the interval and how to interpret it now we could also answer those questions about making justifications based on some kind of a claim so based on the interval that we actually got the one that was entirely positive about 7 to 16% so if somebody said hey is it true that the proportion of teachers with student loan debt is definitely more than the proportion of nurses with student loan debt well based on that first interval I would say yeah I my interval does provide evidence my entire interval is positive I don't know exactly what the proportion difference is going to be but it's somewhere between s and 16% higher for teachers that have student loan debt so do I know exactly the proportion difference no but it does appear that the proportion of teachers that have student loan debt is going to be more than nurses whereas the second interval that I just kind of made up that was half negative half positive well it's like well listen you know when you got negative and a positive you also have a zero in there and a zero would mean there's no difference between nurses and teachers and What proportion of them have college loan debt so if the question was is there a difference you would say based on the second interval you'd say well I I don't know there could be a difference teachers could be more teachers could be less or there might even be no difference because Zero's in my interval so unfortunately with that interval I just do not have the evidence to officially say teachers are more or nurses are less or anything like that I know that's a lot to unpack there but be very careful with interpreting a confidence interval for the difference in proportions it's really important that you understand what your interval is telling you and how you could actually use that interval to help justify some claims made about the true difference between your two populations so that's it when it comes to constructing conference intervals for proportions there's two of them we can either do a one sample Z interval we're trying to find one population proportion or a two sample Z interval we're trying to find the difference between two population proportions now here's the good news all the formulas needed are on the AP stats formula sheet just let me show you real quick here is one of the pages of the formula sheet looks like this it's a big box all dedicated to proportions now what we see here is actually the information about how to build sampling distrib ions the top row is a sampling distribution for single proportions or one population proportion and the bottom row there is for two population proportions and looking at the difference we see the mean we see the standard deviation but then on the right we see the Standard air so there are those two standard erir formulas that we needed there one at the top for a single and one at the bottom for looking at the difference between two but again notice how the formulas are exactly the same for standard deviation and standard air the only difference is if we don't know P we have to replace the p with P hat and it's it's twin brother we just have to call it standard Air instead of standard deviation now I know that the actual confidence interval formulas are not on here but they do give you a very generic confidence interval formula right here a confidence interval in very generic form is a statistic like P hat plus or minus the margin of air which is a combination of the CR iCal Zar times The Standard air of the statistic so I know it's not full and complete but again they're trying to give you a generic formula that could be used in any situation so in our situations we start off with a sample statistic our sample proportion P hat we go up and down by that critical Zar times the standard err and then we could also do an interval for the difference where our statistic is the difference between two sample proportions plus or minus that same Z critical star value and then the standard a for the difference so again this is a very generic formula for a confidence interval but then you could go to the chart for proportions to get the standard error formulas that you need to get the critical value you got to go to your calculator or Z table to actually calculate those critical Z stars and then your sample statistic is well it's your sample statistic it's what was given to you from your data all right that's it for part one of unit six over confidence intervals for population proportions in the next video which I know you're going to watch part two we talk about significance tests for population proportions there is a lot to unpack in this unit I know there's a lot there's a lot of different types of problems and in these videos I'm just trying to give you a quick overview of it all because I really hope you learned the majority of the information from your teachers in class you're just coming to me for review so I can't wait to see in the next video where we're going to talk about significant Tess