Transcript for:
Chi-Squared Test Lecture

welcome to the video for the chi-squared test okay so in this chapter what we're doing is we're testing two or more um proportions now think about this with me for a second a proportion is really a calculation based on a categorical variable remember categorical was what category does something fit into so if you're talking about animals dog cat horse um banana not an animal but you know what i mean you have categories um and then you could be testing sort of different levels within each category so for example if you wanted to test let's say college students preference caf preferences and maybe you boiled it down to the pizza or the salad bar um and maybe you decided i'm going to take point loma freshman and point loma seniors and i'm going to see if how many of them what proportion of them like the pizza over the salad bar and i'm going to break this up based off of freshman versus seniors so what that is is that's two categorical variables um freshmen so type of student freshman versus senior and then food preference pizza versus salad so it's two calorie variables with two layers each okay so in this chapter we're going to be talking about those types of scenarios so we have um two or more proportions and we sort of have these possible layers and that'll become clearer in a little bit in this video i have a couple of examples for you and so the question is well we want to test the claim that it doesn't make a difference in other words the proportions are all the same and we're going to test that against the alternate claim that there is some relationship okay so that's what the chi-squared test is going to do for us and that's what we're going to be looking at in this chapter so let's do a little bit of setup first all right so a couple of things that we need to talk about we're going to have to calculate what's known as the chi-squared statistic okay and in order to do that we need something called expected counts so let's kind of lay our foundation for the moment all right so when we talk about expected counts we're back to having our tables here except for now we can have well we did actually before also have more than just a two by two table so this right here what i'm looking at is a two by two table i call it a two by two table because that is the number of rows so two rows um versus the number of columns there are two columns and we always go row then column so look at this next one this is a two by three there's two rows and there are three columns and if you look at the next one we've got actually oops my label is actually wrong because it looks like we've got four rows and two columns let me change that label really quick this is a four by two right four rows two columns now the reason why i'm stressing this is because we're going to be working with these tables and the language that we use um refers to the position that you are in the table so each little entry here in the table we call a cell all right so for every cell we're going to calculate something called the expected count and the expected count is what you do is you take the row total you times it by the column total and you divide by the table total so for example if we were up here in this cell that i have highlighted the expected count that would go into this cell would be the row totals that would be adding up the entire row times the column total adding up the entire column it sits in and then dividing it by the table total so that's the total counts of all the cells we're going to be working with this calculation throughout this chapter all right do an example of this in just a second but i do want to go ahead and um stress that we're doing a hypothesis testing and we're assuming when we do expected count we're assuming that h naught is true and again the null hypothesis is that there is no relationship that all the proportions should be the same and so that's what we're calculating here in the expected count the expected count for each cell is if there was no relationship between the variables represented in this table it's what we would expect to have if the null hypothesis is true all right so let's take a look at an example here um so i have the following example um we're going to go ahead and compare two animals i eat cats and dogs and their sort of leisure activity preference okay so um toy stands for playtime okay they like to play with a toy and nap stands for well kind of self-explanatory nap so imagine that we surveyed we did a simple random sample and we surveyed and we found out how many cats and how many dogs we were working with so notice with me there's a total of 16 cats in this study and we had a total of 17 dogs in this study so these are the row totals that's the total number of cats total number of dogs and then out of those animals 15 of them preferred playtime as their leisure activity and 18 of them preferred nap time as their leisure activity don't ask me how we measure play time preference versus nap time preference we'll just assume we had some metric to or some way to measure this all right so then if you notice just like what we saw in pre in a previous chapter with our tables if you add up the column totals so take that 15 and that 18 we get 33 and that's the same as if we added up the row totals so 16 and 17 gives me 33. so the table total here is 33 all right so this is what our our sample or observations gave us this is our data now if the null hypothesis which here the null hypothesis would be that it doesn't matter cats and dogs they both like napping and playing exactly the same in other words there's no relationship between the animal type and the leisure activity type okay if that were true then what we would expect to see is what's known as our expected counts so i've already went ahead and filled in some of this let me show this to you so this right here this cell notice here i'm looking at the expected count or the number of dogs if there's no relationship that i would expect to find who prefer play over nap in our study okay so what are we going to do here this 7.6 was found by taking the row total so this is in the dog row this is going to be 17 times the column total which it's in the toy column so that's 15 divided by the table total so divided by that 33. do me a favor if you need to stop the video make sure pick up a calculator and make sure that you know how this was calculated and you can do that okay and again next door i did the same thing so now i'm calculating if there's no preference okay for leisure activity between the cats and dogs based on my sampling here i would expect to find 9.2 and some change dogs prefer napping and again this is the expected count so this is the row total which this is the dog row total so that's going to be 17 times the column total for nap which is 18 divided by the table total which is 33. okay again i recommend you stop the video and go ahead and see if you can plug in and find the numbers the expected counts for the cat's toys and cats naps and then start the video back up again because i'll show you the correct answers okay hopefully you did that and hopefully you got the same thing that i did here so we have um the expected count of cats preferring toys or play time seven and some change and cats preferring nap time eight and some change now i wanna show you something here really quick if i find the row total for cats notice with me i get the same row total of the table that is by design right because remember this expected count is what we would expect if there is no preference there's no relationship okay based on our sample so we had 16 cats in our sample so this is what we would expect out of the 16 cats right so again i'm going to find the dog total and you'll notice if i add these two numbers across the row i get 17. i had 17 dogs likewise we should see the same thing with toy preference and math preference so you see we have 15 and 18 and hence our table total is the same so again this table down here below is what the numbers what numbers we would expect to see if there's no relationship between these proportions between cats and dogs and preference of naps and toys okay the table at the top the observed table that's what our data is telling us so then you might ask well and and we do doing hypothesis testing does our sample give evidence strong enough evidence to reject the null the null being that there is no preference okay so that's what we're testing here okay um let's go ahead and continue on i want to show you how we're going to carry out this test so let's hop on over and look at some formulas actually a formula all right so let me scoot my screen here so the question is is there a relationship between the categorical variables okay so remember cats and dogs toys and maps those are categorical variables so we denote this c c right relationship between the categorical variables okay again our null is there's no relationship between the variables we call this independent so they're independent and the alternate hypothesis is that well that's incorrect the alternate hypothesis is that there is that should be an a sorry about that the alternate is the alternate hypothesis is that there is a relationship between the variables okay so we would say that they're dependent okay so there is a relationship so in order to test a hypothesis we've seen lots of examples right we've used a z-test statistic a t and abs test the test statistic those things are hard to say fast i'm going to introduce you to our next test statistic so our next test statistic is what's known as chi square this is the greek letter chi and so this is going to be our test statistic that we're going to use to determine if our sample gives us evidence against the null hypothesis and it's actually not that hard to calculate in fact it's something that we're going to go ahead and do um in cases where like the example that we have we can easily do by hand and then anything that's more involved we can easily do with a computer it's actually pretty easy to calculate so here's what we do we have we calculate the following we take the observed count for every cell okay we take the observed count we subtract off the expected count we square it and then we divide by the expected count we do this for every cell in our table and then we add up all these values together and that is what the chi-squared test statistic is it's the total of again for each cell the observed count minus the expected count you square it and then you divide by the expected count it's the total of that quantity for every cell all right let's go and take a look at a different example coke versus pepsi this is a very famous famous famous example this data that i have here i actually got from dr crowe he's been running this experiment in his statistics class for years i think five or six years and so this is his data um and so what we have here so here here's the scenario the question is um basically can people tell the difference between coke and pepsi and does it match with their preference okay so um here's what we have so the the experiment went as follows uh each person had to write down if they had a preference between coke and pepsi and if they didn't have a preference then they were asked to flip a coin um so like heads said you like coke and tails said you like pepsi and that was to ensure that we have a simple random sample here um and so before any testing was going on tasting and testing was going on we asked everybody why didn't we even ask somebody else asked everybody what their preference was and that was recorded then there was a blind taste testing and each individual was given coke and pepsi and they were asked to say which one they liked and then later on it was figured out whether or not if they correctly identified the preference okay so let's take a look at this so here's the observed data and i i went ahead and i have the totals let me let me highlight the totals here for you so let's see what we have here get all these numbers in all right so for reading our table um so let's read column wise so the coke column um this was how many people said they prefer coke how many people chose and preferred coke we'll just say preferred coke how many people prefer pepsi okay the row that starts with own is how many people correctly identified their preference so if you're in this cell that means you preferred coke and you actually identified it if you're in this row that meant you preferred pepsi and you asked you actually identified it after tasting them that yes you do like pepsi this second row the oops row are the number of people who got it wrong so for example this cell that's in the oops row coke column those are the people who said they prefer coke but when it came down to testing it they accidentally got it wrong and this a cell right here i notice it's in the pepsi column oops row these are the number of people who said they prefer pepsi but when it came down to tasting and saying they prefer they actually got it wrong so wrong from the testing okay so let's look at what we have here our coke total was 270 so 270 individuals preferred coke our pepsi total was 242 total number of individuals in this study 512. okay let's see how many people guessed correctly their beverage 312 regardless of which one it is and how many of them guessed incorrectly 200 okay so this is the observed data the observed data now the reason okay so the reason why we're going over this is one of the conditions that you need so i don't know if you remember but previous chapters we always had like these three requirements that we needed when we did our testing so the first requirement when you're doing a chi-squared test is that you have a simple random sample with independent populations okay or independent groups if you want to think of it that way and so on this particular with this particular experiment we actually did do that there was blind tasting we made sure that there was a random selection so that's the first requirement that you need for the chi-squared test is that you have an srs where the groups are independent okay let's talk about our expected counts so we're going to calculate our expected counts and so remember again expected count is for each cell it's the row total times the column total divided by the table total all right so if we're sitting here in the cell that sits with own and coke we want to find the expected counts here so we need to take the own row total so 312 times it by the coke column table which is 270 and divide it by 5 12. and if we do this we should have the following number so we should get 164 point something change okay do me a favor stop the video and double check that you're getting the correct number okay let's go over to the next column excuse me let's go diagonal here and let's look at the cell that sits in the oops row pepsi column so we're going to take the oops row total 200 we're going to times it by the pepsi column total 242 and we're going to divide it by the table total and if we do this we should get 94. i'm going to go ahead i'm going to fill in the rest of the table again i recommend you stop the video and make sure that you're getting the right numbers let's go ahead and find our totals just make sure everything's adding up right okay all right so here is our expected count again this is what we would expect if the null hypothesis were true okay so in other words your preference of beverage doesn't matter doesn't affect has no relationship with um your ability to be able to determine or select your preference out of the coke pepsi deal okay let's talk about the other two requirements that are needed for the chi-squared test and the other two requirements that are needed are as follows so second requirement in your expected counts so all of these cells here they all have to have at least one in them well we've clearly hit that we're clearly good on this because the smallest number we see is 94 but in general on your expected counts they all have to be at least one third requirement that we need for the chi-square test and that is that at least 80 percent of our expected counts so 80 of our expected counts have to be five or greater okay um notice here on this particular one uh we're totally good we've got again 94 is the smallest number but that's the third requirement so the three requirements that you need to be able to successfully use chi-squared number one you have an srs we have independent groups number two your expected counts are all at least one number three at least eighty percent of your expected counts are five or greater okay so let's continue on we're going to actually calculate the chi-squared test i'm going to move things up here and the next thing that we need to do is we need to take observe minus expected so we're going to take for example for the coke own cell we're going to have 152 minus 164 and when we do that we're going to get negative 12 point something in some change all right let's do that for the next one so the next one which is in the oops slash coke sale cell we get notice we get a very similar number let's take a look at pepsi here so we're going to go ahead and get those guys and there we go um we've got something going on here you'll notice all the numbers are pretty much the same except for plus or minus that's not surprising actually okay we'll talk a little bit about that in just a second but let's keep going now we need to square these guys so i'm going to take each of these cell numbers in the cells and i'm going to square them so let's go ahead and do that okay so if i square them we're going to get 157 point something and some change um since all these numbers are the same we know when we square them we're gonna get the same thing all right and now finally we're going to calculate the x observed minus expected squared divided by the expected okay so i'm right now i apologize my labels went by by this should be coke let me type that in here um let's see here oops it looks a little funny there all right let's make that lowercase oh there we go okay so if i'm looking at the own slash coke set cell excuse me so what we're going to do is we're going to take the 157 and some change and we're going to divide it by the expected value which was 164 and if we do that we should get .954 which sounds about right because notice with me this is 157 and i'm going to put that over or get it divided by 164 and some change a little less than one okay the next one we're going to take the this is the oops coke cell so i'm going to take the observed minus expected squared so it's 157 and some change and divide it by 105 okay there we go there's that one and we're going to do the same thing with pepsi oops sorry about that there we go and let's do the next one too okay there we go all right and now what i'm going to do is i'm going to add them all up so this number here this 5.17 this is what's known as the chi-squared test statistic for this example okay now what i'd like to do is for the chi-square test statistic i do have a table in the back of the book that i could use to look this up but what i'm going to do is i'm going to use the software that's why i have excel for this video i want you to see how we can use excel now i need to talk a little bit about degrees of freedom so chi squared has one number as the degrees of freedom okay so chi squared has degrees of freedom that we have to figure out so degrees of freedom is just one number okay and it is the following it's the number of rows minus one times the number of columns minus one so okay so since in this example we have two rows and two columns you can see that our degrees of freedom are going to be one now what i have here is i have what excel says is our p-value given 5.17 is our test statistic our chi-square test statistic so let's talk about this p value really quick and then i'll show you what formula was used to find this so um recall with me p values we're going to go ahead and assume let's go ahead and use alpha equal to 0.05 and so notice with me if alpha is .05 then our sample gives evidence enough evidence to reject the null hypothesis in other words there is a relationship between people's preference coke or pepsi and their ability to be able to identify their preference in a blind tasting and it turns out this is actually true i'm going to go ahead and scoop this up it turns out pepsi folks can usually pick out their pepsi a lot better than coke folks and there's various reasons for that but that's probably for another video but here's the summary again our sample does give enough evidence significant evidence at the .05 level to reject the null hypothesis okay so let me show you really quick how okay so i wanted to show you how this was found using excel so let me double click on this and you can see the formula so this is the chi distribution and the first entry here e25 that's my number which is unfortunately being covered by this formula but that's right here that's that 7 excuse me 5.17 and then you go comma and then you type in the number of degrees of freedom and in this example it's 1 and this will give me to the p value that i'm looking for so really nice formula again chi distribution found in excel give you them your number give the degrees of freedom and you get your p-value thanks for listening congratulations this is our last video on hypothesis testing all right have a good day you guys