Transcript for:
Categorical Association (Chi-Square) Test

hi everyone this is matt2show uh with intro stats and today we're looking at uh introducing the categorical association hypothesis test so the categorical association test another one of those famous relationship tests this is the classic categorical relationship test so if you're trying to see if two categorical data sets indicate a relationship between the variables this is the test so we've been talking about how through all these relationship tests we want to kind of know the null and alternative hypothesis what type of data do we need for this test what are the assumptions for this test and what test statistic are we going to be using so um let's talk a little bit about the null and alternative hypothesis for the categorical association test all right so the null hypothesis would be that categorical variables are not related and the alternative would be that the category of variables are related sometimes you'll see this as not associated or associated for the alternative or you might see it as independent for the null hypothesis versus dependent for the alternative hypothesis so there are different ways you'll see this written in different stat books but the idea is very important like how do we gonna show that how are we gonna show that the categorical variables are not related versus related well again it goes back kind of like what we are doing with goodness of fit if you remember in the goodness of fit we were trying to show that if the percentages were equal then it kind of would not matter what group i'm in and that would indicate the grouping is not really related to the percentage we're dealing with but now with the categorical association test we're dealing with multiple percentages in multiple groups but the idea is still the same so what we usually say think about it this way if there if the if again ask yourself the question does it matter what group i'm in right doesn't matter what group you're in so um the categorical variables are not related would indicate that if all the groups have about the same percentage for that variable so sometimes you'll see here here stat books say the distribution of conditional percentages are the same or equal and uh if the data if the quantitative categorical variables are related we expect that the distribution of conditional percentages would be different so we'll kind of flush that out a little bit about what we mean by that but again it goes back to the more equal to the more closely the percentages align in the various groups that's going to tell me the grouping doesn't matter but if the if the percentages are very different then maybe it does matter what group you're in all right so what kind of data do we need well we really need two categorical data sets uh collected from uh either you can collect them in two ways but basically you're gonna get two sort of bits of categorical information from people so um like we might be asking them do you have a tattoo yes or no and what social what type of social media do you prefer twitter snapchat facebook instagram so if we were asking the same one one random sample of people that the those both of those questions two questions that's a possibility we could get that data that way another way might be to collect the data from multiple random samples so i might just ask them one categorical question so i might go to a random sample of people with tattoos and i might ask him what's your favorite social media and i know all these people have tattoos or i might and then i might go to another random sample of people that do not have tattoos and i know these people don't have tattoos and i would ask them what's their favorite social media at the end of the day i'm still going to get the same data i'm going to get a categorical data set asking about tattoos and a social media so the question is if the if it came from one random sample or it came from multiple random samples you can kind of get the same test i kind of refer to both of these as just a categorical association test the data is going to look a lot alike now there are slight differences you'll in fact if you have one random sample if i just asked two categorical questions from one random sample of people people in the stat world sometimes refer to that as an independence test um and then just the name like i say i i kind of call these things all categorical association tests but let people like to make distinctions if you have multiple random samples so in other words i i i got different random samples for my groups and then i just ask them one categorical question in each of my groups sometimes they refer to that in stat books as a homogeneity test again just names that you like might hear in stat books they'll say something about independence homogeneity i can say it's really the same idea it's just a matter of somatics of how the data was collected there's also slight differences in the way we would check our assumptions depending on if we collected the data from one random sample or multiple random samples now the one the big keys was that you can summarize your data your counts in a contingency table or a two-way table as some people refer to it a contingency table is a way of sort of summarizing your counts so it kind of looks like this this is a contingency table right here this one has to do with music and trying to memorize information so that's uh but this is what a table looks like if your data looks like this you know you're doing the categorical association test a lot of people ask me well how do i know if it's going to be a goodness of fit test or a categorical association test well if your data looks like that if it's a contingency contingency table you know you're doing the categorical association test if i was only looking at maybe just a few observed counts from one particular variable then that's going to be the goodness of fit like if i just looked at high retention just these three numbers 10 11 and 18 then that would be a goodness of fit but if i have more data dealing with different retention levels then that's going to now be have to go graduate to the categorical association test okay so how are we going to do this well a couple things our assumptions and our test statistic is actually going to be the same as what we looked at in the goodness of fit test so we got we're going to be using the chi-square test statistic again same as the goodness of fit the sum of the observed counts minus the expected counts squared divided by the expected counts the degrees of freedom formula will be a little bit different in a goodness of fit we were doing k minus 1 where k was the number of groups but now you have rows and columns because you have a table so we use the formula rows minus one times the number of columns minus one that's the formula for degrees of freedom for a categorical association test again a lot of the stuff we're talking about today can all be calculated very quickly with a computer this is more about understanding the ideas of what the computer is doing so we can kind of explain it to people and sort of have a better understanding of it just like we did with goodness of fit we have the assumptions right we we said we wanted random sample or samples depending on if you have one random sample or multiple random samples so it could just be one random sample that you asked two categorical questions from or it might be multiple random samples that you're asking one categorical question from different groups and that's where the the assumptions can sort of change if you ask the data from one random sample then really you just need the individuals in that one random sample to be independent of each other so usually it would just say individuals within the sample are independent we want the individuals within the sample to be independent now if i collected the data from multiple random samples right multiple random samples again sometimes referred to as a homogeneity test then i would want the individuals within the samples and between the samples to be independent right so that there's a slide variation in terms of the assumptions we're still looking for everybody in the data to basically be independent of each other and not have people that are related in some way again uh to get the data set to be big enough we actually check that the expected counts should be at least five right that's the same as actually the goodness of fit we need all the expected counts to be at least five five or greater so when you get a printout from the computer program make sure you look for the expected counts the computer will calculate the expected counts for you and you want to make sure you check that they're all bigger than five if any of your expected accounts actually drop below five a lot of times the computer program will give you an error message we'll say hey your data set's not big enough to handle this test some of your expected counts are too low just remember that the idea of the chi-square is observed versus expected we kind of talked about this with the goodness of fit test the observed sample count so the observed counts are what happened what really happened in the sample data so these numbers right here in the table right here not the totals but the just the numbers in the table those are what we call the observed counts but then we need to sort of calculate verse and compare the observed counts to what we expect to happen if the null hypothesis was true and that's where this gets really tricky how do i figure out what i expect to happen if the categorical variables are not related okay that's where that's where we're going to dig into this example and see if we can understand that a little bit how do we get the expected counts okay so when you hear the word observe think sample data expected counts think null hypothesis right theoretical counts based on the null hypothesis all right so let's take a look at an example here we got some got an example here this is actually one of the examples i i pulled from my book and we're basically looking at an experiment that was done compared to try to show that listening to music is is or is not related to memorizing information a lot of college students think that if they listen to music they may be better at memorizing information and so basically they did a little experiment on this this is actually an experiment that's been done quite a few times the interesting part of this experiment was that they allowed the student to listen to their favorite music so so one one one sample of students got to listen to their favorite music while they tried to memorize information the other had to listen to a music that they hated now they all had to listen to the same hated music so whatever that hated music was they all had to listen to hate the music they absolutely hated and then one sample of people got to listen to no music no music so this was three different samples and that had to actually memorize information and see how they did to memorize information again this if you're kind of talking about whether it was independence or homogeneity this would be the homogeneity variety because we had multiple random samples and we so we had three different random samples of people okay so let's