Exploring Bivariate Analysis and Hypothesis Testing

hello everybody this is dr. Alvarez and welcome to another lecture in sociology through three statistics today we're going to be talking about bivariate analysis and writing hypotheses and so this will begin our next section of the course we'll be really focusing on bivariate analysis but before we do that let's stop for a second and sort of consider where we are right now so in unit 1 we reviewed research methods induction deduction we talked about different methods we talked about conceptualization right so this is all gave us our broad research methods sort of overview right in unit 2 we learned about sampling in other words how to collect data using good sampling techniques we learned about the difference between probability and nonprobability samples and why they're so important for generating representative samples that allow us to infer something from our sample to our population right and we use good probability samples it will minimize the chances for bias right so that's what we did in unit 2 unit 3 which is a which is quickly an introduction to using SPSS and then in unifor we did what you might call and I didn't really use this term very much but you this is another way to describe it univariate descriptive statistics which is how to describe the variables that are in your sample in other words how to describe how respondents in your sample answered any particular question and I gave you a bunch of tools for this I gave you a frequency table measures of central tendency and measures of dispersion that's all well and good and these are real fundamental portions of doing research and this is a lot of the preparatory work that you will get into before starting a big research project but normally we're not interested in just a single variable on its own we're normally interested in by their relationships what does that mean that means that we're normally interested in the relationships between variables so not just a distribution of any one single variable but rather one variables relationship to another for instance we earlier in one of the lectures we talked about human capital theory that posits a relationship between education right and income right so what we are going to be focusing on in this section of the course is the relationship between two variables and we would normally call that the X variable the independent variable when they the variable that X affects the Y variable the dependent variable right Y will be dependent upon X right so this next section of the course will be focusing on different types of bivariate relationships right and actually that's let's actually be a little bit more specific than that right where are we going we're moving from looking at a single variable to two variables in relationships right so how do two variables impact each other and actually more specifically not impact each other but rather how does the independent variable affect the dependent variable and what we're really doing here and and this is an important point for you to write down what we're really doing is is looking to see if the two variables we are interested in covary in some meaningful way we're trying to establish that there is in fact a correlation between two variables and we have lots of different tools for that and we're going to discuss those today you know there is a thing called correlation coefficient that is very useful for doing this but that's actually later in this course there are other sort of more complicated ways of determining correlation coefficients and that's what we're going to be dealing with first right and just so you're clear about this they're called crosstabs and comparisons of means we're going to be looking at the relationship between two variables in a particular sample and the sample we're looking at of course is GSS 2014 but ever were doing a research project we're using variables from some from some specific sample right that is representative of some population the GS s 2014 is a sample of people who live in the United States and so the population we want to infer to is all people who live in the United States notice I did not say American citizens I just said all the people who live in the United States that is to the population of that the GSS speaks to whenever we are doing an analysis early on in our research we are only looking at it within the context of our sample we're looking at the for instance in the GS s 2014 there about 3800 people we're trying to see if amongst these 3800 people if there's a relationship between two variables we care about is there a relationship between education and say income is there education between age and vocabulary score whatever it is right within that particular sample if we have a different sample the answer might be different right the next thing that we're going to do is learn how to write hypotheses about the relationship between variables what does this mean and I encourage you to write this down when we started this class we said that the type of methods to be used in this course particularly survey research methods and quantitative analysis our deductive methods meaning that we typically start off with some theory or hypothesis and then we evaluate data in order to see if there's evidence for that hypothesis so in this class we will normally approach relationships between variables in order to think about whether or not they support or do not support specific hypotheses and so one of the things that you have to learn as budding statisticians and budding researchers is how to write in actual hypothesis so that it can be tested right so we're gonna learn about writing hypotheses we're gonna learn about writing hypotheses but more than that we're also going to learn about testing hypotheses testing hypotheses we generally test hypotheses in two steps and we're going to talk about that but it is grounding it grounded in this idea of testing for statistical significance some of you all probably heard this term before statistical significance right which indicates that there's a real relationships a real relationship between two variables we evaluate and test hypotheses in order to see if the relationships we see in our sample would we expect to find it in the population and on top of that whether or not that relationship that we've identified confirms what our hypothesis is right so there's two steps to this process and right now you're like I don't know what the hell you're talking about right now dr. Alvarez that's okay we're going to get there I promise you we're going to get there so in this lecture this lecture specifically I'm just going to lay out for you the different types of bivariate analysis we're going to see I'm going to lay out the different types of statistical significance you're going to see we're going to learn about writing hypotheses and then we're going to learn about testing hypotheses in the coming lectures I'm actually going to show you how to put together the different types of bivariate analysis with their corresponding successive cm significance and give you the tools to evaluate hypotheses when presented with them okay so this is just this lecture specifically is a general lecture that starts to lay the groundwork and then the following lectures you'll get more detail about actually how to do it okay so let's just sort of review this today three things introduced by variant analyses introduced writing hypotheses and discuss testing hypotheses that's what we're gonna do okay ready go take a deep breath let's go alright so this is a general matrix that I put together that it sort of illustrates to you the different types of bivariate analysis that you should do depending upon the type of variable or rather than level a measurement of variable that you are trying to analyze okay so let's step back from this for a second I told you that levels of measurement were a very important part of what we do in this class right well one of the reasons is is that you need to be able to determine what the level of measurement are the levels of measurement are for the variables that you're interested in so that you can run the correct form of bivariate analysis because if you get those levels of measurement incorrect you're going to choose the wrong type of bivariate analysis to do is that clear that's one of the reasons why level of measurement is so important when am I done here here I've shown you that there are essentially one two three four one two three four types of bivariate analysis we're going to focus on in this class the first is crosstabs the second is comparison to means the third is correlation the fourth is regression those are the four bivariate regression z' that were i mean excuse me bivariate analysis that we're gonna focus on in this class okay so when would you do each well if you have a nominal or ordinal dependent variable regardless of the measurement of your independent variable you see that you're gonna run a crosstab right you're going to run across that does that make sense to everybody if you're a dependent variable Y or Y variable is nominal or no it doesn't make a difference what level of measurement your independent variable is you're gonna run across that right now you don't know what across that base that's perfectly fine I am giving you tools so that as we get more complex through these additional lectures you have a sense of what these things are and what they mean and how they fit together so that's okay your job I now is to know the thing that you should write down the first thing the type of bivariate analysis I run for any analysis will depend upon the level of measurement of the variables that I am looking at that's the first thing the second thing that you would write is if my dependent variable is nominal or ordinal it does not matter what level of measurement my independent variable is I will need to run a crosstab that's the first two things is you need to write down boom right there you're welcome you're welcome so let's keep going right now if you have a interval/ratio dependent variable and interval/ratio dependent variable and your independent variable is nominal or ordinal you would take a comparison of means a comparison of means again you don't know what that means that's okay you just want to hear the term so when I get there you have an idea of what this means right if you have this is the third point that you would wire write down if you have an interval ratio dependent variable and your independent variable is nominal or ordinal you will run a comparison of means a comparison of means if your dependent variable is interval/ratio and your independent variable is also in turbo ratio you should use a correlation and regression we're going to say correlation and regression too near the end of this class because it's the more it's the most sort of complex analysis that we run so we're going to save it to last we're going to focus right now on cross tabs and comparison amines okay now these bivariate analysis look at what is the relationship between two variables in my specific sample the GSS whatever it is I'm looking at there are also what is called a test of significance for each of these bivariate analyses and I watching that you should write that down there is a test of significance for each of these bivariate analyses well what does that information tell you what do those tests of significance tell you those tests of significance tell you whether or not the relationship between the two variables you see in your sample will be found in the population in other words if you have a successfully significant finding for instance if we found that there's a statistically significant relationship between education and income in the GSS that would mean that we would expect to find that relationship between education and income in the population overall that's what a test of statistical significance tells you the thing you see in your sample will you see it in the population for those of you have heard of statistical significance before you know that it generally revolves around identifying a p-value a probability and that we're looking for low probability events we look for p-values less than point zero five and we use that information those p-values to help us determine statistical significance we use statistical significance to help us figure out if the relationship that we see in our sample is a real relationship that we would expect to find in the population if it's not significant then the thing that we see in the sample that relationship that we see in the sample is not real it's something that occurred by chance or through variation and it's not something that we would expect to find in the population now I'm talking in general terms by saying the thing we find in the sample you may be thinking that's weird what does he mean by that what we mean is that we're gonna run a bivariate analysis that by barrier now analysis is going to give you some result we're going to look at the crosstab and describe and interpret it we're gonna look at the comparison of means and describe and interpret it and this thing that we see that we're telling our reader about that we're telling a story about in our description in our interpretation it looks like there's a strong positive relationship between these two variables or the patterning of these results suggests a relationship between these two variables that interpretation that we come to would we expect to find that relationship in the population overall or is it just in this sample right this will make more sense when we start actually doing it but it's really important that you get you get a sense of what we're really talking about what we're really talking about when we do statistical significance testing its inference can we infer from our sample to our population the sample shows that there is a relationship between education and income now let's run a test of significance to see if we can infer from that relationship to the population over all right does that make sense we call that inference we call that inference and it's one of the true things that we do in the early portion of this class all we did was lay out the basic you know basics of research understand the distribution of a single variable now we're getting into the nitty-gritty where we're actually talking about doing some analysis and once we analyze that sample we have to figure out whether or not our findings are things we would actually expect to see in the population can we reliably and confidently infer from our sample to the population so what are the specific tests of significance that we're going to be dealing with in this particular case I have listed them here in red for a crosstabs the test of significance the inferential test and that's another thing that you can write down with or you could put a star next to it an inferential test and a test of significance or for all intents and purposes the exact same thing we will look at a crosstab we will describe and interpret the results of that crosstab and then we will use the Chi scan chi-square results the chi-square analysis to see if we can take this information about the sample and infer it to the population same thing with comparison of means and ANOVA there we have the an inferential test of t-test and ANOVA those are the teeth that's the significance test that we're going to be using when we get to correlation and regression near the end of this course it is a F test in a t-test an F test and a t-test your job in this class Vanessa this next section of the course is really going to be able to is really going to force you to be able to look at SPSS output to identify to describe and interpret crosstabs to describe and interpret comparison of means and then to be able to look at the SPSS output for the inferential it's chi-square t-test or ANOVA and look at that output and find the information that you need to find to determine if there is a statistics anything it was a relationship between the variables that you expect to find in the population that's what your job is going to be okay and please right now if you if you if you're unsure about something stop what you're doing open up an email window drop rather if you're if you're unsure about what I'm doing open up an email window write out what you know some of your thoughts are your questions and send me an email by you know just do that right now right away all right so let's do let's talk about hypotheses our policies are very important much of what we do is centered around hypotheses you will generally start off all the research that you do with a hypothesis right we're gonna talk about the there are two different types of hypotheses you have to write that you have to write and be able to read and understand to be able to read and understand and there is a specific format and a specific language that you need to use when you write about them okay I'm talking about writing them and you have some experience you'll get some examples writing them but the big thing will also be your ability to read them and make sure that they are correct and that you know what they you know what they are telling you about the expectations about the relationship between the two variables so let's jump in and get a little more concrete so I do introduce our policies yes we're gonna talk about testing in just a little bit but we start off talking about or writing the research hypothesis the research hypothesis what is a research hypothesis and research hypothesis is a statement regarding the proposed relationship between two variables or the value of a statistic we're going to focus on the relationship between two variables in other words you know if you remember earlier on this course we talked about the relationship between education and income and we hypothesize that as education increased so would income right that's a research hypothesis right it states what the researcher believes the relationship between two variables will look like right will look like this look is a couple of examples real fast right let's say somebody is interested in the relationship between sex and cigarette smoking right oh and oftentimes we designate a hypothesis with H sub 1 for the purposes of this class you can just write research hypothesis that's perfectly fine the mean number of cigarettes smoked will be greater for men than for women so stop let's do it let's do a brief test real fast right what's the independent variable and the dependent variable here what's the independent variable and the dependent variable so they'll practice for you the mean number of cigarettes smoked will be greater for men than for women it should be clear when you read a research hypothesis which is the independent variable and which is the dependent variable and if you would like some some some more practice with this you should email me that right now open up another email window sentiment so the mean number of cigarettes mode to be greater for men than for women it sounds like the dependent variable is cigarettes smoked right and the independent variable is what is sex right because here it's saying that the number of cigarettes smoked right will be different for men than for women in other words so the cigarette smoking will be dependent upon sex right well technically gender but that's okay right does that make a difference here do you understand what I'm saying right and notice I am saying that I think men will smoke more than will women right I'm providing guidance about what the relationship looks like between the two variables let's do another one okay let's hear to think about the relationship between smoking and health the more you smoke the lower your health what's the independent and dependent variable there iodine we're fast the more you smoke the lower your health what do you think it sounds like the dependent variable is health why because health is dependent upon how much you smoke right you see that the more you smoke the lower your health right so as smoking increases health decreases right you see that let's do one more and higher actually I think I already gave a version of this is do education in income those who greater levels of education when I have higher incomes independent independent dependent variable is income right because we're saying income is dependent upon what education right and then as education increases income well increase right for any analysis that you do in this course you would always start with writing out a research hypothesis if I were to ask you to evaluate something in this course like on an exam or something on a homework you know your first job would to be to look at what the research hypothesis is to look at the research hypothesis and and make sure you understand what that researcher things the relationship is going to look like however we don't just have a research hypothesis we also have what's called a null hypothesis a null hypothesis the null hypothesis is a statement that there will be no relationship no relationship between the two variables no relationship between the two variables the research hypothesis states what you think it will look like the null hypothesis says there will be no relationship between those two things and you might be thinking why would I write you know hypotheses that says there's no relationship between two things that's what we do and actually as a matter of fact we will Center hypothesis testing around the null hypothesis the null hypothesis is actually very very important for what we do when we evaluate hypotheses so you must write in appropriate null hypothesis in appropriate null hypothesis so we said you know in the last slide we went through three research hypotheses the first was the mean number of cigarettes smoked will be greater for men than for women what's the null hypothesis there what's the null hypothesis there oh it's designated with h0 but you cannot you can write it as null hypothesis that's perfectly fine men and women will smoke the same number of cigarettes many what does that actually mean what does that actually mean it's another way of saying that sex will have no effect on smoking is it that sex will have no effect on smoking men and women will smoke the same number of cigarettes how about for the next one the more you smoke the lower your health what would the null hypothesis be there I'm gonna give you a second ok you might want to write it down just to give yourself a chance to do it right we will be there smoking has no effect on health smoking has no effect on health now of course we know that not to be true but like this is just what you would do this is what the NOI policies would do for every research hypothesis you write you must write a null hypothesis in this course right and you should always be prepared to evaluate the null hypothesis as well as the research hypothesis how about the last one those with greater levels of education will have higher incomes what's the null hypothesis there what's the null hypothesis there has no effect on income that's the null hypothesis the null hypothesis is the statement that there is no relationship between the two variables that you are looking at that's what it says you should write that sentence down the null hypothesis is a statement that said there's no relationship between the two variables that you are looking at the research state the research hypothesis says what what you think it will look like the null hypothesis simply states that there will be no relationship between the two variables right you see what I'm saying you need to be able to do that so every relationship should have a research hypothesis and a null hypothesis but I need to make something about your research hypotheses very very clear in this class we don't even do we don't write just write hypotheses we write good hypotheses it contains a direction it is explicit about what it expects to see in that relationship the relationship between variables can have what's called a direction a direction a direction are you writing for this and write this down there are two directions for a relationship okay I'm going to explain them in the next slide here we go there can be a positive relationship the variables move together in the same direction if one goes up the other goes up have we done have we ever have we given a research hypothesis that can be described as a as stating a positive relationship between two two variables have we done that think about it for one second what do we do we did the mean number of cigarette smoke to be greater for men than for women the more you smoke the lower your health and those with greater levels of education we're in comes which one of those is a positive relationship education will be positively related to income right because educated as education goes up you expect income to us to go up right and those with lower levels of education will also have lower levels of income right education and income move in the same direction therefore there is a pod we are we hypothesize a positive relationship between the two variables you must be able to use this language now if there's positive that's probably also the other side the flip side right we think that is negative so if positive means they move in the same direction negative means they move in opposite direction as one goes up the other goes down right and it doesn't make a difference which right it doesn't make a difference which of the three that we talked about research hypotheses we provided so far today and we've discussed is one of them a negative relationship and again we said the mean number of cigarettes smoked will be greater for men than for women the more you smoke to lower your health is that negative that's positive well is that negative is it negative the more you smoke and smoking goes up what happens to to health it goes down right they move in opposite directions so smoking is negatively related to health right smoking is negatively related to health does that make sense when things are positively related they move in the same direction when they're negatively related they move in opposite directions you need to use that language to use that language when you write hypotheses you also need to be able to you also need to be able to understand what it means when someone says that two things are negatively related right does that make sense now here's an additional point that you need to write down that's not on the slides and if you were just looking through the slides scrolling through this is a piece of information you would not be able to find and so you actually have to listen to the lecture in order to know this piece of information are you ready you know write it in your notes right now put a star next to it and I'll put an exclamation point are we ready and by the way if you want to impress me when you hear this send me an email and say dr. Alvarez I wrote down the the important point of information about the direction of relationships in bivariate analysis okay in writing a hypotheses so I want you to notice something if we have a direction with our hypothesis this implies that both the dependent variable and the independent variable can go up or down right for instance and positive right we said they move in the same direction independent variable goes up dependent variable goes up independent variable goes down dependent variable goes down right so it has to be able to go up or down but it with negative relationships they still have to be able to go up and down even if they move in the other direction right here's the question that I'm going to ask you here's the question that I'm going to ask you we have three levels of measurement three levels of measurement that we discuss in this class nominal ordinal and interval ratio can we apply the term up and down increasing or decreasing to all three levels of measurement I'm gonna I'm gonna say differently now that's gonna take us a little bit closer to the answer so prepare yourself if I have an interval ratio variable how much money you earned last year in dollars can I identify if it's increasing or decreasing I can write I can write let's say I have an ordinal there but right that's about how much you enjoy smoking right it says you enjoy it very much you enjoy it a little you don't really enjoy it at all you really dislike it but you're addicted right or no variable we can rank those categories can we go up and down there we can write if you enjoy it more it goes up if you enjoy it less it goes down right so we can rank them it has an order to it right but what about a nominal variable what if we have a nominal variable what do we do then what do we doing can it go up and down so let's say the variable that measures someone's race white black Latino Asian other right is there an up and down there is there an order is there an amount that can increase a decrease for a nominal variable there is not right there is not there is not so if we have a nominal independent variable how how do we provide direction we provide direction by comparing categories comparing categories of our independent variable did we do that already did we do that already we did indeed what was our very first was our very first research hypothesis I'm gonna go back to it so you can see it we say men and women will smoke the same number of cigarettes notice there's no positive who sees me that's the null hypothesis we said the mean number of cigarette smoke will be greater for men than for women right we can say one is greater or less than right because we have to know the dependent variable there is interval ratio right but we can't say up or down on the independent variable can we so what there did I do how did I provide direction I said this this is actually a good read this is the right formula what did I do I compared men and women and I said that men we're going to be smoke more cigarettes than women that is how we provide direction if we have a nominal a nominal independent variable let's think about race for a second white black Latino Asian other and we compare them on smoking can I say race will be positively related to smoking well smoking can you can do more or less smoking but can you have more or less race can your race go up or down it can't right so how do you write a good high pot research hypothesis in that circumstance you compare categories I think whites will smoke more than blacks and Latinos I think those in the other racial category will smoke more than Asians right all of those provide a more a level of direction that's not up or down but does say what you believe with greater levels of specificity by comparing the categories you're issued it you don't have to use all the categories you just have to compare at least two of them right you could like some people like to do no white will smoke more than non-whites you know and then group everybody who's not white into a big bucket I myself not like doing those types of analysis well that's what some people do you know this is the way that we provide direction when we have a nominal independent variable okay so that was a wee bit of additional material that you will need to know that it doesn't show up in their slides sorry about that but it's not sorry but you know I I want to make sure everybody watches and this is the size it's important right so what do these look like in a fuzzy formal statement like what would you actually see in an analysis or what might you see in an exam it would look something like this each one minion will smoke more cigarettes than women hey Joe the null hypothesis women and men and women will smoke the same number of cigarettes each one there was a negative relationship between smoking and health there's no relationship between smoking and health income there is a positive relationship between education and income each oh there's no relationship between education income these are perfectly good examples of how you would write out a research hypothesis and right after that right now ain't no hypothesis you will not always see this if you look in a journal article a sociology psychology pilot political science economics you won't always see people write out their null hypotheses but they're there lurking in the background as burgeoning new social scientist I am forcing you to write no hypotheses here I want to I want to say something to you and I'm gonna sound like a broken record I I do a lot of repeating it's a technique that we use it's a pedagogical technique I want to make this point to you when we in this class evaluate hypotheses evaluate you know what these researchers are saying when we formally test hypotheses and we will do this we both evaluate the null hypothesis and the research hypothesis I want you to write that down you must evaluate both the null hypothesis and evaluate the research hypothesis okay in order to evaluate your null hypothesis write this down in order to evaluate your null hypothesis you will use your tests of significance your inferential test okay I'll say it one more time in order to evaluate your null hypothesis you will use your test of significance your inferential test in order to evaluate your research hypothesis you will use the bivariate analysis itself the results of your bivariate analysis your look at your comparison of means in order to determine hey men actually do smoke more cigarettes than women right there is support for this in this in this data and then you will look at your test of significance to say hey the t-test shows me that the p-value is point zero one this shows that there is a statistically significant relationship between sex and smoking and so I expect minute I expect that men will smoke more cigarettes than women not just in this population but in the excuse me not just in this sample but in the population overall I can infer from my sample to the population right that's what we do or I have found a negative correlation coefficient between smoking and health this is what the researcher have hypothesized therefore there's support for that researchers hypothesis the p-value for this correlation coefficient is 0.04 so therefore there is a statistically significant relationship between smoking and health that we would expect to find in the population that's what that looks like in sounds like right and this is the this is what we're gonna learn this is what we're going to learn how to do in in this course right that's what we learn how to do so that all falls under testing hypotheses I'm going to give you more information about this this might be a good time for you to go and take a break things get a little bit more complicated right now so if you're like or if your brain is already feeling a little like hot and itchy like this is a good time to stop you know maybe have a beer do a shot if that's your thing um I you know I would prefer you not do that but if you wanted to make sure you're not gonna drive or anything I am gonna have a sip of my water delicious I I encourage you to drink as much water as well now whatever I'm just being silly now because uh this is a long lecture so we jump into testing hypotheses alright let's do it testing hypotheses has two goals evaluate whether the relationship between variables I see in my analysis of the sample will likely be found in the population in other words inference we test our hypotheses specifically our null hypothesis in order to see if we can infer from our sample to the population if we have a statistically significant result that means that we can infer from our sample to the population if we do not have a statistic significant result this means that we cannot infer from the sample to the population and the the analysis that we see in this sample was likely caused by chance or randomness the second goal that we have is to evaluate whether there is support for the researchers hypothesis about the relationship between variables as I said you will use the bivariate analysis itself to do the second one you will use a test of significance to do the first one okay so inference the test of significance there are a bunch of different tests of significance we mention chi-square we mentioned t-test we mentioned F test and that the test of significance you use will depend on the type of analysis you run whether or not you use crosstabs comparison amines etc right for each type of bivariate analysis you run there's a different type of test of significance right your job is to know which ones go with which when you look at the tests of significance you'll be looking for the p-value the probability in order to determine significance we want unlikely events we want small probabilities we want our p-value to be small P is a probability meaning that it ranges from 1 right to 0 0 means something will never happen 1 means that it will always happen the probability is 1 in other words the percentage chance is 100 right the probability of zero means something will not happen right the probability or rather the percentage chance is 0 right we will always be looking for very unlikely again events we are looking for p-values that are very small we generally determine significant findings when P is less than point zero five in other words when the outcome that we see has a less than a 5% chance of occurring significant findings are typically when P is less than point zero five your job will be to look through the SPSS output and to identify the probability in the output so that you can write about it determination right any p-value that's greater than 0.05 is not statistically significant it's just noise randomness something that exists in our sample but not likely in the population that p-value it can vary do you know who gets to make the decision about which p-value we use to determine statistical significance you do as the researcher you do so by determining your confidence level de and we're going to talk about that in just one second but the most general p-value that we look forward to determine statistical significance is point zero five below point zero five okay are we clear about that write that down in your notes if it's below 0.05 statistic significant if it's point O five and above it is not statistically significant right you need to know this you need to know this right when we determine that the p-value is significant that allows us or not significant that allows us to specifically address our null hypothesis and there is a very specific language that we use to discuss the null hypothesis okay are you ready for this I would write I would write this down if you have a p-value data that is less than point zero five we would say that that is a significant finding and that therefore we reject the null hypothesis of no difference or of no effect we reject the null hypothesis why are we rejecting the null hypothesis you know think for yourself right now what do we the null hypothesis states what does the null hypothesis state write down for yourself right now take a second in general what do the null hypothesis how hypotheses state that there's no relationship between the two variables right that there's no effect of one variable on the other that's the null hypothesis no effect no relationship if we reject that if we reject the null hypothesis what does that imply that there is a relationship between the two variables right that the independent variable does have an impact on the dependent variable do you see that we reject the null hypothesis to say we rejected the idea that there is no relationship between the two variables so a finding that there is a statistically significant relationship between the two variables leads us to reject the null hypothesis that there is no relationship or no effect of one variable on the other are we clear about this so what then do we say when the other thing happens when the p-value is at point O 5 or above then we say we fail to reject the null hypothesis we fail to reject the null hypothesis right in other words we fail in our ability to say that the null hypothesis does not hope right I want you to see that we never ever use the words accept or prove true that is not what we are doing we are either rejecting or failing to reject that's it we're not proving anything and we certainly don't use words like true okay we reject or fail to reject now we can get into some more conversation about that about why we use this language but at the end of the day what am I looking for you to demonstrate for me on that if you see the appropriate p-value that a p-value you know whether to reject the null or fail to reject the null based upon that P die right then if somebody says that something is statistically significant you know what to say about the null hypothesis right even if you don't see the p-value right do you see what I'm saying this is what you need to be able to do are we clear about this okay let's keep going so what is really going on here conceptually and this is this is just a thin version of it we're not gonna go into the weeds on this you know we evaluate hypotheses based on probability I said to you you know that the common threshold is a P of point zero five in other words a 5% chance whenever we're willing to accept a 5% chance of being wrong that threshold is set by what's called the confidence level and there are different confidence levels that you can use by far the biggest confidence level the most the most commonly used confidence level is the 95% confidence level meaning that we're willing to accept a P of point of something below 0.05 right 0.05 and you can think about it as a 5% chance plus 95% chance either gives you 100% chance right that's way off thing about so if you have a 99% confidence level what do you think the p-value is that we would look for its point 0.01 right well one percent chance plus a 99% chance this is a hundred percent chance right so we are looking for a highly unlikely events and the less likely it is the more evidence we have to reject the null hypothesis right that's the way that you should think about this and so when we use a higher confidence level that means that we have higher cough in our inference and we think that there's less like likely a chance for us to be wrong right and the 95 percent no confidence level there's a five percent chance that we're wrong about our inference at the 99 percent confidence level there's a 1% chance that we're wrong do you see what I'm saying at the 90 percent confidence level which does people do use people do use there's a 10 percent chance that were wrong so when does when would you change your confidence level and when does it matter well here's what I'll say to you the most commonly used is the 95 percent confidence level which means that you would look for a point or a p-value of 0.05 we wouldn't normally change the confidence level that we're looking at based upon our the size of our sample if we have a small sample size it is more difficult to determine statistical significance and so there we might be more willing to allow for if the possibility of mistake and to therefore use the 90% confidence level so sample size right on the flipside if we have a very large if we have a very large sample size it is easier to finance it as though significance and then therefore we might use a higher confidence about 99 percent confidence level again sample size sometimes is what is the primary driver of the conference level or if we're just not willing to be wrong in other words when you infer to the population you really believe in the VVD do you want to be right about it because the stakes of being wrong are very high imagine that you're doing a medical test right and you want to make sure that there are no adverse real strong adverse effects of taking medication right on the human body right there you want to use very high confidence levels because if you're wrong about the effect of your medication on the human body right then there could be real terrible consequences for people right and so there are these different reasons that might shape what your confidence level is right now there's one thing that I think is really important that has become a bigger and bigger issue both in the philosophy of science amongst statisticians more generally and that is that the p-value that we use point oh five point oh one you know point why no to pick up on the confidence level those numbers are not magical numbers those numbers are not grounded in some deep-rooted scientific and not you know understanding of the level of risk we can take they are traditional numbers that statisticians and scientists have used that's it there's no magical reason or a magical thing that happens because we use a p-value of 0.05 at the 95% confidence I know somebody made that up a long time ago and people have just used it traditionally that's it and so I don't want you to make believe that like there's some hard and fast rule about these things typically if you want to publish in a journal yes you're gonna use a 95% confidence level or 99% confidence that well you've got a large sample sizes that is the traditional thing for you to do but those are not like some mystical or on the other side they're not reading some deep scientific mathematical proof of the 95 or 98 99 percent conference that was all the appropriate ones to use it's just what we've used additionally so I don't want you if you're a Marxist the term is rare if I don't want you to ratify them I don't want you to I want you to you know choose the the conference level that makes sense for you based upon based upon your specific circumstances and I'm saying all this to empower you to make you know decisions for yourself so we've talked about about what this probability is and what we're looking for right P 0.05 point oh one it all depends upon the confidence level that we're at right but what does that p-value really mean what is that p-value right there's a lot of debate about it a lot a debate about this right this is what it's traditionally said the p-value or probability tells us that if we assume the null hypothesis is true right what is the probability of obtaining this outcome right the p-value or probability is true what is the likelihood of obtaining this outcome that's the generally understood way of thinking about what the be fair it's not exactly the accurate but that's one way of looking at it right so let's imagine and write this down for me if you don't mind less of that imagine that the that we say that men smoke on average in a day 25 cigarettes and women smoke 15 right and women smoke 15 25 and 15 right I'm gonna write it down myself as I'm coming up with this it's like a 25 15 and we got a p-value of 0.01 a p-value of point zero what right so what does that tell us what does that say it says that if we lived in the world where the null hypothesis were true that there is no relationship between sex and smoking there would only be a 1% chance that we would actually find a sample where men smoked 25 and women smoked 15 now this is a highly unlikely outcome and highly unlikely outcome right therefore what should we do we should reject that null hypothesis and we should conclude that there is a Sadducee significant relationship between gender or rather sex and smoking and that we would expect to find that in the population make sense most of you are probably gonna be like no not really there is one there's a there's another way that I often tell students to think about it which is absolutely incorrect absolutely incorrect but can be useful for as a as a heuristic to help you understand sort of what it's trying to tell you the wrong way of thinking about but useful way of thinking about what the p-value is is what is the likelihood then that the null hypothesis is true right what the p-value is telling you is the likelihood that the null hypothesis is true that's not what it's telling you right but that's a useful way of thinking about it right and so there under this circumstance right if you think about the p-value that you're looking for that you know based upon your confidence level that's the percentage chance that you're willing to be wrong about the null hypothesis being true right so in other words men smoke you know 25 cigarettes per day women smoke 15 and in in this world the probability that the null hypothesis is true when I say was it's point O 1 right so there's a 1% chance that the null hypothesis is true that's very very low chance that the null hypothesis is true so what do we do we kick we kick that that damn the hypothesis out here right well that's imagine that you know you know that the we take another sample and in that sample we find that you know men smoke 21 and women smoke 19 right and the p-value associated there is 0.12 right in other words in the world that we get this sample there's a percent chance that the null hypothesis is true that's too likely for the null hypothesis to be true for us so we fail to reject the null hypothesis right we fail to reject the null hypothesis and we say the thing that we see in the sample you know 21 vs. 19 that's just here in the sample is this something that's produced by a second chance that's not something that we would expect to find in the population right you see what I'm saying that's a useful way of thing about it if the outcome is unlikely if it's if it's unlikely that the null hypothesis is true we understand that this outcome is unlikely to be produced by chance and is likely to be found in the population and therefore we can infer our findings to the population so I would write down all that stuff that I just said you know just in case right just in case so that's kind of just make sure we're clear about this right for the null hypothesis we're looking at p-values right we use that p-value to determine it does this is just a significant relationship if we're at the 95% confidence that what we say that p-value that we're looking for is 0.05 so if if it's less than 0.05 we say a statistic significant relationship and we reject the null hypothesis if it's a if the p-value is at 0.05 or above it's not statistically significant and therefore we fail to reject the null hypothesis right so what do we do in the case of the research hypothesis the research hypothesis there we must look at the actual bivariate analysis that you're looking at right and you have to use your description and interpretation your description and interpretation to help you evaluate whether there is or is not support for the research hypothesis right a research hypothesis describes what the relationship between two variables will look like you must determine if it does describe that relationship or if it does not if it does describe that relationship then there is support for this view search hypothesis because and say because we found that men average 25 25 cigarettes per day smoking whereas women only had 15 there is support for this research hypothesis there is support for this research hypothesis right and you must address if the result to obtain support your research hypothesis notice I didn't say anything about you know the null hypothesis right I didn't say anything about that at all even in the case of this second sample that we took what we said that men smoked 21 and women smoked 19 right in that case we say that that p-value was point 1 2 right and that there is not a statistic significant not a statistically significant relationship between the variables that we expect to find in the population and therefore we fail to reject the null hypothesis right even then we still have to evaluate the research hypothesis and even there we would say hey we do find support weak support for the research hypothesis because men smoked 2 more cigarettes than women did 21 verses 18 right however the difference between those two between men and women is not such a significant and we would not expect to find this difference in the population see what I did there there is weak support but there is not a statistic sniffing a difference we would expect to find in the population right you must evaluate busting up that bet both the null hypothesis and the null hypothesis all right you must do both regardless of whether it's statistically significant or not you will store evaluate the research hypothesis okay so that's summarize we have a research hypothesis and a null hypothesis we have to evaluate both we have violated our research hypothesis using the analysis the crosstabs the comparison of means whatever it is we evaluate our all hypotheses using the inferential test of significance chi-square t-test F test more specifically we use the p-value obtained in the chi-square in the t-test in the F test to either reject or fail to reject the null hypothesis we clear that was a fun little lecture email me if you have questions about this okay all right have a good one everybody

Transcript for:Exploring Bivariate Analysis and Hypothesis Testing

Transcript for:
Exploring Bivariate Analysis and Hypothesis Testing