Transcript for:
Understanding Correlation and Analytical Tests(Lecture12 Correlation1)

of several lectures on correlation and so up to this point we have learned about several different analytical tests that we can use to assess information when we have data across different groups or populations and this requires us to have some continuous numerical data and there that data is in some type of associated category whether that might be a bay system, a period of time, different species of fish, maybe males and females, and these are really useful tools in which to assess questions that we're interested in. But oftentimes we're also interested in the relationship between data that don't have any specific clear groups. So if we go back to our examples of hypothesis testing and the chi-squared test, we can think about the question of how temperature affects phytoplankton biomass.

And our hypothesis in this particular instance several lectures ago was that all the temperatures had the same biomass mean biomass for a null hypothesis and then our alternative hypothesis might be that as temperature gets warmer we're going to see more biomass. In this particular example, we grouped our data into different categories of temperature, ranging from coldest to hot. But when we think about temperature, it's this continuous numerical variable, and how do we classify temperature as warm versus warmer?

And so rather than grouping temperature, usually it's more appropriate to measure this as a continuous variable and compare the relationship of that to phytoplankton biomass, which is also a continuous numerical variable. And we can think back to our data visualization lectures and one of the main ways in which to visually represent two continuous numerical variables is through scatter plots. So we can determine if there's any kind of relationship that we're seeing similar to what we see here with length and the weight of animals and their respective prey weight.

And so correlation is a way that we can measure the strength of association between two variables that are continuous and numerical. And these variables are going to be in x-y pairs, meaning that for every data point we measure both of these continuous numerical variables. And then that in turn will allow us to determine their relationship, or if there is a relationship, and how strong their relationship is.

And so we're measuring these two variables for each individual in the sample. Now, the key here is going to be that we're determining if and in what way these variables are related to each other. And so for example, we can think about humans and pollution. And generally when we see more humans, there tends to be more pollution.

And so we'd say that there is this positive direct relationship that we're seeing. More humans, more pollution. We can think about a different example here in that temperature actually leads to a negative relationship with sharks in South Florida. That as the temperatures in the southern portion of Florida increase, we tend to see fewer sharks as they migrate north along the eastern seaboard of the United States. So here we would have a negative relationship.

And so the goal here for correlation is to determine the relationship. Now oftentimes this is worked into the idea of cause and effect relationships. And when we're looking at two continuous numerical variables, our mind immediately goes to which variable is causing the relationship to occur. And for this particular idea, we use linear regression. And so when we're thinking about using correlation versus linear regression, these are two separate analytical frameworks.

And the question that's being asked is what's going to determine if we should use correlation or regression. Sometimes we can get mixed up because they are relatively similar to each other. But correlation simply determines the degree of association between those variables. So how closely are they related?

Whereas regression is going to help us determine the level of dependence of one variable on another. And this is oftentimes when we think about these cause and effect relationships between what we call a response variable and a predictor variable. These are terms that if we remember way back to the beginning of the semester we discussed. So in this particular set of lectures, we're going to focus on correlation, that degree of association between the two measurements. And we're going to determine what that strength of association is.

So first question. The correlation is if two variables are related, and if they are in a consistent way, how are they related? Are they positively related?

Are they negatively related? And then the second question that we address with correlation is how strong their relationship is, and if there is some type of a significant relationship that we're exhibiting. So we can think back to our example of body size and prey size and what we can see here is that it looks like there's some type of relationship.

It looks like as the size of the animals increase, their prey size also increases. For correlation, we're just simply interested if there's an association and the strength of that association between these two variables, not which variable is causing the other. So in thinking about correlation, oftentimes we're thinking about these continuous numerical data, and we can see positive relationships, we can see negative relationships, and the strength of those relationships varies based on the association of those two variables. And so we need a way to determine how strong that association is.

And so in order to do this, we use the correlation coefficient. And it tells us how well the data fit to a straight line. And so the closer the data points fit to a straight line, whether that be positive or negative, then the stronger the relationship is.

As those points deviate from a straight line, then the relationship tends to be weaker. Now this correlation coefficient is measured with the Pearson's correlation coefficient. And this is for population parameters.

Now as we know, it's rare for us to know these population parameters ahead of time, so we can use sample data in which to calculate the correlation between our respective variables. And this particular correlation coefficient is represented by r. So this is going to be our test statistic for correlation, similar to t was for our t test and f was for our f test and for ANOVA.

So the range for this correlation coefficient is between negative one and positive one. And if we have a correlation coefficient that equals one, it's going to imply a strong positive linear relationship. And if it's negative one, it's going to imply a strong negative linear relationship.

Now notice on the plots. below that the slope of the line isn't necessarily what's dictating the value for our correlation coefficient. It's more of the clustering of those data points around the line, so much as that the slope of the line does not equal zero. And if r equals zero, then it's going to imply that there's no linear relationship, that there's no relationship between these two variables that we're interested in.

Again, something key to note here is that the slope of the line isn't important, we just care how tightly they fit the straight line, again assuming that the slope does not equal zero. So moving forward into the next lecture, we're going to discuss how to calculate this correlation coefficient based on our raw data. Now something important for you to do before viewing the next video is to go to section 13.1 of your text and read through that to make sure you are comfortable with all the background information that you need.