Transcript for:
Understanding Pearson's Correlation Coefficient

so in a previous video we looked at graphs like this and we interpreted them and we used words to do that a strong positive association a moderate negative association a weak negative association no association at all there are lots of other ones but you get the idea words words words words i'm a maths teacher numbers is what i want correlation coefficient pearson's r is a number that we can use to describe the association in a scatter plot now i'm not going to get into the nitty gritty of how to calculate it right now because it's really complicated but i am going to explain it to you we have pearson's r correlation is a number between positive one and negative one now one is a perfect positive association dots that if you were to draw a line through the line would go through every single one of those dots a little known fact i used to be a fruiter i used to sell fruit and if you go to the shop and buy fruit say bananas however many kilos you buy and however much money it costs you that's a perfect positive correlation if you buy three kilos of bananas it'll cost you say six dollars if you buy five kilos of bananas it will cost you ten dollars it's a straight line perfect positive association one and so as you might expect a perfect negative association is negative one and it's a straight set of dotted lines downwards if you're far away from home and you get in your car and start driving at a constant speed the distance from your house will get less and less and less and less until you are home at zero so distance from your house time perfect negative association as long as you drive at exactly the same speed for the entire time and zero is just random dots on a page no association i'm not going to give you an example you know what no association means random dots now it's not just zero that's no association we can kind of expand that out a little bit and we can say that if your r value is between 0.25 and negative 0.25 anywhere within that bound it's going to be no association i think you can probably see where i'm going with this between 0.25 and 0.5 is going to be weak positive between negative 0.25 and negative 0.5 is going to be weak negative between 0.5 and 0.75 moderate positive and between negative 0.5 and negative 0.75 moderate negative finally no surprise here strong positive from 0.75 to 1 and strong negative from negative 0.75 to negative 1. it's probably important to write all of that down because you might get asked a question like so two variables have a correlation coefficient of r equals 0.6 describe the association it's going to be really straightforward we go to our little graph here we say well 0.6 that's about there 0.6 it falls into the moderate positive so we can say that the association is moderate positive you could also make a small attempt at sort of sketching what that might look like right it's positive so it's moving up in that direction and it's moderate so it's dots going upwards but a little bit spread out not like right next to each other that looks like it's probably got an r of about 0.6 which leads me to a different question you might get asked you might get asked look at this graph tell me the approximate r value so we can have a bit of a crack at this we can see that it's heading downwards so it's negative right so it's somewhere between zero and negative one and we might say well they're sort of close-ish they're probably a little bit closer together than this previous picture here which we said was 0.6 so this is going to be like negative more than that so probably r is approximately negative 0.7 now there's no right answer there but maybe it was like a multiple choice question maybe they gave you five different r values to choose from and it's your job to choose the r value that most looks like that one that would be a great exam question super important it only works for linear associations linear like a line so if you tried to calculate the r of something that looked like this well the r value for that would probably be 0 or it might be something else but it would be meaningless and you just wouldn't want to do it because it only works for linear associations and you can see this looks more like it's it's a bendy one right it's non-linear not much more to say there just r is meaningless if you're dealing with non-linear it only works for linear association so you know r is a number that measures linear association you know it can go between negative 1 and 1 but i haven't told you how to calculate it now if you're interested you should stick around because i'll explain kind of how it's calculated but you should know it's very complicated to calculate this and there's a low likelihood that anyone would ever ask you to calculate it using the formula i'm about to show you so if you're still here thanks it means you're a curious person i appreciate you so let's go through this um we've got a bunch of points each of those points has an x coordinate and a y coordinate so this might be 12 18 and this might be 3 15 and all of them are going to have things like that now there are 12 dots here and there that means there's going to be 12 x coordinates and 12 y coordinates so what i'm going to do is take all of the x coordinates add them together and divide them by 12. in other words find the average x coordinates of all of these dots now that's the average x coordinate of all these dots seven and i've ruled a line through seven and now what i'm going to do is measure the distance of every dot from there to there in other words find the difference between the x-coordinate and the average x-coordinate and then we can do the same with the y-coordinates so we take the average y-coordinate of every single one of these dots and rule a neat little line in there and then we just find all of these distances and so now i've found all of the distances from the dots to this pink line here and now we're kind of in business but there's still a bit of an issue here because look at the scale of the x-axis compared to the scale of the y-axis and look how spread out they are here compared to how spread out they are here we have to do this weird thing called standardization so essentially what we do is using the standard deviation which you'll know is a measure of spread we take these and we standardize the width of this and the width of this it's a bit technical but you're going to have to sort of trust me on this what we're going to do is take this line here this line here standardize them and then multiply the lines together so let me show you what i mean for this particular dot because we actually have to do this calculation for every single dot it's really crazy so we take the x value 12 and we subtract the average x value 7 and then we divide by the standard deviation now the standard deviation is a number and you know it's a measure of spread i don't know what the standard deviation is for this because i'm just making it up but we don't really need to know okay now you multiply that by 18 minus whatever this pink value is here it's probably like 14 so 18 minus 14 over the standard deviation now it's important to note these standard deviations are different this is the standard deviation of the x's and this is the standard deviation of the y's so this gives you the standardized area of that rectangle crazy we're going to do that for every single one of the 12 dots there so we do the sum of all of these particular ones now i'll get rid of these numbers and sort of t because it's not just this dot it's all of the dots right just need to put something in uh that's a little more general and then finally once we've done all of that we divided by the number of dots minus 1 to come up with a sort of variance here a sort of average now when you do that it's going to spit out the r value and the r value is going to be somewhere between negative 1 one and closer to one means it looks a bit more like this and closer to negative one means it looks like that and you might be thinking like what but why why and the reason is that if you've got things in this quadrant here you've got 12 minus 7 which is 5 and you've got 18 minus 14 which is 4. both of those numbers end up being positive and when you multiply them together you get a positive answer now in this quadrant if you do the same thing you get a negative multiplied by a negative which is also a positive you get a positive answer here so dots that appear in this quadrant and dots that appear in this quadrant are positive dots that appear in this quadrant and dots that appear in this quadrant are negative so what it means is that if you get a lot of dots in this quadrant and this quadrant you're going to get a very positive answer if you get more dots appearing in this quadrant in this quadrant you're going to get a more negative answer so that's why positive lots of dots in these quadrants negative lots of dots in these quadrants okay that's that gives you an int an intuition for how that formula works but seriously we never calculate it by hand because it's such a pain we grab our calculator and we type in the numbers and the calculator spits out an r value for us we put it into excel excel has a nice little formula in it that'll spit out the r value for you we never use a formula that complicated i just wanted to show it to you