Transcript for:
Understanding Correlation Analysis Techniques

this video is about correlation analysis we start by asking what a correlation analysis is we will then look at the most important correlation analysis Pearson correlations Spearman correlation candles Tau and Point by zero correlation and finally we will discuss the difference between correlation and causation let's start with the first question what is a correlation analysis correlation analysis is a statistical method used to measure the relationship between two variables for example is there a relationship between a person's salary and age in this scatter plot every single point is a person in correlation analysis we usually want to know two things number one how strong the correlation is at number two in which direction the correlation goes we can read both in the correlation coefficient which is between -1 and 1. the strength of the correlation can be read in a table if R is between 0 and 0.1 we speak of no correlation if R is between 0.7 and 1 we speak of a very strong correlation a positive correlation exists when high values of one variable go along with high values of the other variable or when small values of one variable go along with small values of the other variable a positive correlation is found for example for body size and shoe size the result is a positive correlation coefficient a negative correlation exists when high values of one variable go along with low values of the other variable and vice versa a negative core relation usually exists between product price and sales volume the result is a negative correlation coefficient now we have different correlation coefficients the most popular are Pearson correlation coefficient are spiem and correlation coefficient RS candles Tau and Point by zero correlation coefficient rpb let's start with the first the Pearson correlation coefficient what is the Pearson correlation as all correlation coefficients the Pearson correlation R is a statistical measure that quantifies the relationship between two variables in the case of Pearson correlation the linear relationship of Matrix variables is measured more about metric variables later so with the help of piercing correlation we can measure the linear relationship between two variables and of course the Pearson correlation coefficient R tells us how strong the correlation is and in which direction the correlation goes how how is peace and correlation calculated the Pearson correlation coefficient is obtained via this equation where R is the Pearson correlation coefficient x i are the individual values of one variable for example h y i are the individual values of the other variable for example salary x dash and Y dash are respectively the mean values of the two variables in the equation we can see that the respective mean value is first subtracted from both values so in our example we calculate the mean values of H and salary we then subtract the mean values from each person's age and salary then we multiply both values and we sum up the individual results of the multiplication the expression in the denominator ensures that the correlation coefficient is scaled between -1 and 1. if we now multiply two positive values we get a positive value so all values that lie in this area have a positive influence on the correlation coefficient if we multiply two negative values we also get a positive value minus times minus is plus so all values that lie in this area also have a positive influence on the correlation coefficient if we multiply a positive value and a negative value we get a negative value minus times plus is minus so all values that lie in these ranges have a negative influence on the correlation coefficient therefore if our values are predominantly in these two areas we get a positive correlation coefficient and thus a positive relationship if our values are predominantly in these two areas we get a negative correlation coefficient and thus a negative relationship if the points are distributed over all four areas the positive terms and the negative terms cancel each other out and we get a very small or no correlation but now there's one more thing to consider the correlation coefficient is usually calculated with data taken from a sample however we often want to test a hypothesis about the population in the case of correlation analysis we then want to know if if there is a correlation in the population for this we check whether the correlation coefficient in the sample is statistically significantly different from zero the null hypothesis in the Pearson correlation is the correlation coefficient does not differ significantly from zero there is no linear relationship and the alternative hypothesis is the correlation coefficient differs significantly from zero there is a linear relationship attention it is always tested whether the null hypothesis is rejected or not in our example the research question is is there a correlation between age and salary in the British population to find out we draw a sample and test whether in this sample the correlation coefficient is significantly different from zero the null hypothesis then is there is no correlation between salary and age in the British population and the alternative hypothesis there is a correlation between salary and age in the British population whether the correlation coefficient is significantly different from zero based on the sample collected can be checked using a t-test where R is the correlation coefficient and N is the sample size a p-value can then be calculated from the test statistic T if the p-value is smaller than the specified significance level which is usually five percent then the null hypothesis is rejected otherwise it is not but what about the assumptions for a piercing correlation here we must distinguish whether we just want to calculate the Pearson correlation or whether we want to test a hypothesis to calculate the piece of correlation coefficient only two metric variables need to be present metric variables are for example a person's weight a person's salary or electricity consumption the Pearson correlation coefficient then tells us how large the linear relationship is if there is a non-linear relationship we cannot tell from the Pearson correlation coefficient however if we want to test whether the piercing correlation coefficient is significantly different from zero the two variables must always be normally distributed if this is not given the calculated test statistic t or the p-value cannot be interpreted reliably let's continue with the Spearman correlation the Spearman rank correlation is the non-parametric counterpart of the Pearson correlation but there is an important difference between both correlation coefficient regions Spearman correlation does not use the raw data but the ranks of the data let's look at this with an example we measure the reaction time of 8 computer players and ask their age when we calculate a Pearson correlation we simply take the two variables reaction time and age and calculate the Pearson correlation coefficient however we now want to calculate the Spearman rank correlation so first we assign a rank to each person for reaction time and age the reaction time is already sorted by size 12 is the smallest value so gets rank 1 15 the second smallest value so it gets Rank 2 and so on and so forth we are now doing the same with age here we have the smallest value there the second smallest they are the third smallest fourth smallest and so on and so forth let's take a look at this in the Scatter Plots here we see the raw data of age and three action time but now we would like to use the rankings so we form ranks from the variables age and reaction time through this transformation we have now distributed the data more evenly to calculate the Pearson correlation we simply calculate the Pearson correlation from the ranks so the Spearman correlation is equal to the Pearson correlation only that the ranks are used instead of the raw values let's have a quick look at that in data tab here we have the reaction time and age and there we have the chest created ranks of reaction time and age now we can either calculate spewing correlation of reaction time and age where we get a correlation of 0.9 or we can calculate Pearson correlation from the ranks where we also get 0.9 so exactly the same as before if you like you can download the data set you can find the link in the video description if there are no rank ties we can also use this equation to calculate the Pearson correlation RS is the spumin correlation n is the number of cases and D is the difference in ranks between the two variables referring to our example we get a different D's with this one minus 1 is equal to zero two minus three is minus one three minus two is one and so on now we Square the individual D's and add them all up so the sum of d i squared is eight n which is the number of people is eight if you put everything in we get a correlation coefficient of 0.9 just like the Pearson correlation coefficient R spiem and correlation coefficient RS also varies between -1 and 1. let's continue with candles Tau candles tau is the correlation coefficient and this does a measure of the relationship between two variables but what is the difference between Pearson correlation and canvas rank correlation in contrast to Pearson correlation candles rank correlation is a non-parametric test procedure thus for the calculation of candlestow the data need not be normally distributed and the variables need only have ordinal scale levels exactly the same is true for the Spearman rank correlation right that's right candles towel is very similar to spearman's rank correlation coefficient however candle's Tau should be preferred over spearman's correlation if very few data with many ranked ties are available but how is candlestar calculated we can calculate candle Style with this formula where C is the number of concordant pairs and D is the number of discordant pairs what are concordant and discordant pairs we will now go through this with an example suppose two doctors are asked to rank six patients according to their physical health one of the two doctors is now defined as a reference and the patients are sorted from 1 to 6. now the sorted ranks are matched with the ranks of the other doctor EG the patient who is in third place with the reference doctor is enforced place with the other doctor now using candlestow we want to know if there is a correlation between the two rankings for the calculation of candles Tau we only need these ranks we now look at each individual Rank and note whether the values below are smaller or greater than itself so we start at the first Rank 3 1 is smaller than 3 so gets a minus 4 is greater so gets a plus two is smaller so it gets a minus 6 is greater so it gets a plus and 5 is also greater so it also gets a plus now we do the same for one here of course each subsequent rank is greater than one so we have a plus everywhere at Rank 4 2 is smaller and six and five are greater now we do this for Rank 2 and rank 6. then we can easily calculate the number of concordant and discordant pairs we get the number of concordant pairs by counting all the Plus in our example we have 11 plus in total we get a number of discordant pairs by counting through all the minus in our example we have a total of four minus C is thus 11 and D is 4. candlest Tau now is 11 minus four divided by 11 plus 4 and we get a candle style of 0.47 we get an alternative formula for candle style here with s is C minus D therefore 7 n is the number of cases ie6 if we insert everything we also get 7 divided by 15. just like the Pearson correlation coefficient R candles Tau also varies between -1 and plus one we have again calculated correlation coefficient using data from a sample now we can test if the correlation coefficient is significantly different from zero thus the null hypothesis is the correlation coefficient tau is equal to zero there is no relationship and the alternative hypothesis is the correlation coefficient tau is n equal to zero there is a relationship therefore we want to know if the correlation coefficient is significantly different from zero you can analyze this either by hand or with a software like data tab for the calculation by hand we can use the set distribution as an approximation however for this we should at least have 40 cases so the six cases from our example are actually too few we get the set value with this formula here we have Tau and N is the number of cases this brings us to the last correlation analysis the point by zerial correlation Point by zero correlation is a special case of Pearson correlation and examines the relationship between a dichotomous variable and the metric variable what is a dichotomous variable and what is a metric variable at Economist variable is a variable with two values for example gender with male and female or smoking status with smoker and non-smoker a matrix variable is for example the weight of a person the salary of a person or the electricity consumption so if we have a dichotomous variable and a metric variable and we want to know if there is a relationship we can use a Point by zero correlation of course we need to check the assumptions beforehand but more about that later how is the point the serial correlation calculated as stated at the beginning the point by zero correlation is a special case of the Pearson correlation but how can we calculate the PSN correlation when a variable is nominal let's look at this with an example let's say we are interested in investigating the relationship between the number of hours studied for a test and the test result passed failed with calculated data from a sample of 20 students where 12 students pass the test and eight students failed we have recorded the number of hours each student studied for the test to calculate the point by zero correlation we first need to convert the test result into numbers we can assign a score of 1 to students who pass the test and the score of zero to students who failed the test now we can either calculate the Pearson correlation of time and test results or we use the equation for the point by zero and correlation X1 Dash is the mean value of the people who have passed and X2 Dash is the mean value of the people who failed N1 is the number of people who passed and N2 the number of people who failed and N is the total number but whether we calculate the piercing correlation or we use the equation for the point by zerial correlation we get the same result both times let's take a quick look at this in data tab here we have the learning hours the test result was passed and failed and there the test result was 0 and 1. we Define the test result with 0 and 1 as metric if we now go to correlation and calculate the Pearson correlation for these two variables we get a correlation coefficient of 0.31 if we calculate the point by zero correlation for learning hours and exam result was passed and failed we also get a correlation of 0.31 just like the Pearson correlation coefficient R the point by zero correlation coefficient rpb also varies between -1 and 1. if we have a coefficient between -1 and less than 1 there is a negative correlation thus a negative relationship between the variables if we have a coefficient between greater than zero and one there is a positive correlation that is a positive relationship between the two variables if the result is zero we have no correlation as always with the Point by zero correlation we can also check whether the correlation coefficient is significantly different from zero thus the null hypothesis is the correlation coefficient R is equal to zero there is no relationship and the alternative hypothesis is the correlation coefficient R is unequal to zero there is a relationship before we get to the assumptions here's an interesting note when we compute a point by zeroid correlation we get the same p-value as when we compute a t-test for independent samples for the same data so whether we test a correlation hypothesis with the point by zeroid correlation or a difference hypothesis with the t-test we get the same p-value now if we compute a t-test in data tab with these data and we have the null hypothesis there is no difference between the groups field and asked in terms of the variable learning hours we get a p-value of 0.179 and also if we calculate a point by zero correlation and have the null hypothesis there is no correlation between learning hours and test results we get a p-value of 0.179 in our example the p-value is greater than 0.05 which is most often used as a significance level and thus the null hypothesis is not rejected but what about the assumptions for a point by zero correlation here we must distinguish whether we just want to calculate the correlation coefficient or whether we want to test a hypothesis to calculate the correlation coefficient only one metric variable and one decadamus variable must be present however if we want to test whether the correlation coefficient is significantly different from zero the one metric variable must also be normally distributed if this is not given the calculated test statistic t or the p-value cannot be interpreted reliably this brings us to the last question what is causality and what is the difference between causality and correlation causality is the relationship between a cause and an effect in a causal relationship we have a cause and a resultant effect an example coffee contains caffeine a stimulating substance when you drink coffee the caffeine enters the body affects the central nervous system and leads to increased alertedness drinking coffee is the cause of the feeling of alertedness that comes afterwards without drinking coffee the effect I.E the feeling of alertedness would not occur but causality is not always so easy to determine clear requirements must be met in order to speak of a cause and relationship but more about that later so what is the difference between correlation and causality a correlation tells us that there is a relationship between two variables example there is a positive correlation between ice cream sales and a number of sunburns however an existing correlation cannot tell us which variable influences which or whether a third variable is responsible for the correlation in our example both variables are influenced by a common cause namely sunny weather on sunny days people buy more ice cream and spend more time Outdoors this can lead to an increased risk of sunburns causality means that there is a clear cause effect relationship between two variables causality exists when you can say with certainty which variable influences which however a common mistake in the interpretation of Statistics is that a correlation is immediately assumed to be a causal relationship here is an example the American statistician Daryl Huff found a negative correlation between the number of head lice and the body temperature of the inhabitants of an island a negative correlation means that people with many head lice generally have a lower body temperature and people with few head lice generally have a higher body temperature the island has concluded that head lice were good for health because the reduced fever so their assumption was that headlights have an effect on the temperature of the body in reality the correct conclusion is the other way around in an experiment it was possible to prove that high fever drives away life so the high body temperature is the cause not the effect what are the conditions for talking about causality there are two conditions for causality number one there is a significant correlation between the variables this is easy to check we simply check whether the correlation coefficient is significantly different from zero row number two the second condition can be met in three ways first chronological sequence there is a chronological sequence and the results of one variable occurred before the results of the other variable second experiment a controlled experiment was conducted in which the two variables can be specifically influenced and number three Theory there is a well-founded and plausible theory in which direction the causal relationship goes if there is only a significant correlation but none of the other three conditions are met we can only speak of correlation never of causality thanks for watching and I hope you enjoyed the video