Welcome to our full and free tutorial about statistics we will uncover the tools and techniques that help us make sense of data this video is designed to guide you through the fundamental concepts and most powerful statistical tests used in research today from the basics of descriptive statistics to the complexities of regression and Beyond we'll explore how each method fits into the bigger picture of data analysis and don't worry if you have no clue about statistics we'll go through everything step by step if you like you can find all topics in our book as well the link is in the video description so what is the outline of the video our video has three major parts in the first part we discuss what statistics is and what the differences between descriptive and inferential statistics are in the second part we go through the most common hypothesis tests like T Test and Anova and we discussed the differences between parametric and non-parametric tests and in the third part we take a look at correlation analysis and regression analysis and finally we talk about class analysis we have prepared detailed videos for each section so let's start with the video that explains what statistics is after this video you will know what stat statistics is what descriptive statistics is and what inferential statistics is so let's start with the first question what is statistics statistics deals with the collection analysis and presentation of data an example we would like to investigate whether gender has an influence on the preferred newspaper then gender and newspaper are our so-called variables that we want to analyze in order to analyze whether genda has an influence on the preferred newspaper we first need to collect data to do this we create a questionnaire that asks about gender and preferred newspaper we will then send out the survey and wait two weeks afterwards we can display the received answers in a table in this table we have one column for each variable one for gender and one for newspaper on the other hand each row is the response of one served person the first respondent is mail and stated New York Post the second is female and stated USA Today and so on and so forth of course the data does not have to be from a survey the data can also come from an experiment in which you for example want to study the effect of Two drugs on blood pressure now the first step is done we have collected data and we can start anal analyzing the data but what do we actually want to analyze we did not survey the entire population but we took a sample now the big question is do we just want to describe the sample data or do we want to make a statement about the whole population if our aim is limited to the sample itself I.E we only want to describe the collected data we will use descriptive statistics descriptive statistics will provide a detailed summary of the sample however if we want to draw conclusions about the population as a whole inferential statistics are used this approach allows us to make educated guesses about the population based on the sample data let us take a closer look at both methods starting with descriptive statistics why is descriptive statistics so important let's say a company wants to know how its employees travel to work so the company creates a survey to answer this question once enough data has been collected this data can be analyzed using descriptive statistics but what is descriptive statistics descriptive statistics aims to describe and summarize a data set in a meaningful way but it is important to note that descriptive statistics only describe the collected data without drawing conclusions about a larger population put simply just because we know how some people from one company get to work we cannot say how all working people of the company get to work this is the task of inferential statistics which we will discuss later to describe data descriptively we now look at the four key components measures of central tendency measures of dispersion frequency tables and charts let's start with the first one measures of central tendency measures of central tendency are for example the means the median and the mode Let's first have a look at the mean the arithmetic mean is the sum of all observations divided by the number of observations an example imagine we have the test scores of five students to find the mean score we sum up all the scores and divide by the number of scores the mean test score of these five students is therefore 86.6 what about the median when the values in a data set are arranged in ascending order the median is the middle value if there is an odd number of data points the median is simply the middle value if there is an even number of data points the median is the average of the two middle values it is important to note that the median is resistant to extreme values or outliers let's look at this example no matter how tall the last person is the person person in the middle Remains the person in the middle so the median does not change but if we look at the mean it does have an effect on how tall the last person is the mean is therefore not robust to outliers let's continue with the mode the mode refers to the value or values that appear most frequently in a set of data for example if 14 people travel to work by car six by bike five walk and five take public transport then car occurs most often and is therefore the mode great let's continue with the measures of dispersion measures of dispersion describe how spread out the values in a data set are measures of dispersion are for example the variance and standard deviation the range and the interquartile range let's start with the standard deviation the standard deviation indicates the average distance between each data point and the mean but what does that mean each person has some deviation from the mean now we want to know how much the person's deviate from the mean value on average in this example the average deviation from the mean value is 11.5 cm to calculate the standard deviation we can use this equation Sigma is the standard deviation n is the number of person persons x i is the size of each person and xbar is the mean value of all persons but attention there are two slightly different equations for the standard deviation the difference is that we have once one divided by n and once's 1 / nus 1 to keep it simple if our servy doesn't cover the whole population we always use this equation to estimate the standard deviation likewise if we have conducted a clinical study then we also use this equation to estimate the standard deviation but what is the difference between the standard deviation and the variance as we now know the standard deviation is the quadratic mean of the distance from the mean the variance now is the squared standard deviation if you want to know more details about the standard deviation and the variance please watch our video let's move on to range and inter quartile range it is easy to understand the range is simply the difference between the maximum and minimum value inter quartile range represents the middle 50% of the data it is the difference between the first quartile q1 and the third quartile Q3 therefore 25% of the values are smaller than the inter quartile range and 25% of the values are larger the inter quartile range range contains exactly the middle 50% of the values before we get to the last two points let's briefly compare measures of central tendency and measures of dispersion let's say we measure the blood pressure of patients measures of central tendency provide a single value that represents the entire data set helping to identify a central value around which data points tend to Cluster measures of dispersion like like the standard deviation the range and the interquartile range indicate how spread out the data points are whether they are closely packed around the center or spread far from it in summary while measures of central tendency provide a central point of the data set measures of dispersion describe how the data is spread around the center let's move on to tables here we will have a look at the most important ones frequency tables and eny tables a frequency table displays how often each distinct value appears in a data set let's have a closer look at the example from the beginning a company surveyed its employees to find out how they get to work the options given were car bicycle walk and public transport here are the results from 30 employees the first answered car the next walk and so on and so forth now we can create create a frequency table to summarize this data to do this we simply enter the four possible options car bicycle walk and public transport in the First Column and then count how often they occurred from the table it is evident that the most common mode of Transport among the employees is by car with 14 employees preferring it the frequency table thus provides a clear and concise summary of the data but what if we have not only one but two categorical variables this is where the contingency table also called cross tep comes in Imagine the company doesn't have one Factory but two one in Detroit and one in Cleveland so we also ask the employees at which location they work if we want to display both variables we can use a contingency table a contingency table provides a way to analyze and compare the relationship between two categorical variables the rows of a contingency table represent the categories of one variable while the columns represent the categories of another variable each cell in the table shows the number of observations that fall into the corresponding category combination for example the first cell shows that car and Detroit were answered six times and what about the charts let's take a look at the most important ones to do this let's simply use data.net if you like you can load this sample data set with the link in the video description or you just copy your own data into this table here below you can see the variables distance to work mode of transport and site data daab gives you a hint about the level of measurement but you can also change it here now if we only click on mode of Transport we get a frequency table and we can also display the percentage values if we scroll down we get a bar chart and a pie chart here on the left we can adjust further settings for example we can specify whether we want to display the frequencies or the percentage values or whether the bars should be vertical or horizontal if you also select site we get a cross table here and a grouped bar chart for the diagrams here we can specify whether we want the chart to be grouped or stacked if we click on distance to work and mode of Transport we get a bar chart where the height of the bar shows the mean value of the individual groups here we can also display the dispersion we also get a histogram a box plot a violin plot and a rainbow plot if you would like to know more more about what a box plot a violin plot and a rainbow plot are take a look at my videos let's continue with inferential statistics at the beginning we briefly go through what inferential statistics is and then I'll explain the six key components to you so what is inferential statistics inferential statistics allows us to make a conclusion or inference about a population based on data from a sample what is the population and what is the sample the population is the whole group we're interested in if you want to study the average height of all adults in the United States then the population would be all adults in the United States the sample is a smaller group we actually study chosen from the population for example 150 adults were selected from the United States and now we want to use the sample to make a statement about the population and here are the six steps how to do that number one hypothesis first we need a statement a hypothesis that we want to test for example you want to know whether a drug will have a positive effect on blood pressure in people with high blood pressure but what's next in our hypothesis we stated that we would like to study people with high blood pressure so our population is all people with high blood pressure in for example the us obviously we cannot collect data from the whole population so we take a sample from the population now we use this sample to make a statement about the population but how do we do that for this we need a hypothesis test hypothesis testing is a method for testing a claim about a parameter in a population using data measured in a sample great that's exactly what we need there are many different hypothesis tests and at the end of this video I will give you a guide on how to find the right test and of course you can find videos about many more hypothesis tests on our Channel but how does a hypothesis test work when we conduct a hypothesis test we start with a research hypothesis also called alternative hypothesis this is the hypothesis we are trying to find evidence for in our case the research hypothesis is the drug has an effect on blood pressure but but we cannot test this hypothesis directly with a classical hypothesis test so we test the opposite hypothesis that the drug has no effect on blood pressure but what does that mean first we assume that the drug has no effect in the population we therefore assume that in general people who take the drug and people who don't take the drug have the same blood pressure on average if we now take a random sample and it turns out that the drug has a large effect in a sample then we can ask How likely it is to draw such a sample or one that deviates even more if the drug actually has no effect so in reality on average there is no difference in a population if this probability is very low we can ask ourselves maybe the drug has an effect in the population and we may have enough evidence to reject the Nile hypothesis that the drug has no effect and it is this probability that is called the P value let's summarize this in three simple steps number one the null hypothesis states that there is no difference in the population number two the hypothesis test calculates how much the sample deviates from the null hypothesis number three the P value indicates the probability of getting a sample that deviates as much as our sample or one that even deviates more than our sample assuming the null hypothesis is true but at what point is the P value small enough for us to reject the Nile hypothesis this brings us to the next Point statistical significance if the P value is less than a predetermined threshold the result is considered statistically significant this means that the result is unlikely to have occurred by chance alone and that we have enough evidence to reject the null hypothesis this threshold is often 0.05 therefore a small P value suggests that the observed data or sample is inconsistent with the null hypothesis this leads us to reject the null hypothesis in favor of the alternative hypothesis a large P value suggests that the observed data is consistent with the N hypothesis and we will not reject it but note there is always a risk of making an error a small P value does not prove that the alternative hypothesis is true it is only saying that it is unlikely to get such a result or a more extreme when the null hypothesis is true and again if the null hypothesis is true there is no difference in the population and the other way around a large P value does not prove that denal hypothesis is true it is only saying that it is likely to get such a result or a more extreme when the null hypothesis is true so there are two types of Errors which are called type one and type two error let's start with the type one error in hypothesis testing a type one error occurs when a true null hypothesis is rejected so in reality the null hypothesis is true but we make the decision to reject the null hypothesis in our example it means that the drug actually had no effect so in in reality there is no difference in blood pressure whether the drug is taken or not the blood pressure Remains the Same in both cases but our sample happened to be so far off the True Value that we mistakenly thought the drag was working and a type two error occurs when a false null hypothesis is not rejected so in reality the null hypothesis is false but we make the decision not to reject the null hypothesis in our example this means the drag actually did work there is a difference between those who have taken the drag and those who have not but it was just a coincidence that the sample taken did not show much difference and we mistakenly thought the drug was not working and now I'll show you how data helps you to find a suitable hypothesis test and of course calculates it and interprets the results for you let's go to data.net and copy your own data in here we will just use this example data set after copying your data into the table the variables appear down here data tab automatically tries to determine the correct level of measurement but you can also change it up here now we just click on hypothesis testing and select the variables we want to use for the calculation of the hypothesis test d data tab will then suggest a suitable test for example in this case a Ki Square test or in that case an analysis of variant then you will see the hypotheses and the results if you're not sure how to interpret the results click on summary inverts further you can check the assumptions and decide whether you want to calculate a parametric or a non-parametric test now we know the differences between descriptive and inferential statistics our next step is to take a closer look at inferential statistics and at choosing the appropriate hypothesis test which hypothesis test you can use depends on the level of measurement of your data there are four levels of measurement nominal ordinal interval and ratio and here is an easy explanation for you in this video we are going to explore the four levels of measurement nominal ordinal interval and ratio each level gives us important information about the variable and supports different types of statistical analysis by the end of this video you will know what the levels of measurement are and especially you will understand why you need these levels so whether you are analyzing survey data optimizing business operations or studying for a statistics Exam stay tuned what are levels of measurement levels of measurement refer to different ways that variables can be Quantified or categorized if you have a data set then every variable in the data set corresponds to one of the four primary levels of measurement these levels are nominal ordinal interval and ratio in practice interval and ratio data are of often used to perform the same analysis therefore the term matric level is used to combine these two levels why do you need levels of measurement the level of measurement is crucial in statistics for several key reasons it tells us how our data can be collected analyzed and interpreted here's why understanding these levels is so important different levels of measurement support different statistical an analysis for instance mean and standard deviation are suitable for matric data in some cases it may be suitable for ordinal data but only if you know how to interpret the results correctly and it definitely makes no sense to calculate it for nominal data the level of measurement also tells us which hypothesis tests are possible and determines the most effective type of data visualization for example example bar charts are great for nominal data while histograms are better suited for metric data so each level provides different information and supports various types of statistical analysis but attention the level of measurement is mainly relevant at the end of the research process however the types of data to be collected and their formed are determined at the beginning therefore it is crucial to consider the level of measurement of the data from the start to ensure that the desired tests can be conducted at the end so let's take a closer look at each level of measurement what characterizes nominal variables this is the most basic level of measurement nominal data can be categorized but it is not possible to rank the categories in a meaningful way examples of nominal variables are gender with the categories male and female types of animals with for example the categories dog cat bird or preferred newspaper in all these cases you can tell whether one value corresponds to the other so you can distinguish the values but it is not possible to put the categories in a meaningful order an example we would like to investigate whether gender has an influence on the preferred newspaper both variables are nominal so when we create a questionnaire we simply list the possible answers for both variables since there is no meaningful order for nominal variables it usually does not matter in which order the categories are listed in the questionnaire then we can display the collected data in a table where each row is a person with the respective answer we can now use our data to create frequency tables or bar charts but but what about the ordinal level of measurement ordinal data can be categorized and in comparison with noral data it is possible to have a meaningful ranking of the categories but differences between ranks do not have a mathematical meaning this means the intervals between the data points are not necessarily equal examples of ordinant variables are all kinds of rankings like first second third satisfaction ratings very unsatisfied unsatisfied neutral satisfied very satisfied and levels of Education High School Bachelors Masters in a questionnaire you could ask how satisfied are you with your current job in this case we have these five possible options the answers can be categorized and there is a logic order that's why the variable satisfaction with the job is an order ordinal variable what about matric variables matric variables are the highest level of measurement matric data is like ordinal but the intervals between values are equally spaced this means the differences and sums can be formed meaningfully examples of mic variables are Income weight age and electricity consumption if you ask for a metric variable in a questionnaire there's usually just an input field in which the person directly enters the value for example age or body weight let's look at what we've learned so far using an example imagine you're conducting a survey in a school to understand how pupils get to school here are questions you might ask each corresponding to a different level of measurement the first question could be what mode of transportation do you use to get to school bus car bicycle walk this is of course a nominal variable the answers can be categorized but there is no meaningful order this means that bus is not higher than bicycle walk is not higher than car and so on and so forth if you want to analyze the results of this question you can count how many students use each mode of transportation and present it in a bar chart further you can ask how satisfied are you with your current mode of transportation choices might include very unsatisfied unsatisfied neutral satisfied very satisfied this is of course an ordinal variable you can rank the responses to see which mode of transportation ranks higher in satisfaction but the exact difference between satisfied and very satisfied for example or other options isn't quantif iable and the last question how many minutes does it take you to get to school minutes to get to school is a metric variable here you can calculate the average time to get the school and use all standard statistical measures we can visualize this data with a histogram showing the distribution of times you get to school and compare the different Transportation modes so using ninal data we can categorize and count responses but cannot inere any order ordinal data allows us to rank responses but not to measure precise differences between ranks matric data enables us to measure exact differences between data points as already mentioned matric level of measurement can be further subdivided into interval scale and ratio scale but what is the difference between interval and ratio level let's look at an example example in a marathon of course the time of the marathon runners is measured let's say the first one took 2 hours and the last one finished the marathon in 6 hours here we can say that the fastest runner was three times as fast as the slowest or to put it the other way around the slowest one took three times as long as the fastest one this is possible because there is a true zero point at the beginning of the marath on where all Runners start from zero in this case we have ratio level of measurement if however the stopwatch is forgotten to start at the beginning of the race and only the differences are measured starting from the fastest runner we don't have this true zero now the runners cannot be put in proportion in this case we can say how big the interval between the runners is for example the fastest runner is 4 hours faster than the slowest runner but we cannot say that the fastest runner was three times as fast as the slowest this is because we don't know the absolute values for both Runners we still have equal intervals we can say things like Runner B finished one hour after the fastest runner and Runner C finished 1 hour and 45 minutes after the fastest runner the time difference differences are measurable and meaningful but since there is no true zero point we can say the fastest runner was x times as fast as the slowest runner we only know how much later the other Runners finished relative to the fastest runner but not that total running times and in this case we have an interval level of measurement in summary while both interval and ratio scales have equal intervals and support similar operations like addition and subtraction ratio scales have a true zero point zero represents the absence of the quantity being measured this allows meaningful multiplication and division and now a little exercise to check whether everything is clear to you first we have state of the US which is a nominal level of measurement this means the data is used for labeling or naming categories without any quantitative value in this case the states are names with no inherent order or ranking next we have product ratings on a scale from 1 to five this is an example of ordinal data here the numbers do have an order or rank five is better than one but the intervals between the ratings are not necessarily equal moving on to religious confession like the states this is also nominal the categories here such as different religions are for categorization and do not imply any order next we have CO2 emissions in the year which is measured on a metric ratio scale this level allows for the full range of mathematical operations including meaningful ratios zero emissions mean no emissions at all then we have telephone numbers although telephone numbers are numeric they are categorized as nominal they are just identifiers with no numerical value for analysis care level of patience is another ordinal example this might include levels such as low medium and high care which indicates an order but not the exact difference between these levels living space in square meters is measured on a ratio scale like CO2 emissions Co square meters mean there is no living space and comparisons like double or half are meaningful lastly we have chop satisfaction on a scale from 1 to four this is ordinal data it ranks satisfaction levels but the difference between each level isn't Quantified now we know what the level of measurement is and we can go through the hypothesis tests that are most popular and discuss when to use which test let's start with the video on the most common hypothesis test the tea test see you in a moment this video is about everything you need to know about the te test after this video you know what a t test is and when you use it what types of tea tests there are what the hypotheses and the assumptions are how a t test is calculated and how you interpret the results let's start with the first question what is a T Test the T Test is a statistical test procedure hm and what does the T Test do the T Test analyzes whether there is a significant difference between the means of two groups for example the two groups may be patients who received one drag a and On's drag B we would now like to know if there is a difference in blood pressure between these two groups now there are three different types of T tests the one sample T Test the independent samples T Test and the paired samples T Test when do we use a one sample T Test we use the one sample T Test when we want to compare the mean of a sample with a known reference mean example a chocolate bar manufacturer claims that its chocolate bars weigh an average of 50 g to check this we take a sample of 30 bars and weigh them the mean value of this sample is 48 G now we can use a one sample T test to check if the mean of 48 G is significantly different from the claimed 50 g when do we use the independent samples T Test we use the T test for independent samples when we want to compare the means of two independent groups or samples we want to know if there is a significant difference between these means example we would like to compare the effectiveness of two painkillers we randomly divide 60 people into two groups the first group receives track a and the second group receives track B using an independent T Test we can now test whether there is a significant difference in pain relief between the two drugs when we use the paired samples T Test we use the paired samples T Test to compare the means of two dependent groups example we want to know how effective a diet is to do this we weigh 30 people before the diet and then weigh exactly the same people after the diet now we can look at the difference in weight between before and after for each subject we can now use a paired sample T test to test whether there is a significant difference in a paired sample the measurements are available in pairs the pairs result for example from repeated measurement with the same people independent samples are made up of people and measurements that are independent of each other here's an interesting note the paired samples T Test is very similar to the one sample T Test we can also think of the paired samples T Test as having one sample that was measured at two different times we then calculate the difference between the paired values giving us a value for one sample the difference is 1's - 5 1's + 2 1's - 1 and so on and so forth now we want to test whether the mean value of the difference just calculated deviates from a reference value in this case zero this is exactly what the one sample T Test does what are the assumptions for a t test of course we first need a suitable s sample in the one sample T Test we need a sample and the reference value in the independent T Test we need two independent samples and in the case of a pair T Test a paired sample the variable for which we want to test whether there is a difference between the means must be metric examples of metric variables are age body weight and income for example a person's level of education is not a metric VAR variable in addition the metric variable must be normally distributed in all three test variants to learn how to test if your data is normally distributed watch my video test for normal distribution in case of an independent T Test the variances in the two groups must be approximately equal you can check whether the variances are equal using lavine's test for more information watch my video on lavine's test so what are the hypotheses of the T Test let's start with the one sample T Test in the one sample T Test the null hypothesis is the sample mean is equal to the given reference value so there's no difference and the alternative hypothesis is the sample mean is not equal to the given reference value what about the independent samples T Test in the independent T Test the n hypothesis is the mean values in both groups are the same so there is no difference between the two groups and the alternative hypothesis is the mean values in both groups are not equal so there is a difference between the two groups and finally the paired samples T Test in a pair T Test the Nile hypothesis is the mean of the difference between the pairs is zero and the alternative hypothesis is the mean of the difference between the pairs is not zero so now we know what the hypotheses are before we look at how the te test is calculated let us look at an example of why we actually need a te test let's say there is a difference in the length of study for a bachelor's degree between men and women in Germany our population is therefore made up of all graduates of a bachelor who have studied in Germany however as we can cannot survey all Bachelor graduates we draw a sample that is as representative as possible we now use the test to test the Nal hypothesis that there is no difference in the population if there is no difference in the population we will certainly still see a difference in study duration in the sample it would be very unlikely that we drew a sample where the difference would be exactly zero in simple terms terms we now want to know at what difference measured in a sample we can say that the duration of study of man and women is significantly different and this is exactly what the T Test answers but how do we calculate a T Test to do this we first calculate the T value to calculate the T value we need two values first we need the difference between the means and then we need the standard devation from the mean this is also known as the standard error in the one sample T Test we calculate the difference between the sample mean and the known reference mean s is the standard deviation of the collected data and N is the number of cases s ided by the square root of n is then the standard deviation from the mean which is the standard error in the dependent samples T Test we see simply calculate the difference between the two sample means to calculate the standard error we need the standard deviation and the number of cases from the first and second sample depending on whether we can assume equal or unequal variance for our data there are different formulas for the standard error read more about this in our tutorial on data.net in a paired sample T Test we only need to calculate the difference between the paired values and calculate the mean from that the standard error is then the same as for a one sample T Test so what have we learned so far about the T value no matter which T Test we calculate the T value will be greater if we have a greater difference between the means and the T value will be smaller if the difference between the means is smaller further the T value becomes smaller when we have a larger disperson of the mean so the more scattered the data the less meaningful a given mean difference is now we want to use the T test to see if we can reject the null hypothesis or not to do this we can now use the T value in two ways either we read the critical T value from a table or we simply calculate the P value from the T value we'll go through both in a moment but what is the P value a T Test always tests the null hypothesis that there is no difference so first we assume that there is no difference in the population when we draw a sample this sample deviates from the null hypothesis by a certain amount the P value tells us How likely it is that we would draw a sample that deviates from the population by the same amount or more than a sample we drew thus the more the sample deviates from the null hypothesis the smaller the P value becomes if this probability is very very small we can of course ask whether the null hypothesis holds for the population perhaps there is a difference but at what point can we reject the N hypothesis this border is called the significance level which is usually set at 5% so if there is only a 5% chance that we draw such a sample or one that is more different then we have enough evidence to assume that we reject the null hypothesis and to put it simple we assume that there is a difference that the alternative hypothesis is true now that we know what the P value is we can finally look at how the T value is used to determine whether or not the null hypothesis is rejected let's start with the path through the critical T value which you can read from a table to do this we first need a table of critical T values which we can find on data.net under tutorials and T distribution let's start with the TW tailed case we'll briefly look at the one tailed case at the end of this video here below we see the table first we need to decide what level of significance we want to use let's choose a significant level of 0.05 or 5% then we look in this column at 1 - 0.05 which is 0.95 now we need the degrees of freedom in the one sample T Test and the paired samples T Test the degrees of freedom are simply the number of cases minus one so if we have a sample of 10 people there are N9 degrees of freedom in the independent sample see test we add the number of people from both samples and calculate that minus 2 because we have two samples note that the degrees of freedom can be determined in a different way depending on whether we assume equal or unequal variance so if we have a 5% significance level and 9 degrees of freedom we get a critical T value of 2.262 now on the one hand we've calculated a t value with the T Test and we have the critical T value if our calculated T value is greater than the critical T value we reject the null hypothesis for example suppose we calculate a t value of 2.5 this value is greater than 2.262 and therefore the two means are so different that we can reject an N hypothesis on the other hand and we can also calculate the P value for the T value we've calculated if we enter 2.5 for the T value and nine for the degrees of freedom we get a P value of 0.034 the P value is less than 0.05 and we therefore reject the null hypothesis as a control we copy the T value of 2.262 here we get exactly a P value of 0.05 which is exactly the limit if you want to calculate a t test with data tab you just need to copy your own data into this table click on hypothesis test and then select the variables of interest for example if you want to test whether gender has an effect on income you simply click on the two variables and automatically get a T Test calculated for independent samples here below you can read the P value if you're still unsure about the interpretation of the results you can simply click on interpretation in wordss a two-tail t test for independent samples equal variances assumed showed that the difference between female and male with respect to the dependent variable salary was not statistically significant thus the null hypothesis is retained the final question now is what is the difference between directed hypothesis and undirected hypothesis in the undirected case the alternative hypothesis is that there is a difference for example there is a difference between the salary of men and women in Germany we don't care who earns more we just want to know if there is a difference or not in a directed hypothesis we are also interested in in the direction of the difference for example the alternative hypothesis might be that men earn more than women or women earn more than men if we look at the T distribution graphically we can see that in the two-sided case we have a range on the left and a range on the right we want to reject the null hypothesis if we are either here or there with a 5% significance level both r ranges have a probability of 2.5% together just 5% if we do a onail T Test the null hypothesis is rejected only if we are in this range or depending on the direction which we want to test in that range with a 5% significance level all 5% fall within this range we have seen how the T Test is a powerful tool for comparing the means of two groups to determine if they differ significantly but what if we extend our analysis to more than two groups this is where analysis of variance or Anova comes into play so let's get started with Anova hello and welcome in this video I explain to you what an analysis of variance a so-called Anova is and how you can calculate it there are different types of analyses of variance this video is about the oneway or single factor analysis of variance without measurement repetitions and that's where we start the first question is why do you need an analysis of variance at all what does an analysis of variance do an analysis of variance checks whether there are statistically significant differences between more than two groups therefore the analysis of VAR is the extension of the T test for independent samples to more than two groups when calculating an independent T Test we looked at whether there is a difference or more precisely a difference in means between two independent groups for example if there is a difference in the salary of men and women in this case we have two groups the man's group and the women's group if we want to compare more than two independent groups we used the analysis of variance in case of the T Test we used an independent T Test if the two groups or samples were independent this is the case if one person in the first group has nothing to do with a person from the second group exactly the same now applies to the analysis of variant without repeated measures except that here we have at least three independent samples if we have more than two dependent samples we would use an analysis of variance with repeated measures now let's look at an example let's say that as the founder of data tab I might be interested in whether there are differences in the age between people who use data tab SPSS or R in order to do this I take a sample of people who use statistical software and ask them which statistical software they use and how old they are I've only compared three groups in this example of course there could also be more groups in order to analyze this example I would now use an anova so the next question is what is the research question I can answer with using an anova the research question is is there a difference in a population between the different group groups of the independent variable in relation to the dependent variable the independent variable is the variable with the different categories in our example it is the statistic software used here we have the three groups data tab SPSS and R the dependent variable in our example is the age of the software users we would like to know whether the groups of the independent variables have an influence on the dependent variable of course the analysis of variance does not give us any information about the direction of the causal relationship but why is our research question about the population don't we just have a sample actually we want to make a statement about the population unfortunately in most cases it is not possible to serve the whole population and we can only draw a sample the aim is to make a state about the population based on our sample with the help of the analysis of variance for our example the question would be is there a difference between the users of different statistical software Solutions in terms of age but what about the hypotheses in the case of the analysis of variant the null hypothesis is that there are no differences between the means of the individual groups we have our individual groups of which we can calculate the mean in each case and our null hypothesis is that there is no difference in the mean in the population the alternative hypothesis ag1 is that there is a difference between at least two group means therefore our null hypothesis assumes that there is no difference and the alternative hypothesis says that there is a difference all well and good now we know what the null hypothesis is but what does this mean graphically how can one picture that vividly let's say we want to test whether there is a difference in salary between the three groups group one group two and group three the salary has some dispersion some people earn 400 a month some 2,600 and others € 6,000 a month thus both in the population and in our example the salary is broadly distributed now the question is where does this variation come from and can we explain some of the variation by these three groups so how much of the variation in salary can we explain by dividing the people into these three groups in the extreme case the result could be that the salary in group one has this distribution in group two that distrib distribution and in group three the distribution would look like this in this case the division into groups could explain a lot of variance in a variable salary the result would be different in this case here however we could explain almost no variance by forming the three groups within the groups the variance is almost the same as in the whole sample therefore it does not matter whether we form the groups or not the three groups have nearly no influence on the salary if we now look at the variance within the groups we can see that in this case we have very small variances within the groups So within this group we have a very small variance within that group we have a small variance and also in the last group on the other hand the variance between the groups is very large because the mean values of the individual groups are very far apart in the other case we have a very large variance within the groups however the variance between the groups is very small because the mean values of the groups are very close together how can we calculate an anova there are two possibilities for the calculation either you use a statistic software like data tab or you calculate the analysis of variance by hand admittedly no one will calculate the analysis of variance by hand but the knowledge is very helpful to understand more precisely how an analysis of variance Works in this video I show you how you can easily calculate an analysis of variance online with data tab to calculate an analysis of variance with data tab just visit data. net you can find the link in the video description below then you copy your own data into this table and click on this tab under this tab you will find a variety of hypotheses tests here below you can see the variables you copied into the table depending on which variables you select data tab will calculate the appropriate hypothesis test if you click on a matric variable and a nominal variable with at least three characteristics data da calculates an analysis of variance here you can read the P value if you don't know exactly how to interpret the P value just get the summary in words above furthermore you can check the assumptions of the analysis of variance here now that we understand how Anova helps us compare means across multiple groups let's take it a step further an NOA looks at one factor at a time but what if our study involves more than one factor this is where two Way Anova becomes essential two-way Anova allows us to not only explore the effects of each individual Factor on our outcome but also how these factors interact with each other this gives us a deeper understanding of the Dynamics within our data let's explore the two-way Anova what is a two-way Anova a two-way Anova is a statistical method used to test the effect of two categorical variables on a continuous variable the categorical variables are the independent variables for example the variable drag type with drug A and B and gender with female and male and the continuous variable is the dependent variable for example the reduction in blood pressure so the two-way Anova is the extension of the one-way Anova while a oneway Anova tests the effects of a single independent variable on a dependent variable a two-way Anova tests the effects of two independent variables the independent variables are called factors but what is is a factor a factor is for example gender of a person with the levels male and female type of therapy with therapy a b and c or the field of study with medicine business administration psychology and Mathematics in an analysis of variance a factor is therefore a nominal variable we use an anova whenever we want to test whether these levels have an influence on the so-called dependent variable you might want to test whether gender has an effect on salary whether therapy has an effect on blood pressure or whether field of study has an effect on length of study celery blood pressure and length of study will then be the dependent variables in each of these cases you test whether the factor has an effect on the dependent variable since you only have one factor in these cases you would use a oneway Anova okay you're right in the first case we have a variable with only two categories so of course we would use the independent samples T Test but when do we use a two-way Anova we use a two factor analysis of variance when we have a second factor and we want to know whether this Factor also has an effect on on the dependent variable we would also like to know whether in addition to gender the highest level of education has an impact on salary or we would like to include gender in addition to type of therapy or in the third case we would also like to know whether the university attended in addition to the field of study has an influence on the length of study now we don't have one factor in all three cases but two factors in each case and since we now have two factors we use a two-way analysis of variance so in a one-way andova we have one factor from which we create the groups if the factor we're looking at has three levels for example three different types of drug we will have three groups to compare in the case of a two-way analysis of variance the group results from the combination of the level s of the two factors if we have one factor with three levels and one with two levels we have a total of six groups to compare but what kind of statements can we make with a two-way Anova with the help of a two way an NOA we can answer three things whether the first Factor has an effect on the dependent variable whether the second Factor has an effect on the dependent variable and whether there is an inter interaction effect between the two factors but what about the hypotheses in a two-way Anova there are three null hypothesis and therefore also three alternative hypothesis the first null hypothesis is there is no significant difference between the groups of the first factor and the alternative hypothesis there is a significant difference between the groups of the first Factor the second Nile hypothesis is there is no sign ific difference between the groups of the second factor and the alternative hypothesis there is a significant difference between the groups of the second factor and the third null hypothesis reflects the interaction Effect one factor has no effect on the fact of the other factor and the alternative hypothesis at least one factor has an influence on the effect of the other factor and what about the assumption for the test results to be valid several assumptions must be met number one normality the data within the groups should be normally distributed or alternatively the residual should be normally distributed this can be checked with a quantile quantile plot Number Two homogeneity of variances the variance of data in groups should be equal this can be checked with the Lin test number three Independence the measurements should be independent I.E the measured value of one group should not be influenced by the measured value of another group number four measurement level the dependent variable should have a metrix scale level but how to calculate a two-way Anova let's look at the example from the beginning we would like to know if drag type and gender have an influence on the reduction in blood pressure drag type has the two levels drug A and B and gender has the two levels male and female to answer the question we collect data we randomly assigned patients to the treatment combinations and measured the reduction in blood pressure after a month for example the first patient receives drug a is male and after one month a reduction in blood pressure of six was measured now let us answer the questions is there a main effect of drag type on the reduction in blood pressure is there a main effect of gender on the reduction in blood pressure and is there an interaction effect between drag type and gender on the reduction in blood pressure for the calculation we can use either a statistical software like data tab or do it by hand first I will show you how to calculate it with data and how to interpret the results at at the end I will show you how to calculate the an NOA by hand and go through all the equations to calculate a two-way an NOA online simply visit data.net and copy your data into this table then click on hypothesis test under this tab you will find a lot of hypothesis tests and depending on which variable you select you will get an appropriate hypothesis test suggested we want to know if Dr type and gender have an influence on the reduction in blood pressure so let's just click on all three variables data now automatically gives us a two-way and over we can read the three null and the three alternative hypothesis here afterwards we get the descriptive statistics and the LaVine test for equal variance with the LaVine test we can check if the variances within the groups are equal the the P value is greater than 0.05 so we assume equality of variance in the groups for this data and here we see the results of the analysis of variance we'll look at these in more detail in a moment but if you don't know exactly how to interpret the results you can also just click on summary in words in addition you can check here if the requirements for the analysis of variance are met at all but now back to the results let's take a closer look at this table the first row tests the N hypothesis whether drag type has an effect on the reduction in blood pressure the second row tests whether gender has an effect on the reduction in blood pressure and the third row tests if the interaction has an effect you can read the P value in each case right at the back here let's say we set the sign ific level at 5% if our calculated P value is less than 0.05 the null hypothesis is rejected and if the calculated P value is greater than 0.05 the N hypothesis is not rejected thus we see that all three p values are greater than 0.05 and therefore we cannot reject any of the three null hypotheses therefore neither the drag type nor gender have a significant effect on the reduction in blood pressure and there's also no significant interaction effect but what does an analysis of variance actually do and why is the word variance in analysis of variance in a two-way analysis of variance the total variance of the dependent variable is divided into the variance that can be explained by factor a the variance that can be explained by by Factor B the variance of the interaction and the arror variance actually SS is not the variance but the sum of squares we will discuss how to calculate the variance in this case in a moment but how can I imagine that the dependent variable has some variance in our example not everyone will have the same reduction in blood pressure we now want to know if we can explain sum of this variance by the variable Str type gender and their interaction the part that we cannot explain by these three terms accumulates in the error if the result looked like this we would be able to explain almost all the variance by factors A and B and their interaction and we would only have a very small proportion that could not be explained this means that we can make a very good state ment about the reduction in blood pressure by the variable drug type sex and interaction in this case it would be the other way around drag type gender and the interaction almost have no effect on the reduction in blood pressure and it all adds up in the arrow but how do we calculate the sum of squares the F value and the P value here we have our data one's drug type with drug A and B and one's gender with m male and female so these individuals are for example all male and have been given drug a first we calculate the mean values we need we calculate the mean value of each group so male and drag a that is 5.8 then male and Drug B that is 5.4 and we do the same for female then we calculate the mean value of all males and females and the mean value of drug A and B finally we need the total mean we can now start to calculate the sum of squares let's start with the total sum of squares we do this by subtracting the total mean from each individual value squaring the result and adding up all the values the total mean is 5.4 so we calculate 6 - 5.4 2ar + 4 - 5.4 squared to finally 3 - 5.4 squar so we get a sum of squares of 84.8 the degrees of freedom are given by n * P * Q minus one n is the number of people in a group in our case five and P and Q are the number of categories in each of the factors in both cases we have two groups the total variance is calculated by dividing the sum of squares by the degrees of freedom so we get 4.46 now we can calculate the sum of squares between the groups for this we calculate the group mean minus the total mean so 5.8 - 5.4 sared + 5.4 minus 5.4 squar and the same for these two values we get 7.6 in this case the degrees of freedom are three which gives us a variance of 2.53 now we can calculate the sum of squares of factor a a dash is the mean value of the categories of factor a so we calculate 5.9 minus the total mean value and 4.9 minus the total mean value this results in five together with the degrees of freedom we can now calculate the variance for factor a which is five we do the same for Factor B in this case we use the mean values of male and female and we get the variance of 0.8 now we can calculate the sum of squares for the interaction we obtain this by calculating the sum of squares minus the sum of squares of A and B the decrease of Freedom result to one for the interaction action we get a variance of 1.8 finally we can calculate the sum of squares of the error we substract the mean value of each group from the respective group values so in this group we substract 5.8 from each individual value in this group we subtract 5.4 here we subtract six and then we subtract 4.4 this gives us a sum of squar of 77.2 the degrees of freedom are 16 and we get a variance of 4.83 and now we calculate the F values these are obtained by dividing the variance of factor a b or the interaction by the arrow variance so we get the F value for factor a by dividing the variance of factor a by the aror variance which is equal to 1 04 we can now do exactly the same for FB and faab to verify we get exactly the same values with data tab 1.04 0.17 and 0.37 for the calculation of the P value you need the degrees of freedom and the F distribution so with these three values you can either read the critical P value in a table or as usual you just use a soft W to calculate the P values you can find a table of critical F values on data tab a g for a significance level of 5% you can use this table if the red value is greater than the calculated F value the N hypothesis is rejected otherwise not we've seen how a Nova allows us to compare means across different groups to determine if there are significant differences but what if our research design involves measurements taken from the same subject at different time points this is where dependency among the observations comes into play Let's dive into how repeated measures Anova adjusts our approach to interconnected data points this video is about the repeated measures in NOA we will go through the following questions what is a repeated measures analysis of variance what are the hypotheses and the assumptions how is the repeated measures and over calculated how are the results interpreted and what is a post talk test and how do you interpret it we'll go through all points using a simple example let's start with the first question what is a repeated measures Anova a repeated measures analysis of variance tests whether there is a statistically significant difference between three or more dependent samples what are dependent samples in a dependent sample the same participants are measured multiple times under different conditions or at different time points we therefore have several measurements from each person involved let's take a look at an example let's say we want to investigate the effectiveness of a training program for this we've started looking for volunteers to participate in order to investigate the effectiveness of the program we measure the physical fitness of the participants at several points in time before the training program immediately after completion and two months later so for each participant we have a value for physical fitness before the program a value immediately after completion and a value two months later and since we are measuring the same participants at different points in time we are dealing with dependent samples now of course it doesn't have to be about people or points in time in a generalized way we can say in a dependent sample the same test units are measured several times under different conditions the test units can be people animals or cells for example and the conditions can be time points or treatments for example but what is the purpose of repeated measures and NOA we want to know whether the fitness program has an influence on physical fitness and it is precisely this question that we can answer with the help of an anova with repeated measures physical fitness is therefore our dependent variable and time is our independent variable with time points as levels so the analysis of variance with repeated measures checks whether there is a significant difference between the different time points but isn't that what the paired samples T Test does doesn't it also test whether there is a difference between dependent samples that's correct the paired samples T Test evaluates whether there is a difference between two dependent groups the repeated measures a Nova extends this concept allowing you to examine differences among three or more dependent groups what are the hypotheses for repeated measures and NOA the null hypothesis for a repeated measures and Nova is that there are no differences between the means of the different conditions or time points in other words the null hypothesis assumes that each person has the same value at all times the values of the individual persons themselves May differ but one and the same person always has the same value the alternative hypothesis on the other hand is that there is a differ between the dependent groups in our example the null hypothesis states that the training program has no influence on physical fitness I.E that physical fitness does not change over time and the alternative hypothesis assumes that the training program does have an influence I.E that physical fitness changes over time to correctly apply repeated measures and over certain assumptions about the data must be fulfilled number one normality the dependent variable should be approximately normally distributed this can be tested using the QQ plot or the cologo smof test for more information please watch my video test for normal distribution you can find the link in the video description number two sphericity the variances of the differences between all combinations of factor levels or time time points should be the same this can be tested with the help of Marley's test foros ferity if the resulting P value is greater than 0.05 we can assume that the variances are equal and the assumption is not violated in this case the P value is greater than 0.05 therefore this assumption is fulfilled if the assumption is violated adjustments such as Greenhouse Gea or Hind fails can be made now I'll show you how to calculate and interpret an analysis of variance online with data Tab and then we'll go through the formulas to explain how to calculate the analysis of variance with repeated measures by hand to calculate an anova online you simply go to data.net and copy your own data into this table I use this example data set you can find a link to load this example data set in the video description make sure that your data is structured correctly I.E one row per participant and one column per condition or time now we click on the hypothesis test tab at the bottom we see the three variables before in the middle and end from the data set if we now click on all of them a repeated measures andova is automatically calculated firstly we can check the assumptions here we see that the oress test for sphericity results in a P value of 0.357 this value is greater than 0.05 and thus the assumption is fulfilled if this is not the case you can take sphericity correction I will explain how to test the normal distribution in a separate video in our example now we will assume normal distribution with a lot of nashing of teeth if the assumption is not fulfilled you can simply calculate the nonparametric counterpart to the repeat measures and NOA the fredman test this does not require your data to be normally distributed first of all if you do not know exactly how to interpret the individual tables in your analysis you can simply click on summary in words or on AI interpretation for the tables but now back to the result first we see the null and the alternative hypothesis the null hypothesis is that there is no difference between the dependent variables before in the middle and end and the alternative hypothesis is that there is a difference at the end of the test we can say whether we reject this null hypothesis or not now we see the descriptive statistics and a box plot we then get the results of the Anova with repeated measures in this table the P value is the most important value it is 0.01 and indicates the probability that a sample deviates as much or even more from the null hypothesis as our sample with a P value of 0.01 the results are statistically significant at the conventional significance level of 0.05 which means that there are significant differences between the mean values of the three levels before in the middle and end this rejects the null hypothesis and we assume that there is a difference between the groups and that the training program or therapy has a significant effect if you want an interpretation of the other values in this table simply click on AI interpretation finally here is the table for the bonferoni posst hog test since the p value of the analysis of variance is smaller than 0.05 we know that there is a difference between one or more groups with the POs hog test we can now determine between which groups this different exists we see that there is a significant difference between before and end and in the middle and end both have a P value of less than 0.05 how do you calculate an analysis of variance with repeated measures by hand let's say this is our data we have five people each of whom we measured at three different points in time now we can calculate the necessary mean values first we calculate the mean value of all the data which is 5.4 then we calculate the mean value of the three groups for the first groups we get a mean value of five for the second a value of 6.1 and for the third a value of 5.1 and finally we can calculate the mean value of the three measurements for each person so for the first person for example we have an average value of eight over the three measurements and for the last person we have an average value of five now that we have all the mean values we need to calculate the required sums of squares but note our goal is the so-called f value and subsequently calculate a P value from it there are different ways for getting this F value I will demonstrate one common way how to do this depending on which statistics textbook you use you may come across a different formula but back to the calculation let's start with the sum of squares within the subject we obtain this by calculating each individual value xmi minus the mean value value of the respective subject squaring this and adding it up so we start with 7 - 8 squ + 9 - 8 2ar until finally 3 - 5 and 7 - 5 we can then calculate the sum of squares of the treatment I.E the sum of squares of the three points in time we obtain this by subtracting the total mean value from each group mean value squar squaring it and adding it Up N is the number of people in a group so we get 5 - 5.4 2ar + 6.1 - 5.4 sared + 5.1 - 5.4 squared now we can calculate the sum of squares of the residual we get this by simply calculating the sum of squares within the subjects minus the sum of squares of the treatment alternatively we can also use this formula here xmi is again the value of each individual person AI is the mean value of the respective group PM is the mean value of the respective person of the three points in time and G is the total mean value we can then calculate the mean squares to do this we divide the respective sum of squares by the degrees of freedom the mean square of the treatment is therefore calculated by dividing the sum of squares of the treatment by the degrees of freedom of the treatment the degrees of freedom of the treatment are the number of factor levels minus one so we have three time points minus one which is two the mean square of the residual is obtained in the same way here the degrees of freedom are the number of factor levels minus one times the number of subjects minus 1 we get 2 * 7 which is equal to 14 now we calculate the F value which is done by dividing the mean square of the treatment by the mean square of the residual or error finally we calculate the P value using the F value and the degrees of freedom from the treatment and residual to calculate the P value you can simply go to this page on data tab the link can be found in the video description here you can enter your values our F value is 1.69 the numerator degree of Freedom I.E that of the treatments is two and the denominator degree of Freedom I.E that of the error is 14 we get a P value of 0.22 the P value is greater than 0.05 and there therefore we do not have enough evidence to reject the null hypothesis of course we can then compare the results with data tab to do this we copy the data back into this table and click on the variables we can see that we also get a P value of 0.22 here after exploring how repeated measures andova can be used to analyze data we might wonder how to handle even more complex designs this is where mixed model Anova comes in let's find out how this powerful tool can help us what is a mixed model Anova what are the hypotheses and assumptions and how to interpret the results of a mixed model Anova this is what we discussed in this video let's start with the first question what is a mixed model Anova a mixed model Anova is a statistical method used to analyze data that involves both between subject factors and within subjects factors but what are between subjects factors and within subjects factors let's look at an example let's say we want to test whether different diets have an effect on cholesterol levels we would like to compare the three diets a b and c so the fact a diet has the three levels a b and c to test whether there is a difference between the diets we are conducting a study with 18 participants the individual participants are called subjects now we randomly assign six participants to each of the three groups each participant or subject is assigned to only one group in this case we have a between subject Factor different subjects are exposed to different levels of a factor in this analysis our objective is to determine whether the significant ific differences exist in the mean cholesterol levels among the various groups on the study and this is exactly what a One Way Anova does now of course we could also examine the impact of one diet across multiple time points we could measure the cholesterol levels at each participant at the start of the diet after two weeks and after four weeks so the factor time has the three level start two weeks and four weeks and in this case the same subjects are being exposed to all levels of the factor and this is called a within Subs Factor the same subjects are exposed to all levels or conditions in this case we want to know if there is a difference in the mean value of the cholesterol levels between the different points in time and this is exactly what a repeated measures Anova does therefore in a repeated measures an NOA we have Within subject factors so in a between subjects design each subject or participant is only assigned to one factor level so that the different subjects only have the influence of the respective group in contrast in the within subject design the same subjects or participants are exposed to all Factor levels which enables a direct comparison of the reactions to each factor level but what if we want to test if there is a difference between Diet a b and c over the different points in time so we want to test if there is a difference between the diets and if there is a difference between the different time points then we need a mixed model an NOA because we have both one between subject factor and one within subject Factor so in a mixed model and NOA we have at least one between subjects factor and at least one within subjects factor in the same analysis note a mixed Moda NOA is also called a two way an NOA with repeated measures because there are two factors and one of them results from repeated measures therefore the mixed model and Nova test whether there is a difference between more than two samples which are divided at least between two factors one of the factors is a result of measurement repetition with the help of a mixed model an NOA you can now answer three things first whether the within subject Factor has an effect on a dependent variable second whether the between subject Factor has an effect on a dependent variable and third whether there is a so-called interaction effect between the two factors this gives us a good transition to the hypothesis the first null hypothesis is the mean values of the different measurement times do not differ there are no significant differences between the groups of the within subject Factor then of course there is the second the means of the different groups of the between subject Factor do not differ and the third Nile hypothesis reflects the interaction Effect one factor has no effect on the effect of the other Factor what are the assumptions of a mixed model Nova normality the depend dependent variables should be approximately normally distributed within each group of the dependent variables this assumption is especially important when a sample size is small when a sample size is large a Nova is somewhat robust to violations of normality homogeneity of variances the variances in each group should be equal in mixed model andova this needs to be true for both the within subjects and between sub objects factors the leine St can be used to check this assumption homogeneity of co-variances sperity this applies to the within subjects factors and assumes that the variance of the differences between all combinations of the different groups are equal what does that mean let's start with the differences between all combinations to do this we simply need to calculate the difference of the first group minus the second the difference of the first group and the third group and the difference of the second group and the third group these calculated differences should now have the same variance this assumption can be tested using Marsh's test of sphericity when this assumption is violated adjustments to the degrees of freedom such as Greenhouse Geer or high Feld can be used independence of observations this assumes that the observations are independent dep of each other this is a fundamental assumption in an NOA and is usually assured by the study design no significant outliers outliers can have a disproportionate effect on Anova potentially causing misleading results it's important to identify and address outliers let's calculate an example and I'll show you how to interpret the results let's say this is our data that we want to analyze we want to know whether there thy a b and c and three different time points have an effect on cholesterol levels each row is one person the therapy is the between subject factor and the time with the levels before the therapy in the middle and at the end of the therapy is the within subject Factor so the first patient on therapy a had a cholesterol level of 165 before therapy a cholesterol level of 145 in the middle and a cholesterol level of 140 at the end let's first calculate the example online with data Tab and then discuss how to interpret the results to calculate an anova online simply go to data.net and copy your data into this table you can also load this example data set using the link in the video description then click on hypothesis testing under this tab you will find a lot of hypothesis tests and depending on which variable you click on you will get an appropriate hypothesis test suggested if you copy your data up here the variables will appear down here if the correct scale level is not automatically detected you can simply change it under variables view for example if we click on before middle and end a repeated measures Anova is automatically calculated but we also want to include the therapy so we just click on therapy now we get a mixed model Anova we can read the three null and the three alternative hypotheses here then we get the descriptive statistics output and here we see the results of the analysis of variance and also the posst talk test we'll look at these again in detail in a moment but if you don't know exactly how to interpret the result results you can also click on summary inverts but now back to the results most important in this table are these three rows with these three rows you can check if the three n hypotheses we stated before are rejected or not the first row testal hypothesis whether cholesterol level changes over time so whether the therapies have an effect on cholesterol level the second row tests whether there is a difference between the respective therapy forms with respect to cholesterol level and the last row checks if there is an interaction between the two factors you can read the P value in each case right at the back here let's say we set the significance level at 5% if our calculated P value is less than 0.05 then the respective null hypothesis is rejected and if the calculated P value is greater than 0.05 then the null hypothesis is not rejected thus we see here that the P value of before middle and end is less than 0.05 and therefore the values at before middle and end are significantly different in terms of cholesterol levels the P value in the second row is greater than 0.05 therefore the types of therapy have no significant influence on the cholesterol level it is important to note that the mean value over the three time points is considered here it could also be that in one therapy the blood pressure increases and in the other therapy the blood pressure decreases but on average over the time points the blood pressure is the same if this was the case however we would have an interaction between the therapies and the time we test this with the last hypothesis in this case there is no significant interaction between therapy and time so there is an influence over time but it does not matter which therapy is used the therapy has no significant influence if one of the two factors has a significant influence the following two tables show which of the combinations different significantly so far we've explored various types of Anova and test now these are also called parametric tests and require certain assumptions about the data like normality but what happens if our data doesn't meet these assumptions this is where nonparametric tests come into play let's compare these two families of tests to understand their differences hi in this video video I explain the difference between parametric and non-parametric hypothesis testing why are you interested in this topic you want to calculate a hypothesis test but you don't know exactly what the difference is between a parametric and a nonparametric test and you're wondering when to use which test if you want to calculate a hypothesis test you must first check the assumptions for the test one of the most common assumptions is that your data is normally distributed in simple terms if your data is normally distributed parametric tests are used such as the T Test analysis of variance or peeron correlation if your data is not normally distributed nonparametric tests are used such as men with u test or spearman's correlation what about the other assumptions of course you still need to check whether there are other assumptions for the test in general however nonparametric tests make few assumptions than parametric tests so why use parametric tests at all parametric tests are generally more powerful than non-parametric tests what does that mean here's an example you have formulated your null hypothesis men and women are paid equally whether this null hypothesis is rejected depends on the difference in salary the dispersion of the data and the sample size in a parametric test a smaller difference in salary or a smaller sample is usually sufficient to reject the null hypothesis if possible always use parametric tests what is the structual difference between parametric and non-parametric tests let's take a look at peon correlation and spearman's rank correlation as well as a t test for independent samples and the man Whitney U test let's start with the Pearson and Spearman correlation the spearman's rank correlation is the nonparametric counterpart to the pearon correlation what is the difference between the two correlation coefficients spean correlation does not use raw data but the ranks of the data let's look look at an example we measure the reaction time of eight computer players and ask their age when we calculate a peeron correlation we simply take the two variables reaction time and age and calculate the peon correlation coefficient however we now want to calculate span's rank correlation so first we assign a rank to each person for reaction time and age the reaction time is already sorted by size 12 is the smallest value so gets rank one 15 the second smallest so gets rank two and so on and so forth we are now doing the same with ag here we have the smallest value there the second smallest here the third smallest fourth smallest and so on and so forth let's take a look at this in a scatter plot here we see the raw data of age and reaction time but now we would like to use the rankings so we form ranks from the variables age and reaction time through this transformation we have now distributed our data more evenly to get spearman's correlation we simply calculate Pon correlation from the ranks so Spearman correlation is equal to P correlation only that the ranks are used instead of raw values what about a t test for independent samples and the man with u test the T test for independent samples and the man with the U test check whether there is a difference between two groups an example is there a difference between the reaction time of man and women The Man withney U test is the nonparametric counterpart to the T test for independent samples but there's an important difference between the two tests the T test for independent samples tests whether there is a mean difference for both samples the mean value is calculated and it is tested whether these mean values differ significantly the man with you test on the other hand checks whether there is a rank sound difference how do we calculate the rank sums for this purpose we sort all persons from the smaller to the largest value this person has the smallest value so gets rank one that person has the second smallest value so gets rank two and this person has the third smallest value and so on and so forth now we have assigned a rank to each person then we can simply add up the ranks of the first group and the second group in the first group we get a rank sum of 42 and in the second group a rank sum of 36 now we can investigate whether there is a significant difference between these rank sums if you want to know more about the man with u test check out my related video so we can summarize the raw data are used for parametric tests and the ranks of the raw data are used for nonparametric tests the hypothesis test you use usually depends on how many variables you have and whether it is an independent or dependent sample in most cases there is always a nonparametric counterpart to parametric tests so if you do not meet the assumptions for the parametric test you can use the non-parametric counterpart but don't worry data tab will do its best to help you choose the right hypothesis test of course you can calculate the most common parametric and nonparametric test with thetaab online simply copy your own data into the table and your variables will appear here below now click on the variables you want to calculate a hypothesis test for for example if you choose salary and gender a t test will be calculated here you can check the assumptions if the assumptions are not met you can simply click on nonparametric and a man with the U test will be calcul ated if you click on salary and Company an analysis of variance is calculated or in the non-parametric case the cross volis test as we've seen parametric tests rely heavily on the assumption that data are normally distributed this leads us to an essential step in data analysis checking our data for normality before applying parametric tests it is crucial cral to check this assumption otherwise we would get inaccurate results let's now look into different methods and statistical tests to check our data for normal distribution in this video I show you how to test your data for normal distribution first of all why do you need normal distribution let's say you've collected data and you want to analyze this data with an appropriate hypothesis test for example a t test or an analysis of variant one of the most common requirements for hypothesis testing is that the data used must be normally distributed data are normally distributed if the frequency distribution of the data has this bell curve now of course the big question is how do you know if your data is normally distributed or not or how can you test that there are two ways I either you can check the normal distribution analytically or graphically we now look at both in detail let's start with the analytical test for normal distribution in order to test your data analytically for normal distribution there are several test procedures the best known are the Koger of smof test the Shapiro wil test and the Anderson darling test with all these tests you test the null hypothesis that the data are normally distributed so the null hypothesis is that the frequency distribution of your data fits the normal distribution in order to reject or not reject the null hypothesis you get a P value out of all these tests now the big question is whether this P value is greater or less than 0.05 if the p value is less than 0.05 this is interpreted as a significant deviation from the normal distribution and you can assume that your data are not normally distributed if the P value is greater than 0.05 and you want to be statistically clean you cannot necessarily say that the frequency distribution corresponds to the normal distribution you just cannot disprove the hypothesis in practice however values greater than 0.05 are assumed to be normally distributed to be on a safe side you should always take a look at the graphical solution which we will talk about in a moment so in summary all these tests give you a P value if this P value is less than 0.05 you assume no normal distribution if it is greater greater than 0.05 you assume normal distribution for your information with the Koger of smof test and with the Anderson darling test you can also test distributions other than the normal distribution now unfortunately there is a big disadvantage of the analytical methods which is why more and more people are switching to using the graphical methods the problem is that the calc Cal at P value is influenced by the size of the sample therefore if you have a very small sample your P value may be much larger than 0.05 but if you have a very large sample your P value may be smaller than 0.05 let's assume the distribution in your population deviates very slightly from the normal distribution then if you take a very small sample you will get a very large P value and thus you will assume that it is normally distributed data however if you take a larger sample then a P value becomes smaller and smaller even though the samples come from the same population with the same distribution therefore if you have a minimal deviation from the normal distribution which isn't actually relevant the larger your sample the smaller the P value becomes with a very large sample you may even get a P value smaller than 0.05 and thus reject the null hypothesis that it is a normal distribution to get around this problem graphical methods are being used more and more we'll come to that now if the normal distribution is checked graphically you either look at the histogram or even better at the QQ plot if you use the histogram you plot the normal distribution in the histogram of your data and then you can see whether the curve of the normal distribution roughly corresponds to that of the normal distribution curve however it is better if you use the so-called quanal quantil plot or QQ plot for short here the theoretical quantiles that the data should have if they are perfectly normally distributed and the quantiles of the measured values are compared if the data is perfectly normally distributed All Points would lie on the line the more the data deviates from the line the less it is normally distributed in addition data dep plots the 95% confidence interval if all are almost all of your data lies within this interval it is a very strong indication that your data is normally distributed your data would not be normally distributed if for example they form an arc and lie far away from the line in some areas if you use data tap in order to test for normal distribution you get the following evaluation first you get the analytical test procedure es clearly arranged in a table then come the graphical test procedures how you can test your data with data tab for normal distribution I will show you now just copy your data into this table click on descriptive statistics and then select the variable you want to test for normal distribution for example AG after that you can simply click on test for normal distribution here and you will get the results down here I know the test procedures are not actually descriptive methods but if you want to get an overview of your data it's usually also relevant to look at the distribution of your data furthermore if you calculate a hypothesis test for example whether gender has an influence on a salary of a person then you can check the preconditions for each hypothesis test and you will also get the test for normal distribution if the pre condition is not met you would click on this and a nonparametric test the Man withy U test would be calculated the man with u test does not need normally distributed data another important assumption is the equality of variance you can check whether two or more groups have the same variance using the lvin test let's take a look at it what is a lavine's test lavine's test tests the hypothesis that the variances are equal in different groups the aim is to determine whether the variances in different groups are significantly different from each other the hypotheses for lavine's test are as follows the null hypothesis is the variances of the groups are equal and the alternative HP hypothesis is at least one of the groups has a different variance when is lavine's test most commonly used lavine's test is most often used to test assumptions for another hypothesis test what does that mean let's say your hypothesis is there is a difference between two medications in terms of perceived pain relief to test this hypothesis you've collected data now to test the hypothesis based on your data you use a hypothesis test such as a T Test many hypothesis tests have the assumption that the variances in each group are equal and this is where lein test comes in it tells us whether this assumption is fulfilled or not how is a lavine's test calculated here's an example we want to know if there is a significant difference and variance between these groups first we simply calculate the mean of each group then we subtract the respective group mean from each person the amount of each value is now formed so that the negative values become positive from these new values the group mean can now be calculated again the larger the group mean the greater the variance within it group thus there is a smaller variance in this group than in that group in addition we can calculate the total mean value now we can calculate the square deviations of the group means from the overall mean and sum them up and then we can calculate the square deviation of the individual values from the respective group mean and add them up we can now compare the two two calculated sums and that is exactly what lavine's test does the test statistic of lavine's test is obtained with this equation n is the number of cases and I the number of cases in the i group set I is the mean value of the E Group set is the overall average set i j is the respective value in the groups and K is the number of groups the calculated test statistic L is equal to the F statistic therefore with the F value and the degrees of freedom the P value can be calculated the degrees of freedom result with number of groups minus one and number of cases minus number of groups if the P value is greater than 0.05 the null hypothesis that the variance are equal is not rejected thus equality of variance can be assumed if you use data Tab and calculate an analysis of variance you can find lavine's test under test assumptions in an independent T Test you will find lavine's test at the bottom of the results if equality of variance is not given you can use the T test for for unequal variance now that we understand the importance of testing for normal distribution we might find ourselves in a situation where our data do not meet these assumptions in this case we turn to nonparametric methods that are less sensitive to the distribution of data we will discuss various kinds of tests for this purpose like man with u test cral Wallace test Willl Cox signed rank test and fredman test let's start with the nonparametric counterpart to the T test for independent samples the man Whitney U test what is a man Whitney U test and how is it calculated that's what we will discuss in this video let's start with the first question what is a man Whitney UT test a man Whitney UT test tests whether there is a difference between two independent samples an example is there a difference between the reaction time of women and men but the T test for independent samples does the same it also tests whether there is a difference between two independent samples that's right the manwood U test is the nonparametric counterpart to the T test for independent samples but there is an important difference between the two tests the T test for independent samples tests whether there is a mean difference for both samples the mean value is calculated and it is tested whether these mean values differ significantly the man with the U test on the other hand checks whether there is a rank sum difference how do we calculate the rank sum for this purpose we sort all persons from the smallest to the largest value this person has the smallest value so gets rank one this person has the second smallest value so gets rank two and this person has the third smallest value and so on and so forth now we have assigned a rank to each person then we can simply add up the ranks of the first group and the second group in the first group we get a rank of 42 and in the second group a rank of 36 now we can investigate whether there is a significant difference between these rank sums but more on that later the advantage of taking the rank sums rather than the difference in means is that the data need not to be normally distributed so in contrast to the T Test the data in the man with u test do not have to be normally distributed what are the hypotheses of the man with u test the null hypothesis is in the two samples the rank sums do not differ significantly the alternative hypothesis is in the two samples the rank sums do differ significantly now let's go through everything with an example first we calculate the example with data Tab and then we see if we can get the same results by hand if you like you can load the sample data set to follow the example you can find a link in the video description we simply click on data.net and open the statistics calculator I've already loaded the data from the link here you can also copy your own data into this table then all you have to do is click on the hypothesis test Tab and then simply select the desired variables we measure the reaction time of a group of men and women and want to know if there is a difference in reaction time so we click on response time and gender we don't want to calculate a t test for independent samples but a man Whitney U test so let's just click on non-parametric test here we see the results of the man Whitney U test if you're not sure how to interpret the results just click on summary inverts for the given data a Man withney U test show that the difference between female and male with respect to the dependent variable response time was not statistically significant thus the null hypothesis is not rejected so we now calculate the man with u test by hand for this we have plotted the values in a table on one side we have gender with female and male and on the other side the values for reaction time unfortunately the data is not normally distributed so we cannot use a t test and we calculate the Man withney U test instead first we assign a rank to each value we pick the smallest value which is 33 which gets the rank one the second smallest value is 34 which gets the rank two the third smallest value is 35 which gets the rank three now we do the same for all other values so now we have all ranks assigned and we can just add up all the ranks from women and all the ranks from Men the rank sum is abbreviated with t and we get T1 for female with 2 + 4 + 7 + 9 + 10 + 5 which is 37 now we do the same for male here we get 11 + 1 + 3 + 6 + 8 which is 29 again our null hypothesis is that both rank sums are equal now we want to calculate the P value for this we have once calculated the rank sum for the female participants and we have have the number of cases of six therefore we have six female subjects we can now calculate the U1 that is the U for the female participants using this formula here we have N1 and N2 that is the number of cases of female and male minus the rank sum of the female participants if we insert our values we get a U1 of 14 we now do exactly the same for the male participants and we get a U2 of 16 so now we have calculated U1 and U2 the U for the man with u test is now given by the smaller value of the two so in our case we take the minimum of 14 and 16 this is of course 14 next we need to calculate the expected value of U which we get by n 1 * N2 / 2 in our case it is 6 * 5 / 2 and that is equal to 15 last but not least we need the standard error of U the standard error can be calculated with this formula and in our case it is equal to 54772 with all these values we can now calculate Z the Z value results with u minus mu U divided by the standard error in our case we get 14 - 15 ided 54772 which is equal to - 0.825 so now we have the set value and with the set value we can calculate the P value however it should be noted depending on how large the sample is the P value for the man U test is calculated in different ways for up to 25 cases the exact values are used which can be read from a table for large samples the normal distribution of the U value can be used as an approximation in our example we would actually use the exact values nevertheless we assume a normal distribution for this we can simply go to data Tab and calculate the P value for a given set value the P value of 0.855 is significantly greater than the significance level of 0.05 and thus the null hypothesis cannot be rejected based on this sample how to calculate the man with UT test on tide ranks you can learn in our tutorial on data.net you find the link in the video description but what if you want to compare to the pendent samples and need a non parametric test let's take a look at wil coxon signed rank test in this video I will explain The Wil coxen test to you we will go through what will coxen test is what the assumptions are and how it is calculated and at the end I will show you how you can easily calculate the will coxen test online with data Tab and we get started right now The Wil coxen test analyzes whether there's a difference between two dependent samples or not therefore if you have two dependent groups you want to test whether there is a difference between these two groups then you can use The Wil coxen test now you rightly say hey the T test for dependent samples does the same thing it also tests whether there's a difference between two bant groups that's correct of course the will coxen test is the nonparametric counterpart of the T test for dependent samples the special thing about will coxen test is that your data do not have to be normally distributed to put it simple if your data are normally distributed you use a parametric test in a case of two dependent samples this is the T test for dependent samples if your data is not normally distributed you use a nonparametric test in the case of two dependencies samples this would be the Willl coxon test now of course you could say hm well then I'll just always use the wilcoxin test and I don't even have to check the normal distribution at the end of this video I will show you why you should always use the te test if it's possible to do so first I have a little reminder for you what dependent samples are in dependent samples the measured values are always available in pairs the pairs result from for example repeated measures of the same person but what is the difference now between a t test for dependent samples and the Willl coxen test the T test for the Bandon samples tests whether there's a difference in means if we have a dependent sample say we took a value from each person once in the morning and once in the evening then we can calculate the difference from each pair so for example we would have 45- 34 which equals 11 the T test for dependent sample now tests whether these differences differ from zero or not in The Wil coxen test we don't use the differences of means but we form ranks and then we compare these ranks with each other three is the smallest value in terms of amount it gets rank one four is the second smallest value and it gets rank two six gets rank three and 11 gets rank four we assign a plus to all positive values and a minus to all negative values but don't worry we will go through this slowly and we'll also look at an example now we go to the assumptions and the hypotheses in The Wil coxen test only two dependent random samples with at least ordinary scaled characteristics need to be present the variables do not have to satisfy a distribution curve what should be mentioned now however is that the distribution shape of the differences of the two dependent samples should be approximately symmetric the null hypothesis in The Wil coxen test is that there is no difference in the so-called central tendency of the two samples in the population that is that there is no difference between the dependent groups the alternative hypothesis is that there is a difference in the central tendency in the population so we expect that the two dependent groups are different so now we finally look at a quite simple example let's say you have measured the reaction time of a small group of people once in the morning and once in the evening and you want to know if there is a difference between morning and evening in order to do this you measure the reaction time of seven people in the morning and in the evening the measured values are therefore available now in pair and now you want to calculate the difference between morning and evening if the difference would be normally distributed it you would use a t test for dependent samples if it is not we would use a wilcoxin test let's just assume that there is no normal distribution and we need to calculate a will coxen test in order to do this the first thing we do is we form ranks we look for the smallest value in terms of amount that is min -2 which gets rank one what is the second smallest value that is three which gets rank two and so on and so forth until we have ranked all the values the next thing is that we look at the differences and we try to figure out which difference is positive and which is a negative one for the negative differences we simply add a minus then we can add up the positive ranks and the negative ranks for the positive ranks we get 7 + 2 + 3 + 4 + 6 which is equal to 22 for the negative ranks we get 5 + 1 which is equal to 6 if there is no difference between morning and evening the positive and negative ranks should be approximately equal therefore the null hypothesis is that both rank sums are equal but how can we test this now we have the sum of ranks and we use it to calculate the test statistic W this is simply the minimum value of t+ and T minus in our case it is the minimum value of 22 and six and therefore the test statistic W in our case is six now we can further calculate the value for t or W that we would expect if there was no difference between morning and evening in this case we would get a value of 14 therefore if there is no difference between morning and evening we would actually expect a value for t plus and T minus of 14 and thus the W would also be 14 further we can calculate the standard deviation this is given by this to be fair a little bit complicated formula once we finished with that we can now calculate the set value the set value is obtained by calculating W minus mu and dividing that by the standard deviation so we compare the value that would be expected if the was no difference with the value that actually occurred note that if there are more than 25 cases normal distribution is assumed in which case we can calculate the set value using this formula if there are less than 25 values the critical T value is read from a table of critical T values therefore in our case we would actually use the table now I will show you how you can e easily calculate the will coxen test online and then I will go into why you should always prefer the dependent T Test to the will coxen test if it's possible in order to calculate the will coxen test simply go to data.net you will also find the link in the video description and you copy your own data into this table then when you click on this tab and you will see the names of all your variables that you copied into the table above underneath this tab many hypothesis tests are summarized and data tab automatically suggests the appropriate hypothesis test for your data if you now select morning and evening data daab automatically recognizes that it is a dependent sample and calculates the dependent t test but we don't want to calculate a T Test we want to calculate The Wil coxen test so we just click here now data automatically calculates the Willl coxen test here we can read the negative and positive ranks and here we see the Z value and the P value if you don't know exactly how this is interpreted just look at the summary in words it says that the morning group had low lower values than the evening group and a wil coxen test showed that this difference was not statistically significant P equal 312 and finally we can take a look at the nonparametric counterpart of the Anova let's start with the cral volis test this tutorial is about the cral volis test the cral volis test is a hypothesis test that is used when you want to test whether there is a difference between several independent groups now you may wander a little bit and say hey if there are several independent groups I use an analysis of variant that's right but if your data are not normally distributed and the assumptions for the analysis of variance are not met the cral Wallis test is used the cral wace test is the non-parametric counterpart of the single factor analysis of variance I will now I'll show you what that means there is an important difference between the two tests the analysis of variance tests if there is a difference in means so when we have our groups we calculate the mean of the groups and check if all the means are equal when we look at a crosscode Wallace test on the other hand we don't check if the means are equal we check if the rank sums of all the groups are equal what does that mean now what is a rank and what is a rank sum in the cral volis test we do not use the actual measured values but we sort all people by size and then the person with the smallest value gets the new value or rank one the person with the second smallest value gets rank two the person with the third smallest value gets rank three and so on and so forth until each person person has been assigned a rank now we have assigned a rank to each person and then we can simply add up the ranks from the first group add up the ranks from the second group and add up the ranks from the third group in this case we get a rank sum of 42 for the first group 70 for the second group and 47 for the third group the big Advantage is that if we do not look at the main difference but at the rank sum the data does not have to be normally distributed when using the cral volis test our data does not have to satisfy any distributional form and therefore we also don't need it to be normally distributed before we discuss how the crustal wace test is calculated and don't worry it's really not complicated we first take a look at the assumptions when do we use the cral wallet test we use the cross called volis test if we have a nominal or ordinal variable with more than two values and a matric variable a nominal or ordinal variable with more than two values is for example the variable preferred newspaper with the values Washington Post New York Times USA Today it could also be frequency of Television viewing with daily several times a week r never a matric variable is for example salary well-being or weight of people what are the assumptions now only several independent random samples with at least ordinary scaled characteristics must be available the variables do not have to satisfy a distribution curve so the null hypothesis is the independent samples all have the same central tendency and therefore come from the same population or in other words there's no difference in the rank sums and the alternative hypothesis could be at least one of the independent samples does not have the same central tendency as the other samples and therefore comes from a different population or to say it in other words again at least one group differs in rank sums so the next question is how do we calculate AAL volum test it's not difficult let's say you have measured the reaction time of three groups group a group b and Group C and now you want to know if there's a difference between the groups in terms of reaction time let's say youve written down the measured reaction time in a table let's just assume that the data is not normally distributed and therefore you have to use the cral volis test so then our n hypothesis is that there is no difference between the groups and we're going to test that right now first we assign a rank to each person this is the smallest value so this person gets rank one this is the second smallest value so this person gets rank two and we do this now for all people if the groups have no influence on reaction time the ranks should actually be distributed purely randomly in the second step we now calculate the rank sum and the mean rank sum for the first group the rank sum is 2 + 4 + 7 + 9 which is equal to 22 and we have four people in the group so the mean rank sum is 22 / 4 which equals 5.5 now we do the same for the second group here we get a rank sum of 27 and the mean rank sum of 6.75 and for the third group we get a rank sum of 29 and a mean rank sum of 7.25 now we can calculate the expected value of the rank sums the expected value if there is no difference in the groups would be that each group would have a rank sum of 6.5 we've now almost got everything we need we interviewed 12 people so the number of cases is 12 the expected value of the ranks is 6.5 we've also calculated the mean rank sums of the individual groups the degrees of freedom in our case are two and these are simply given by the number of groups minus one which makes three - one last we need the variance the variance of ranks is given by n to the power of 2 -1 divided by 12 n is again the number of people so 12 so we get a variance of of 11.92% the ranks in our case the number of cases is 12 we always have four people per group so we can pull out the ne 5.5 is the mean rank of group a 6.75 is the mean rank of Group B and 7.25 is the mean rank of Group C this gives us a rounded age value of 0.5 as we just said this value corresponds to the G Square value so now we can easily read the critical G Square value in the table of critical G Square values you find this table on data.net we have two degrees of freedom and if we assume that we have a significance level of 0.05 we get a critical G squ value of 5991 so of course our value is smaller than the critical G squ value and So based on our example data the Nile hypothesis is retained and now I will show you how I can easily calculate the cral Vol test online with data tab in order to do this you simply visit data.net you will find a link in the video description and then you click on the statistics calculator and insert your own data into this table further you click on this tab and under this tab you will find many hypothesis tests and when you select the variables you want to test data tab will suggest the appropriate test after you've copied your data into the table you will see the reaction time and group right here at the bottom now we simply click on reaction time and group and data tab automatically calculates an analysis of variance for us but we don't want an analysis of variance we want the non-parametric test so we just click here now data automatically calculates the cral volis test we also get a g Square value of 0.5 the degrees of freedom are two and the calculated P value is 0.779 and here below you can read the interpretation in words across calist has showed that there is no significant difference between the categories P makes 0.779 therefore with the data used the null hypothesis is not rejected if we have three or more dependent samples we can use the fredman test as a nonparametric alternative to the Anova with repeated measures this video is about the fredman test and we start right away with the first question what is the fredman test the fredman test analyzes whether there are statistically significant differences between three or more dependent samples what is a dependent sample again in a dependent sample the measured values are linked for example if a sample is drawn of people who have knee surgery and these people are interviewed before the surgery surgery and one and two weeks after the surgery it is a dependent sample this is the case because the same person was interviewed at multiple time points now you might rightly say that the analysis of variance with repeated measures tests exactly the same thing since it also tests whether there is a difference between three or more dependent samples that is correct the Freedman test is is the nonparametric counterpart of the analysis of variance with repeated measures but what is the difference between the two tests the analysis of variance tests the extent to which the measured values of the dependent sample differ in the Freedman test on the other hand it is not the actual measured values that are used but the ranks the time where a person has the highest value gets rank one the time with the second highest value gets rank two and the time with the smallest value gets rank three this is now done for all people or for all rows afterwards the ranks of the single points of time are added up at the first time point we get a sum of seven at the second time point we get a sum of eight and at the third time point we get a sum of nine now we can check how much much these rank sums differ from each other but why are rank sums used the big Advantage is that if you don't look at the mean differences but at the rank sum the data doesn't have to be normally distributed to put it simple if your data are normally distributed parametric tests are used for more than two dependent samples this is the analysis of variance with repeated measures if your data are not normally distributed nonparametric tests are used for more than two dependent samples this is the Freedman test this leads us to the research question that you can answer with the fredman test is there a significant difference between more than two dependent groups let's have a look at that with an example you might be interested to know whether therapy after a slip dis has an influence on the patient's perception of pain for this purpose you measure the pain perception before the therapy in the middle of the therapy and at the end of the therapy now you want to know if there is a difference between the different times so your independent variable is time or therapy progressing over time your dependent variable is the pain perception you now have a history of the pain perception of each person over time and now you want to know whether the therapy has an influence on the pain perception simplified in this case the therapy has an influence and in that case the therapy has no influence on the pain perception in the course of time the pain perception does not change here in this case it does now we also have a good transition to the hypothesis in the Freedman test the null hypothesis is there are no significant differences between the dependent groups and the alternative hypothesis is there is a significant difference between the dependent groups of course as already mentioned the Freedman test does not use the True Values but the ranks we will go through the formula behind the Freedman test in a moment this bring brings us to the point of how to calculate the fredman test for the calculation of the fredman test you can of course simply use data tap or calculate it by hand to be honest hardly anyone will calculate the Freedman test by hand but it will help you to understand how the Freedman test works and don't worry it's not that complicated first I will show you how to calculate the fredman test with data Tab and then I will show you how to do it by hand in order to do this simply go to data.net and copy your own data into this table let's say you want to investigate whether there is a difference in the response time of people in the morning at noon and in the evening we simply click on this tab under this tab you will find many hypothesis tests and data tab will automatically suggest an appropriate test if we click on all three variables morning noon and evening data tab will automatically calculate an analysis of variance with repeated measures but in our case we want to calculate the nonparametric test so we click here now we get the results for the Freedman test up here you can read the descriptive statistics and down here you can find the P value if you don't know exactly how to interpret the P value you can just read the interpretation in words down here a frean test showed that there is no significant difference between the variables Kai Square = 2.57 P = 0.276 if your P value is greater than your set significance level then your null hypothesis is retained the null hypothesis is that there is no difference between the groups usually a significance level of 0.05 is used so this P value is greater than 0.05 additionally data gives you the post Haw test if your P value is smaller than 0.05 the post talk test helps you to examine which of the groups really differ so now let's look at what the equation behind the fre test R and recalculate this example by hand here we have the measured values of the seven people in the first step we have to assign ranks to the values in order to do this we look at each row separately in the first row which is the first person 45 is the largest value this gets rank one then comes 36 with rank 2 and 34 with rank three we do the same for the second row here 36 is the largest value and gets rank one then comes 33 with rank two and 31 with rank three now we do this for each row so for all people afterwards we can calculate the rank sum for each point in time for example we just sum up all ranks at one point in time time in the morning we get 17 at noon 11: and in the evening 14 if there were no differences between the different time points in terms of reaction time we would expect the expected value at all time points the expected value is obtained with this formula and in this case it is 14 so if there is no difference between morning noon and evening we would actually expect a rank sum of 14 at all three time points next we can calculate the kai Square value we get it with this formula n is the number of people which is seven K is the number of time points so three and the sum of R squar is 17 SAR + 11 SAR + 14 SAR so we get a Kai Square value of 2.57 now we need the number of degrees of freedom this is given by the number of time points minus one so in our case two finally we can read the critical Kai Square value in the table of critical Kai Square values for this we take the predefined significance level let's say it is 0.05 and the number of degrees of freedom here we can read that the critical Kai Square value is 5.99 this is greater than our calculated value therefore the null hypothesis is not rejected and based on these data there is no difference between the responsiveness at a different time points therefore the null hypothesis is not rejected and based on these data there's no difference between the responsiveness and a different time points if the calculated Kai Square value was greater than the critical one we would reject the N hypothesis having explored various nonparametric tests for ordinal or nonnormally distributed metric data now let's shift our Focus to nominal data when our variables are nominal such as gender or color preferences we require a different kind of statistical test the kai Square test the kai Square test is a powerful tool for analyzing nominal data let's get started what is a Kai Square test and how is the kai Square test calculated that's what we will discuss in this video let's start with the first question what is a Kai Square test the kai Square test is a hypothesis test that is used when you want to determine if there is a relationship between two categorical varibles Ables what are categorical variables again categorical variables are for example gender with the categories male and female the preferred newspaper with the categories USA Today The Wall Street Journal the New York Times and so on or the highest educational level with the categories without graduation College bachelor's degree master's degree so jender preferred new newspaper and highest educational level are all categorical variables for example no categorical variables are the weight of a person the salary of a person or the power consumption if we now have two categorical variables and we want to test whether there is a relationship we use a Kai Square test for example is there a relationship between gender and the preferred newspaper we have two categorical variables so we use a Kai Square test another example is there a relationship between perferred newspaper and highest educational level here again we have two categorical variables so we use a Kai Square test however there are two things to note first the Assumption for the kai Square test is that the expected frequencies per cell are greater than five we'll go over what that means in a moment second the kai Square test uses only the categories and not the rankings however in the case of the highest educational level there's a ranking of categories if you want to account for rankings check out our tutorials on Spearman correlation manw U test or Cross C Wallace test but how do we calculate the kai Square test let's go through that with an example we would like to investigate whether gender has an influence on the preferred newspaper so our question is is there a relationship between gender and the preferred newspaper our null hypothesis is there is no relationship between gender and the preferred newspaper and our alternative hypothesis is there is a relationship between gender and the preferred newspaper so so first we create a questionnaire that asks about gender and the preferred newspaper we will then send out the questionnaire the results of the survey are displayed in a table in this table we see one respondent in each row the first respondent is mail and stated New York Post the second respondent is female and stated USA Today We can now call copy this table into a statistic software like data deab data daab then gives us the so-called contingency table in this table you can see the variable newspaper and the variable gender the number of times each combination occurs is plotted in the cells for example in this survey There are 16 people who stated New York Post and male or 13 people who stated female and New York Post now we want to know if gender has an influence on the preferred newspaper or put another way is there a relationship between gender and the preferred newspaper now to answer this question we use the kai Square test there are two ways we can calculate the kai Square test either we use a statistical software like data tab or we calculate the K Square test by hand we start with the uncomplicated variant and use data tab if you like you can load the sample data set for calculation you can find the link in the video description to calculate a Kai Square test online simply copy your own data into this table or use the link to load this data set then the variables gender and newspaper appear here below now we click on hypothesis test here you will find a variety of tests and data tab will help you to choose the right one for example if we click on chender and newspaper the kai Square test will be automatically calculated now we get the results for the kai Square test above we see the contingency table for the variables gender and newspaper the contingency table shows us how often the respective combinations occur in our survey female and USA Today for example occurs six times in the second table table we can see what the contingency table should actually look like if the two variables were perfectly independent that is if gender had no influence on the preferred newspaper here it is important to note that all of the frequencies should be larger than five so that the assumptions of the kai Square test are fulfilled but this is the case here the kai Square test now Compares this table with that table and here we see the results the P value is 0.91 which is much higher than our significance level of 0.05 and therefore we keep the null hypothesis if you don't know exactly how to interpret the results just click on summary inverts aai squarus was performed between gender and newspaper all expected cell frequencies were greater than five thus the assumptions for the kai Square test were met there was no statistically significant relationship between gender and newspaper this results in a P value of 0.918 which is above the defined significance level of 5% the kai Square test is therefore not significant and the null hypothesis is not rejected if you're unsure what exactly the p means just watch our video about the P value and now we come to the question how to calculate the kai Square test by hand and we go through the formulas needed don't worry it's not difficult we need the contingency table with the observed frequencies the contingency table with the expected frequencies that is those frequencies that would occur with perfectly independent variables you can find how to calculate the expected frequencies on data tab in the tutorial on the K Square test we can now calculate the kai square with this formula the index K stands for the respective cell o k is the observed frequency e k is the expected frequency so we get 6 - 6.08 SAR ided 6. 08 plus the next cell 7 - 6.92 sared divided by 6.92 if we do this for all cells and sum them up we get a Kai square of 0.504 so this results in a Kai Square value of 0.54 now we would like to calculate the critical Kai Square value what do we need it for if we use a statistical software we simply get a P value displayed if the value is smaller than the significance level for example 0.05 the null hypothesis is rejected otherwise not in our example case the null hypothesis is not rejected by hand however you can't really calculate the P value therefore you read off in a table which k Square value you would get with a P value of 0.05 this Kai Square value is called the critical Kai Square value in order to calculate the critical Kai Square value we need the degrees of freedom these are obtained by taking the number of rows minus one times the number of columns minus one we have four rows and two columns therefore we get 3 * 1 and thus three degrees of freedom now let's take a look at the table of critical Kai Square values you can find this table on data tab the link is in the video description we select a significance level of 0.05 and have three degrees of freedom therefore our critical K Square value is 7.81 15 the critical Kai Square value of 7.81 15 is larger than our calculated Kai Square value of 0.54 thus the null hypothesis is retained up until now we focused on statistical tests designed to compare groups or categories another fundamental aspect of data analysis is understanding the relationship between variables this is where correlation comes into play Let's transition from tests of differences to measures of Association in the following video this video is about correlation analysis we start by asking what a correlation analysis is we will then look at the most important correlation analysis Pearson correlation Spearman correlation candles Tow and Point by zerial correlation and finally we will discuss the difference between correlation and cation let's start with the first question what is a correlation analysis correlation analysis is a statistical method used to measure the relationship between two variables for example is there a relationship between a person's salary and age in this CER plot every single point is a person in correlation analysis we usually want to know two things number one how strong the correlation is and number two in which direction the correlation goes we can read both in the correlation coefficient which is between minus1 and one the strength of the correlation can be read in a table if R is between 0 and 0.1 we speak of no correlation if R is between 0.7 and 1 we speak of a very strong correlation a positive correlation exists when high values of one variable go along with high values of the other variable or when small values of one variable go along with small values of the other variable a positive correlation is found for example for body size and shoe size the result is a positive correlation coefficient a negative correlation exists when high values of one variable go along with low values of the other variable and vice versa a negative correlation usually exists between product price and sales volume the result is a negative correlation coefficient now we have different correlation coefficients the most popular are Pon correlation coefficient R Spearman correlation coefficient RS canand Tow and Point by zal correlation Co efficient rpb let's start with the first the pon correlation coefficient what is a Pon correlation as all correlation coefficients the pon correlation R is a statistical measure that quantifies the relationship between two variables in the case of pearon correlation the linear relationship of matric variables is measured more about matric variables later so with the help of P correlation we can measure the linear relationship between two variables and of course the peon correlation coefficient R tells us how strong the correlation is and in which direction the correlation goes how is p and correlation calculated the p and correlation coefficient is obtained via this equation where R is the p and correlation coefficient x i are the individual values of one variable for example h y i are the individual values of the other variable for example salary X Das and Y dash are respectively the mean values of the two variables in the equation we can see that the respective mean value is first substracted from both values so in our example we calculate the mean values of Ag and salary with then then subtract the mean values from each person's age and salary then we multiply both values and we sum up the individual results of the multiplication the expression in the denominator ensures that the correlation coefficient is scaled between minus1 and 1 if we now multiply two positive values we get a positive value so all values that lie in this area have a positive influence on the correlation coefficient if we multiply two negative values we also get a positive value minus * minus is plus so all values that lie in this area also have a positive influence on the correlation coefficient if we multiply a positive value and a negative value we get a negative value minus * plus is minus so all values that lie in these ranges have a negative influence on the correlation coefficient therefore if our values are predominantly in these two areas we get a positive correlation coefficient and thus a positive relationship if our values are predominantly in these two areas we get a negative correlation coefficient and thus a negative relationship if the points are distributed over all four areas the positive terms and the negative terms cancel each other out and we get a very small or no correlation but now there's one more thing to consider the correlation coefficient is usually calculated with data taken from a sample however we often want to test a hypothesis about the population in the case of correlation analysis we then want to to know if there is a correlation in the population for this we check whether the correlation coefficient in the sample is statistically significantly different from zero the null hypothesis in the Pearson correlation is the correlation coefficient does not differ significantly from zero there is no linear relationship and the alternative hypothesis is the correlation coefficient differs significantly from zero ER there is a linear relationship attention it is always tested whether the null hypothesis is rejected or not in our example the research question is is there a correlation between age and salary in the British population to find out we draw a sample and test whether in this sample the correlation coefficient is significantly different from zero the null hypothesis then is there is no correlation between salary and age in the British population and the alternative hypothesis there is a correlation between salary and age in the British population whether the correlation coefficient is significantly different from zero based on the sample collected can be checked using a T Test where R is the correlation coefficient and N is the sample size A P value can then be calculated from the test statistic T if the P value is smaller than the specified significance level which is usually 5% then the null hypothesis is rejected otherwise it is not but what about the assumptions for a peeron correlation here we must distinguish whether we just want to calculate the peeron correlation or whether we want to test a hypothesis to calculate the peon correlation coefficient only two two matric variables need to be present matric variables are for example a person's weight a person's salary or electricity consumption the peon correlation coefficient then tells us how large the linear relationship is if there is a nonlinear relationship we cannot tell from the peon correlation coefficient however if we want to test whether the peon correlation coefficient is significantly different from zero the two variables must always be normally distributed if this is not given the calculated test statistic t or the P value cannot be interpreted reliably let's continue with the Spearman correlation the Spearman rank correlation is the nonparametric counterpart of the peon correlation but there is an important difference between both correlation coefficients spean correlation does not use the raw data but the ranks of the data let's look at this with an example we measure the reaction time of eight computer players and ask their age when we calculate a Pon correlation we simply take the two variables reaction time and age and calculate the PE and correlation coefficient however we now want to calculate the spe and rank correlation so first we assign a rank to each person for reaction time and age the reaction time is already sorted by size 12 is the smallest value so gets rank one 15 the second smallest value so gets rank two and so on and so forth we are now doing the same with ag here we have the smallest value they are the second smallest they are the third smallest fourth smallest and so on and so forth let's take a look at this in the skatter plot here we see the raw data of age and reaction time but now we would like to use the rankings so we form ranks from the variables age and reaction time through this transformation we have now distributed the data more evenly to calculate the pon correlation we simply calculate the pon correlation from the ranks so the Spearman correlation is equal to the pon correlation only that the ranks are used instead of the raw values let's have a quick look at that in data tab here we have the reaction time and age and there we have the just created ranks of reaction time and age now we can either calculate spean correlation of reaction time and age where we get a correlation of 0.9 or we can calculate peeron correlation from the ranks where we also get 0.9 so exactly the same as before if you like you can download the data set you can find the link in the video description if there are no rank ties we can also use this equation to calculate the pon correlation RS is the Spearman correlation n is the number of cases and D is the difference in ranks between the two variables referring to our example we get a different D with this 1 - 1 is equal to 0 2 - 3 is -1 3 - 2 is 1 and so on now we Square the individual D's and add them all up so the sum of d s is 8 n which is the number of people is8 if you put everything in we get a correlation coefficient of 0.9 just like the pearon correlation coefficient R spean correlation coefficient RS also varies between minus one and one let's continue with candle's tow candle's tow is a correlation coefficient and is thus a measure of the relationship between two variables but what is the difference between Pon correlation and candle rank correlation in contrast to Pearson correlation K rank correlation is a nonparametric test procedure thus for the calculation of kandle St the data need not be normally distributed and the variables need only have ordinal scale levels exactly the same is true for the spe and rank correlation right that's right candle's tow is very similar to spearman's rank correlation coefficient ever candle's tow should be preferred over spearman's correlation if very few data with many rank ties are available but how is candle Stow calculated we can calculate candle Stow with this formula where C is the number of concordant pairs and D is the number of discordant pairs what are concordant and discordant pairs we will now go through this with an example suppose two do doctors are asked to rank six patients according to their physical health one of the two doctors is now defined as a reference and the patients are sorted from 1 to six now the sorted ranks are matched with the ranks of the other doctor EG the patient who is in third place with the reference Doctor Is In fourth place with the other doctor now using candle's tow we want to know if there is a correlation between the two rankings for the calculation of candle tow we only need these ranks we now look at each individual Rank and note whether the values below are smaller or greater than itself so we start at the first rank three one is smaller than three so gets a minus four is greater so gets a plus two is smaller so it gets a minus six is greater so it gets a plus and five is also greater so it also gets a plus now we do the same for one here of course each subsequent rank is greater than one so we have a plus everywhere at rank four two is smaller and six and five are greater now we do this for rank two and rank six then we can easily calculate the number of concordant and discordant pairs we get the number of concordant pairs by counting all the Plus in our example we have 11 plus in total we get a number of this coordinate pairs by counting through all the minus in our example we have a total of four minus C is thus 11 and D is four candles tow now is 11 - 4 / 11 + 4 and we we get a kandle tow of 0.47 we get an alternative formula for kandle tow here with s is C minus D therefore 7 n is the number of cases IE six if we insert everything we also get 7 ided by 15 just like the peeron correlation coefficient R can to also varies between minus1 and + one we have again C calculated the correlation coefficient using data from a sample now we can test if the correlation coefficient is significantly different from zero thus the null hypothesis is the correlation coefficient to is equal to0 there is no relationship and the alternative hypothesis is the correlation coefficient to is unequal to zero there is a relationship therefore we want to know if the correlation coefficient is significant iFly different from zero you can analyze this either by hand or with a software like data tab for the calculation by hand we can use the set distribution as an approximation however for this we should at least have 40 cases so the six cases from our example are actually too few we get the set value with this formula here we have too and N is the number of cases this brings us to the last correl ation analysis the point by zerial correlation Point by zerial correlation is a special case of peon correlation and examines the relationship between a dichotomous variable and the metric variable what is a damus variable and what is a matric variable a damus variable is a variable with two values for example gender with male and female or smoking status with smoker and non-smoker a metric variable is for example the weight of a person the salary of a person or the electricity consumption so if we have a dichotomus variable and a metric variable and we want to know if there is a relationship we can use a Point by zal correlation of course we need to check the assumptions beforehand but more about that later how is the point by zal correlation calculated as stated at the beginning the point B zerial correlation is a special case of the pearon correlation but how can we calculate the peon correlation when a variable is nominal let's look at this with an example let's say we are interested in investigating the relationship between the number of ours studied for a test and the test result passed failed we've calculated data from a sample of 20 students where 12 students passed the test and eight students failed we have recorded the number of hours each students started for the test to calculate the point by Zer correlation we first need to convert the test result into numbers we can assign a score of one to students who passed the test and a score of zero to students who failed the test now we can either calculate the p and correlation of time and test result or we use the equation for the point by zero correlation X1 Dash is the mean value of the people who have passed and X2 Dash is the mean value of the people who failed N1 is the number of people who passed and N2 the number of people who failed and N is the total number but whether we calculate the P correlation or we use the equation for the point by zial correlation we get the same result both times let's take a quick look at this in data tab here we have the learning hours the test result was passed and failed and there the test result with zero and one we Define the test result with zero and one as metric if we now go to correlation and calculate the peeron correlation for these two variables we get a correlation coefficient of 0.31 if we can calculate the point by zal correlation for learning hours and the exam result with passed and failed we also get a correlation of 0.31 just like the pon correlation coefficient R the point by Zer correlation coefficient rpb also varies between minus1 and 1 if we have a coefficient between minus one and less than one there is a negative correlation thus a negative relationship between the variables if we have a coefficient between greater than zero and one there is a positive correlation that is a positive relationship between the two variables if the result is zero we have no correlation as always with the point Byer correlation we can also check whether the correlation coefficient is significantly different from zero thus the null hypothesis is the correlation coefficient r is equal to zero there is no relationship and the alternative hypothesis is the correlation coefficient R is unequal to zero there is a relationship before we get to the assumptions here's an interesting note when we compute a point by zero correlation we get the same P value as when we compute a t test for independent samples for the same data so whether we test a correlation hypothesis with the point by Zer correlation or a difference hypothesis with the T Test we get the same P value now if we compute a test in data tab with these data and we have the null hypothesis there is no difference between the groups failed and passed in terms of the variable learning hours we get a P value of 0.179 and also if we calculate a point by zero correlation and have the null hypothesis there is no correlation between learning hours and test result we get a P value of 0.179 in our example the P value is greater than 0.05 which is most often used as a significance level and thus the null hypothesis is not rejected but what about the assumptions for a point by zal correlation here we must distinguish whether we just want to calculate the correlation coefficient or whether we want to test a hypothesis to calculate the correlation coefficient only one matric variable and one damus variable must be present however if we want to test whether the correlation coefficient is significantly different from zero the one matric variable must also be normally distributed if this is not given the calculated test statistic t or the P value cannot be interpreted reliably this brings us to the last question what is causality and what is the difference between causality and correlation causality is the relationship between a cause and an effect in a causal relationship we have a cause and a resultant effect an example coffee contains caffeine a stimulating substance when you drink coffee the caffeine enters the body affects the central nervous system and leads to increased alerted drinking coffee is the cause of the feeling of alerted that comes afterwards without drinking coffee the effect I.E the feeling of alerted would not occur but causality is not always so easy to determine clear requirements must be met in order to speak of a causal relationship but more about that l so what is the difference between correlation and causality a correlation tells us that there is a relationship between two variables example there is a positive correlation between ice cream sales and the number of sunburns however an existing correlation cannot tell us which variable influences which or whether a third variable is responsible for the correlation in our example both variables are influenced by a common cause namely sunny weather on sunny days people buy more ice cream and spend more time Outdoors this can lead to an increased risk of sunburns causality means that there is a clear cause effect relationship between two variables causality exists when you can say with certainty which variable influences which however a common mistake in the interpretation of Statistics is that a correlation is immediately assumed to be a causal relationship here is an example the American statistician Daryl Huff found a negative correlation between the number of head lies and the body temperature of the inhabitants of an island a negative correlation means that people with many head lies generally have a lower body temperature and people with few headl generally have a higher body temperature the Islanders concluded that headli were good for health because the reduced fever so their assumption was that headlights have an effect on the temperature of the body in reality the correct conclusion is the other way around in an experiment it was possible to prove that high fever drives away the lies so the high body temperature is the cause not the effect what are the conditions for talking about causality there are two conditions for causality number one there is a significant correlation between the variables this is easy to check we simply check whether the correlation coefficient is significantly different from zero number two the second condition can be met in three ways first chronological sequence there is a chronological sequence and the results of one variable occurred before the results of the other variable second experiment a controlled experiment was conducted in which the two variables can be specifically influenced and number three Theory there is a well founded and plausible theory in which direction the causal relationship goes if there is only a significant correlation but none of the other three conditions are met we can only speak of correlation never of causality after examining how correlation analysis helps us determine the extent to which variables are related we now move into the field of regression regression analysis allows us to understand the nature of the relationship more deeply and make predictions let's look at how we can use Simple linear regression to predict values and explore how this extends into multiple regression offering even richer insights into complex data interactions after watching this video you will know what regression analysis is and what the differences between simple and multiple linear regression are further you will be able to interpret your results understand the assumptions for a linear regression and also the use of dummy variables and at the end I explain to you the basics about logistic regression and of course I'll also show you how to calculate a regression easily online let's get started a regression analysis allows you to inere or predict a variable based on one or more other variables let's say you want to find out what influences the salary of people for example you could take the highest level of Education the weekly working hours and the age of people now you can investigate whether these three variables have an influence on the salary if you can do so you can predict a person's salary by taking the highest educational level the weekly working hours and a person's age the variable you want to inere the one that you want to predict is also called the dependent variable or Criterion the variables you use for your prediction are called in independent variables or predictors regression analysis can be used to achieve two goals first you can measure the influence of one variable or several variables on another variable or you can predict a variable based on other variables this is the second goal in order to give you a feeling for this let's go through some examples now let's start by measuring the influence of one or more variables on another in the context of your research work you might be interested in what has an influence on children's ability to concentrate you're interested in whether you can prove that there are parameters that positively or negatively influence children's ability to concentrate but in this case you're not interested in predicting children's ability to concentrate another example would be you could investigate whether the educational level of parents and the place of residence have an influence on the future educational level of children this area is very research based and has a lot of application in the social and economic Sciences the second area which is using regressions for predictions is more application oriented let's say in order to get the most out of Hospital occupancy you might be interested in how long a patient will stay in the hospital so based on the characteristics of the prospective patient such as age reason for stay and pre-existing conditions you want to know how long this person is likely to stay in the hospital based on this prediction you can for example optimize bad planning another example would be that as an operator of an online store you may be very interested in which product a person is most likely to buy so you want to suggest this product to the visitor in order to increase the sales of the online store this is where regression comes into play what we now need to know is that there are different types of regression analysis and we get started with these types right now in regression analysis at is distinction is made between simple linear multiple linear and logistic regression in simple linear regression you use only one independent variable to infer the dependent variable in the example where we want to predict the salary of people we use only one variable either for example if a person has studied or not the weekly working hours or the age of a person in multiple linear regression we use several independent variables to inere our dependent variable so we use the highest educational level weekly working hours and the age of a person so therefore the difference between a simple and a multiple regression is that in one case we only use one independent variable and in the other case we use several independent variables both cases have in common that the dependent variable is matric and matric variables are for example the salary of a person the body size the shoe size or for example the electricity consumption in contrast to that logistic regression is used when you have a categorical dependent variable so for example when you want to inere whether a person is at high risk of burnout or not when whenever yes and no answers are possible you use logistic regression so in linear regression the dependent variable is metric in logistic regression it is categorical whenever the dependent variable is yes or no you will use a logistic regression does one person buy a product yes or no is a person healthy or sick does a person vote for a certain party yes or no and so so on and so forth in all these cases it does not matter what scale level the independent variables have they can be either nominal ordinal or metric so as I just said the scale level of the dependent variable can be metric ordinal or nominal in all three cases so in the simple linear in the multiple linear and in a logistic regression the dependent variable is metric in the linear case and it is nominal or ordinal in the case of a logistic regression it is important to know that in the case of a nominal or an ordinal independent variable the variables may classically have only two characteristics such as gender with male and female if your variables have more than two characteristics then you have to form so-called dummy variables but we will talk about dummy variables a bit later so now a quick recap for you there is the simple linear regression and the question could be does the weekly working time have an impact on the hourly wage of people the distinguishing point is that we have only one independent variable in this case then there is the multiple linear regression a question could be do the weekly working hours and the age of employees have an influence on the hourly wage in this case we have at least two independent variables so for example weekly working hours and age and now let's look at the last case which is logistic regression here the question could be do the weekly working hours and the age of employees have an influence on the probability that they are at risk of burnout in this case burnout at risk has the expression yes or no so now I hope you got a first impression about what regression analysis is and we'll move on to the linear regression now let's get started as you already know a regression analysis allows you to inere or predict a variable based on one or more other variables but what is the difference between a simple linear regression and a multiple linear regression the simple linear regression uses only one independent variable to infer the dependent variable so for example does the weekly working time have an influence on the hourly salary of employees in a multiple linear regression several independent variables are used to infer the dependent variable so for example do the weekly working hours and the age of employees have an influence on their hourly salary so therefore therefore the difference between a simple regression and a multiple regression is that in one case only one independent variable is used and in the other case we use several independent variables so both have in common that the dependent variable is metric in contrast logistic regression is used when you have a categorical dependent variable logistic regression is discussed later in this video let's first start with this simple linear regression the goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable the greater the linear relationship between the independent variable and the dependent variable is the more accurate is the prediction visually the relationship between the variables can be represented in a scatter plot the greater the linear relationship with between the dependent and the independent variable the more the data points lie on a straight line the task of a simple linear regression is now to determine exactly the straight line that best describes the relationship between the dependent and the independent variable in the context of a linear regression analysis a straight line is plotted on the scatter plot in order to determine this straight line the linear regression uses the method of least squares let's say a hospital asks you to give them an estimate based on the age of a person how long this person will stay in the hospital after a surgery the target of the hospital operator is to optimize the bad planning in this example your dependent variable the one you want to infer is the length of stay after surgery your independent variable is the age of a person the equation that describes the model now looks like this B is the slope and a is the receptor point if a person would be zero years old which doesn't really make sense in this example the model would tell us that this person stays eight days in the hospital in order to calculate the coefficients the hospital must of course provide you with a sample of people where you know the age and the length of stay after surgery by using your data you could find out that b is 0.14 and a is 1.2 this is now our model which helps us to estimate the length of stay after surgery based on the age of people now let's say a person who is 33 years old is registered for a surgery then we would put 33 for x our model then tells us that this person stays in the hospital for 5.82 days after surgery now of course the question is how do you calculate the slope B and how do you calculate the intercept a usually you use a statistics program like data tab in the case of simple linear regression however it is also quite simple to do this by hand B results from the correlation of the two variables times the standard deviation of the variable length of stay after surgery divided by the standard deviation of H A is obtained by calculating the mean value of the length of stay minus the slope times the mean value of the H the regression line always tries to map the given data with a straight line as best as possible of course this always res results in an error this error is called Epsilon let's say we estimate the length of stay after surgery of a person who is 33 years old our model tells us that the person stays in the hospital for 5.82 days but in fact he or she stays in the hospital for seven days then the difference between the estimated value and the true value is our error and exactly the error Epsilon now you have a good overview of what a simple linear regression is and we proceed with the multiple linear regression now unlike simple linear regression multiple linear regression allows for the consideration of more than two independent variables as we already know the goal of a regression is to estimate one variable based on several other variables the variable to be estimated is called the dependent variable or Criterion the variables that are used for prediction are called independent variables or predictors multiple linear regression can be used to control for the influence of a third party variable multiple linear regression is often used in empirical social research as well as in Market Research in both areas it is of interest to find out what influence different factors have on a certain Val variable therefore the equation for simple linear regression looks like this we have one independent variable X plus the constant if we now go the way of multiple linear regression we have several independent variables the first variable the second variable and the Ki variable the coefficients can now be interpreted similarly to the linear regression equation if all independent variables are zero we get the value a if an independent variable changes by one unit the corresponding coefficient B indicates by how much the dependent variable changes let's say X1 is the age of a person and B1 is 10 then for every further year Y is increased by 10 if the person is five we have 5 * 10 which equals 50 so Y is increased by 50 before we now look at the interpretation of the regression results I will first explain to you how you can easily calculate a linear regression online in order to do this simply visit data tab you can find the link in the video description and just copy your own data into this table then you simply click on the regression tab after you've copied your data into the upper table your variables will appear here below now you only need to select your dependent variable and one or more independent variables of course this data set is far too small but it just serves us as an example your own data set May of course have several thousand rows let's say we want to find out what influences a person's salary so we choose salary as the dependent variable as the independent variables we can now choose gender age and for example weight as soon as you've selected the variables you want to use data tab calculates a regression analysis for you since your dependent variable has a matrix scale level a linear regression is calculated now you can see the results here we will go through the interpretation of the results in detail in a moment the good thing is the data tab also helps you with the interpretation so just click here a multiple linear regression analysis was performed to examine whether male age or weight variables significantly predict salary the regression model indicated that the predictors explained 48.9% of the variance and a collective significant effect was not found f = 2.56 P = 0.1 and r² is 0.49 but let's now go into more detail about the interpretation of results your results will be presented to you in this form you can now of course just use the interpretation in words on data tab but we'll go through each block individually let's start with the model summary the multiple correlation coefficient R measures the correlation between the dependent variable and the combination of the independent variables what does that mean here we see the equation for linear regression our statistics program like data tab calculates the regression coefficients now we can sum up all the values and with the sum values you can calculate the correlation with the dependent variable this correlation is the multiple correlation coefficient r or to say it in other words with the regression coefficients and the independent variables an estimate of the dependent variable is calculated R now indicates how strong the correlation is between the true dependent variable and the estimated variable therefore the greater the correlation the better at the regression model the coefficient of determination r squared indicates how much the variance of the dependent variable can be explained by the independent variables your dependent variable has a certain variance the more of this variance we can explain the better it is let's take the example of how long is stay in the hospital after surgery of course not every person stays in the hospital for the same amount of time so we have a variation and it's this variation that we want to explain with the help of a regression model if for example we can predict the length of stay in hospital very well with the age and the type of surgery this R squar is very large if we can even explain everything the result would be one in this case we could predict exactly how long a person will stay in the hospital after a surgery using only the age and the type of operation and we can explain all the variance of the dependent variable but of course that doesn't happen very often in practice R squar overestimates the coefficients of determination when too many independent variables are used to fix this issue the adjusted R square is often calculated the standard estimation error indicates by how much the model overestimates the depend variable on average let's say you want to predict the length of stay in days after surgery if your standard estimation error for example is four this would mean that you miscalculate on average by 4 days with your prediction now of course your question to the hospital management would be whether an average deviation of 4 days is too much or whether the hospital says great this enables us a mar much better planning security so next this table is displayed here the so-called F test is calculated the F test tests the null hypothesis whether the variance explanation as squared in a population is zero this test is often not of great interest the test is equivalent to asserting that all true slope coefficients in a population are zero so B1 is zero B2 is zero and up to BK is zero in this case since we have a very small data set in our example the results show that the null hypothesis cannot be rejected the P value is greater than 0.05 and we assume based on available data that all coefficients are zero so now we reach the probably most interesting table here we see the unstandardized coefficients the standardized coefficients and the significance level capital B the unstandardized coefficients are just the coefficients that you can put into your regression equation so if we want to predict the salary we have a constant of 1,920 plus the value of the gender variable plus the value for the age variable Plus weight this allows us to use the unstandardized regression coefficients to build our regression model if we want to predict the salary we now only have to insert the variables better are the standardized coefficients here you can see which variable has the greatest influence on the salary in this case age has the greatest influence on salary because better of age has the greatest amount and finally we see the significance value for the coefficients values smaller than 0.05 are considered significant in this case we see that not a single coefficient is significant what does that mean now the null hypothesis in each case is that these coefficients are zero in the population if the P value is greater than 0.05 we cannot reject this null hypothesis based on the available data and thus it is assumed that the coefficients in the population are not different from zero and therefore the variables have no influence on salary due to the fact that I've used a data set that is actually very small it's difficult to reach significant values of course when calculating a regression analysis you also have to check the assumptions if the assumptions are not fulfilled you cannot interpret the results of the regression meaningfully therefore we now take a look at the assumptions the assumptions for the linear regression are as follows there must be a linear relationship between the dependent and the independent variables the error Epsilon must be normally distributed third assumption is that there must be no m multicolinearity or no instability of the regression coefficients and finally there must not be haos catastic the variance of the residuals must be constant over the predicted values and now we'll go into each point in more detail let's get started in a linear regression a straight line is drawn through the data this straight line should represent All Points as good as possible if the points are nonlinear the straight line cannot fulfill this task let's look at these two graphs in the first one we see a nice linear relationship between the dependent and the independent variable here the regression line can be drawn in a meaningful way in the second case however we see that there is a clearly nonlinear relationship between the dependent and the independent variable therefore it's not possible possible to put the regression line meaningfully through the points since this is not possible the coefficients of the regression model cannot be interpreted meaningfully or errors in the prediction can occur that are larger than expected therefore we have to check at the beginning whether there is a linear relationship between the dependent and the independent variables this is usually done graphically the next requirement is that the error Epsilon must must be normally distributed in order to check this there are two ways one is the analytical way and the other one is the graphical way when using the analytical way you can use either the colog smof test or the Shiro Wilk test if the P value is greater than 0.05 there is no deviation of the data from the normal distribution so it can be assumed that your data are normally distributed however these analytical tests are not used that often anymore because they tend to always attest normal distribution in small samples and they become significant very quickly in larger samples thus rejecting the null hypothesis that the data is normally distributed therefore the graphical version is used more and more for the graphical solution you can either look at the histogram or better the QQ plot the more your data lies on the line the better is the normal distribution and now we come to the hatus catastic another assumption for the linear regression is that the residuals have a constant variance since your regression model never exactly predicts your dependent variable in practice you always have a certain error now you can plot your dependent variable on the xaxis and the error on the Y AIS the error should scatter evenly over the whole range then homoscedasticity is present however if the results look like this then we have heteroscedasticity in this case we have different error variances depending on the value range of the dependent variable this should not be the case so and before I show you how you can check the assumptions online we now now come to the last assumption which is no multicolinearity multicolinearity means that two or more independent variables are strongly correlated with each other the problem with multicolinearity is that the effect of individual variables cannot be clearly separated let's look at the regression equation again there we have the dependent variable and here the independent variables with with the respective coefficients so now for example if there is a high correlation between X1 and X2 or if these two variables are almost equal then it is difficult to determine B1 and B2 if both are completely equal the regression model does not know what B1 and B2 should be this means that the regression model becomes unstable if you want to use the regression model for a predi ition it does not matter whether there is multicolinearity or not in a prediction you're only interested in how good the prediction is but you're not interested in how big the influence of the respective variables is however if the regression model is used to measure the influence of the independent variable on the dependent variable there must not be multicolinearity and if there is the coefficients canot be interpreted meaningfully so now the question is how can we diagnose multicolinearity if we look at the regression equation again we have the variables X1 X2 up to the variable XK we now want to know if X1 is quite identical to another variable or a combination of the other variables in order to do this we simply set up a new regression model in this new regression model model we take X1 as the new dependent variable if we can now predict X1 very well by using the other independent variables we don't need X1 anymore because we can use the other variables instead if we would now use all variables it could be that the regression model is unstable in mathematics we say that the equation is overdetermined we can now do this for all other variables so we estimate now X2 by using the other independent variables and we estimate XK by using the other independent variables for all K regressions in this case we have K new regression models we calculate the so-called tolerance or the vi value the tolerance is obtained by taking 1 - R 2 which is the variance explanation once again the variance explanation indicates how much of the variance of X1 can be explained by the other variables the more it is the more one speaks of multicolinearity if these variables can explain 100% of the variance of X1 then we no longer need X1 in the upper equation if the tolerance is less than 0.1 you have to be careful because in this case we could have multicolinearity on the other hand we have the V value the vi value is calculated by dividing one by the tolerance accordingly you have to be careful if the values are greater than 10 and now I will show you how you can test the assumptions online if you want to test the assumptions online with data tab you only need to select your VAR Ables and then click on check assumptions now you see a nice overview of the results here you can check the linearity the normal distribution of the error the multicolinearity and the hrus catastic it's very simple just try it yourself now it's time to look at dummy variables you have to use dummy variables if you want to use use categorical variables with more than two values as your independent variables as independent variables or predictors either metric or categorical variables with two expressions can be used of course it's also possible to use variables with more than two categories and I will explain this to you in the next slides categorical variables with two characteristics are also called dmas an example would be gender with the categories male and female let's say you coded female as zero and male as one in this case female would be the reference category if we now look at the regression equation and say the variable X1 is gender then B1 is the regression coefficient for gender now how can we interpret B1 we said zero is female and one is male so we just put that in then we have 0 * B1 for a female person and one * B1 for a male person accordingly B1 indicates the difference between male and female let's say we want to predict the income of a person so why is the income if a person is male this person earns more by the amount of B1 than a woman if the person is male there is a one here and we have one times this value if the person is female there is a zero here and we have zero times this value if B1 is 400 for example then it means that men earn €400 more than women according to this model now we've discussed how to handle variables with two values further we should look at what we do when we have a variable with more than two values or categories let's say you want to predict the fuel consumption of a car based on its horsepower and the vehicle type and let's say there are only three vehicle types sedan sports car and family van thus here we have a variable vehicle type with more than two characteristics but for a regression model we we need a categorical variable with two characteristics therefore the question is now what should we do next in order to use categorical variables in a regression we have to create dummy variables this means that we simply create three new variables each new variable has two characteristics these characteristics are zero and one or yes or no so here we have our vehicle type with the characteristics sedan sports car and family wi now we create one new variable for each characteristic so first we create a variable is it a San yes or no second would be is it a sports car yes or no and our third variable asks is it a family van yes or no so before we had one variable and now we got three variables which are Damas and therefore they all have only two characteristics and these three new variables can now be used in the regression model so the next question is now what does that mean for the data preparation you originally have one column with the vehicle type where the individual vehicles from your sample are listed the first is a sedan the second is also a sedan the third is a sports car and so on and so forth out of this table you create your three new variables what is the first car a seedan so you put a one here and a zero for the others it's not a sports car and it's not a family V the second one is also a sedan so we'll put a one here again and the third one is a sports car so we'll put a one here and the others are zero but by continuing this procedure we have finally created our new dummy variables now there's only one important thing to note the number of dummy variables is always the number of characteristics minus one why is that the case if we know that it is a sedan we are sure that it's not a family van if we know that it is a sports car we can also be sure that it's not a family van and if it's not a sedan and not a sports car we can be sure that it is a family van accordingly one of the three variables is not needed because this information is redundant therefore nus one damy variables are created so in this case we only need the damy variable is it a cedan or is it a sports car of course you can also drop sedan and use the other to and now I show you how you can create the dummi variables online with data tab in this example we use salary as our dependent variable and as independent variable the age and the company the variable company here has three characteristics BMW Ford and GM now we can see that data tab automatically creates the dami variables once we have the AG and once the new dummy variables is it BMW and is it Ford since we only need n minus one dummy variables the last category was dropped if your dependent variable is categorical then you need the logistic regression and logistic regression is what we are going to look at now in the previous part we discussed linear regression in a linear regression regression the dependent variable is metric for example it could be salary or hate now let's look at the logistic regression here the dependent variable is a categorical variable let's say you want to predict whether a person is at risk for Burnout or not you want to make this prediction by asking a person's highest educational attainment weekly working hours and a person's age since the variable whether a person is at risk for Burnout or not is a categorical variable you have to use a logistic regression the logistic regression is a special case of regression analysis and it is calculated when the dependent variable is nominally or ordinally scaled let's look at a few examples first for an online retailer you need to predict which product a particular customer is most likely to buy in order to do this you receive a data set with past visitors and their purchases from the online retailer in this case you need a logistic regression because your dependent variable is a categorical variable namely which product is a particular costumer most likely to buy the second example comes from the medical area you want to investigate whether a person is susceptible to a certain disease or not for this purpose you receive a data set with deceased and non- deseased people as well as other medical parameters again in this case you need a logistic regression because the dependent variable is categorical and it asks is a person susceptible to a particular disease or not and the last example comes from Politics the question could be would a person vote for party a if there were election next weekend so again a categorical dependent variable with yes and no as response options what is logistic regression now in a basic form of logistic regression damous variables that is variables with the characteristics zero and one or yes or no can be predicted for this purpose the probability of occurrence of characteristic one which is characteristic present is estimated in medicine for example a common goal is to find out which variables have an impact on a disease in this case Zero could be not deceased and one deceased and the influence of age gender and for example smoking status on this particular disease could be examined let's look at this in a graphical way so we have our independent variables age gender and smoking status and we use these three variables in order to predict whether a person is likely to get a certain disease or not so the dependent variable is will the person get a disease or not so maybe now you might ask yourself the question and why do I need a logistic regression for this why can't I just use a linear regression a quick recap in a linear regression this is our regression equation now we have a dependent variable that is zero or one therefore no matter what value we have in a dependent variables we always get either zero or one if we are using a linear regression we would simply put a linear straight line into these points the graph shows that values between plus and minus infinity can now occur the goal of logistic regression is to estimate the probability of occurrence not the value of the variable itself so we want to know how likely it is that a value of one will be the result of the given values for our independent variables the range of values for the prediction is thus restricted to the range of 0 to one in order to ensure that only values between zero and one are possible the logistic function fun is used the logistic model is based on the logistic function the important thing about the logistic function is that only values between zero and one are possible so no matter where we are here on the xaxis between minus and plus infinity we can only get values between zero and one as a result and that is exactly what we want the equation for logistic regression now looks like this we have have 1 / 1 + e to the power of minus Z for Z we now use the usual equation from the linear regression which is the equation here below B1 to BK are the regression coefficients and a is the on point X1 to XK are the independent variables after we insert all that our logistic function looks like this now we we need to determine the coefficients so that our model best represents the given data in order to solve this problem we use the so-called maximum likelihood method there are good numerical solutions that help us solve the problem so usually we just use a statistics program and it will give us the values B1 B2 up to BK finally I will now show you how to calculate a logistic regression online with data tab in order to do this you just visit data.net and you copy your own data into this table when you select a categorical dependent variable a logistic regression is automatically calculated let's say your dependent variable is a person buys a product or not with the Expressions yes and no and the independent variables are salary and age after you click on the variables a logistic regression is automatically calculated and now you get the results which you can see below the first table is the so-called classification table this table tells us how well we can classify the categories with the regression model after that we get the model summary and the co ients we just talked about regression where we predict values based on known relationships let's now shift our Focus to discovering hidden patterns in data with no apparent relationship this brings us to class analysis especially the K means clustering technique K means clustering is a powerful method used to identify hidden groups or clusters within our data let's explore how that works and how it can enhance our understanding of complex data sets in this video I will explain to you everything you need to know about clas analysis I will start with the question what is the K mean claster and then I will show you how you can easily calculate it online with data Tab and now let's start with the question what is the kin's class analysis the K class analysis is one of the simplest and most common methods for clust analysis by using the K means method you can cluster your data by a given number of clusters so you already need to Define beforehand the number of clusters for example you have a data set and you want to Cluster your cases into three clusters this can be done with the kin's class analysis for example you could have a data set with 15 European countries and you want to claster them into three country groups so now the question is how does the cin class analysis work there are five simple steps required let's start with the first step first you have to define the number of classs to find the groups of classs the number of clusters is the K in K means in our case we simply select three clusters so in this example k a was selected equal to three the second step now is to set the cluster centers random each of the sentries now represents one cluster let's come to step three now we have selected the number of classs and we set the claster centers randomly now we assign each element to one claster so for example we assign each country to one cluster let's start with one element and now the distance from the first element to each of the claster centroids is calculated so for example we calculate the distance from this element to each claster centroid afterwards each element is assigned to the claster to which it has the smallest distance in our example the distance between this element and the class centroid is the smallest so we will assign this element to the yellow cluster now this step is repeated for all further elements so at the end we have one yellow cluster one red cluster and one green cluster and then all points are initially assigned to a cluster so let's summarize it again we first have defined the number of clusters we have then assigned the clust Cent randomly and we have assigned each element instead step four we now calculate the center of each cluster so for the green elements for the yellow elements and for the red elements the center of each claster is calculated these centers are the new claster centroids now this means that we simply shift the centroids into the claster center so the cluster cids are moved to the claster centers now in Step number five we assign the elements to the new clusters since the centroids now can be located at a different element each element is assigned to the claster that is closest to it now we have finished all steps and from now on step four and five are repeated until the claster solution does not change anymore then the clustering procedure is over one disadvantage of the K means method is that the final results depend depends very much on which initial clusters we used to take this into account the whole procedure is carried out several times and different randomly chosen starting points are used for each of the calculations each time we use different starting points it could be that the outcome is different so we do the whole class analysis several times in order to get the best possible result if you use data dep to calculate the class analysis the analysis is for example done 10 times with 10 different randomly chosen starting points and then at the end the best cluster solution is chosen so the next question is what is the optimal claster number with each new cluster the sum distance in the Clusters get smaller and smaller so if we have a look at this picture where we have two classs and that picture where we have three classs for sure these three classs fit the data better than these two clusters the distance between the elements and the Clusters is in this case higher than in that case so the question now is how many clusters should be used in order to answer this question we use the elbow method with each additional cluster the summed distance between the elements and the claster Center becomes smaller and smaller however there is a claster number from which each additional claster reduces the sum distance only slightly this point is used as the number of clusters so if we have a look at this plot we can see that there is a big gap between cluster number one and two and there's also a big gap between the number of clusters two and three but there's only a small gap between number of clusters three and four so in this case we select a number of clusters of three clusters now I'd like to show you how you can easily calculate a k means class analysis online with data tab to do this please visit data.net and click on the statistics calculator in order to calculate a clust analysis you simply choose the tab cluster if you want to use your own data you can clear the table and then copy your own data into this table I will simply use the example data now so I want to calculate a k means class analysis let's say we want to claster a group of people by salary and age first we can Define the number of clusters and we enter the number of clusters data tab will now calculate everything for you and you get your results right away so here you you can see the three clusters with the centroids and there you can see the elbow method in this case the results indicate that we should use a solution with two clusters we have selected three clusters before so we can change it in order to get the best suitable number of clusters but let's go through the results now step by step here we can see how many elements are assigned to the different clusters and here we get the plot with the different clusters here we can see one cluster we see one cluster there and we see another cluster here further we get a table where each element is allocated to a cluster if we now choose two as the number of clusters we get new results and we can see them here in this plot we have one cluster here and one cluster there moreover we can see that the two clusters fit the data quite well thanks for watching and I hope you enjoyed our video don't forget to subscribe our Channel