Transcript for:
Statistics for Experimental Research Lecture Notes

hi my name is professor at lehigh university in belgium and today i'm going to talk about statistics for experimental research today's presentation is a follow-up in a playlist on experimental research last lecture was about lab notebook and today we are starting to talk about statistics in experimental research but before starting it's very important to know that if experiments at are at the core of laboratory research then statistic is the way to make sense of your experimental work so statistics form the basis of your analysis and therefore this presentation of today is catering mainly for postgraduates and scientists who are conducting or seek to conduct research in the field of medical sciences social sciences life sciences and natural sciences including engineering and for sure mentors and supervisors now let's look at this common slide that i use in every presentation of our series on experimental research experimental research is part of quantitative research and as you can see there's only three approaches to conduct research mainly qualitative and we had a playlist regarding that you can have a look at it in my channel we had now the first series on quantitative research with the experimental and in the future you will find observational and modeling for sure i mentioned now the two approaches the quantitative the qualitative but there is a mixed approach which combines both in this context it's very important to keep account that all i'm saying is within the experimental research methodology now let's move talking about the objective of this presentation well the goal number number one of this presentation is to enhance your capacity to conceptualize and design experimental research number one and number two is to plan your statistical planning during early stages of research when we talk about statistical analysis we are talking here about different variables but mainly i'm talking here about defining your independent and dependent variables analysis analysis of variance the correlation regression analysis and the cause effect or the causality effect in this sense this is the basic or this is the foundation that will take you to the next step of conducting design of experiment and in the following presentation i will talk in detail about how conducting the design of experiments but let us focus now on statistics so that we are able to go later on on design our designing our experimental statistically the content of today's presentation is mainly an introduction talking about the principle of experimental research reminding you about them terms and definitions logic of statistical analysis statistical tests that are commonly used and how to go in order to conduct them and finally some takeaway method message so let's start with the introduction what is the nature of experimental research well it is a systematic inquiry to describe explain predict and control the observed phenomena and in this sense we seek to predict control confirm the test and test while taking into account integrity reproductibility and eliminate external factors and in previous presentation on control we talked how to make sure that we try to reduce noise or external and neutralize all external factors so in this sense it is structured and re-determined experimental research is structured and pre predetermined and you must have a pre-test and a post-test have a look at my previous research on research ethics if you are interested to know more about integrity well further with experimental research what are the key steps for experimental research first of all we have to have knowledge and experience about statistics and this makes the importance of this presentation then we will design our experiment once we design it we can conduct our measurement and perform our measurements and then comes the analysis the post-processing of result and the data visualization so statistical analysis is not the last part of your research that will be performed after you conduct your experiment statistical analysis must be embedded in the early stage of your design of experiment therefore it's very important to know and learn about statistics before designing your experiment and keep into account that many scholars or researchers or scientists they start collecting data and after that they start to think how to process them and how to make sense of them and this is totally wrong actually from the beginning you have to consider your design of experiment take into account what kind of statistical test you are looking to conduct what kind of data you conduct and then you start collecting data not the other way around now just a very fast reminder on the principle of experimental research as we mentioned before the three hours are the principle of experimental research randomization replication and reduction of noise or control and the advantage of experimental research that it allows you to do that so in a control con environment for testing you have to be able to conduct those three elements and don't forget that blindness is considered most of the time as part of the randomization if you want to more know more about random randomization or blindness or replication you can watch the videos here below in the link now let's talk to the terms and definitions of experimental research first of all statistics we have to look at the type of sampling that we are using we have to look at the list of subjects that we are working with we have to look at the exact sample size which is indicated always in the end for each experiment and you must look at statistical tests that you are looking out for for example the r value the p value the pearson's r and so on so it's very important to look at that and you have to describe your covariates that you will be tested this is the first part when i'm talking about statistics so a full description of the statistic parameters including central tendencies like the means or other basic estimates like the regression coefficient or the variation for example the standard deviation or associated estimates of uncertainty which is represented through the confidence of interval they all make part of the process of statistical analysis for your work now let's go further in detail and ask some questions here what is sampling well to understand what sampling you have to make sure that your subject there is a population of your subject whether we mentioned before in the presentation on subjects they can be human they can be object they can be a car whatever kind of testing subject you are looking at in your experimental setting but when we look at about population it is the entire aggregation of cases that meets a specified set of criteria and in this sense you must have a eligibility criteria and you have to set an inclusion and exclusion criteria to make sure that you are defining your population then from this population you start the sampling which is the process of selecting a portion of the population to represent the entire population and then you have your accessible population which is the population of people or subjects whatever they are even they are specimens or samples that will be available for your study for your experiment and finally you must focus on your target target population which is the entire population in which the researchers is interested and to which he or she wants to generalize the results so in fact with sampling we are reducing the investigated subjects to a number of limited subjects that will allow us to extract or learn from about the overall population now very important when we do sampling reporting to take into account the following if you are looking to publish your reports that you are doing your pub if you will publish your experimental protocol or if you are going to publish your results in a paper at the end of the day you must be very conscious about reporting so when we talk about reporting regarding population it is the estimate of effect of size of your population for example you have the cohen's d you have the pearson's r and they are indicating how they were you must indicate how they were used when we talk about sample sampling size how to report that you have to have the exact sample size n number for each experimental group and condition given as a discrete number and unit of measurement and when you are reporting about sampling you have to talk about and describe your sampling randomness so those are the basic information that any description of your sample should include when you report back in a report or in a publication now what are the types of sampling there is two main types of sampling one of them we are not so much here concerned with which is the non-probability sampling the non-probability sampling involves the convenience sampling snowball sampling quota sampling and purpose of a purpose of sampling and it is rarely representative of target population so therefore i mentioned those type of sampling in the qualitative research series uh before because they are more uh suitable to be used in the context of qualitative research while what is the most suitable methodology sampling technique in the field of experimental research is the probability sampling and by probability sampling i mean the simple random sampling the systematic sampling the stratified sampling the cluster sampling and the multi-stage sampling so just have a look to give you an idea about probability sampling or the variations of different sampling you can see here this is some random sampling the systematic sampling will have a kind of system saying i will select every third every fifth for example subject stratified sampling i try to create strategy strata stratified groups and represent each of them and i will make sure that every time i'm making sure that i will take a subject based on my gender stratification for example one time a lady one time a man so i make sure that i am having a stratified sample and there is also cluster sample samples and here i'm talking about groups of subjects that might have common features or conditions or settings and therefore i'm doing i'm going to select based on the clustering making sure that i'm representing different clustering but for sure for the stratification and for the clustering sampling i need to do a early pre-processing to make sure how i will stratify my overall sample and how i will cluster it so these are just examples to have an idea now let's talk about the simple random sampling how do we perform simple random sampling first of all we have to define the population then we decide the sample size and final and thirdly we start to randomly select the sample and we after that can finally start with step four which is the collecting uh of the data from my sample so you have to keep into account that randomization helps ensuring that the sample represents the population and it protects against bias from variables we might not realize important or we might not think that they are important so here it comes the importance of randomization or random sampling now what what what to include in my statistical models when i am starting my uh statistical analysis for my experiment first of all you have to include the independent variable so the independent variable will look at the results of the test it represents the answer variable or the guiding parameter of your work and it is a quality indicator that was a characteristic value for your work the second aspect that i have included in my statistical model when i'm starting to think about how to design my experiment and how to design my statistical framework i have to define my independent variable which is in this sense the potential influential factor and always when you are reporting this information in a paper or in a report you have to investigate whether the independent variable is influencing your measurement or not and in this sense it can be one variable or multiple values now if you know about this basic information i talked about sampling and i talked about the basic statistical models that has uh independent variable or it has a dependent variable you can have several of them but so far let's start with the most simple then i have to move to the next step how to start how to start the story of statistic analysis well statistical analysis starts with the literature review and operationalization and i strongly advise you to watch the video on study variables and operationalization that was prepared previously because this video will help you to understand exactly how to operationalize and select the variables of your study and once this is done you can go with the next step actually when we do statistic and analysis we are trying through our experiment to trace any variance uh as an influence of introducing a treatment we are looking to establish we never know maybe yes maybe no a correlation can be positive it can be negative maybe there there's no correlation and if possible we are trying to establish a causation it doesn't work so easy like i'm saying every time sometimes it stops with variance maybe it stops with correlation but that's what we are seeking when we do statistics we are looking at variance as an as a consequence of introducing an intervention or a treatment and if this variance is well documented we try to check if this variant is correlated and if it's correlated we try to check if there is a causality behind it or not so after finding several ways of operationalizing the concept into variables of your research the data collection should start with four criteria in mind you have to look at your variables what are the sub variables that you are investigating and what are the indicators that you will use to measure and quantify each variable you have to have analysis of variance for the influence of introducing your independent variable on the dependent variable you have to look at establishing a correlation if possible whether it's positive or negative but that you have to be very attentive to that you cannot establish a correlation without proving that there is variance so once there is no variance proofed you cannot move to step number three and if strap step number three is established and you have a correlation you might be able to investigate causality the cause effect and finally you can predict which is something very powerful we look at when we do research however the prediction part is always considered in the modeling side which is the research methodology for that i will have a playlist upon in the future now the hypothesis predicts a correlation that must be tested after data collection and the chosen variables must be tested for correlation that might lead to regression relation so this is the overall beginning point uh when i start to design my experiment and in parallel design my statistical framework now talking about hypothesis i need to be aware about the null hypothesis versus the alternative hypothesis what is the null hypothesis the null hypothesis in fact there is no statistical significant difference between the sample means of two groups so i introduced my intervention or my treatment to my sample and i found that there is no variance so here i say there is no difference between the sample the baseline sample and alternative or sample number one which includes the intervention any observed difference is the result of sampling error alone in this case most probably this can be another interpretation or it could be indicating that the data is inconclusive those are the three options if i cannot detect variance when i do my experiment and i introduce my intervention i cannot find a variance between the base case condition without intervention and the alternative condition with intervention now if you have a positive uh a positive outcome in this sense we talk here about the alternative hypothesis there is a statistical significant difference between the sample means of the two groups then we can talk here about alternative hypothesis and in this sense we can say that there is a true population difference that exists between the intervention population and the no intervention population and again when we do reporting we must talk about our null hypothesis testing all the type of tests that were conducted for example the affect or the t the r factor with confidence intervals and the effect of size the degree of freedom and the p value must be also noted so in this sense p values are as an exact value whenever possible must be communicated so this is the overall logic before starting designing my statistic statistical framework and before starting my testing now let's talk about the logic of statistics and this is a very important part of our presentation today because many people go into detail who is doing testing without understanding the logic and for me i care more about the logic of statistic understanding it and i want to make sure that you are online understanding what is the logic behind your statistical testing why we do statistical analysis actually keep this into account and i advise you to keep always take note and put this in front of you when you start working on statistic first of all we look at describing the sample sample or the target population number two we are looking to detect variance this variance will come mainly from an intervention that will introduce then we would like to compare and test a hypothesis and if we have a positive variance or there is a significant variance between our intervention to the subjects and the baseline condition of the subject then we can start to look at establishing a correlation and we can look at the capacity of predicting through regression for example and finally we can prove causality so actually that's all about statistic and that's what we look at we start with first describing our statistical samples for sure we are talking here about a comparative approach i will have always statistical data for a sample without the intervention and a statistic information for with intervention and i start to go step by step checking variance if it's significant if it's remarkable i move to the next step which is establishing a correlation if it's negative or positive correlation i can go to the next which is causality and prediction so it's very important to keep in this account and in general an experiment is a study of cause and effect so it differs from non-experimental method in that it involves the deliberate manipulation of one variable or more while trying to keep all other variable constant and here is the importance of control or noise reduction now what are the types of statistical analysis that we conduct in general well we do normally two types of statistical analysis descriptive statistical analysis and inferential statistics we call the inferential also parametric and we call the descriptive non-parametric so these are different namings but what is a descriptive statistic or non-parametric statistic it's simply summarizing the data organizing the data simplifying the data we present tables graphs averages means standard deviation standard distribution so we are just trying to characterize our sample or our information and that's it without any analysis or any further treatment that goes beyond the description however when i talk about inferential statistic or parametric statistics here i'm trying to compare samples and i start to study samples to make generalization about the population or interpret the experimental data look at the confidence of my analysis and here i can look at the margin of error the significance the p-value the confidence interval and so on so it's very important to look at those two types of statistical analysis and as i mentioned before if you want to go beyond comparing your different samples that has an intervention and no intervention then you must leave the descriptive statistic and go into the inference inferential statistics and do parametric statistics so as you can see i'm repeating or allow me to reiterate what are the types of statistics mainly we can say we have a group of types of statistic tests that describe the sample sample or the target population we call them descriptive or non-parametric and the second type of statistics they are looking to detect variance the covariance for example the anova test and so on compare different samples establish correlation predict and prove causality and in this sense we call it inferential or parametric statistics now let's talk about the main major statistical text the tests in this i'm almost reaching the end but i will go through six main famous approaches to conduct statistical text that are based on this coming framework description detection of variance comparison and testing hypothesis establishing correlation prediction and proving causality what i'm going to do in the following presentation i will follow this number and for each number of those types of uh statistical analysis that we conduct i will share with you the most common tests that are done so that when you have an idea and you get lost in all these different statistical test yeah you can go back and question yourself and ask actually this this statistical test is meant for which purpose and why i'm doing it and then you can go back and make sure that you are using the right test for proving the right statistical let's say intervention so first of all number one the description part descriptive statistic it describes the sample and target population what is that descriptive statistic involves describing the distribution of a single variable including its central tendency and dispersion and the characteristic of variables distribution shall be depicted in graphical or tabular format including histograms and always when you report about it you need to describe fully your statistical parameters including the central tendencies like the means and the variation and the standard dev the deviation so this is one of the uh um kind of methods to describe your samples and one of the very famous ways to describe your sample is the standard deviation method or or test which is a measure that quantify the amount of variation or dispersion of a set of values and in this sense we use a sigma close to zero indicating that the data points tend to be very close to the means of a set so here as you can see we have assumed we are assuming that the sample is normally distributed and we have a constant error the standard deviation test is one of the most common tests that we use to test the the distribution and check if this sample is normally distributed or not another very common test that is used to describe a sample is called the kolmogorov symmernov test it is an unparametric test of testing if a variable follows a given distribution in a population and this given distribution is usually not not always the normal distribution hence the kolmogorov smirnoff normality test and here i can visualize it and i can [Music] through juxtaposition characterize how far is my sample normally distributed or not and if there is variance or between a standard distribution or not so this is the second test that can be conducted in the field of describing the statistics now moving to the second type of uh statistical analysis that i conduct the statistical analysis that is seeking to detect variance and in this sense if i'm detecting variance most probably i have introduced in this sense a treatment and the most common way to do that what is variance variance measures how far a set of numbers are spread out so a variance of zero indicates that all values are identical and here we look at high variability medium variability low variability and based on this formula we can go detect the variance between two group of samples and if this variance is significant then we can go proceed with the next step but let me re build on what i'm talking you rebuild on the detection of variants the aim of detection of variance is to determine the variance of your sample and what are the most common tests that are done in this field the [Music] quai square test of single variance this is the test hypothesis that must be always expressed in term of variance or standard deviation we have the wilcoxon signed rank test that can be used to determine whether two dependent variables were selected from populations having the same distribution and the values one away analysis of variance those are the most common tests that are used to check the variance and in this sense you know that you have in your experimental setting a subject with intervention a subject without intervention and then you conduct these tests on the results after repeating them i think having the different randomization uh effects and then you start to prove your variance also following up on the detection of variance you can look at the anova test so here we look at it's it's a statistical method that aims to determine whether there are any statistical significant difference between the means of three or more independent groups and it's a very famous test and again it is the key to proceed with your statistical testing if you cannot prove variance then you cannot prove that your treatment or your intervention has a significant effect on the results or whether their improvement or getting worse negative or positive so also the anova test is the second type of statistical analysis we do after the description still i am talking about the detection of variance the anova test simply i will have always two hypothesis the a the hypothesis is zero which is there is no difference between the group means and the h1 or the hypothesis when there is a difference between the group means and in this sense i can use the p value which is which tell me how likely the difference between groups should have happened by accident and if it's less than 0.5 statistically significant difference between groups is proved so this is the second approach when i do variance now i will move to the third approach when i start to compare between groups and this is called the comparing and testing the hypothesis approach and here we are looking at the analysis of variance as an analytical and statistical procedure to determine if there are differences between group means in a sample and whether these differences exist only due to randomness or can be attributed to a specific cause here the level of complexity uncertainty the certainty is getting higher because i'm looking also to causation and what are the most common tests the one way and the two-way anova the f-test of equality of variance is also used to test the null hypothesis that two normal population have the same variance so this is important when i start to look at comparison and refute or confirm or deny a hypothesis so always you are assuming here that the sample sample is normally distributed and that the error is constant because if it not we have different type of testings now comparing also the testing comparison between groups is essential to determine if there is a significant difference but the other type of test that can be used in this approach is the null hypothesis we mentioned we can have the f and the t and the r test with confidence intervals effect size degrees of freedom and p-value noted and we can have also the t-test which is a type of inferential statistic used to determine if there is a significant difference between the means of two groups which may be related in certain features so here these are the reported tests that are mostly used to assure that there is a comparison that was done and that we have a positive or a null like the hypothesis is null or the hypo is positive now moving to the fourth type of testing if i approved already that there is a variance and i succeeded to make sure that the groups indicate that there are variants then i can move now to look at establishing correlation what do i mean by establishing correlation well correlation way is a way to describe if there is a relationship that links the two variables that i'm investigating and in this sense the definition is two variables are correlated if a unique change in one variable leads to a change in the other variables so correlation can be either positive or negative and as we look at here in this example positive correlation we can say that the two variables move in the same direction if i increase the x-axis the y-axis will increase too if one increase the other increase and if i wonder one decrease the other decreases and if i have a negative correlation the two variables also move but in the opposite direction as you can see so if one increase the other decrease if one decrease the other increases and this is an example and for sure maybe to indicate my correlation i need to use the beta coefficient which is used to measure the degree of the association of the two variables and it's a calculation so i need to look at this this is the general in normal if it's a linear model then it will be a general model which is a framework for comparing how several variables affect different continuous variables and then i can use this up equation where i have a the dependent variable an independent variable then i have a function f and then i have the unknown parameter the beta and then i can also communicate my error terms so that i make sure that i'm taking care of the replication error and any condition that are in the control condition and when i report this information in any protocol or any report or any publication i must have my regression coefficient beta so it signifies the amount y that changes for a unit increase in x and in this way it represents the degree of which the line slopes upwards or downwards so here i am having correlation but just take care that it that it is not a straightforward approach it doesn't mean that i i proved covariance in my sample that there is correlation correlation is a i have to conduct a series of statistical tests and i have to be very rigorous i have to be very careful to make sure that if there is really a correlation or not because it doesn't mean that once i have variance by default that i will have a correlation also as part of my correlation establishment i look at the significance of level of statistical test it is the probability that the test could have occurred by chance this is important if the level is quite low that is the probability of occurring by chance is quite small we say the test is significant and a p-value is a measure for the probability that one observed differences could have occurred just by random chance so therefore we use the u p value and the p value must be noted in any uh reporting when you are conducting the test and you have to be able to interpret it and positioned regarding your relation just some brief information about the correlations types here we look at linear relations between two uh continuous variables but there's different types of correlation and we can go move with different approaches we have the famous spear man correlation the pearson correlation the conditions contingency coefficient and they are all type of testing tests for linear uh correlation for non-linear you have different types of statistical analysis but this is beyond the presentation today you have to go further in detail now i'm still talking about how to establish correlation and once i prove that there's a correlation i can go for a regression and regression will allow me to predict so i can have a curve so that i can see what happens and regression simply describes the nature of the relationship between the two variables and in this sense regression of a y on a x is simply y is the dependent variable which is measured by the researcher and it is the independent variable which is controlled by the researcher and by mean it means that the average value of y is a function of x and the relationship is represented by a regression equation as you can see here is the regression equation it is uh the multiple linear here it allows the multiple linear regression is a generalization of a simple linear regression so we can have different types of regressions to the case of more than one independent variable and here we have to report the regression coefficient beta that signifies the amount of y changes for a unit increase in x and in this way it represents the degree of which the line slopes upwards or downwards so this is very important to look at now once i have the regression and i'm able to establish a correlation i can go to the prediction part and here regression analysis is the best way to do it because a regression analysis is a statistical technique for determining the relationship between a single dependent criterion or variable and one or more independent predictor variable and the analysis yields a predicted value for the criterion resulted from a linear combination of the predictors and here i can use the simple linear regression if i'm doing non-parametric testing i can do simple and multiple logistic regression and i can do log linear models and artificial intelligence based methods that are more complicated and more advanced now i'm moving to the last part of the types of statistical analysis that you need to be aware about if you proved you described your status your sample you proved that there is variance and now you have also a correlation you can look at causality which is not a straightforward thing causality here is very important because you could use a correlation as your statistical test and demonstrate that the high quality true experiment you conducted strongly implies causation and this is rarely happening because it's very difficult to prove causation you need a very large sample size you need to replicate your your your test you have to make sure there is randomization you have to avoid any biases you have to do blindness and so on and if it is positive then it is there so a full description of your associated estimations of uncertainty and your confidence intervals when you are communicating any results that confirm or proves or negates causality needs to be taken into account and proving causality can have correlation also an important remark here that correlation doesn't mean that there is a causation so i can have a positive correlation or a negative correlation but it doesn't mean that i have a causality so causation means that one event causes another event to a cure and causation can only be determined from an appropriately appropriately designed experiment the only way and my only advice for you if you are seeking this causation aspect that you use the fish bone diagram because in most of the time causation today phenomena that we are investigating are so complex and we have this cocktail effect so it's very difficult to say that this specific factor has this influence on this phenomena but most of the time it is a cocktail so it's a series of recurring phenomenons that together are uh working on on on causing this effect and therefore we use this famous fishbone diagram to identify display and examine possible causes causes on variants and once you name your possible causes and you make sure that you control them and you neutralize them all your statistical test has has to make sure that this was taken to account and only then you can have a causation and a confirmation positive confirmation voila this is the end of today's presentation some take away messages before you leave statistical analysis very important what do we do with statistical analysis i want you just to memorize this slide this is why we do statistical analysis before in getting into detail learning software and knowing what type of test to do you just need to know why do we do statistical analysis logical we do statistical analysis to describe the sample to detect a variance to compare two different groups and test our hypothesis establish a correlation and predict through regression for example and finally prove causality keep into account the process is not straightforward and not every experiment that you will do will go from one to six some or most of the tests i can tell you eighty percent of investigation they stop after step number two we don't detect any variance and then the experiment is dead and this is the the the unfortunately the null result but today's uh the scientific word is trying to allow to publish even the null result because we learn from it and if you have a positive variance then you can go to the comparison of the groups and go with prediction and cause establishing a correlation before proving causality the second key uh takeaway message of today you have always to look at the normal distribution what i talked today about is mainly under the condition that i'm so and i'm assuming that the sample is normally distributed and as i am assuming that the error is constant an audio test and i'm i am here also you have to make sure to include all relevant factors to control your experiment number one revisit when you do statistical analysis or when you are preparing your experiment for statistical analysis revisit the relevant literature look of different tests and variables that were done in the field or in the in the same area of investigation avoid weak statistical tests any statistical test that is considered as a weak test try to avoid it pay attention to your assumptions and reinvestigate them and test them repeat your statistical test and your report this is very important so that you detect the errors and that you detect or report your confidence interval for your experiment and for sure you must indicate the appropriateness of each statistical test and fully report all the outcomes and all the type of tests that were conducted when you are doing your test and finally for sure i advise you to seek training courses on statistical analysis to better be ready able to design your experiment conduct your experiment and analyze the data in a meaningful way and in a straightforward approach by that i end today's presentation i advise you to look at the following video on design of experiments and don't hesitate to share this video with any potential colleague scientist who is looking to learn about statistics and perform experimental research and finally i thank you for your attention and i hope you enjoyed this presentation thank you very much