Transcript for:
Introduction to Statistics

[Music] let's begin this lesson by defining the term statistics statistics is a mathematical science pertaining to the collection presentation analysis and interpretation of data it's widely used to understand the complex problems of the real world and simplify them to make well-informed decisions several statistical principles functions and algorithms can be used to analyze primary data build a statistical model and predict the outcomes an analysis of any situation can be done in two ways statistical analysis or a non-statistical analysis statistical analysis is the science of collecting exploring and presenting large amounts of data to identify the patterns and trends statistical analysis is also called quantitative analysis non-statistical analysis provides generic information and includes text sound still images and moving images non-statistical analysis is also called qualitative analysis although both forms of analysis provide results statistical analysis gives more insight and a clearer picture a feature that makes it vital for businesses there are two major categories of statistics descriptive statistics and inferential statistics descriptive statistics helps organize data and focuses on the main characteristics of the data it provides a summary of the data numerically or graphically numerical measures such as average mode standard deviation or sd and correlation are used to describe the features of a data set suppose you want to study the height of students in a classroom in the descriptive statistics you would record the height of every person in the classroom and then find out the maximum height minimum height and average height of the population inferential statistics generalizes the larger data set and applies probability theory to draw a conclusion it allows you to infer population parameters based on the sample statistics and to model relationships within the data modeling allows you to develop mathematical equations which describe the inner relationships between two or more variables consider the same example of calculating the height of students in the classroom in inferential statistics you would categorize height as tall medium and small and then take only a small sample from the population to study the height of students in the classroom the field of statistics touches our lives in many ways from the daily routines in our homes to the business of making the greatest cities run the effect of statistics are everywhere there are various statistical terms that one should be aware of while dealing with statistics population sample variable quantitative variable qualitative variable discrete variable continuous variable a population is the group from which data is to be collected a sample is a subset of a population a variable is a feature that is characteristic of any member of the population differing in quality or quantity from another member a variable differing in quantity is called a quantitative variable for example the weight of a person number of people in a car a variable differing in quality is called a qualitative variable or attribute for example color the degree of damage of a car in an accident a discrete variable is one which no value can be assumed between the two given values for example the number of children in a family a continuous variable is one in which any value can be assumed between the two given values for example the time taken for a 100 meter run typically there are four types of statistical measures used to describe the data they are measures of frequency measures of central tendency measures of spread measures of position let's learn each in detail frequency of the data indicates the number of times a particular data value occurs in the given data set the measures of frequency are number and percentage central tendency indicates whether the data values tend to accumulate in the middle of the distribution or toward the end the measures of central tendency are mean median and mode spread describes how similar or varied the set of observed values are for a particular variable the measures of spread are standard deviation variance and quartiles the measure of spread are also called measures of dispersion position identifies the exact location of a particular data value in the given data set the measures of position are percentiles quartiles and standard scores statistical analysis system or sas provides a list of procedures to perform descriptive statistics they are as follows proc print proc contents proc means proc frequency proc univariate proc g chart proc box plot proc g plot proc print it prints all the variables in a sas data set proc contents it describes the structure of a data set proc means it provides data summarization tools to compute descriptive statistics for variables across all observations and within the groups of observations proc frequency it produces one way to inway frequency and cross tabulation tables frequencies can also be an output of a sas data set proc univariate it goes beyond what proc means does and is useful in conducting some basic statistical analyses and includes high resolution graphical features proc g chart the g chart procedure produces six types of charts block charts horizontal vertical bar charts pi doughnut charts and star charts these charts graphically represent the value of a statistic calculated for one or more variables in an input sas data set the tread variables can be either numeric or character proc box plot the box plot procedure creates side by side box and whisker plots of measurements organized in groups a box and whisker plot displays the mean quartiles and minimum and maximum observations for a group proc g-plot g-plot procedure creates two-dimensional graphs including simple scatter plots overlay plots in which multiple sets of data points are displayed on one set of axes plots against the second vertical axis bubble plots and logarithmic plots in this demo you'll learn how to use descriptive statistics to analyze the mean from the electronic data set let's import the electronic data set into the sas console in the left plane right-click the electronic.xlsx dataset and click import data the code to import the data generates automatically copy the code and paste it in the new window the proc means procedure is used to analyze the mean of the imported data set the keyword data identifies the input data set in this demo the input data set is electronic the output obtained is shown on the screen note that the number of observations mean standard deviation and maximum and minimum values of the electronic data set are obtained this concludes the demo on how to use descriptive statistics to analyze the mean from the electronic data set so far you've learned about descriptive statistics let's now learn about inferential statistics hypothesis testing is an inferential statistical technique to determine whether there is enough evidence in a data sample to infer that a certain condition holds true for the entire population to understand the characteristics of the general population we take a random sample and analyze the properties of the sample we then test whether or not the identified conclusions correctly represent the population as a whole the population of hypothesis testing is to choose between two competing hypotheses about the value of a population parameter for example one hypothesis might claim that the wages of men and women are equal while the other might claim that women make more than men hypothesis testing is formulated in terms of two hypotheses null hypothesis which is referred to as alternative hypothesis which is referred to as h1 the null hypothesis is assumed to be true unless there is strong evidence to the contrary the alternative hypothesis is assumed to be true when the null hypothesis is proven false let's understand the null hypothesis and alternative hypothesis using a general example null hypothesis attempts to show that no variation exists between variables and alternative hypothesis is any hypothesis other than the null for example say a pharmaceutical company has introduced a medicine in the market for a particular disease and people have been using it for a considerable period of time and it's generally considered safe if the medicine is proved to be safe then it is referred to as null hypothesis to reject null hypothesis we should prove that the medicine is unsafe if the null hypothesis is rejected then the alternative hypothesis is used before you perform any statistical tests with variables it's significant to recognize the nature of the variables involved based on the nature of the variables it's classified into four types they are categorical or nominal variables ordinal variables interval variables and ratio variables nominal variables are ones which have two or more categories and it's impossible to order the values examples of nominal variables include gender and blood group ordinal variables have values ordered logically however the relative distance between two data values is not clear examples of ordinal variables include considering the size of a coffee cup large medium and small and considering the ratings of a product bad good and best interval variables are similar to ordinal variables except that the values are measured in a way where their differences are meaningful with an interval scale equal differences between scale values do have equal quantitative meaning for this reason an interval scale provides more quantitative information than the ordinal scale the interval scale does not have a true zero point a true zero point means that a value of zero on the scale represents zero quantity of the construct being assessed examples of interval variables include the fahrenheit scale used to measure temperature and distance between two compartments in a train ratio scales are similar to interval scales in that equal differences between scale values have equal quantitative meaning however ratio scales also have a true zero point which give them an additional property for example the system of inches used with a common ruler is an example of a ratio scale there is a true zero point because zero inches does in fact indicate a complete absence of length in this demo you'll learn how to perform the hypothesis testing using sas this example let's check against the length of certain observations from a random sample the keyword data identifies the input data set the input statement is used to declare the aging variable and cards to read data into sas let's perform a t-test to check the null hypothesis let's assume that the null hypothesis to be that the mean days to deliver a product is six days so null hypothesis equals six alpha value is the probability of making an error which is 5 percent standard and hence alpha equals 0.05 the variable statement names the variable to be used in the analysis the output is shown on the screen note that the p-value is greater than the alpha value which is 0.05 therefore we fail to reject the null hypothesis this concludes the demo on how to perform the hypothesis testing using sas let's now learn about hypothesis testing procedures there are two types of hypothesis testing procedures they are parametric tests and non-parametric tests in statistical inference or hypothesis testing the traditional tests such as t-test and anova are called parametric tests they depend on the specification of a probability distribution except for a set of free parameters in simple words you can say that if the population information is known completely by its parameter then it is called a parametric test if the population or parameter information is not known and you are still required to test the hypothesis of the population then it's called a non-parametric test non-parametric tests do not require any strict distributional assumptions there are various parametric tests they are as follows t-test anova chi squared linear regression let's understand them in detail t-test a t-test determines if two sets of data are significantly different from each other the t-test is used in the following situations to test if the mean is significantly different than a hypothesized value to test if the mean for two independent groups is significantly different to test if the mean for two dependent or paired groups is significantly different for example let's say you have to find out which region spends the highest amount of money on shopping it's impractical to ask everyone in the different regions about their shopping expenditure in this case you can calculate the highest shopping expenditure by collecting sample observations from each region with the help of the t-test you can check if the difference between the regions are significant or a statistical fluke anova anova is a generalized version of the t-test and used when the mean of the interval dependent variable is different to the categorical independent variable when we want to check variance between two or more groups we apply the anova test for example let's look at the same example of the t-test example now you want to check how much people in various regions spend every month on shopping in this case there are four groups namely east west north and south with the help of the anova test you can check if the difference between the regions is significant or a statistical fluke chi-square chi-square is a statistical test used to compare observed data with data you would expect to obtain according to a specific hypothesis let's understand the chi-square test through an example you have a data set of male shoppers and female shoppers let's say you need to assess whether the probability of females purchasing items of 500 or more is significantly different from the probability of males purchasing items of 500 or more linear regression there are two types of linear regression simple linear regression and multiple linear regression simple linear regression is used when one wants to test how well a variable predicts another variable multiple linear regression allows one to test how well multiple variables or independent variables predict a variable of interest when using multiple linear regression we additionally assume the predictor variables are independent for example finding relationship between any two variables say sales and profit is called simple linear regression finding relationship between any three variables say sales cost telemarketing is called multiple linear regression some of the non-parametric tests are wilcoxon rank sum test and kruskal-wallis h-test wilcoxon rank sum test the wilcoxon signed rank test is a non-parametric statistical hypothesis test used to compare two related samples or matched samples to assess whether or not their population mean ranks differ in wilcoxon rank some test you can test the null hypothesis on the basis of the ranks of the observations kruskal-wallis h-test kruskal-wallis h-test is a rank-based non-parametric test used to compare independent samples of equal or different sample sizes in this test you can test the null hypothesis on the basis of the ranks of the independent samples the advantages of parametric tests are as follows provide information about the population in terms of parameters and confidence intervals easier to use in modeling analyzing and for describing data with central tendencies and data transformations express the relationship between two or more variables don't need to convert data into rank order to test the disadvantages of parametric tests are as follows only support normally distributed data only applicable on variables not let's now list the advantages and disadvantages of non-parametric tests the advantages of non-parametric tests are as follows simple and easy to understand do not involve population parameters and a sampling theory make fewer assumptions provide results similar to parametric procedures the disadvantages of non-parametric tests are as follows not as efficient as parametric tests difficult to perform operations on large samples manually hey want to become an expert in big data then subscribe to the simply learn channel and click here to watch more such videos to nerd up and get certified in big data click here