Transcript for:
Understanding Statistics for Data Science

Hello everyone, Myself Prashant Sharma We are going to record the lectures of Stats 1 in Hindi Which Usama Em has taught in English We will record the same thing in Hindi We are going to teach in Hindi. So, let's start the first lecture, lecture 1.1. In this lecture, in the first lectures, we will try to understand that first of all, why is this course?

Means, what is the objective of Statistics for Data Science 1? After that, we will see what is the definition of statistics? After definition of statistics, we will see the two major branches of it, what are descriptive, what are descriptive statistics, what are inferential statistics.

As soon as we come to the point of inferential statistics, we need to know what is sample and what is population. We will see the difference between sample and population. After that, we will try to understand how to collect data.

As soon as we collect the data, it has to come in a tabular form, in the form of a data set. As soon as I have a proper data set, then we will try to understand what the variables are, what the cases are in that particular data set. Then if we already have a data set, then we will try to see what the types of data are. That means we will see that that data which is the numerical variable in that data set We call numerical variable as quantitative data Then we will try to understand the difference between cross sectional data set and time series data set After knowing the difference between cross sectional data set and time series data set, we will move on What are scales of measurement? There are four different types of scales of measurement Nominal, Ordinary, Interval and Ratio Scale of Measurement We will read about all of them in detail As soon as we read about scales of measurement Then we move ahead and try to see How can I download and manipulate the data set We try to understand that And how can we work on the data set The main objective of statistical analysis is to understand the data set.

After understanding the data set, the main task is to frame the questions from the data set. We will see all these things in detail. Let's start with this.

we will see basic definition of population sample etc. and descriptive statistics, inferential statistics, we will see definitions of all. First we will start what is statistics? So the definition of statistics is changing over the years. The definition given by Sheldon Ross is that statistics is the art of learning from data. So the art of learning from data means that we learn some information from data.

We have to create an art of what information we can extract from data and infer that what we will use for our analysis. What is that? If we see a formal definition, then statistics is the art of learning from data. It is concerned with the collection of data, data collection, their subsequent description, how to describe it and their analysis.

which often leads to the drawing of conclusions. So, when we will do analysis of the data set, at the end we come to the conclusion that we have drawn from that particular data set. Next, we will study that we can divide statistics majorly into two parts.

First is descriptive statistics, second is inferential statistics. So, when we use descriptive statistics, when we use inferential statistics? So descriptive statistics are used when the objective is to explore the data set only, if I have to summarize the data set only. So in that case we use descriptive statistics.

If we see the definition, the part of statistics concerned with the description and summarization of the data is called descriptive statistics. Whenever we use the data set only, whatever data set is given, if the data set is not given, if we are interested in description and summarization of the data then descriptive statistics are available So, the second part of statistics is inferential statistics. What is inferential statistics?

If we go to the word, inference means we have to infer something from the data. In this part, we draw some conclusion from the data set. If you see the definition, the part of statistics concerned with the drawing of conclusions from the data is called inferential statistics. As I am saying that we have to draw a conclusion from the data set, in that case the possibility of chance comes And this possibility of chance increases us to read what is probability Probability and all we will read in the next week But we need to understand that trying to infer or draw any conclusion from a small dataset, there comes a chance factor. And as soon as we do inferential statistics, we get two things for that, population and sample.

We will try to understand that through this example. For example, if in inferential statistics, suppose I have this interest in knowing that percentage of all the students in India who have past their class 12th exams and yesterday engineering. Second is the prices of all houses in Tamil Nadu, the total sales of all cars in India in the year 2019. Next is the age distribution of people who visit a city mall in a particular month. So, here we have two ways.

First is that if I want to... I want to know the percentage of all students in India who have passed their class 12th exam and want to do engineering. In that case, we can do complete enumeration.

Complete enumeration means that we collect the data set of all students who have passed class 12th and then we try to know how many are going for engineering. Similarly, if we see prices of all houses in Tamil Nadu. If we see prices of all houses in Tamil Nadu. In a particular year, prices of all the houses sold out. So, as we try to do this, we can keep in mind that this is a very difficult task.

Collecting all the data and then performing it is a very difficult task from the point of view of cost and time. The other way is to do this. information, we can take a small part of it which is called sample. We can take a sample and do analysis on it and draw conclusion for the whole population.

That is inferential statistics. If you see, drawing of conclusion from the data. From which data?

From small data, like sample data. and to get inference from sample data for bigger population data set. Actually, what is population?

It is collection of all the elements in which we are interested. What will be the population here? Collection of all the students in India who have passed the class 12th.

So, what will happen? This will be the population. Because it is very difficult to work on it. Similarly, collecting data of prices of all houses in Tamil Nadu is also a difficult task. So, we will take a sample, a small data set.

What is that? It has a part. Since it is a sub-part, I will make it in this.

This will be my sample. So, by working on the sample data set, we will draw the conclusion for the population data set. This is our inferential statistics.

We have to find the inference. for a larger population. We need to keep in mind that this sample should be representative for the population. Now what is representative?

If we see here, what is the definition of population? The total collection of all the elements that we are interested in. For example, what will be the population?

Collection of all the students who have passed the class 12th. collection of prices of all the houses in Tamil Nadu will be our population. Sample will be a part of it.

So sample should be good representative. of your population. So what does representative mean?

Suppose I have this population and this is our sample. So what is the bigger one? Population.

What is the small one? Sample. So this sample is not good representative for population. Why?

Because we can see here that only blue items are selected in the sample, not orange ones. So what does the blue one mean? This sample should be good representative.

As we said, a sample is a subgroup of the population that will be studied in detail. Population and sample are important for inferential statistics. Population and sample are used a lot in inferential statistics. We will study that further. Now we will try to see the purpose of statistical analysis Means when we will go for descriptive study, when we will go for inferential study So we will try to see that if our purpose of analysis is that Whatever data set we have, if I just want to explore and examine that data set Means what is there in that data set, what is not there If I just want to examine and explore that data set, I mean, I want to explore the information given in that data set, in that case we use descriptive statistics.

For example, I try to explain through a data set. So, as I have a cricket data set, it is not a whole data set, it is not a complete data set, it is a small data set, a subpart. But still, If my purpose is to explore this data set In this data set, player name, jersey number, how many matches he played, role, batting average, highest score, wicket, bowling average, best bowling If my interest is only this like which player scored the most runs in this, whose batting average is the best, whose bowling average is good, whose highest score is good, who has taken the most wickets.

If we just want to see all these things, if I just have to examine this particular data set, just have to explore, nothing more than that, I am just trying to summarize this data set, whatever information is given here, I am just interested in that information. To So if I want to examine this particular dataset, then descriptive statistics are used. We don't have to do more than that. We just have to examine and explore.

In that case, descriptive statistics are used. Whereas if my objective is to draw information from a sample dataset, from a small dataset to a large dataset, for population dataset. In that case, inferential statistics are used. So, as we can see here that if the information is obtained from a sample of a population and the purpose of yesterday is to use that information to draw conclusions about the population, then yesterday is inferential.

As if we see in this data set, although this is not a complete cricket data set, it is a small data set. If we, let's say, I have to draw some conclusion, like we all know about IPL Auction So what I have to do for IPL Auction is to select the player So if I use this data set for that player selection procedure So what we have to do is, with this data set I will collect all the information That information will help me further for player selection I will click on this I am not interested in examining and summarizing this data set only. I am interested in the future as well.

Our job is to get players for the IPL auction. For that, I will collect and gather information from this data set. The information that we will use for player selection in the future. In this case, inferential statistics are being done.

By gathering information from these small data sets, we can... and use this for the further task of player selection then in this case inferential statistics are obtained because we are moving ahead from a small dataset Now, descriptive statistics may be performed either on a sample or a population. We can understand this from the following point that what is descriptive statistics? Whatever data set we have, like cricket data set, we can see that here data of 25 players is given.

So, if this data of 25 players is given, then we will summarize this data set, examine it and explore it. Suppose, if the data set of 250 players is given, then we will examine that data set, explore it, and that descriptive statistics will come. We just have to explore the exam and nothing more than that. Then if we have a data set of 2500 players, if we examine and explore that data set, then that too comes as descriptive statistics.

So, we can perform descriptive statistics on a sample and on a population. Inferential statistics can only be performed on this sample because in inferential statistics, we have to draw a conclusion from a sample dataset to a big dataset. We can use the sample population for the course of statistical inference etc. Here we just need to understand that as soon as we draw any summary statistics, In that case, we need to understand that the summary statistics are for that sample, for that small data set and that summary statistics are for a population data set. So, as we tried to understand through sample population that what is inferential statistics and then we also saw that what is descriptive statistics.

To add In the end of this video, we will try to understand that the summary part is that we have seen that there are two major branches of statistics. Descriptive statistics and inferential statistics. Descriptive statistics is when we have to summarize the data set, examine and explore.

For example, we can take a data set. Suppose I have a As an analyst, I went to the school and I want to see that the school is from class 1 to 12. For example, I am interested in the on an average performance of that school. So, due to time and other factors, I simply cut 50 students. I randomly selected the data of 50 students and collected their marks and data. Now what I will do is, this data set of 50 students is a sample and all the students in the school So, all the students in that school became our population.

So, now what I have to do is, if I can see the marks data of these 50 students, I can see the maximum marks, minimum marks, average marks and other things. If I simply collect the data of these 50 students and I simply see the maximum marks, I can see the maximum marks. 92, minimum 32 and average 70 This is hypothetical data, it can be correct or wrong If I just stopped here and took the data set of 50 students And I just saw that maximum marks are this much, minimum marks are this much, average marks are this much So this is my descriptive statistics Why? Because in this case I did not do much more than this, I did not go here in this part, I simply summarized this data set, I examined this only that this is maximum, this is minimum, this is average, I can explore this data set more.

Now what is the factor of inferential statistics? Suppose I am interested that I have to submit a report that of that particular school, So, what I will do in this case is that suppose I have taken out the average marks of 70 of this sample data set. Then what I did is that I said that in that school and the students of that school, on an average, the average performance is 70. Means on an average they are getting 70 marks in their exams.

Now what is happening here is that in this class, On the basis of sample data, I have concluded that the average marks of that school is 70 So now what is happening here is that on the basis of small data, I am drawing a conclusion for the whole school For a big data set, so here what happens is that inferential statistics comes So you can see here what is descriptive, whatever data set is given to me I just examine, explore and summarize it. Inferential is that we perform analysis on a sample dataset and draw some conclusion for a large population. Then we also understood the difference between population and sample.

We saw that a sample is a subgroup of a population and a sample should be a good representative of the population. Thank you.