Categorical Data Description

In the last lecture, you would you were introduced to basically the two important branches of statistics which are descriptive statistics and inferential statistics. You also what is the difference between a sample and population, but we restricted our introduction to what is a sample and population for what is required for this particular course. Of course, in the advanced courses you learn more about sample and population. Then we went on to identify and understand what is a data.

Again for this purpose of this course we are restricting ourselves to only structured data. In the structured data, I have data in the form of a table where the variables are recorded in a column and the observations are cases which are represented by the rows in a data set. We looked at the types of data broadly, you classify data as categorical data and numerical data.

We also understood what was the difference between cross-sectional data and time series data. And finally, You should be knowing by this time what are the measurement scales. By that I mean that when you have categorical data, you should know how to distinguish between whether it is a nominal variable or an ordinal variable and when it comes to numerical data, you should know whether it is an interval variable or ratio variable. This is what you should be knowing at this point of time. Moving forward, we are going to understand how to describe categorical data.

Now, again in this module you are going to understand first we start with describing categorical data for a single variable and then we will look at measures of association when I have more than one variable. So, today what we are going to do is we are going to understand how to describe categorical data. So, we start with what we understand as a frequency distribution.

Now the definition, frequency definition. It is a listing of distinct values and their frequencies. What do we mean by frequency?

Frequency is nothing but the count, nothing but count. And by distinct values you mean the what are the distinct values the categorical variable actually takes. This is what we mean by that. Now, each row of a frequency table lists a category along with the number of cases or count of cases.

The minute I say number of cases, I am just I imply how many of that particular case is there in that particular category. So, for example, let us look at constructing frequency table for this simple given data. The category is just alphabets, I have the alphabets, I have 4 categories, I can term them as A, B, C and D.

So, how do I construct a frequency distribution? List the distinct values of observation. Here what are the distinct values of my observation?

In the first example, distinct values of observation are a, b, c, d. and D, I list them and I write it as a category. This is the first thing I do.

That is my step 1. Step 1 is to list the distinct values and this is my first column. So, you can see that in the first column, I listed the distinct values which are A, B, C and D. For each observation, Then place a tally mark in the second column.

So, for A, I place a tally mark which I am marking here. The second observation again is A, I again have a tally mark. Third observation is B, fourth observation is a C, fifth observation is again A, sixth observation is a D, seventh observation is an A, then I have a B, then I have a D, I have a C. So, these are the tally marks which I am going to have. So, that is step 2 for each observation place a tally mark and then count the number of tallies.

So, if I count the number of tallies the frequency or the count of A is 4. Count of B is 2, count of C is 2 and count of D is 2. This is what I refer to as a frequency distribution where the distinct values are given in column 1, tally marks in column 2 and the count in column 3. Now, let us go and look at this example. If you look at this example, again I have a tally mark. Do it quietly faster at this time a, a, b, c, a, d, a, b, d, c.

Now this a, I just cross it out because this is my fifth value. Whenever I have more than 4, I am having a tally mark, I am crossing it out as the fifth value, then I have a b, c, d and a. So this 5 plus 1 gives me a value of 6. This is a 3. 3, this is a 3, this is a 3. So, I have a total of 15 observations in this case, where this is the distribution category A occurs 6 times, B occurs 3 times, C occurs 3 times and D occurs 3 times.

So, if you look at this data where A had a tally mark of 4, B had 2, c hat 2. 2 and d hat 2 with a total 10, this was a 4, 2, 2, 2. This is a 6, 3, 3, 3. Now let us look at this here again, the same data of 15 points but now I have a A, I have a a B, B, I have a C, I have a A, I have a D, I have a B, B, D, C, A, B, C, D. So, this has A is appearing 3 times, B is appearing 6 times, C is appearing 3 and D is appearing 3. I have again a total of 15 observations. So, if you look at compare this example with the earlier example, you see that the only difference between example 2 and example 3 is. A appears 6 times here and B appears 3 times, C and D appear 3 times each.

In this example, it is flipped between A and B with A appearing thrice, B appearing 6 times and C and D appearing 3 times each. Both of them have 15 observations and this is what is given. Now, let us look at a final example where I have so many observations. if I do a tally mark A, A, B, C. A, D, A, B, D, C, A, B, C, D, A, C, D, D. So if I look at this, I have a 6, I have a 3, I have a 4, I have a 5, okay.

So I have 18 observations here. Now if you look at this, I have a tally mark here. 3, I had a 6, I had a then I had 1, 2, 3, I had a 3. This is what I have here.

So, you can see that we have we can construct different frequency distributions. How do I do a frequency table in a Google sheet? So, let us look at how to do a Google sheet.

Frequency table in a Google sheet. I look at the same example I have taken. So this is the Google sheet I have. I am going to construct, I am going to add a sheet here. In that sheet I am going to type first the category name.

So I have a category here. I am going to write down the whatever data I have here, the data I am going to list on the data I have A. I have a A, B, C, A, D, A, B, D, C. So, you can see that this is the first example which we looked.

I have listed down the data here. So, you go back to the step 1, select highlight the cells you have. What are the cells I have?

I just have these cells, I am highlighting these cells that is the first step. Now, you look at the second step in the formatting bar, click on the data option. So, I go to the formatting bar, I click on the data option, then in the data option go to the pivot table option, go first I highlight my data, I go to the data option, I have what I call a pivot table option. Now this pivot table I specify the range, you see that the range is specified a1 to a11 with the cell a1 specifying what is the category. Here I have just given the name category, I could give any name to this category variable, I could give an alphabet or I could just tell it is some group, anything, but this I am just specifying a certain category asking you to create a pivot.

Now, let us go to the final step after creating the pivot table in the pivot table editor. What is the pivot table editor? You have the pivot table editor which appears on the right hand side. There you add rows.

In the row, I just add a category. What are the different categories? I have category A, B, C and D. And in the values I am going to add what are the values the category has which is 4, 2, 2, 2. So, one way to create a table is I can just copy this and I spaced values.

I can give a category table here with frequency here. And, I can see that this is nothing but the table we have just created. So, this is one way to create a frequency table in your Google Sheet.

So, once you have your frequency table, your frequency table, so this is precisely what What we did for the first example 4, 2, 2, 10. You can see that is what our Google sheet gives A frequency 4, B frequency 2, C frequency 2 and D frequency 2 with a grand total of 10. This is what we have here, the first frequency table which we have created on a Google sheet. Now again I can create the same table for any given data. If you recall, this is the blood group data, hospital data which we discussed in our earlier class. Suppose I want to know the distribution of a particular categorical variable here, I can look at two categorical variables here, one is gender and other is blood group.

When I look at blood group, this is what I have here, this is the categorical variable. So I go back to my pivot table step. So, what do I do?

I remember in a first I select the cells, I click on the data option, I go to the pivot table option and I go to the pivot table editor. So, we are going to do the same thing here. I select this data, I click on data option, here I click on the pivot table option, I create a new sheet. Once I create a new sheet, I go to the pivot table editor, I add the rows. Now you can see the name of the variable.

Now here is a blood group. That is what I want to know the frequency distribution of the variable blood group. So I click on blood group here and I go to values and I add the values and you can see that this is the frequency distribution of the blood group.

I can just copy, I can paste the values and I can just put here, this is the blood group and this is the frequency and you can see that this is the frequency distribution of table of my Blood group which I get in Google sheets. If you look at the sum, you can see the sum of all of this. It is 30 and that is precisely I have 30 observations.

This is blood group with a frequency of 30 people. So this is how we construct frequency tables both manually that is through first principles of using tally marks and this is through using a Google sheet. Frequency table gives the count of each variable, each categorical variable. There is another thing which is very useful and that is called relative frequency.

What relative frequency captures is the ratio of the frequency to the total number of observations. So, we already have constructed a frequency table. We already saw that there are 4a, 2b, 2c and 2d.

The ratio of the frequency, so 4 is the frequency of a, total number of observation is 10, the ratio that is 4 by 10 which is 0.4 gives me the relative frequency of a in this table. So relative frequency. distribution, I just divide each frequency by the total number of observations and I get 0.4, 0.2, 0.2 and the sum total of a relative frequency should always add up to 1. Now if we look at this data, this was a 0.4, 0.2, 0.2 and 0.2 and adding up to 1. Here you look at it, this is going to be 6 by 15, this is going to be 3 by 15, this is going to be a 3 by 15, this is going to be a 3 by 15 and you can see that all of them add up to 9 by 15, 15 by 15 which is 1. You can again see that this is a 0.4, this is 0.2, this is 0.2, this is 0.2.

Now what I want you to see is the free of these two distributions are different. Here I had an A which is 4222, here the frequency was 6333. Whereas when you look at the relative frequency of this data set and this data set, you can find that the relative frequency of A in this data set, the relative frequency of each one of the categories or each one of the variables a, b, c and d is the same as the relative frequency of a, b, c and d in this data set. Now the reason for why do we need relative frequency? As I have demonstrated here, even though there is a difference between these two datasets in the count. You can see the relative frequency is pretty much the same.

Here I had totally 10 observations, here I had totally 15 observations, but both of them have the same relative frequency, the frequency is different. So, what relative frequency helps us is to compare two data sets. And because relative frequency always is between a 0 and a 1, It is a good standard for comparison. Hence, we always prefer to have a relative frequency table. How do we create a relative frequency table in Google Sheet?

In Google Sheet, we already have a frequency table. I create what is called a relative frequency column here. And I know that relative frequency is nothing but the frequency divided by the total.

That is how I define it. And I can just drag this down and you can see that if I look at the sum of these values, up to 1 with each of these frequencies giving me the relative frequency. This is for the blood group. I can do the same thing for this pivot table.

I can do the relative frequency which is equal to just a 0.4, 0.2, 0.2, 0.2 and the sum of all relative frequencies would always add up to 1. So, now I have this which is going to give me this is not blood group this is category sorry. So, I have for the first example I have with me what I call. the frequency and the relative frequency listed along with the category variable. So I have 2 more examples.

In the earlier example, so this is going to be 3 by 15 which is a 1 by 5. So I have a 0.2, 6 by 15 which is 2 by 5 which is 0.4, I have again 0.2, I have 0.2. So, for each one of them you can see that it adds up to 1. I leave this as an exercise, but you can see that this is going to be 6 by 18, this is going to be 3 by 18, this is going to be 4 by 18 and 5 by 18. You can see that this adds up 5 plus 4, 9, 9 plus 9, 18. So this adds up again to 1. So in summary, what we have learned in this portion is how to construct a frequency table, What is the notion of relative frequency and how do you construct? a relative frequency table.

Transcript for:Categorical Data Description

Transcript for:
Categorical Data Description