Transcript for:
Understanding Skewness and Data Visualization

Okay, then I think let's start. No. Media and history. Oh. You are only four students, then why not? at the front why you are sitting behind come forward huh what what you are saying why my leg is you are not a disabled person so this is only you are Talking you are giving reason without without any solid reason. Okay, you are just talking that I cannot move Anyhow, it is only 40 50 minutes not more than that. It is one hour class, huh? Okay So maybe your muscles were frozen, are in a silent state, you can say. You can call it jamed or something like that. So that is not an issue. If you will move, everything will be okay. Okay, good. They are at the top. Okay, this is the topic actually. Previously. we were we saw what is correlation and i think we have done practice with these all these four or five different things so now it the topic is about skewness of the data so review skew of attribute distribution so normally skew means that your data is positioning are aligning on one side okay either on the left side either on the right side or something like that normally the the literal meaning of this skew and you can better view this skew only through visualization okay but but you can you can analyze this skew only through visualization but here we are in the topic of understanding different things through numbers through numbers through stats but that's why now we are going to see the skewness of the attributes of the variables of the column headings which are so those attribute how the data of those attributes are distributed in the data set we want to see their skewness either they aligned on the left side or right side or whatever why because normally our data set should be according to the normal curve okay gaussian distribution if our data set is like that then it is good okay but something skewness is something like that okay sometimes like here or maybe if you have something here or sometime you have here but i am i don't want to talk because this is visualization i will talk this in visualization when this topic will come again so here we want to see the skewness through numbers so so the first thing is if the skewness is there in your data then you need to correct it you need to correct it why because because we want our model to be in a good and accurate shape okay so how we have many algorithms we have many machine learning algorithms and all the algorithm normally assumes by default that our data is in normal gaussian distribution form or according to the normal bell curved data okay as i explained normal bell curve data means this so all this is not exact shape but so all our machine learning algorithm assumes So assumes that our data is already in this normal curve form. OK, so this is some which is the other name for it is the Gaussian distribution. So our machine learning algorithm normally assume that our data is distributed according to this Gaussian distribution, which is normal curve. And if it is not, then it is a kind of a problem and you have to remove that problem. So this skewness, so this skewness can be positive or negative. OK, this skewness can be positive or negative or zero. Zero is something here. OK, sorry, this line is not straight. So now it is. So this is zero. And then this curve can be on the left side or the right side or whatever. And then you can call it positive or negative. So if the value of your data is close to zero, then it is close to zero. If it is zero, no skewness. And if it is close to zero, then it means there is some skewness, some less amount of skewness. And it is far from the... for from the zero then it is quite much value of of the skewness so here is the code when you normally came in the class please don't say salaam just go there and sit okay because when you talk it create disturbance okay even salaam is not needed okay okay so now so this is a short code to see the skewness of your data in numbers and so this is the same from pandas import read csv so we are just this is the path of this is the data set we are going to consider here we are creating headings for that and then we are reading that file which we are storing in this data object and then print data.skew so you only have to write this this data is the name of the object in which you have all the data of your data set and then you have to write skew and when you will use it with the print this is the output and here you can see the value of all the variable this is one attribute and it has value this 0.90 which is which is a little far from the zero value which means this is a positive skew this is not close to zero okay this is positive skew and this plus is 0.17 so this is positive but close to zero so you can see that if it is more than five then you can say it is something positive skew and if it is minus something more than minus five minus 0.5 then it is something negative skewness huh so here another attribute with value minus 1.84 so it is minus 1.84 so now definitely it is a little far from the zero going in the negative side so it is negative skew this is 0.11 close to zero huh so the here you can see okay you this is bearable this data is okay the value of this column whatever the data you have here That is okay. But this test, it has 2.27, more than 0.5. So it is positive skew, you can say. Similarly for others. Similarly for others. And you know the D type here, it is giving you the data type of these columns, nothing else. So this is called skewness. And now we are moving towards the visualization part for data understanding. How we can understand the data through visualization. Now through stats is, we have other things, but these are just samples, okay. These seven things that you have seen, it is just a sample. And we can do many things. Now through visualization, how we can understand the data through visualization. So some people can understand the data through numbers, and some people can understand the data through visualization. So visualization will definitely... help us help us to understand the data in data sets nothing else so through visualization we will see that what is the situation of our data and through visualization we can see what are the different relationships or correlations it's the same thing so what are the different relationship between different column headings you between different features between different attributes same name attributes column adding variables what are the others features so the same same meaning okay so how so we can also see the different relationship between the between different features or attributes so normally now data visualization techniques are divided into two categories univariate plots plots mean the things that you could plot in a visual form okay like a chart like a graph univariate plots multivariate plots so univariate means where you are plotting one attribute one variable and multivariate plot where you are plotting more than one variables you our features together okay so then univariate plotting has further three categories histogram density plots box plots we will see these categories one by one and then multivariate has correlation matrix plots scatter matrix plots we will see these things one by one starting with the univariate plots uni means single huh Variate is from the word variable you can understand. So it will show single variable. So visualization about the single variable. Or you can say understanding the variable or attributes independently. Without a relation with the other attribute. Independently. So this is a technique. And through this technique we can understand the feature. attribute or anything single attribute independently without a correlation to any other variable so that that's why we are in univariate category and there are some techniques in Python which are available first one is histogram what is histogram histogram will do what it will show you the data like bins. This is a bin. One column is a bin. Okay. So the one column that you are seeing in this visualization that is called bin. Here bin is a sophisticated word. Otherwise you can use the word column. But these bins can be in a different shape. Okay. Not always columns. So histogram will do what? It will group the data. in in in visual bits it will show you the bars and this histogram is also this is the faster way to get an idea about the distribution of each feature each attribute how your data is at how your data is distributed in the data set this is the fastest way visualization is fastest way than the numbers are then the numbers so that's why okay what are the different characteristics of histograms it will provide us the count. It will provide for example here if it is a data set like here pregnancy but this is not a count okay. Suppose if you have six here then this six will represent that this person has pregnancy this this this woman has pregnancy six times. So it is giving us a count. Okay. It is giving us what? Count. So histogram will provide us count of observation in each column or in each bin. And through this histogram, we can also see the distribution of the data that either your data is in Gaussian form. This is Gaussian form. normal curve are skewed so this is positively skewed and they are exponential so this is also skewed if i will move make it so this is what normal distribution now my pen will not work here okay and this is what right are positively skewed negatively skewed and here this If you have something like that, this is called uniform distribution. Every value has the same uniform. This is again normal curve. And this is what? You have two normal curves, which means you have two by model. Maybe you have data set from two different things. Okay. Your data is not symmetric here. Okay. Symmetric by model distribution means. you have two distributions similar in shape. So that's why the word is symmetric which is an evidence that your data is something repeated or your data has is taken from two different things. So sometimes you will have the same shape and sometimes you will have a different shape. This is non-symmetric. bimodal distribution. Bimodal. This is not good. Okay. Only this thing is a good thing. Normal distribution. The rest, the other things are not good things. Okay. Only these. Anyhow. So this is the purpose of the histogram. Why we normally see the histograms for our data. how you can do that again we will use the pandas date frame and for plotting we will use the mat plot library okay we will use now another library along with the pandas we will use another library that is called mat plot library how you can so now you we are writing from from mat plot library import pyplot and we are importing pyplot from it this is the pandas you already know from from pandas import read.csv okay here okay pyma indian diabetes data set creating headings reading that database putting it in the subject data and then when the data is there in the subject now data.hist data. hist. We are just, you just have to write data.hist, hist for histogram. Okay, then histogram is ready, but it is not plotted yet. How you can plot? pyplot.show, this pyplot, which is a module available in this matplot library. So through this, now this is a kind of object of that matplot library. So pyplot.show, when you will use it, it will display something like that. It will display something like that about the Pima Indian Diabetes data set. So have a look over here. Now I cannot use the pen. So this is what? This is skewness. So this is rightly skewed or positively skewed. You can also call it linear regression going sorry you can call it exponential okay in reverse direction from high to low speedily quickly exponential and this is for the class and you know that this class variable has only zero and one value so that's why only two bars and this is for the mass it is Not exactly like a normal curve but it is in the middle and the shape is like that. So here I can say that this edge does not have appropriate values. Okay. Here distribution is not okay. But here for mass the distribution is almost okay because it is close to the Gaussian distribution. Normal curve. This is again. the skewness distribution is not okay. This is again like normal curve but your curve is a little bit on the right hand side from the middle. But you can say it is here the values are better than this and that okay. And here again the values are disturbed even it is not an exponential. Even it is not a exponential. but it is you can say it is a kind of a skew it is again to normal curve but your data is a little bit on the right right hand side so but you can say it is okay it is not that much disturbed this data is not that much disturbed so this is disturbed this is disturbed so through visualization you can understand something like that And this is mentioned here, age, PADI and test, exponential distribution. What is age? This is age. This is PADI. And what is test? Okay. This is age. This is PADI. And this is test. And here we are saying that this is disturbed due to that skewness. And mass and plus, Gaussian distribution. Okay. Let's see that. This is mass and this is plus. So this is Gaussian distribution. You can also say something like that. These three are almost Gaussian distribution, okay? This is just, this is not something final, okay? This is just an idea. Now, density plots. By the way, this is available this histogram is available here if you will go into this history that pie that would be available if I will run it you can think the database huh here the database is like that sorry the name is esto so this is the output you can make it big and you can make the size bigger or smaller as per your need so this is how that four line code is giving you the output okay now density plots so this density plot is again like histogram but now histogram is a kind of a field diagram field visualization but this here we have something a line or a curve is drawn here and now we can also call them with another name abstracted histograms and this is the code for that so again matplot import piplot that is same record like in previous program Panda same, path same, heading same and here we are reading that Pima Indian Diabetes data set. This data is the object and then we are plotting data.plot and then here we can take the first thing is kind. What kind of plotting you want and then kind is equal to within single quote you are writing density. here you can write histo also okay if you will write histo it will give you the history and you can write other there are eight nine different diagrams that you can write the name there okay you can just copy this command and paste it in the browser it will go into the documentation of pandas and in the pandas you can see how many different diagrams through through this plot you can plot okay it will show you so kind so you can display bar charts or even line charts or different charts you can display so kind is density and you are saying sub subplots true which means plotting all plots so this is subplot subplot this is subplotting okay subplots two means all Show some plot for attribute 1, then show some plot for attribute 2, show some plot for attribute 3 and so on. This is what it means. And layout 3, 3. So, you can see 3 rows, 3 columns. And share x is equal to false. And share x is equal to false. There are two options. Share x is equal to false. Share y is equal to false. It means that you are not sharing the x axis values within the attributes. OK, between them. So normally you have to put these false and this and then it will show you something like that. And here from this density plot, you can also understand the things like that. So here your data, you have the skew. Shape is like a skew. this shape is like a normal curve. Here it is like a normal curve but this is not normal curve. Data is disturbed. Here you have the skew. It is like a normal curve. Here you have the skew. Here you have the skew. Here this is not normal. You have two bell curves. So through this visualization then you can get an idea that which distribution of attribute is okay and which attribute does not have a correct distribution of data in it. And now after that box and whisker plots. So through this technique or through this box and whisker, we can also see the distribution of the data. All the techniques that we have, we can see the distribution of the data. That is the general word you can say about all techniques. Then each technique has its own something, a kind of, you can say, specialty. So through this technique, what we can do, we can summarize the distribution of each attribute. Each attribute, it will show you the box and whisker. plot for every attribute for every attribute for every column heading what it will do it will draw a line in the middle in the middle value or you can say it will show you the median so this this is a box and this is the mid center line this is the midline this midline will will show you the median of that attribute okay and that attribute will have values huh so so this center line will show you the median of that attribute and then draws a box around 25 and 27 percent so this lower quality so this line is depicting first from here from minimum to here so this is depicting 25 percent of of 25 percent of values and then you have upper corridor tile which is we are calling q3 and it is representing 75 percent of the values and this whisker and then these are these are called so this is the box and these are called the whiskers and these whiskers will give you the idea that your data is how much your data is spread because this will show you the minimum value of your data this will show you the maximum value of your data and the box is in between this line is 25 percent this line is 75 this is the median you can say 50 percent huh and then you will have dots outside the whiskers huh maybe you will have some dots here some dots here and if you will have some dots on these positions dots means something like that huh if you will have something like that then these dots would be called what out layers anomalies huh some value which is uneven rare event outliers now the term is outliers and this whole range we are calling it interquartile range iqr anyhow leave it and move on to this so mat plot same pandas framed database same heading same reading same and then we are data dot plot we are plotting what kind is equal to not density box if you want to plot these boss box and whisker plots you have to write box subplots true layout 3 3 share x and share y false the same the same like previous only thing is change is kind is equal to box and when you will run it you will see something like that what is you will see something like that and now it is showing giving you the idea that the attribute is preg for pregnancy data and the values are from 0 to 15 so so there is so and this is the box this is the minimum whisker this is the maximum this line green line is giving you the medium which is maybe it is three yeah maybe it is So the median of this data is that different women's you can say at least average is what? Three times pregnant. And then 25% what? Maybe here 25% is achieving here at maybe one. And it is 70% is reaching at five. And then you have these dots. These dots means you have. some exceptional cases which has almost 15 time pregnancy, 16 time pregnancy or something like that. But your data is starting from 0 to 15 and this is the spread of your data. And here, okay, here the value is starting from 0 to 15 and maximally you are going up to 15. but your data is around between 1 to 9 values are normal you can see here look over here for data is going for plus 0 to 200 the median is what the median is around after 100 maybe 120 and then your values are from a minor less 25 in starting from maybe maybe from 80 and then it is going up to maybe 140 or something like that and the spread is from maybe from 40 to 200 okay no issue the data is almost in the middle of these values huh you can say it is not something it is not something only if you are if your values are starting from 0 to 200 and if you all box whisker is from 0 to 50 and know anything there then you can say your data has is only has lower values according to the scale. Here the scale is from 0 to 100. Your data is mostly between you can say 60 to 70 and there are some out layers. That is also your data is in the middle position on this scale from 0 to 100. Look at this. This is something different. from 0 to 100 data the range of data is 0 to 100 one out layer at 100 and then mostly data is between you can say 0 to 25 so your data is aligned towards lower values so your your data is aligned towards lower values so here some balance is is needed some balancing is needed. Similarly here from 0 to 800 mostly data is here on the lower values and then you have out layers. Here balancing is needed. Here balancing is needed. Here from 0 to 60 your data is in the middle. Okay some out layers no need. Here from 0 to 2 the values are from 0 to 2 not 0 to 200. Okay so that's why if if it is in 0 to 0 to 1 okay no issue because you have only two three values not more than that and here from 20 to 60 your data is in the beginning so here you can say a little bit balancing is needed and here classes you have only two values zero and one so that's why like that that's why you have the box and the whiskers at the same position minimum whisker, maximum whisker and the box, no median because the this column has values only 0 or 1. Now let's have an idea. Age, test, skin they are skewed towards the smaller values. They are aligned towards the smaller value. What are age, test and skin? Let's see it again. age test and skin this is age okay this is yes it is skewed it is it is aligned towards chlorovalid because the range is almost 100 almost 100 and mostly data is in the in the on the lower part age test test is very much clear 0 to 800 and mostly data is maybe from 0 to 100 age test and skin this is skin and i already talked about this 0 to 100 and your data is already in the lower value so they these are skewed or you can simply say they are aligned towards lower value and here balancing is needed here some correction is needed okay Now the second category is multivariate plots. How we can plot multiple variables, multiple attributes, multiple attributes together. So it means that if we are going to consider multiple attributes, it means now we are going to show the relationship between multiple attributes. then there are some techniques how we can go. So the first one is and if you go to that previous technique multivariate plots two techniques correlation matrix plots scatter matrix plots and here we are going towards the first one correlation matrix plots. Through this correlation matrix plots the first thing is it is for the multiple variables. It is to show relationship or correlations between the multiple variables. This is the technique to show the correlation between multiple variables. This is the one thing. So then how we can show them. So we will use the core function which is available in pandas and we will use the pipeline. And through this correlation matrix, we can identify that which variable has a strong correlation with other variable or which variable has a low or weak correlation with the other, with the other variable. So let's see that. So this is the source code. This is same. pandas same and we are importing numpy another library so if it is not available in your python you have to download it like you have downloaded pandas yesterday huh so you have to download and you have to pip install you and then you have to write pip install matplotlib pip install numpy then it will be installed this is the path of the database headings of the database we are reading and putting it in the data and then here we are calculating data.core so so in the data we have the we have the data set and now data.core now correlations will be calculated and it will be stored in this variable correlation this is the variable and now you have to plot now you have to plot something like that and for plotting this whole thing is for plotting here first we are saying we are using this pipe plot we because it is matplot library we are using it and we are saying pipe plot dot figure so we are setting a figure so we are setting a space for the figure and then that space is now fixed in this variable fig and then fig dot add subplot so in when you have fixed this fig fig So this area is you can say this area is now allocated. Memory is allocated. And then fig.add subplot. So add a subplot in this figure with these values. What is the meaning of that? Here first means one row. Second one means one column. One row, one column. And then the last one means first subplot, first subplot. So in this figure, so some area is now divided. First row, first column and the first subplot. So if you have not shown normally in the figure, you normally write first. OK, that you will have. 12 subplots in that figure okay normally you mention but here it is not mentioned but here it is not mentioned so if it is not mentioned then we are we are fixing some some plot one row one column and first subplot and we are fixing this area and we are calling it with a x so we are fixing this area you can see first row first column we are fixing this area for this subplot area area and then we are using ax that mat show so this is again the module that we are using and then this mat show this is a method which is taking your correlations taking your correlation and then vertical minimum value is minus 1 maximum value is 1 and it will show the matrix with these values with the kegs and it will show it it doesn't mean that it will show physically it will calculate something and we are storing it in kegs and here in the next statement fig dot color bar and we are passing this kegs into it so this color bar this is the color bar which is displayed on on the right hand side this is the color bar in the figure now we are fixing we are fitting we are fixing this color bar with this kx here with the minimum value minus one with the maximum value here it is here the maximum value is one minimum is minus one due to these due to these values but here fig dot color bar kx this is the statement for this color bar and now we are focusing on this area. Numpy, we are using this numpy. We are drawing these cells, okay. Numpy, this is the library and in this library some method range is available and now we are giving some value to it 0, 9, 1. What is the meaning? Start from 0 and stop at 9 and step Step is what? 1. You are giving these values to this method. It is acting. This is not a list. Okay. This is the method. It will start creating the cells from 0, then 1, then 2, then 3, then 4, then 5, 6, 7, 8 and before 9 it will stop. So from 0 to 8, it has 9 cells. Okay. It will go. it will go from 0 1 2 3 4 5 6 7 8 and at 9 it will stop because 0 is the start and 9 is the stop and here in this method you don't have to go into the design of this method you have to use this method but how this method work starting value ending value and it will start it will stop before the ending value and the step and the step what is one increment you can see you and then it will it will create something it will start from zero it will stop at nine and the step is one it will do what it is it will create an array of nine cells stop not included i have given this instruction for your understanding with the comment here this is the comment it will create an array and it will store that array in ticks so ticks means these ticks that minor ticks you are seeing here are you seeing okay these minor ticks so now this ax is what ax is the area that you already fixed i already mentioned that this is the area you are fixing in the first row first column of this in this whole diagram you are calling it first row first column this whole thing and then here you are doing what you are taking this object now ax dot set x ticks with the ticks so the x so the x axis set it with the ticks so okay then then in that in the diagram these ticks will be displayed okay these ticks will be displayed here and also here these will be displayed and counted it is nine and then ax dot set y ticks and this ticks which are nine it is going into it and it is creating what setting nine y cells so it is you can say it is setting these ticks which means setting these also cells nine cells or you can say these ticks whatever you want to say for understanding it is creating nine cells here nine cells here and then a x dot set x tick labels and the names These headings you are passing into it as a label and x on labels on the x-axis and similarly on the y-axis will be set by these two commands. And then pipe plot dot show. Now you are showing the plot. Now correlations is already calculated which is given here in this mat show. Okay. With the minus and positive values. And you are. You are displaying color bar with these values and also values are calculated. So pi plot dot show and then it is showing you this thing. It is showing you this thing. And this means and here pregnancy with the pregnancy. It means positive positive correlation. When it is 1 it is a positive high correlation. When it is 0 no correlation. When it is minus 1. negative correlation huh this is what this color bar is giving you the information blue means negative yellow means positive and here it is a mixture this is zero so this is high correlation and here this is a little bit yellow these two so it means it has some kind of a positive positive correlation huh and similarly this one yellowish these are yellowish Okay, so these yellowish has a positive correlation. And then these green ones, even these green ones, mostly green ones also has the positive correlation. We don't have exactly this bluish color in this diagram. So in short, I can say mostly I have the positive correlation between the attributes. Mostly I have the positive. However, I can say what is the relation between plus and skin? This cell is giving me that idea. What is the correlation between mass and class? This cell is giving me an idea. This is what through visualization you can understand. And in the next, we will see it, inshallah, in the next lecture.