Transcript for:
Distribution Overview and Types

okay and this notebook we're going to discuss uh distributions so that's why uh title is exploring distributions so as always you can see here we have a summary of of what we're going to do in this notebook and also a summary of what we did in the previous notebook you guys can take a look uh on the uh of this on your own so we can get right away started with the distributions first of all uh what are distributions uh distribution basically describe uh how the data is spread uh or disperse and and what is how does the this uh dispersion uh looks like in terms of shape of the frequency right so so we going to take a look at how the numbers are are like like the name say distributed uh along the the possible values within the data uh and this uh plays a crucial role in these different aspects of data analysis um so in general it's it's a important topic for uh data analysis data science machine learning and so on um so it's also going to allow us to um explore the idea of a statistical inference uh which refers to the capacity of uh basically using samples in order to to think about statistics from the population because in in real life uh most of the time it's super hard to to get data the data for the population because it's either too big it's very h it's cost too much to obtain the whole data for the population therefore you have to rely on a small samples in order to to try to do an a educated guess of where the h the population parameters may lie so so that's basically uh the brief idea of a statistical inference and basically that's something that the distribution help you to uh do once you once you learn them so first thing is uh we're going to as always we're going to import the libraries these are your usual libraries i already have them here for you so you don't have to write them even though you should know what they do and what is the usual alias and so on okay so you just have to run it up actually so this is the my notebook so as I suggested before make a copy okay and work on the copy work on the copy so you can so when you do new stuff you can actually save it so I'm just going to run the uh this cell and the and the copy of the notebook and waiting for this to finish two okay so now it's done uh in here I have an example um so first of all I have some data that I'm still I still have we still haven't used a data set for this uh notebook up to this point but something that you can do and you always have to keep in mind is that you if you have some small data set like numbers and you want to work with those you can write them as a list save them in a variable and use it as a regular data like they don't have to mandatory work with a CSV or Excel file you can do your own data or even create synthetic data which is like like uh some some data you created on your own either following some guidelines from the world real world or just uh random data in general to play with um so in this case we have some weights here uh a list of weights and what we're doing here is a histogram so in the previous noble we discussed how to build histograms uh using first math plot lib and then seabboard uh so if we run this we see that we have a histogram uh and something that you may notice right away is that this right here looks like a reflection right so if we take a look there's a high bar in the middle um so that's the part where we have uh more frequency right so because remember in the in the y-axis for this for the histogram we have like counts or frequencies however you want to call it uh that's what we have here and this highest bar is like the one that is the most frequent uh but but if we actually do a line in the middle we see that the bars in both sides are are basically symmetrical um it's like you have a mirror in the middle and this is like the reflection of the other side uh so this is just this is just an example of something that is kind of normal uh we cannot say that it's normal because of the of the tails but this is what we mean when we talk about distributions um so we want to see uh how these points and their frequencies are distributed along the possible values so here there's a little bit ofation showing that oh we see that we have values falling between 44 and 58 uh that they symmetrically distribute around the the 51 kilogram with the middle uh just uh some insights from the data but uh like I say this is just the idea of distribution um we're going to explore different type of distributions okay so why are distribution important uh first of all summarization and understanding distributions provide a structure way to describe data sets and model assumptions and statistical inference erh so if we want to do a statistical inference to find some information about the population parameters we have to erh we have to rely on the distributions for some test that we're going to discuss depending on the test you might say uh oh assuming the data is normal distributed we're going to approach it this way uh depending on how the data is distribution you're going to have different approaches therefore it's important to understand the distributions uh and the distributions have different shapes and here we have some a few shapes that are usual uh and their type of and they're measured like for example here this shape right here which is similar to the one we have before it's what we call symmetric if you put a mirror in the middle both sides are uh basically the if it's perfectly symmetric they will not be exactly the same but if it's close closely symmetric that's when people say it's approximately normal okay so that's the DS and sometimes you're going to see that people call it belt curve uh but this is basically what we consider normal then we have uniform and where uh you have uh data that h that they have the same amount of frequency or if you're using the relative frequency they're going to have the same percentage of occurrence uh that's considered uniform okay we have biodal which means that you're going to have two kind of two mountains or it's biodal because you're going to have two modes two parts where you have the highest amount of frequency concentrated so that's what's called biodal there's also multiple more modes that can be added if you have more peaks uh but it's usual to discuss about biodal but uh keep in mind that you can have multiple modes then we have a skew left skew right uh when when we say skew left we refer to the side where the highest tail is so if you notice we have a higher tail in the left side and majority of the data is towards the right to towards the right side so that's why it's skew left because the longest tail is in the left side similarly for skew right is because the the longest tail is in the right side okay depending for these three right here uh depending on the skewess of the data uh the the the mean the mode and the median are going to be located differently so for example if it's symmetric all three fall fall in the middle okay they are all equal they should be all equal if the it is is perfectly uh symmetrically distributed now if it's skew left then the mode is something that is always going to be on the where that you have the highest frequency right because that's the the idea of the mode right so so the mode is here at the peak but since we have the longest tail on the left side uh th those are going to pull the mean and the median uh apart from the mode and the mean always going to follow the the the longer side of the cell so in this case in the order of appearance will be first the mean then the median and then the mode meaning that the mean is going to be lower than the median and the median is going to be lower than the mode for the skew right it's kind of the opposite the mode is going to be in the highest peak but then the mode is going to be less than the median and the median is going to be less than the mean keep in mind that the mean is very sensitive to to to uh to outliers and basically these longer tails are kind of like unusual values you could consider them outliers and that's uh that's the reason why they're being pulled the mean is being pulled towards those values because again it's sensitive okay so those are examples of the shapes of distributions uh here we're going to introduce a new data set uh and this data set is going to be about cars and we're going to use it to to to to take a look into a few things related to distributions uh there's a definition for each color but uh in this case I'm not asking you guys to practice anything with this one you just have to run the uh run the the cells and but of course I'm expect that you are a little bit curious in order for you to just explore the data okay so take advantage that you have here uh some data set that you maybe haven't seen and start checking uh what is the shape like how many rows how many columns um what else you can check like the col like we already have all the columns here but let's see if there's any like something that is not showing up or maybe you want to see the names only so here you can pull the names then you can pick uh one of these like the brand and you can take a look at that column and you can you can actually like okay let me see what are the unique brands in this in this column right so we could do this and we can see that these are the unique brands included uh and so on several several things like if you want to see how what is the frequency for each one you can use the function value counts okay you can use value counts and find out find out what is the frequency for each one we see that four appears 1,235 and so on like there's the functions that we have mentioned before in the other notebooks uh you should use them to to get an idea of what you're dealing with in terms of your data set okay so for now I'm just going to keep this as unique uh but you can leave it off however you want in your in your notebook okay so now if we want to take a a look into our data we can do kind of the same thing we can do a histogram but let's focus on the price so if we do a histogram of the price we see that this if you remember from the shapes this is kind of uh skew right right we see have a long a long tail in the right side so remember that the for histograms the way that you interpret this is in the bottom you have uh whatever your column is about in this case about price but in the y- axis we have the counts or the frequencies like and remember that the histogram what it does is it's split uh all the possible numbers within your data into ranges and finding what is the frequency within that range that's why we have bars and the bars can always be uh changed uh within the histogram if you want more like a split in more classes um there's a lot of accelerations that can be done here uh from this part you can take a look at uh you can take a look at those here we see that there are a few a few cars in the 80,000 range uh maybe you can consider those outliers maybe not they they are just expensive cars depending on the situation so these are unusual values that will pro will actually pull the mean towards the right side okay um so the most common price price range uh it's it should be these two bars right here you can consider that the most common or you can say from zero to 20 21,000 or something uh something around that area remember that you can uh change we're doing a simple histogram without not much like um like not not working on making it look nicer but uh you can put edge colors for from for each bar it would be easier to to know the numbers you can also change the ticks from the plot and make sure that this shows numbers I don't know every every 5,000 you could do that as well okay uh yes the histogram allows you to to to get a lot of information especially about how it's distributed we see here that uh majority of the data is concentrated within uh from the zero to the 30,000 range or 20,000 if you want to keep it up to the only the first three row uh um bars so but we see a high concentration here and it starts falling off going to the to towards the the higher prices okay so uh for distributions we have two types of distribution we have discrete and uh discrete distributions and continuous distributions so we're going to define both and we're going to explore a few of these distributions together so first of all we're going to define discrete distribution what the green distributions are is just uh using countable values meaning that um you're using uh whole whole numbers to to def uh define the the the dist uh to define the the the possible outcomes in this distribution right so for example uh if you have a dice in a dice you have one two three four five six that will be considered discrete okay whenever you have things that are countable numbers that's discrete uh so examples of this type of distribution we have uniform uh we have uniform distribution which is when we have an equal probability for all outcomes for example rolling a fair dice um we have verdi which is when you have two possible outcomes usually success failure and you also going to see that they represented as 01 so two possible outcomes it's bernoli now we also have binolia but sorry my bad binomial which is a combination of multiple bnolus uh and and b what binomial is is just the number of successes if in n trials uh for example the number of defectic items in a batch right so assuming you know what is the probability that one uh one item is defective you can find out the number of successes within and among uh items in this batch so then we also have uh geometric which is the number of trials until the first success uh different from binomial so in binomial the n is fixed while in geometric it could be any number until you find your first success so for example flipping a coin until the head appears that's an example of a geometric distribution we also have pos which is something that maybe you saw in your statistic class or maybe you didn't uh and I don't think we're going to use it as much in in in this class just so for you just for you to know it's a very important model uh for those interesting interested in data science it's important for you to know it because allows you to model several scenarios in real life so this distribution is especially useful for to count events occurring in a fixed interval for example the number of calls to a call center per hour so in those cases wasan is a good distribution to use so we're going to check a few of them uh here we're just going to going to have an example of how to use Python to kind of like get uh values from this distribution and and visualize it and if I remember correctly didn't go through all of the examples that I show because uh the ones that we going to use we're going to discussed discuss those as we go uh this is more like an introduction only so we uh when we get to normal we're going to talk a little bit more about how to use normal t distribution k square and so on so so but and uh just for you to have an idea for example for a discrete uniform distribution we can use uh the library run integer run in from sci and this uh this uh library sci stats have a lot of good uh things to you uh for you to use in statistics like several several important distribution sever important test and so on that are useful in statistics and are included within this library to be used so we're going to import run in which will allow us to do a discrete uh uniform distribution and in here what what I'm doing is specifying a range because if we want a discrete distribution for example we want to mimic uh the the probability for a dice uh we have to specify what are the possible outcomes so this part is just creating numbers from 1 to six and we are doing dynamically because maybe you have another scenario that goes from one to a thousand counting one by one you can do it in an easier way by doing it like this and here we're just using a function from the nonpoint package called arrange uh and this function allows you to basically uh get counting numbers within a start range and the end range the only thing you have to keep in mind is the number that you provide for the end range h the function itself is going to do one less so that's why we did seven but we only see up to six the last one is not included okay if we use that and we pass those values with for the the ran in uh the uh dist distribution uh we only have to make sure that we do PMS and what PMF is is like the probability mass function because is which is just like the the the relative frequency at each possible outcome that's what this function will uh return so the only thing you have to do is do run in pmf pass your values that you're interested on that like for the outcomes specify an start range and end range in order to calculate uh the probability for each value if we do this oh sorry never don't be like me or that you have to run this in order so you first have to do the uh the library first i skipped this i have to I have to I I went straight to doing showing you the the numbers in order but yes you have to run the library then uh the numbers and now we're just going to run this and here we can see uh the the six possible outcomes with the probability for each one and this probability is just the theoretical probability for launching a dice like if you think about it like what is the probability for example of you getting one if you have six choices well will be 1 / 6 1 / 6 should be around this area okay and all of them have the same probability because uh this is like mimicking what is the theoretical pro that at least theoretically speaking that's the the probability of each outcome going to die okay so a little bit of comments about the uniform you can take a look but this is how you can do a discrete uniform distribution now something that you did a lot in your intro to statistics class uh it's binomial distribution and to do a binomial distribution here in Python you can just import the library binome from the sciat module um and here we can uh do a plot for this binomial distribution so first thing is just generating values so this right here will just generate values for the binomial case uh which is usually from uh zero up to n plus one trials uh and the reason why you do n plus one is because the function is going to take one away so if we want to go up to I don't know up to 10 we have to write 11 instead so that's why we do n plus one in this case but we're just creating the possible outcomes and in here you just have to change this parameter if you want to play along with something for example right here we have five and 0.25 if we if we're against saying uh so if you run this you're gonna see you have the probability for each of these in terms of successes like um if you have five trials what is the probability you have zero successes if each of these um if each each of these trials is have a 25% of chance of success okay um in that case you get some probability for each of 10 exactly um so and this is kind of the idea for example when you do a quiz and you want to guess randomly uh if you have for example 10 question and 10 questions and your probability of guessing one question is one out of four so it's 0.25 i'm going to keep that one like that so this is how you can like calculate those probabilities again if you have 10 question with four choices where only one is correct and you run this it's going to tell you what is the probability of you randomly guessing and getting this amount of success for example what is the probability that I have four questions right right if I start guessing in this quiz with 10 questions well you have a 15% around a 15% probability of guessing um four uh four correct in a quiz with 10 questions and very low probability of you guessing eight uh nine or 10 uh exactly okay when we say that this is exactly you can also do like less or more than you're just going to have to add those or or like add it to the left add it to the right or add two of these depending on the situation uh later on we're going also going to discuss some other functions that will allow you to do those probabilities adding other events uh in an easier way but for now this is how you can access to the exact probabilities for this distribution okay we also have continuous distributions so meaning that before we have count countable numbers but now we're going to have a range of number and it's continuous because you're going to have an infinite choices uh in terms of the numbers that you can use uh for example if you say what are the possible outcomes between zero and one if you have a continuous function well infinite right because every time you can create a new number so if you have 0.5 you can then you can say 0.05 and you can add another zero and say 0.0 005 and you can do that infinite amount of times so between zero and one there's an infinite amount of numbers so that's why in continuous distributions we define our possible outcomes as a range rather than in an interval rather rather than mentioning the outcomes and we have distri uh that's the definition is over an interval of values uh but then um we have also other distributions for for continuous cases so we still have a version of the uniform for continuous cases which is just uh h all values within that interval have the same probability uh we have the normal distribution which is a continuous case that's why you see that um the the normal curve the the normal C curve includes everything uh within within the curve right so because it's continuous you can use any value uh uh in in the in the domain of that function um we have the students t distribution which is usually used for small a small sample size analysis and also for cases where you don't know the population standard deviation uh that's distribution that we're going to talk about in future notebooks as well so you don't have to worry much about it for now um we also have exponential distribution which is the time between events in a pos so this is an extension of Wan and the continuous case uh and a nice example is like the waiting time for a customer arrival uh so that's what you can use this exponential continuous distribution for uh we also have some other distributions called gamma and beta and a few others that I didn't mention that are useful for basing statistic relability analysis and some other cases uh they are very important if you're into uh data science or actuary or there's a lot of science where knowing all these probabilities are good to do uh to model like real life scenarios uh you should take the extra step and and research about them but we're going to keep it simple for the purpose of this class so we're just going to go through a few of them uh this is the case uh again we are still using the same like the same module and we're going to import in this case instead of importing the run int which is for the for the discrete case we're going to import the the uniform uh library and and the idea now is um we we still going to basically do the same we're going to create in ter in for us to to graph this uh distribution what we're going to have to do is create values within the range that we want but now since we cannot create infinite amount of values we're going to rely on creating as much as we can within one range and we're going to we're going to make them evenly spaced so if you have 10 numbers you're going to space each one by one so they are evenly spaced the distance between each of those two numbers within the the the range is going to have this it's going to be the same across all def all numbers or values within this range and for that you use a function in numpy called lind space which will automatically create uh numbers that are evenly spaced out the uniform function uh you only have to pass um you're only going to have to pass the the values that you're going to use which we created with the lint space function to create evenly spaced numbers and something that you can do if you don't understand this part is just take it in a separate cell and run it and see and see and check the numbers uh but maybe 100 it's too many values to check you can do 10 or or five and see how the numbers are evenly spaced out after that you can just pass uh those numbers and represents the the range with for the scale using the the location and the scale which which should be B minus A minus A to represent the the the width of the the distribution so if you I keep doing this on this video ah okay uh that so if you forget your own library notice that also I do the same thing sometimes I think it's very it's a recurring issue in this notebook not usually happens to me but yeah this is the third time I do in this notebook don't forget to run the library if you if you get an error maybe you forgot to run some cell and remember that the notebooks run in sequence so that's why that happens so uh once we have loaded the library Um we can run this and you see that what happens now is we have a continuous range with equal probability of all the numbers within this from 0 to a from 0 to 10 range have the same amount of probability so that's why uh a is where you start location is where you start and a scale is just the um the range basically so you just have to subtract the last number by the the first number and that's what you pass as the the scale to know the width of this uniform distribution okay so that's what a continuous uniform distribution is then we have a normal distribution which is just the the bell shape that we discussed before and for that we just have again to do equally space number but in this case we we want to do it with some numbers on the negative side some numbers on the positive side uh and what we're going to do is for this function right here which will allow you to get the the the the density or the frequency for normal distribution we need to pass the the location in this case going to be the mean uh I'm sorry location lock is going to be the mean and the is going to be the standard deviation and what we're going to do is for those is we're going to use a a mean of zero and a standard deviation of one which if you if some of you remember that's called standard normal curve when we have that mean and the standard deviation uh we're also going to discuss that later on but uh that's just an standard normal distribution if we run this I think I remember now why you making this error usually I put the library within the same the example and I never separated but for some reason this notebook I put the library separated but yeah um okay so this is the the graph for the normal distribution I created a th00and value I think even if you do 100 shouldn't be too much of a difference right still not much of a difference as long as it's not uh not too low let me see right so if you do If you do a a very small amount of equally based numbers it it's not uh normal because you you are trying to match just 10 points and the plot by itself this type of plot which is called a line plot it's just going to cross two straight lines so I believe to play it safe keep it at a at a th00and and you're going to be you're going to see a very smooth uh bell shape okay which a mean of zero because remember it's symmetric the mean is exactly at the middle and the highest point and a standard deviation of one okay so so a I don't know if you guys remember about the empirical rule about the amount of percentage of data contained within certain amount of uh certain amount of standard deviation but this one is going to match that in terms of the amount of data contained okay so good so that's basically the idea of using the function norm normal type PDF from the sci stats to create norm uh normal distributions okay practical implications so uh you need to check for some types of distribution in order to use some test so for example t test assume normality logistic regression test assumes binomial distribution assumptions so depending on on on on the test that you're going to use you're going to have to check if the assumptions are met like you're going to have to make sure that your data is following those type of distribution and therefore it's very important to know about uh all these different distributions we later on going to have tests to check whether this erh that our data follows so uh if like if we believe it's normal we have to check and test whether it's normal or not so this is just an example of those tests but we're going to explore that in future classes so you don't have to worry much about this i don't think I going to put a anything in the quiz for this until we get to the point where we actually start discussing so if we run this we see that uh for example this test right here let us know that the data is not normally distributed okay based on a p value and also we run a t test uh to find the difference from the population mean uh we see a significant difference so there's a few tests that uh we can run as long as we uh assess the the distribution okay here there is a problem for you to try but uh maybe I forgot to let with you guys right here oh no so you guys should have it empty make sure you give it a try for you to just check the distribution of the of the mile mileage for the cars_df data set uh pause the video give it a try but then once you're done um you should have something like this which uh it's basically rice Q we can see that we have a high a long tail here that it should go something along this line and we see that there's some outliers this is very far okay just and if you're confused this is in large scales because you see this one E6 for those that remember that's just uh scientific notation which means that you need to add 60 to the right side okay so these are uh numbers that have been shrink down to scientific notation but this looks very far away from the data so this could be considered outliers um yeah so you can interpret this after you do your histogram but that's how a histogram is used for now why do why are we discussing all these different function uh like ways to plot uh distributions uh and and actual theoretical distributions right um it's because you can first you can use these distributions to generate synthetic data like for example uh if you want to generate synthetic data for weights uh you could do so using a normal distribution right and and different other scenarios in life where you can use some of this distribution to generate uh synthetic data like rolling a for rolling a dice you can use a uniform discrete uniform distribution as we saw before you can start rolling and see what happened based on the probabilities so so in this case what we're going to do is we're going to take a look at the distribution or the price uh the car prices but in top of that distribution we're going to plot a normal distribution ution with the same mean and a standard deviation from the cars uh and see how close it look like of there's some work that have been done in order to match the scales of the of the distribution you don't have to worry about understanding all this part but this is this is just for you to have an idea what is useful to understand the theoretical distributions and compare them to our data to see if they are close to be what we claim that they are like for example I might think that the prices are normally distributed but uh if we run this our data is this one the blue one and we are plotting this uh blue line which is basically a representation of how the shape looks like based on the bars uh but the red line it's the theoretical distribution based on the on the mean and the standard deviation from this data uh and we can clearly see that first of all I mean not clearly but visually we can we can uh draw some conclusions that these bars looks like they are not following this bellshape or first of all we are missing a tail in the left this is being cut short so it's not center uh we don't have that symmetric uh uh symmetry right because we have like a little curve here that is very pronounced but this one is slight slightly uh smoother and so on right but visually speaking we can have an idea but later on we're going to discuss actual test to say uh actual test in order for us to say whether this is normal or not at some level because since this right now this is it depends on the person for some or some of you might say like oh but that's not very far some others will be like "Oh no yeah that's actually very far from the red line." So to remove that erh those relative uh ideas we're going to have some tests that will allow us to deal with knowing whether this distribution is what we believe it is okay so some interpretation here you can read it is basically what we I just discussed and a summary of the notebook so again always make sure that you explore the notebooks practice you are curious about what this notebook uh uh it's it's showing and and how it works there are parts that you shouldn't worry about like like this part don't don't worry too much about it it's not coming on the and the quizzes uh let me see the other part uh this graph is just for showing purposes it's not something that h if I remember correctly you have to to to remember for the quiz um but you need to understand like this like the fun the different functions that you can use for the distribution so I'll say like review the notebook carefully in terms of learning what functions you can use what the distribution are how it's useful uh what is a skew left what is skew right symmetric by model and and so on okay if you have any questions send me questions in a Slack or an email if it's something more uh serious and I'll see you in the next notebook