Data Science Statistics Lecture Notes

[Music] hello guys what are we basically going to cover from basics to advanced uh this will be specifically related to positions like data scientist data analyst related to business intelligence tool everything will get covered over here we need to understand the basic differences between descriptive statistics and second one is inferential stats the differences between descriptive stats and inferential stats because the entire statistics with respect to data science is divided into this two concept in descriptive stats some of the topics that i really want to mention is measure of central tendency measure of dispersions these are some of the examples anything that is related to summarizing the data so all the tools that you're probably using like histograms you're using box plot whisker plot everything will probably come over here if i sub divide many of the topics here we are basically going to understand histograms we are going to understand about pdf we are going to understand about cdf we are going to see that probably how do we create this pdf by what techniques we care create this pdf cdf everything uh we will also be understanding some topics like probability permutations which are pretty much improbability is very much important in terms for data science mean median mode so you also have variance standard deviation we're going to cover many distributions let me name the distributions over here like gaussian distribution then you have log normal distribution other type of distribution like binomial distribution then you have bernoulli's distribution pareto distributions this is also called as power law distribution the we'll also be discussing about standard normal distributions the seventh thing that probably we will be discussing about is uh in standard normal distribution we may also have different different techniques we'll be discussing about transformation we'll be discussing about standardization we'll be discussing about different kind of transformation and this all will be with the help of python also we'll try to see we'll distribute we'll discuss about something called as q q plot we'll try to find out how how to determine whether a distribution is a normal distribution or not that all things we will try to discuss these are some of the topics that i have written uh there is also very something very much important which is called as inferential stats now in inferential stats our main focus is basically like z test t test anova test chi-square test if i just consider some example with respect to z-test there are multiple ways to actually perform z-test so in z-test probably you will be having different ways and this i will also try to show you by executing in python t test also i'll be showing you by using python programming language chi square test anova test so anova test is also called as something called as f test we'll be discussing about this uh like factorial anova different kind of anovas that we are going to discuss most important thing we forgot right which is called as hypothesis testing how can i forget this okay we are also going to discuss about hypothesis testing right in hypothesis testing how do you determine your null hypothesis alternate hypothesis everything will probably get covered in this uh here we are specifically going to understand about p values one very much important thing is something called as confidence intervals confidence interval then i'll also teach you how to see z table um you know which is a kind of sheet where you can directly get the values over there similarly t table is there chi square table is there many things will basically be there let's start the first topic the first topic uh that obviously anybody needs to understand is that what is statistics okay we really need to understand because whatever i'm discussing right it is very much important in terms of interview in terms of interview i'm actually going to teach so that you will definitely be able to understand many things so the first thing we will understand what exactly its statistics many people have different kind of definition with statistics but i really want to give a very simple definitions which is from wikipedia so i'm going to say statistics is the science of collecting organizing and analyzing data now you know based on the amount of data that is getting generated now you can just understand directly like how important stats is you have tons and tons of data you have huge amount of data and definitely you can actually utilize this particular data to make sure that uh there is improvement in your products there is improvement in your business goals and that actually helps you to finally make a very good decision so finally why why we are doing this for why we are doing this we are doing this for better decision making so we are specifically doing this for better decision making everything that is basically getting covered on this and if i try to now dis define statistics or the types of statistics first of all there is one very important thing which is called as data so data over here is nothing but facts or pieces of information that can be measured so what is data in short of facts or pieces of information that can definitely be measured and let's go ahead and let's see some of the examples what do you think about data definitely if i if i say that okay uh let's let's consider one very simple example i am basically going to say that fine with respect to the data i can give you some lot of examples so one example i can say that let's see if i want to measure the iq of a class of the students right i want to measure the iq of the students of the class so i may probably get values between 0 to 100. suppose let's say that i am getting this one i'm getting this i'm getting 55 i'm getting 75 i'm getting 65 so this is one example of data here we can basically measure and the example is iq of a class suppose i i want to give one more example okay the age the ages of student of a class i may have different ages like 30 25 24 23 27 28 what is this this is specifically data and always remember the most intrinsic meaning of data is that it can be measured that is the most important thing types of statistics the first type as i said is called something called as descriptive so the first type is basically called as descriptive stats now how do you define descriptive stats descriptive stats i'll just say that it consists of organizing and summarizing of data it consists of organizing and summarizing data that's it very simple if i really want to understand i'll probably make you understand more about what is descriptive stats but let's go towards the definition of inferential stats now in inferential stats you can basically say that it is it is a technique wherein we use the data that we have measured to form conclusions now if i talk about two important things one is conclusion and one is about data now first of all we will understand about descriptive stats and then probably i i'll give you a very good example okay i'll try to give you a very good example and based on that particular example what is the type of question that may come up in descriptive stack so let's let's consider that i have a classroom of math students and in this classroom let's consider that there are around 20 people and now i want to find out the marks marks of the first sem let's say now here probably the marks with respect to percentage are like this 84 86 78 72 75 65 80 81 92 95 96 97 so over here you can see how many data there 1 2 3 4 5 6 7 8 9 10 11 12 let's consider that we have around 20 data what is the average age of the students in the class student in the class so this may be a perfect example of descriptive stats now here i've just told about the average it can be anything it can be standard deviation it can be mean it can be mode it can be different different things so here you can see that i've taken a very simple example i have uh our math students like 20 people over here and probably you can basically understand over here that we are trying to find out what is the average age of the student in the class you may also say that what is the percentage of the people passing out from the class you can also say that different different examples probably you'll be able to understand when i talk about percentiles and all now let me go ahead and let me find out and let me tell you the other example of inferences stats based on this what kind of thing what kind of question you can ask with respect to inferential stats i have told you the definition what what inferential stats basically consists of it is a technique wherein we use the data that we have measured to form conclusions i may say that are the ages of the students of this classroom similar to the age of the college similar and let's say age of the college but age of the maths classroom in the college so this is basically my question my question basically says that are the ages of the student of this classroom similar to the age of the math classroom in the college so here maths classroom in the entire college is my population and probably just a classroom student's age is just like my sample sorry did i discuss about max okay sorry i'll not say age but average marks i'll just try to change over here just a second guys i'm extremely sorry so this is not age this is marks so i'll not say this as ages and let's let me but you can also take ages as an example i will say it as marks like that let's consider the maths classroom there are different five different classrooms and i have actually taken the data of only one classroom and from this this is basically called a sample and this is my entire population now since we have discussed about population and sample and i'll be coming more on making you understand about descriptive when we are deep diving into various topics now it is time that we really understand about population and samples so coming over here is basically population and sample what exactly is population now population basically means let's consider one example again see guys i will definitely give you lot of examples the reason why i'm giving you examples is that because understand if we learn statistics in such a way that we have examples in mind we will be able to explain the interviewer in a very good way so let's take an example of elections probably you may be talking about goa you may be talking about up let's consider this two-state so obviously it is not possible probably let's consider that the election has finished and we really need to find out the exit poll now exit poll what usually this press reporters and all will do what they will do is that they cannot go and ask each and every person suppose the goa population is this big let's consider it is not possible for every reporter to go and ask each and every person that whom you have voted because it is not possible you may not find some people some people may be traveling some people may be doing different different things and also it is not possible at all so what happens in this exit poll this reporters what they do is that they take up sample of population from different different region and again there are different different kinds of sampling techniques they take up different different samples and then what they do is that they ask that whom did you vote and based on that maximum number of people whom did they vote they basically say based on that they actually create their exit poll now in this particular case what is my population data my population data is this entire population of goa so this specific thing is my population data and this round circles that i have actually done is basically my sample data so i hope you have basically got some some examples with respect to that guys i hope everybody is clear with this i basically told age over here so don't get confused sometime when i'm teaching sometimes students may come ages may come or marks may come so you will not get confused don't worry so here is one example now let's go ahead and let's try to understand one thing now in this particular scenario in this particular example many people have told about krish why are you just considering okay you are considering samples to solve a particular problem what are the different sampling techniques you really need to understand or tell us that because there are different different sampling techniques what are the different kind of sampling techniques but before i go ahead usually population if i talk about population you really need to understand about some of the notation population is basically given by capital n and sample is basically given by small n so this is how we basically denote population this is how we basically denote sample now the next question comes that krish why you have selected samples randomly is there any better ways to do sampling also or just we need to uh do the sample randomly i would like to say that guys this entire sampling takes place based on various scenarios and for that i will be showing you some of the examples so let's go and understand about some of the sampling techniques and what are different different sampling techniques we basically have the first sampling techniques let me write it down for you now the first sampling techniques which is most of the time used is called as simple random sampling simple random sampling very simple very important suppose i have some data i have some i'm sorry i have some population suppose this is my population simple random sampling will be just like you go and pick up some people like this anyhow you want there is no there is no such confusion as such you just go and randomly pick up people simple random sampling and simple random sampling is basically used in many of the scenarios probably in exit poll you can use simple random sampling suppose if you want to use some kind of medicines right you do some kind of test for the medicines at that point of time you cannot use simple random simple random sampling you have to pick up some people probably have to check their medical history based on that you have to apply but simple random sampling it's all about i can basically say that i'll just give you a small definition over here when performing simple random sampling every member of the population has an equal chance of being selected for your sample n now coming to the second type the second type of sampling is called as stratified sampling let's let let me give you a definition stratified sampling is a technique where the population that is capital n is split into non-overlapping groups so one example i'll be talking about it don't worry this is also called as strata strata basically means layering stratified layering like that we basically say this is what a stratified sampling basically means let me give you one example let's let's consider gender i want to do this sampling based on two things one is male and female let's consider that i want to do a survey and for a survey obviously i will be requiring some people and based on that my samples will basically be divided right based on male and female male people will give different kind of or survey female people may give different kind of survey okay so something like this so this is definitely one example any other example that you would like to say obviously wherever you can see that there can be non-overlapping groups obviously you can do it let me give you one more example suppose i want a survey to be done by zero to ten years of kids i want to next uh probably i'll try to make this kind of layering based on age probably 10 to 20 will be one age group probably 20 to 40 will be another age group and probably it will be for 40-100 will be another age group so based on different different age group i can also do a sampling understand one thing this terminology is very much important non-overlapping it should not overlap over here there is no chance of overlapping based on profession can i do stratified sampling based on profession can i do stratified sampling hey a profession may be that let's let's say that this profession is with respect to different different different different people who are working okay suppose a person is a dotnet developer a person is a php developer a person is a you know data scientist or he's working specifically in python over here definitely you can say that they have different different stratified layers but there may be some scenarios that it may overlap a php person may know dotnet a dotnet person may know python so both the scenarios will be there if a person is highly experienced he says that no i don't know dot net then it will not become overlapping but definitely we can apply it for doctors engineers doctors engineers different different survey can be there so just understand that in some of the cases we can do stratified sampling but by applying some other conditions we can make sure that that sampling satisfies all the conditions coming to the third one the third technique is basically called as systematic sampling the third technique is called as systematic sampling here from the population n what we do we just pick up every nth individual i'll give you a very good example nth individual from this population what does this basically mean let's consider that i'm outside the mall and i want to do a survey regarding covet so what i am doing every seventh or eighth person that i see i am saying that for this person do the survey so in systematic sampling you consider any eighth person i'm just saying as an example every eighth person i may take every first person that i see every fifth person that i see or every tenth person that i see in front of my eyes i'll just tell him to do the survey so this is what systematic sampling is all about in systematic sampling there is no reason why you're selecting the eighth or the ninth person you just said that okay it is my personal duty what i'm actually going to do whichever person that i see on the seventh time i'm just going to catch him and i'm going to basically ask him about this survey so thanos when he snapped the when he snapped his finger what do you think what kind of sampling techniques may have used do you think random sampling is basically getting used because you could see right ah probably random sampling may have been used okay now let's come to the next sampling which is called as uh uh you can say it as convenient sampling you can say it as voluntary response sampling i'll just say it as the fourth technique i will say it as convenient sampling this kind of sample so suppose let's consider that i am doing a survey only those people who are who are a domain expert is in that particular survey will be doing will be participating in that particular survey suppose let's say consider that i am doing a survey related to data science i will say that any person who is probably interested in data science and has the knowledge of data science if you consider only those people only those people then it basically becomes a convenient sampling only those people who are basically interested in this will basically be doing it or who are expert in that will definitely be doing it because this is a specific topic which requires domain knowledge which require some uh amazing things in that basically he should be knowing based on this survey because those service will be important through surveys you take out some kind of information you you will be able to make some kind of decisions so that is very much important who is taking the survey like many people also how do you generate your data set that is also said like in many companies what they do is that they make sure that the people actually try to put some kind of surveys in front of the people and they basically use that data for doing different different things again i'm going to repeat what is convenient sampling let's consider that i am doing a survey related to a specific topic in this particular example data science obviously i will not go to some other people who don't have the knowledge of data science to do that specific survey so i may collect my sample in a bit different way where will focus on people who is giving the survey should have knowledge on that specific topic okay now let me give you some of the examples let's say that there is an exit poll what kind of sampling we would be better okay guys again people are getting confused with respect to system stratified sampling and this sampling in convenience sampling we are just specifically considering a domain there we are dividing groups based on something so tell me the examples of exit poll what kind of sampling technique we may use so obviously we will be using over here as random sampling the rbi i hope everybody knows rbi they do something survey with respect to household household service for this household service what kind of sampling probably they may use hey guys you may also consider that over here you need to follow some stratified random sampling obviously we can't do but over here most of the time random sampling is basically done in household surveys rbi make sure that they have to fill the survey from a human where probably they are trying to find out like what is the cost expenditure in running a house so here you can probably consider stratified sampling if you don't want to consider stratified sampling we can also do convenience sampling only women you can basically consider over there and you can do it now understand sampling techniques may be different it is completely dependent on the use case that we are following based on the use case that you really want to do based on that you will do and it is not like we will just be dependent on one kind of data we try to use different different sampling techniques and finally we try to come to a conclusion on the same let me give you one more example a drug needs to be tested so for this what kind of samples we may take now here i can bring up multiple use cases first of all to whom this drug needs to be tested if i get that specific information i will basically do the age groupings and then i may probably apply let's consider this drug is for everyone probably then i may consider picking up some samples but at least i'll put a condition that at least it should be greater than 15 years because we cannot just directly use a specific drugs on kids so different different it depends on the use case that you're probably trying to do and based on that you will probably try to select it and again there may be many things many many questions that may come is that okay krish why not this why not that why not this why not that right this kind of questions may come that is where we basically experiment in multiple things so in the real-world scenario also when you are probably collecting the data you will find this kind of scenarios a lot now let's go with the next topic what is called as variables now what is a variable obviously if you are a coder you obviously know that what is a variable so i will just give you a definition that is much more related to you i'll say that a variable is a property that can take on any value a variable is a property that can take any value let's say an example i'll say height i may say weight these are variables we can have any value we can have 170 centimeters 172 centimeters 185 centimeters 190 centimeters anything i can have different different values with respect to height 182 178 168 150 160 170 anything similarly with respect to weight i can have any values like 78 99 100 or 60 or 50 anything that i want so this is a simple definition with respect to a simple variable with lot of examples now understand there are two kinds of variables so let me go ahead and let me teach you this there are two kinds of variables the first kind is basically quantitative variable quantitative variable the second type is basically you just send me the answer i'll pause for five second the second type is something called as qualitative variables qualitative or categorical variables so these are the two types of variables that we specifically use now i will try to divide this into many types and we'll try to understand this variable because these are also very much important now first of all coming to the quantitative part this quantitative part will have some properties it can be measured numerically so we can measure them by putting numbers we can perform lot of operations like add subtract divide multiply right we can we can perform any kind of operations that we want so one example of this is i may consider age i may consider weight i may consider height some of the examples with respect to quantitative variable if i say that okay age is a quantitative variable in qualitative and categorical variables if i specifically take an example let's consider gender in gender i have male and female now what does this basically mean based on some characteristics based on some characteristics we can derive some categorical variables or we can derive categorical variables that basically means we have categories in categorical variable here we cannot add subtract or do some kind of mathematical equations because here we don't have that option another example i may basically say that i may say okay i may have categories of let's let's consider that i have iq iq if i say 0 to 10 i will divide this iq 10 to 50 and 5200 wherever the values are between 0 to 10 i may say that less iq whenever i say 10 to 50 i may say that medium iq suppose i say this 5200 i may say good i i'm just saying it this as an example now based on some characteristics i have derived or i have classified this into multiple categories which is called as iq here don't tell me sir krish how sir like how crash more than 50 you are saying that good iq then probably i'm just taking an example over here blood group is another example i may have a positive a negative like that i may have lot of iqs i may also say t-shirt size based on the properties you know we may have large excel medium small this kind of things now coming to the quantitative part quantitative also has two different kind of categories obviously we know continue quantitative basically means we have some numerical values here i am going to divide this into two one one is the discrete variables and one is the continuous variable so discrete variables and continuous variable in discrete variable you will specifically have a whole number let me just talk about some of the examples number of bank accounts of a person in this particular case the example is that you'll say that i have two bank account three bank account four five six bank account seven bank account you can't say that you have two point five back in count another example that i would like to give number of children in a family so this is why another example here you obviously will say that okay there are two children three children four children five children but you cannot say it is 2.5 children or 3.5 children right now let's go with respect to continuous variable here we have already discussed that any values it can have okay suppose i say height i can say that the person is 172.5 centimeters i can say that the person is 162 centimeters i can say a person is 163.5 centimeters any value that can come over here similarly with respect to weight here i can say the person is 100 kgs i can say 99.5 kg i may say 99.75 kgs i can also talk about amount of rainfall which is measured in inches suppose i say uh it is 1.1 inches 1.25 inches 1.35 inches right all these things are basically there so this was an example with respect to continuous variables i'll give you some examples what kind of variable gender is what kind of variable marital statuses what kind of variable river length is what kind of variable the population of a state is what kind of variable song length is so gender is obviously a discrete one i'll not say discrete but i'll say categorical sorry not discrete okay so it is a qualitative or categorical variable marital status again same thing river length continuous if i want to say discrete continuous or normal continuous it will be a continuous quantitative variable population of the state it will be discrete and song length will also be continuous what kind of variable blood pressure is blood pressure it will also be continuous what kind of variable is pin code discrete or categorical don't worry as we go ahead in some of the classes you will be able to understand this okay that is where when you will be getting a problem statement in data science where you have specifically pin code in a data set how you're going to handle those okay now let's go to the next one next topic variable measurement so here we are probably going to understand how do we measure variables so over here we basically have four different types of measured variable the first type is nominal the second type is ordinal the third type is something called an interval and the fourth type is something called as the ratio now first of all we'll try to understand about nominal probably i'll here also i'm going to give you a lot of examples and why why it is very much necessary to know this kind of measured variables four type of measurement wells because your data set will also have this kind of variables you'll have nominal data you'll have ordinal data you will have internal data it's our interval data ratio related data so that you'll be able to do a good data analysis okay so you basically use this kind of variables so if i talk about nominal variable so nominal uh data also i can say these are specifically categorical or qualitative data so whenever i say categorical data you know that it is split into different different classes colors color is one example you have example gender you have example different different things type of flower these are some of the examples with respect to the nominal data because the first thing i've heard this interview are asking what is the difference between ordinal and nominal data now let's go ahead and let's discuss about ordinal data in order to understand ordinal data i would like to say some example here over here in this particular data the order of the values the order of the data matters but value does not i'll talk about it why i'm saying value does not let's say that i have five students and here the marks of the students are like 100 96 57 85 and 44. now tell me over here if i just try to find out the rank rank basically means who is having the highest marks will get the first rank 96 will then get second this 85 will get third and this we will get as fourth and this finally will get our fifth this data that we specifically have is my ordinal data here we focus more on the order not on the values here we mostly focus on this ranks we are not worried like what marks that particular person has got yes he has got the first track so this was with respect to the ordinal data now let's me let me come towards the so over here you can basically say that uh ordinal data will be present and we also use a different technique to analyze those data and probably we try to probably when we'll be seeing some data set in the future we will probably try to see that okay scenarios also now internal interval data here the order matters here the value also matters and one thing is that your natural zero is not present what is this natural zero yeah order also matters values also matter so if i take an example of interval let's say that i have an interval of temperatures and let's consider fahrenheit fahrenheit temperature i'm just talking about i may have values like this 70 to 80 fahrenheit 80 to 90 fahrenheit or i may have 70 to 80 fahrenheit 80 to 90 fahrenheit here interval is there definitely some kind of values are there 90 to 100 fahrenheits but if i say zero fahrenheit it won't it won't basically make a useful meaning in this so definitely this is basically called as an interval you have some range of values between them and the order also basically matters a lot i may also have distance 10 to 20 20 to 30 30 to 40 where probably this interval data may be used in ola i think you have probably booked cabs you booked the cap for let's say you're booking the cab for six hours there they'll be saying that you can actually go till 0 to 60 and then you can probably uh if you are more than 60 that time you have to pay more natural zero zero will not be present right zero fahrenheit will not make any difference now ratio data will be an assignment for you let me go ahead and let me take another topic which is called as frequency distribution now this is pretty much important because in the later stages you will be understanding about histogram and all let's say that i have a sample data set and suppose in this particular data set i have three types of flowers rose lily and sunflower now similarly in this particular data set i have lot of flowers like rose lily data sunflower then again i have rose then again i have lily then again i have lily okay so suppose let's consider that this is my entire data set now usually for showcasing this uh data set in some kind of visualized manner we can basically use this frequency distribution table based on the flower type and how much is the frequency okay and this will be very much important suppose if i say rows in rows uh how many times i have one two three so 3 is the count of rows if i consider lilly so lilia what is the basic count i am basically having 1 2 3 4. so 4 is the count if i consider sunflower what is the count 1 and 2. so this is the frequency of this particular values of this particular data set with respect to different different categories okay so here you can see that i this is entirely frequency distribution table and from this table you can derive bar charts you can derive pie charts you can derive different different things now one more topic now this you know that it is a frequency distribution but there is something called as cumulative frequency cumulative frequency basically says that initially i have rose three flowers then i am going to add this to this so it will be seven then i am going to add this to this it will be nine at the end of the day when we go with respect to cumulative frequency and when we go to the last category we will be able to find out how many total number of flowers are present this is basically the cumulative frequency the frequency is getting added and finally you'll be able to see the cumulative frequency over here now what we can basically derive from this i'll just show you an example there's something called as by uh bar bar graphs and pie charts so that particular part now we'll try to draw from this and we'll try to see that how does it look like in the case of discrete variables we can definitely draw a bar chart if the variable is continuous at that point of time we can draw a continuous we can draw a histogram so let me just talk about bar graph so first one is the bar graph in bar graph in the x axis i will probably have all my flowers so this is the rose this is lily and this is sunflower in the y-axis i will probably have frequency so this will be my value one two three obviously i know how many roses are there roses are three so i'm just going to create this graph over here which will be looking like so this is my bar chart for rose lilies are obviously four so i'm going to basically lilies i may use blue color and this will be my four value and finally you will be able to see sunflower this will be sunflower let's say sunflower is only two so i'm going to create this this specific diagram is basically called as bar chart from this why do we use it as i say summarizing the data this is still the part of descriptive statistics descriptive statistics so this is how you basically define it here you can see that the values is discrete variables now what if i consider the next example which is called as histograms now in the case of histogram how do we define first of all your data should be continuous it can be discrete continuous it cannot be discrete continuous let's take one example of age suppose i have a data set of ages and i have values like 10 12 14 18 24 26 30 35 36 37 40 41 42 43 and 50 51 okay so suppose if i have this specific edges okay now here you know that it is a continuous value now in the case of continuous value if you want to represent it through some diagrams for the data analysis purpose you can basically use something called as histograms you can basically use histogram so the histogram will have like this now understand one very important thing in histogram we make something called as bins bins basically means we make some kind of grouping by default the bin size is usually 100 sorry 10. now if i really want to make this bins what i'll do in the y-axis i will be having the frequency obviously you will know this now let's make the bin i told you 10 will be the bin size 30 40 50 60 70 80 90. you can change the bin value also between 0 to 10 between 0 to 10 i don't have any value so i'm not going to create it so let's see that 1 is there 2 is there 3 is there 4 is there and 5 is there okay so this is my frequency count now between 10 to 20 i have how many values one two three four so four values i'm going to create a diagram over here between 10 to 20 i have four different values i have four different values and then 20 to 30 i have three different values one two three let's consider i'm going to draw my next diagram then i have between 30 to 36 i have one two three again i have three between 32 no sorry i have four because i'm also going to count one two three four okay so i'm going to draw my another building over here and this building is basically called as histograms okay and then uh between 40 to 50 i have one two three four okay so again i have four over here and finally between 50 to 60 i have one this buildings that you see is basically called as histograms this building that you basically see is called as histograms and this in this histograms your values will be continuous now one amazing thing because people ask about what is pdf i say that's pdf is smoothening of histogram so i'll just tell you one example if i smoothen this histogram my my pdf function will look something like this now you may be considering krish how is this basically getting created okay how is this basically getting created i'll say that there is something called as kernel density estimator now this kernel density estimator how it is done we will try to understand that in the upcoming classes probably that is little bit in the advanced side but i hope everybody got an idea about histograms i everybody got an idea about barch paragraph definitely only for okay tell me what is the difference between bar versus histogram bar versus histogram why do we use bar graph and why do we use histogram bar is specifically used for discrete this is used for continuous now if somebody asks you what exactly is probability density function you're just smoothing the histograms hello guys so yesterday if you remember we have discussed all the basic things today we will be moving from basics to intermediate stats specifically for data science so this is what we are going to discuss and there are so many topics that i am probably going to cover today we are basically going to cover measure of central tendency measure of central tendency measure of dispersions gaussian distribution then fourth we are going to understand z score then we are going to understand standard normal distribution standard normal distribution and there are some more topics that we really need to cover so the first topic that probably we are going to discuss is something called as arithmetic mean for population and sample mean basically means over here specifically we are talking about average now with population and with sample we really need to understand the formulas of mean and we will try to understand in this specific way population is basically given by capital n sample is given by small n now coming to the first thing whenever we are probably discussing about mean you need to remember that we are trying to find out the average of a specific distribution so let's say that my data sets look something like this 3 3 4 5 comma 5 comma 6 so if i really want to find out the mean of this population mean of this population i can basically give by a symbol which is mu and i'll say summation of i is equal to 1 to capital n x of i divided by n now what is this x of i let's consider that this is my random variable x and probably i have so many different values inside my data set 1 1 2 2 3 3 4 5 5 6. so if i really want to expand this thing x of i basically we are going to iterate through all these n elements so i may write 1 plus 1 plus 1 plus 2 plus 2 plus 3 plus 3 plus 4 plus 5 plus 5 plus 6 divided by capital n over here capital n is what 1 2 3 4 5 6 7 8 9 10. so 10 elements so it is 32 by 10 which is nothing but 3.2 so 3.2 is basically my average now with respect to population always remember how the symbol is basically given we can write x bar which is specified by sample mean here i'm going to write summation of i is equal to 1 to small n and here i can basically write x of i divided by n obviously we'll get the same answer because we are going to take the same data set so this was the example with respect to arithmetic mean always understand that notation the notation is quite important over here the reason why i'm saying you notations over here in this way because i want because in the real world industry when you are working when you're explaining someone as a data scientist you really need to use this well-known notation you can use your own notation whatever you like but think of a larger point of view here you really need to make sure that whatever standards is being followed we need to try to follow in that specific way so this was the basic things with respect to mean uh mean is the part of central measure of tendency apart from mean there are two more things so uh let me just define what is central tendency which we basically say central measure of tendency there are three main things one is mean second one is median and third one is mode now if i really want to make you understand because it is a very important interview question if someone says you that what is central tendency or what is measure of central tendency you can just say that it refers to the measure of measure used to determine the center of the distribution of the data so that basically means whenever i have a data if i really want to find out the center part of that particular distribution i can use mean median mode why specifically will be using it that all i'll be talking about but i hope everybody got the definition till here everybody clear with this definition it refers to the measure used to determine the center of the distribution of the data so average and mean are one and the same guys understand average mean okay we use the same formula that is basically used okay so this was the part with respect to central tendency now let's go ahead and let's try to solve some problems right obviously i have given you lot of examples with respect to mean but now let's go ahead and try to understand median and why do we specifically use median so i'm going to take the same data set whatever data set i have used over here that is one one two two three three four five five six okay so one comma 1 comma 2 comma 2 comma 3 comma 3 comma 4 and what was the data then you had 5 5 6 right so here i am basically going to take 5 5 6. so suppose if this is my data obviously the mean we found out was nothing but 3.2 now what if i tell you that in this distribution you add one more element like 100 so when you add 100 then it will become 32 plus 100 divided by 11. 32 is basically from the sum of all the numbers that we have done or taken previously plus 100 which is a new element that is basically added and we are just going to divide by 11. so once i do 132 by 11 we are getting 12. before when 100 was not added when this element was not added at that time my mean was 3.2 but after adding 100 in that specific distribution my mean became 12. now here you can see that there is a huge movement of mean there is a huge difference with respect to this mean and why it is basically added because of this number we consider this number as outliers outliers really have a adverse impact on the entire distribution so that is the reason why we should be very much careful with outliers in data science also in statistics also we use different techniques to remove the outliers which also i'll be discussing today when we are going to discuss about percentiles and all so remember outliers has a major impact because here you can see that the entire distribution of the central data is basically moving and the difference is quite huge so for this particular case what we do we can definitely use median now in median if i take the same number like 1 1 2 2 or 3 3 4 5 5 6 and then probably i have hundred always understand in median the first thing that you really need to do is sort the numbers so first step is sort the numbers so over here you can see that the numbers are already sorted if your numbers is not sorted at that point of time you will be able to see that you know you probably have to sort it right now by default i have made sure that the number is already sorted so do you define distribution in statistical term distribution basically means that how your data looks see what is distribution okay how do you see how your data is basically distributed there are various ways we use pdf we use histograms we use different different techniques so i will be coming in making you understand about different distributions still i have not started that now first step is always sorting the numbers i have sorted the numbers after sorting the numbers what i am actually going to do is that i am basically going to take the central element a central element means which one suppose if i have odd number of elements so over here what is the count 1 2 3 4 5 6 7 8 9 10 11. so 11 is the count in order to find out the central element okay we will probably find out the center one so one two three four five one two three four five so this will basically be my central element this will be my central element because it is the middle element now in this particular case i can definitely say that my mode is nothing but 3. now understand even though the outlier is added see outlier basically means what outlier is a number which is completely different from the entire distribution over here you can see that 100 is a completely different from the entire distribution now what if now your question may rise that okay krish what if i have one more number like this let's say that i have one more number 112. now in this particular case you told to pick up the central element now in this case which will be my central element now in this case my total number of elements are 12. so in order to find out the central element what i will do i will take up the middle two elements one two three four five one two three four five so the middle elements is basically present over here now here i will take this middle element which is my three and four and i will do the average of them so three plus four divided by two so in this particular case i can say that my mode is 3.5 even though i had two different outliers but yes if i keep on increasing the number of outliers then the distribution will become normal now understand one thing why mode actually works in a better way before because of this outlier my mean was 12. even though after adding the outlier my mean was 12 but now here you can see that even though i added two outliers my mode which is again a measure of central tendency okay which is again a measure of central tendency there is highly any difference a very less difference so that is the reason why we use median did i say mode oh sorry sorry it is median i by mistake i wrote mode it should not be okay median okay so median i hope everybody understood what exactly is median we basically take the central elements if the number of elements is even then we probably take the central two elements we try to find out the average and we try to calculate it but understand one thing over here what is the main purpose initially when we did not add outlier and we tried to calculate the mean at that time i got 3.2 when i tried to calculate by adding an outlier my median was 12. sorry my mean was 12. when i try to do this with respect to median even though i had outli added the outlier it came as 3 and finally you'll be able to see that when i probably used two outliers and then probably i got the median as 3.5 now here you can basically see that there is less difference right less difference when compared to this if i talk about median it works well with outlier so this is the proper statement that i want to consider so in the case of mode the third topic now suppose if i have a specific data set like this one two three four five six six six seven eight now even though i have some outliers like hundred two hundred now in this particular case what should be our mode which is again taken as a measure of central tendency over here in this particular case in mode we find out the most frequent element mode most frequent element now in the most frequent element we just try to count and we try to see that which element is having the maximum number of elements so over here you can see six is basically having three the count is three if you see two the count is two so in this particular case my mode should definitely be six my mode should be definitely six which is again the measure of central tendency now see guys now in this particular case there is one disadvantage even though suppose let's consider that i have many many outliers like this 100 100 or 100 let's consider now in this particular case since we find out the most frequent element we try to take this as an outlier so usually in most of the outliers that we specifically use we basically use median now where specifically mode is used let's consider one data set let's say that i have a data set which is called as gender salary let's consider i have this in gender you will probably find out male female male female some different different values may be there or let me just change this data set and make it in a simpler way why specifically we use mode in mode also we use it in both integer and categorical variables but it works well with categorical variables let's say that this is a type of flower type of flower and this is petal length and petal width now over here you'll be able to see different different flowers like rose lily sunflower and you have some flowers let's consider that you have some missing data over here and based on this missing data now this particular data set has come to me and let's say consider that i have seen in this particular data there are 10 percent missing data now what do you think in order to handle this missing values what type of things we can definitely use from mean median mode don't you think i can definitely use mode over here because the most frequent occurring flower can be replaced with this missing value so the what i'm saying the missing value will be replaced with most frequent occurring element so we can definitely say that most frequent element you can actually get it by using mode which is most frequently used and this specifically works well categorical variable now let's take another example suppose i have a feature age age i have values like 25 26 dash dash dash 32 34 38 now in this particular case what do you think what may be a suitable thing suppose let's say that these are my ages of students should i apply mean median or mode which do you think based on the scenario that is ages of students we should definitely apply just tell me this answer in this particular case definitely i would suggest let's go with meat because i know students age will basically range from one value to one value it won't extend more than that specific value so here a domain knowledge will also come into existence if i say that this is probably the ages of all the population throughout the world probably i'll not go with me so something like that you know so this is a very good example very good uh understanding with respect to various use cases that we can actually think now uh let's go and discuss about the next major topic which is called as measure of dispersion now in measure of dispersion what all things we specifically discuss we discuss about two main topics one is variance and the second one is something called as standard deviation so these are the two topics that we are probably going to discuss so uh let's go ahead and let's discuss about this now first topic is basically with respect to variance now how do we define variance variance is a concept of measure of dispersion and probably for an interviewer also this may be a confusing question they may ask candidates you know and they may probably make them understand different different things and they may again confuse you but when i say dispersion dispersion basically means spread please make sure that you remember this word this basically means spread okay spread how spread how well spread your data is with the mean obviously see let's say that i have two data sets i have data set like 1 1 2 2 4 what is the mean 10 divided by 5 is 2 now let's consider that i have another distribution which looks like this 2 2 2 2 2 this is my next distribution if i try to find out the average then also it is 2. so for both this distribution we are getting the same average or mean i am getting the same mean right so if i am getting the same mean then how do i identify that this two distributions are different because we need to think about it right how do we basically come up like with this is that how this two distribution is different we really need to understand okay and probably interviewer will say you and he may confuse you in dispersion what is variance he may definitely confuse you so for that specific reason if you really want to identify how two distribution are different at that point of time we may use variance and standard deviation now let's go ahead and let's try to understand the formula with respect to variance and standard deviation and here also i will probably talk about two different things and here one very very important interview question will come one is population variance and one is about sample variance so these two things why i'm teaching you with respect to population sample it will all make sense so usually population variance uh is given by something called as sigma square here you basically use as summation of i is equal to 1 to capital n x of i minus mu whole square divided by n sample variance is basically given by small s square summation of i is equal to 1 to small n x of i minus x bar x bar basically means sample mean divided by n minus 1 now many people will say why n minus 1 n minus 1 yes this is an interview question i will talk about it okay i will talk about it don't worry so let's let's take one very good example probably and we'll try to solve this specific problem let's consider that i have my x value as 1 2 2 3 4 5 so this is my distribution so probably over here this is basically my data set so first thing first let's go and calculate now we'll go and calculate so with respect to population we will go and calculate the mu 2.83 so here mu is basically two point eight three two point eight three two point eight three two point eight three two point eight three the next thing is that from this equation i will try to calculate x minus mu so what is x minus mu over here just do the calculation and it's good that you do the calculation so here i get minus 1.83 here i get minus 0.83 this will basically become 0.17 this is plus and over here you will be able to see that this will become 1.17 and then for the 5 you have this and it will become 2.17 now the next step you basically do the squaring now if you do the squaring that is x minus mu whole square you just have to do the square of this 3.34 so here you can see 0.6889 then here also i can see 0.6889 and then for the remaining one you can do the calculation so here it will be 0.03 1.37 and finally you will be able to see 4.71 then what we do we do the addition of this because summation of this is there right so once we do the addition probably then we probably calculated if i do the addition this is nothing but ten point eight four ten point eight four divided by one two three four five six one point eight one now understand one thing let's say if i have a data set which looks something like this and if i have a data set which looks something like this comparing this to data where do you think the variance is more variance understand variance variance whenever your things comes into mind it should be talking about spread so over here in the second picture definitely variance will be higher let's consider that i'm just going to take this example here my variance is 1.81 let's consider that this is 1.81 and tomorrow if i probably get 5.45 can i say that it it may belong to this particular distribution yes so the variance will be definitely higher because the spread is quite high spread when when we say spread is basically high that basically means the elements that is present in the central region is more whenever i talk about more variance that basically means the data is more dispersed let me talk about this also to you so that you can understand okay now let's forget about standard deviation for right now now in this particular image let's see in this particular image what do you see over here you can see where standard deviation is 10 standard deviation is 50. now if you see standard deviation formula it is nothing but root of variance now here you can see when the standard deviation is smaller that basically means you're you're having a very huge curve that basically means the gra the data is not that much distributed when you have a big standard deviation like 50 60 and all you can see your data is highly distributed so this is very much important to understand why variance is more for dispersed data because over here you can see right guys okay let me when when you probably calculate i'll show you some of the problem statements over here but just understand this graphically okay later on i'll just show you one example where probably i will talk about it and let's try to solve that particular example and then we can definitely understand it but some idea you basically got because obviously the variance needs to be spreaded high if the variance is high right the dispersion becomes high because you have more number of values inside it now let's go ahead and let's try to see now i got my variance as 1.81 now my standard deviation is nothing but root of variance root of variance that basically means it is nothing but root of 1.81 so if i go and open my calculator i'll just say root of 1.81 and there i'm actually getting is nothing but 1.345 so one point three four five now see what the standard deviation basically mean what is the mean in this particular case what is the mean mean is nothing but two point eight three right let's consider this one the mean is 2.83 now from this mean your data will be distributed because mean is basically specifying your measure of central tendency it basically says that where the center is there for that specific distribution so from here if i go one step right one standard deviation to the right you have seen standard deviation formula the next element that may probably fall between the one standard deviation will range between let's consider that this is my first standard deviation to the right then it will basically have 2.83 plus 3.4 so this is nothing but 4.17 that basically means in this distribution whatever elements are basically present between 2.83 to 4.17 will be falling within the first standard deviation and if i consider the same thing towards the left that basically is one standard deviation towards the left then what i'll do i'll just subtract 1.34 so this will basically be 9 7 4 1 so it will basically become 1.49 now here it basically says that any elements that falls between 1.49 to 2.83 will be falling in this region that is one standard deviation to the left similarly we will go with the second standard deviation now in this particular case it will be 4.17 one point three four five five five point five one similarly you go and calculate similarly you go and calculate similarly one now your standard deviation is a very small number still i'll say that this is a small number and if i probably try to construct a graph it will look something like this the tip right this this region that you probably will see this is basically called as a bell curve and based on the standard deviation and variance you will be able to decide two important things with the help of variance definitely you will be able to understand how the data is spread and with standard deviation you will be able to understand that between one standard deviation to the right and the left what may be the range of data that may be following it so standard deviation is nothing but it is a root square root of variance that basically means from the mean right how far a element can be let's consider that if i consider 5 now for 5 if you try to calculate it may fall somewhere here so how you are going to represent 5 you will say that it falls in 1.5 standard deviation from the mean so this kind of definition you will be able to tell them so that basically means from the mean how far a specific number is with respect to standard deviation you're calculating you're using a unit called as standard deviation for saying that and variance specifically talk about spread if the variance is high the values the the data spread that is there is very very high now let's understand some amazing basic things which is called as percentile and quartiles this is the first step to find outliers how do we find an outlier so probably we are going to discuss in this the first and with the help of code also you can basically do now with respect to percent times let's try to understand what is percentiles and how do you find out percentile now before understanding percentile you basically need to understand about percentage suppose if i have a distribution i say one two three four five now my question is that what is the percentage of numbers that are odd so how do you basically apply a formula over here so i can basically say percentage is equal to number of numbers that are odd divided by total numbers so if i really try to calculate how many numbers are odd 1 2 3 so 3 divided by 5 is nothing but how much 0.6 which is nothing but 60 percentage very simple this is how we basically calculate percentage now and i hope everybody knows this now let's understand a very very important topic which is called as percentile now i probably think you have heard about percentiles in lot of things percentile probably if you have given gate exam cat exam gmat exam sat exam okay one real life example i'll show it to you that is related to my my uh youtube ranking also if you can see my youtube ranking social blade so here if i show you one example here you can see that you can see education rank here if you hover over here it shows 96.1 percentile if i hover away it shows 94.98 percentile over here if it's if i hover it shows 94.958 percent time so we'll try to discuss about this percentiles right now first of all we will give the definition what is a percentile so percentile is a value below which a certain percentage of observations lie so this is the definition of percentile it it is basically saying it is a value if i say okay this number is the 25 percent type this basically says that 25 percentage of the entire distribution is less than that particular value so percentile is a value below which a certain percentage of observation will lie let me take a very good example and show it to you suppose i have a data set and inside this data set i have elements like 2 comma 2 3 comma 4 comma 5 comma 5 6 comma 7 comma 8 comma 8 1 8 1 2 3 4 5 9 9 10 11 11 12. so let's consider that this many number of elements that i actually have now in this specific number of elements i want to find out what is the percentile let's consider this one my question is what is the percentile ranking of 10 so this is my question we solve this problem by using a simple formula i want to find out the percentile rank of 10 right so my formula let's consider this x is equal to 10 okay so here i'm specifically going to write x so my formula will basically be number of values below x divided by small n which is my sample multiplied by 100 so if you try to calculate this number of values below x divided by n what is n over here n size is sample size 1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20. so 20 is basically my sample size so here i'm going to say number of values below x so how many number of values x is 10 how many number of values are below x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16. so this will basically become 16 16 divided by 20 multiplied by 100 in short this will become four forza 16 for fisa ones are 20 so 80 percentile will basically be my answer for this that basically means if i really want to find out what this 10 value percentile is it is 80. now understand what is the main meaning out of it the main meaning is that 80 percentage please listen to me very very carefully 80 percentage of the entire distribution is less than 10 this is the real meaning that you can probably understand from it now quickly what is the percentile ranking of 11 of value 11 so uh how many elements are present below 11 i'll say 17 divided by 20 multiplied by 100 once a fisa 85 percent let's do the reverse of this so from this particular distribution what value exists at percentile ranking of 25 so how do you calculate this for this you use a very simple formula and the formula is something like this value is equal to percentile divided by 100 multiplied by n plus 1. now see guys i'm not going to derive the formula why it is n plus 1 y is n minus 1 why it is this for sample variance i'll discuss about y n minus 1 but understand we really need to understand what things we are doing and how we are using it in some specific purpose so percentile over here is 25 by 100 multiplied by 21 now understand this this 5.25 is the index position it is very much important to understand this is not the value the index position now i will go and find out which is 5.25 so this is my first element first index second index third index fourth index fifth index and 5.25 will be in between this but right now i don't see any element find between this so what we do is that we take fifth and sixth index and then we do the average and we calculate the value in this particular case my answer will be 5. so 5 is the value for 25 percentile try to find out what is 75 percentile so if i use 75 divided by 100 multiplied by 21 15.75 is the index position now go and count which is 15.75 from the top 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15. 15.75 is the sum of these two numbers so my answer is 9. 15.75 is the index position so here i'm basically getting the nine answer now let's go and discuss about a new topic which is called as five number summary in five number summary we need to discuss about something called as first one is something called as minimum the second topic that we should discuss about is something called as first quartile which is also denoted by q1 the third topic that we must discuss about is something called as median the fourth topic that we should discuss about herd quartile which is also said as q3 and the fifth topic we basically discuss about maximum and with the help of this we will be using these values to basically remove the outliers so let's take one example and let's see that by the help of five number summary how do we remove an outline so removing an outlier a very important thing which is also called as iqr so here we are going to discuss about removing the outliers now removing the outliers let's consider that i have one data set which is like this one two two three three four five five five six six six six seven eight eight nine twenty seven now from this distribution guys what do you think is what do you think which is the outline so obviously you'll be saying that 27 is the outlier always understand guys whenever we need to remove an outlier we really need to define a lower fence let's consider that i am going to define a lower fins and then i am going to define a higher pens the values that you have over here will be between lower fence to higher fence that basically means after a greater number all the numbers above that number will be an outlier after a smaller number all the number below that particular number below this lower fence will be actually treated as an outlier it should also have higher it should also have lower if i consider that i had one element which is called as minus 50 is minus 50 an outlier for this distribution yes the answer is definitely yes right if you have minus 50 over here that is probably in the lower fan side below the lower fence line and it can be treated as an outlier so in order to define the lower fence we write a very simple formula and the formula looks something like this so here you can define lower fence is equal to q1 minus 1.5 multiplied by iqr i'll talk about what is iqr and upper fence is basically defined by q3 plus 1.5 multiplied by iqr this two things are basically there now what exactly is iqr you really need to understand about iqr what exactly is iqr iqr is nothing but it is called as inter quartile range interquartile range is basically iqr and it is given by the formula q3 minus q1 q3 is nothing but 75 percentile and q1 is nothing but 25 percentile now quickly check this distribution and try to find out the 25 percentile so what exactly is 25 percentile what exactly is 75 percentile simple formula 25 multiplied by 100 multiplied by small n small n is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 19 plus 1 right so this is nothing but 25 by 100 multiplied by 20 which is nothing but 5 5 this 5 is nothing but index index position so what is the fifth element index position 1 2 3 4 5 is everybody getting 25 percentile is nothing but 3 is everybody getting 25 percentile or q1 is equal to 3 similarly if you try to find out q3 you will be able to get that it is 7 q3 7 you will get the 15th index for q3 so you are basically going to get 7. now if i go and compute the interquartile range what is interquartile range 7 minus 3 which is nothing but 4. now you have calculated the iqr so what all things we have calculated the iqr q3 q1 everything is being computed now let's go ahead and compute the lower fence now the lower fence basically say q1 minus 1.5 multiplied by iqr right this is what lower fence formula is so what is q1 q1 basically is nothing but what is q1 in this particular case i have computed it it is 3 you can see over here q1 is 3 q3 is 7. so i'm going to write 3 minus 1.5 what is iqr4 so 3 minus 6 which is nothing but minus 3. so the lower fence value is -3 now let's go and compute about the higher fence higher fence basically say q3 plus 1.5 multiplied by iqr q3 is 7 7 plus 6 is equal to 30 so my lower fence to higher fence range is between minus 3 2 plus 13 now tell me which is the outlier from here minus 3 to plus 13 anything that is greater than 13 is considered as an outlier anything lesser than -3 is considered as an outlier so which number should we remove we should remove 27 why 27 is greater than 13 which is from the higher fence now let me write the distributions once again for all of you let me write the distribution after removing the 13. so the remaining data what i have 1 2 2 3 3 4 5 5 5 6 6 6 6 7 8 8 9 27 but i told you we are removing 27 right so 27 is removed because it is an outlier now you know what is the what is the minimum value out of all these numbers minimum value is 1 what is q1 my first quartile we have computed over here q1 is nothing but 3 median you calculate and tell me quickly then you have q3 q3 7 and the maximum number after removing the outlier is nothing but 9 so here you are getting your 5 number summary now quickly compute median and tell me what is median median is nothing but phi now let's draw a plot which is called as box plot by this specific data you can definitely draw a box plot now how does a box plot basically get drawn so you will be having x axis and let's consider that in this particular x axis you have values like minus 2 0 2 4 6 8 10. so this is your x axis now just to go and find out where is minimum element minimum element will probably fall over here that is in one q one will basically fall at three so this will be your three median is basically 5 so this is basically your 5 q3 is nothing but 7 so this is your 7 and max is nothing but 9 so this is your line now all you have to do is that join this lines so this exactly is your box plot if i had kept 27 as an element i would have to extend this line this much big and probably put 27 somewhere here and this used to be one dot over here have you seen this kind of plot this value is nothing but minimum this is my q1 this is my median this is my q3 and this is my max and this technique of removing an outlier we basically say with respect to lower fence and higher pins and we also use something called as iqr the first application that i shown you here this is also used extensively in data visualization so you really need to know all these things i have drawn it in front of you i can also do this with code i have to just install a library in matpotlib you have a library where you can probably do all these things now let's come back to understand about variance summation of i is equal to 1 to n x of i minus x bar whole square divided by n minus 1. this n minus 1 why we do it it is also called as basal correction we also say it as degree of freedom and i have probably made this video in my stats playlist y sample variance is divided by n minus 1. you can go and search for that you can definitely understand these things now now tell me one interview question may come that what is the application of box plot box plots can be used to determine outliers because as i told you that if i was giving 27 over here my element would have come over here so box plot actually gives you a visualization way to basically see where an outlier is actually present if someone asks you how do you create or how do you determine an outlier you can explain this entire concepts whatever i have explained with respect to percentiles hello guys uh regarding the agenda first of all i'm going to talk about we're going to discuss about lot of distributions now in this distribution you will specifically have something called as normal distribution or gaussian distribution then we will try to discuss about standard normal distribution standard normal distribution then probably one more example on z scores we will try to see z scores both with uh uh you know z table there is a concept called a z table and y z scores are actually used then we will discuss about log normal distribution then probably we will also discuss about bernoulli distribution then finally we will discuss about binomial distribution and we'll see some examples we'll solve some examples and then whatever practical part is left that we have not covered till now like mean median mode everything will get covered over here so if you want to do mean median mode we'll try to do with python programming language okay and uh we'll also do variance standard deviation the third thing we'll try to create histograms we'll try to create pdfs probability density functions we'll try to understand how does a distribution this normal distribution will look in code we'll try to find out how to find out this iqr using code and uh we'll see all these things and uh some examples of log normal distribution we are going to see okay i can also discuss about bar plot not to worry that also we'll try to discuss about it okay i can also discuss about violent plot so whatever things will come we'll discuss about the first thing first uh today we are going to discuss about distributions now what exactly is distributions understand distribution of data when i say i have a data set let's say that i have a data set of ages like 24 26 27 28 30 32 you know so we have lot of data set now when we have this particular data set always okay always in the first thing that we need to focus on is that how do we basically see this data set in a visualized way because obviously this is a continuous data we always we already know that this is a discrete continuous data in this particular case age i'm just going to consider as discrete continuous data now in the case of continuous data what kind of graphs do you see probably you'll be able to understand about that specific data so if i really want to get one analysis or if i really want to start my analysis i really need to see lot of visualize diagrams and that is where when i consider this entire distribution they are multiple ways to visualize this data through various graphs and these graphs can really play a very important role whenever probably we are discussing about uh whenever probably we are creating reports where we are doing exploratory data analysis and many things so let us go towards distribution suppose i have a specific distribution of data i probably want to plot this data through some way let us consider that i want to probably plot this data through some way and the best and the easy way that you can probably think about is your histogram right so we have already seen how to create histograms you will be able to create diagrams like this buildings like this right so you will be able to get buildings like this and finally what you do you smooth in this histogram to get some kind of curve and this curve right now looks like a bell curve okay so considering this let's go to the first distribution the first distribution that i'd like to focus on is something called as gaussian or normal distribution now why as i said y distribution is basically used distribution main purpose is to uh why why this different different kind of distributions are there so that we can basically have some idea about a data set now first of all when we discuss about gaussian or normal distribution most of the time you have seen this kind of distribution in this specific way so here probably you have seen a bell curve now this bell curve this is my bell curve now they're very important information men might probably talk about this bell curve this will basically this can be your center line that you see can be your mean it can be your median it can be a mode so what does this basically mean if i have a distribution and probably this distribution follows this kind of bell curve and one important property of this bell curve is that this side is exactly symmetrical to this side so there are many inferential statistics that we will probably be discussing about in the future about this bell curve about this entire distribution or gaussian distribution here you can see that it is exactly similar it is i mean it is exactly symmetrical the right part of the curve when i say consider this particular particular path is equal to this part that basically means that the amount of data that is present in this particular part will also be equal to the amount of data that will be basically present in this part so here you can basically see that exactly this forms a bell curve and whenever we have a specific distribution which exactly follows this kind of bell curve we can definitely say this as a normal or gaussian distribution so this is basically my normal distribution now why we are specifically focused on this distribution this distribution is very much important because from this we can derive lot of conclusions what all different kind of conclusions we can derive that i'll just talk about it now let's go ahead and let's discuss about this distribution always understand whenever let's draw this distribution once again now suppose this is my distribution so this will be a mean median mode then you can go one step towards right second step towards right third step towards right so what is this exactly called standard deviation one step towards the right one step one step or one standard deviation towards the right two standard deviation towards the right three standard deviation towards the right similarly i may have one standard deviation to the left second standard deviation to the left and finally i can also have one more standard deviation to the left this will be very very much important guys now what kind of different conclusions or what kind of uh things we can actually conclude from this kind of graph this side is symmetrical to this side now let's go ahead and discuss about some of the important things in this suppose if i draw this line can i say this is my first standard deviation towards the right and second standard deviation towards the left so this is my region of my first standard deviation the center one over here i can basically write it as mu this will basically become mu plus sigma mu plus two sigma and this will be just a second mu plus 3 sigma similarly here i can write mu minus sigma mu minus 2 sigma mu plus sorry mu minus three sigma because of less space i am just trying to include it in this particular way now the first thing that we will probably come up with is called as empirical formula now this is very much important empirical formula now this empirical formula basically says that you really need to understand this 68 95 99.7 percentage rule now what does this basically mean this basically indicates that let's go with 68 within the first standard deviation around suppose if i have some distribution data let's consider that i have a data set which have 100 data points now what does this basically indicate is that between the first standard deviation between this region in this entire region around 68 percentage of the distribution is present that basically means out of this 100 data point 68 data points will be present in this region that is the reason it is basically called as a bell cup that specific region in that central area you have lot of data so 68 percentage of the entire data set lies in this region within the first standard deviation now coming to the second standard deviation this is something very very important i'll also talk about what you can derive from all these things between the second standard deviation around let's come to the 68 percent this is clear then within the second two standard deviation right within the two standard division region which is this specific region around 95 percentage of the entire data lies in this region and similarly if i go and consider with respect to the third standard deviation which is from here to here around 99.7 percentage of the entire distribution will fall in this region so that is the reason why it is basically called as 68 95 and 99.7 percentile low so everybody is clear that basically means now if you have a distribution which is gaussian or normally distributed then this conclusion can definitely made that within the first standard deviation how much data is basically falling within the second standard deviation how much data is falling and within the third standard deviation how much data is basically falling now let's see some examples some of the examples if i talk about like height height is basically normally distributed who is saying this i am not saying it the domain expert is basically saying it now who is the domain expert in this particular case in this particular case the domain expert is a doctor doctor have taken various samples from different different places and whenever the doctor was constructing this bell curve they it was forming something like this and from that he was able to understand he was able to derive right he or she was able to derive that within the first standard deviation how much data is basically falling within the second standard deviation how much data is falling and within the third standard deviation how much data is falling second example if you consider weight weight will also follow a gaussian distribution third i hope everybody knows about iris data set in irish data set if you talk about petal length sepal length it actually follows gaussian distribution i will show you practically don't worry about that does that following the empirical rule necessary imply that it is distributed see whenever you have a gaussian distributed data at that time it will follow this 68 95 99.7 percentile rule so this was the thing with respect to gaussian or normally distributed now let's go ahead and try to see this let's take an example suppose my i have a data set where my mean is 4 and my standard deviation is 1. if i have this two information can i construct a distribution suppose this is 4 then in the next step what it will come 5 6 7 8 right and then 3 2 1 and 0. so i will be able to create this and let's consider that this is basically following this kind of distribution so this basically follows this kind of distribution now understand this middle one is basically your mean and standard deviation sorry mean is 4 and standard deviation is 1. now see one thing guys if i talk about 4.5 my question is that where does 4.5 fall in terms of standard deviation so you may be thinking okay 4.5 where exactly it is it is somewhere here obviously when i say 5 is first standard deviation to the right that basically means 4 will be plus 0.5 standard deviation to the right understand 0.5 standard decision if you say one standard deviation it is basically coming to 5 it is 0.5 standard deviation now similarly if i say where does 4.75 fall then how you will be able to see it the point the standard deviation was 1 i told 4.5 so 4.5 will be something falling over here and this is like 0.5 standard deviation but in the case of 4.75 it will be very much difficult for you to do the calculation so that is the reason what we can do is that we can use a concept which is called as z score now z score will basically help you find out whenever i talk about a value how much standard deviation away it is from the mean so this formula is x of i minus mu divided by standard deviation now i need to find out for 4.75 i will just write 4.75 minus mu is what mu is 4 4 divided by standard deviation is 1 so here i am actually getting 0.75 so now i can see that it is 0.75 standard deviation to the right why it is saying right because this is positive value now if i give you the same question try to find out where does 3.75 fall like how many standard deviation whether what should be the standard deviation with respect to 3.75 then you go and apply the same formula so here i'll say z score is equal to 3.75 minus 4 divided by 1 which is nothing but minus 0.25 so whenever minus comes that basically means you have to check in this side and it is basically saying that 3.75 will be falling somewhere here that is nothing but minus 2.25 standard deviation to the left now let's go to the next thing suppose i consider this same graph now you understood if i really want to find out how many standard deviation to the right or the left i need to find out i can definitely use z score let's consider this thing i will use the same graph i'm using the same bell curve this is my 4 this is my 5 this is my 6 this is my 3 this is my 2 this is my 1 here you know that my mean is 4 and standard deviation is 1. understand one thing over here i'll talk about z score again don't worry now let's apply z score to every values what will happen if i apply z score to every values what will happen okay what is z score formula x of i minus mu divided by standard deviation okay you know the mean mean is nothing but 4 standard deviation is 1. now if i apply z score to everything initially my distribution was like this 1 2 3 4 5 6 7. now this was my distribution initially now after applying z score to this what will be my distribution that will be coming apply apply for 1 first of all so if i apply z score to 1 then what will happen 1 minus 4 divided by 1 this is minus 3 can i say this 1 is getting converted to minus 3 1 is converted to minus 3 then if i apply the z to the next element 2 then what is 2 minus 4 my 1 it is nothing but minus 2 so here i'm actually getting minus 2 then if i go and apply the z score to 3 then what will happen z of 3 so 3 minus 4 divided by 1 what will happen minus 1 so minus 3 will now get converted to minus 1 then 4 will get converted to 0 then it will get converted to 1 2 3. now understand the main magic in this with the help of z score is this not the standard deviation of the same elements that we got over here is this not the standard deviation of this all elements that we got after applying the z score after we applied this initially my data set was like this then i got this this element falls at -3 standard deviation this elements fall at -2 standard deviation so here you can definitely see that i am able to get the standard deviation now what is happening see over here one beautiful thing that is basically happening i had a distribution which was 1 2 3 4 5 6 seven after i applied a z score this got converted to minus three minus two minus one zero one two three and probably uh yeah right i got this now what is this distribution then called what this was initially a normal distribution a normal distribution or a gaussian distribution after i applied a z score what kind of distribution we are actually getting and what is this basic distribution called as so this distribution is called as standard normal distribution so one of the most important property with respect to standard normal distribution is that your mean is 0 and standard deviation is 1 is this satisfying this property or not it is being satisfied right so can i write can i write a random variable x or y will belong to standard normal distribution where specifically your mean will be 0 and standard deviation will be 1. so after applying a z score we are able to get into a different distribution which is called a standard normal distribution now the question arises why do we do this what is the use of doing this let's go ahead with one practical application and we do this in machine learning we do this in most of the algorithms now let's go ahead and try to see the practical application suppose i have a data set let's consider that i am solving a machine learning problem statement i have a data set age i have features like salary i have features like weight suppose in this particular data set i have these three columns now understand one thing h by what unit we will calculate by years salary we may calculate by rupees or dollar weight we may calculate in kgs understand this units these are these are what these are basically units units of calculation now whenever i have some values like this like 24 25 26 27 salary may be 40k 50k 60k 70k something weight maybe 70 kgs 80 kgs 55 kgs 45 kg now here when you have this kind of data always understand now in this data obviously you can see the units are completely different our main target should be that we should try to bring up in a form probably in this particular form where my mean is 0 and standard deviation equal to 1 at that point of time i can definitely apply standard normal distribution that basically means i can take up this entire data and apply z score and convert this into standard normal distribution similarly i can go ahead and take up this particular data set i can apply z score and i can basically convert this into standard normal distribution this process is basically called as standardization very super important many people will talk about normalization i'll talk about the difference between standardization and normalization whenever we talk about standardization in short internally there is a z score formula getting applied so standardization is a process where i am basically trying to convert a distribution into standard normal distribution the property is that the mean is 0 and the standard deviation is 1. now let's go ahead towards something called as normalization now what exactly is normalization in standardization whenever we talk about here we are getting converted as mean is equal to 0 and standard deviation equal to 1. now in normalization you have an option you will say that i want to i want to shift this entire values or whatever values that i have between 0 to 1 let's consider like this i want to change all these particular values between 0 to 1. so in this particular case i may definitely apply normalization now how do we do normalization there is a very important formula which is called as min max scalar in the mean max scalar you just have to provide 0 to 1 and automatically this kind of normalization will happen and yes i will show you practically also don't worry if i want to probably shift this between minus 1 to plus 1 i can basically apply this so normalization gives you a process where you can basically define the lower bound and upper bound and you can convert your data between them now very important thing where do we use normalization i hope everybody knows about deep learning in cnn whenever you are doing image training image classification or object detection in this particular case understand every images has a pixels suppose i have a 4 cross 4 image 1 2 3 4 1 2 3 4. each and every pixel ranges between 0 to 255. now 0 to 255 what we do before we start training this can be applied with min max scalar and it gets converted between 0 to 1 where the minimum value 0 is assigned to 0 and the maximum value 255 is converted to 1. so when we do this automatically we can apply this kind of min max scala or normalization so in this particular case i will definitely not use min max scalar because min max scalar has a different power formula i will take each and every pixel divide by 255 so when we do this specific division by divide by 255 all your values will be getting changed between 0 to 1 and this is another type of normalization process so till here we have discussed about min max scalar we have discussed about normalization standardization now let's solve one practical example for z score okay recently india versus south africa where india lost it obviously now let's consider that if i consider odi series let's say and every time in last year also odi series happened this year also it happened the series average of 2021 was somewhere around let's say 250 the standard deviation of the score was somewhere around [Music] 10 and rishab let's say rishabh final score was 17 so this was the series information for 2021 let's consider now similarly i have a data for 2020 series let's say the series average in 2020 let's say that the series average is a little bit different in 2020 the series average of the team scoring in 2020 was 260. the standard deviation of the score of all the matches ah is 12. and then over here probably rishabh final score is 68 okay my question is that this two data i have compared to both the series in which year rishab punt final score was better so for checking this obviously many people will say 2020 2021 lot of confusion will be there so we will just try to apply for z score now for the 2021 we will apply the z score so z score will be nothing but it will be x of i minus mu divided by standard deviation we know what is x of i in this particular case x of i is nothing but so 70 minus 250 divided by 10 so what we are getting over here and similarly for 2020 my z score will be x of i minus mu divided by standard deviation so first one you know this properly this values may not be coming let me change this data a little bit okay rishabh1 final i'll say average score not final score so that we change this data a little bit otherwise the data will be very very bad okay rishab month average score let's consider that it is 240 okay and resub1 for average score is somewhere around 245. let's consider like this okay 240 and 245 because i gave one score so that is the reason a huge standard deviation is basically coming uh at that point of time i'm just taking average score average score of the series guys rishaban this players average score of the series average score of the series okay now let me just make some changes and let me put somewhere over here as 240 so 240 minus average score of the series guys three match three match series so 240 minus 250 is nothing but minus 10 divided by 10 so this is minus 1 standard deviation and this data will now change to 245 so 245 this will be minus 15 divided by 2l which is nothing but 15 by 2l which is nothing but minus 1.25 okay clear everybody so understand along with the not out rule something 240 is the average okay let's consider in that specific way i know the data is not approximately right but i could also instead of rishabh month average score i could have team team average score okay team average score and probably team played well probably in the last match or the first match like that okay in this series they played well that also you can basically say over here instead of rishabhanth i could write team team average score team team score in final match i messed up with the problem statement because i was just thinking something score final match score like that okay team final score here also i can say team final score this will probably be more problematic team final score now based on this i have always again this is an example guys just think of it the main idea is to teach you something so that you can apply that anywhere okay so here i've got minus 1 here i got minus 1 here i got minus 1.25 now see i have seen that the mean is 2 in in 21 20 21 so let me write it down again for you so if in 2021 the mean is 250 over here you can see the mean is 250 x of i is nothing but how much uh it is nothing but 240 and the mean is 10 oh sorry and the standard deviation is 10. if i have this information can i draw the bell curve so this is my bell curve the mean is how much 250 standard deviation is 10 basically means this will come as 260 270 280 right this will come as 240 to 30 to 220 right and this is my mean now where does 240 fall into 240 is falling into minus 1 standard deviation so that basically means 240 will fall here now in 2020 in 2020 you know that your mean is how much to 60 right your mean is 260 x of i that is your final score is 245 and your standard deviation is nothing but 12. now based on this i will definitely be able to create another curve which will have this kind of bell curve and my central element will be 260. since my standard deviation is 12 this will become 272 then it will become 284 then it will become 296. similarly over here it will become 248 then it will become 236 then it will become 224 so here i have my value over here and what is the standard deviation over here it is 1.25 so 1.2 minus 1.25 is this specific standard deviation now here you can see the area is little bit less here the area is little bit more so where do you think india has probably performed well in the final match in the final match whether india performed well in 2020 or in 2021 based on this information this information basically tells many thing about probably the pitch condition whenever we say the standard deviation is less that basically means most of the score was rotating around that much values so tell me where probably india may have performed well understand guys here the standard deviation is more here the standard deviation is less understand over here obviously the z score value is minus 1 here the z score value is minus 1.25 which is greater okay now let's go to one more practical example of z score now this this example most of the time with respect to statistics will come this may be probably asked in exa in interviews also and this is a very very important and important question i will probably take one very good example and show it to you how to be done how you can basically do this and how you can actually run learn it okay so uh one problem statement that i am actually going to give to you is that one example i'll give you then we will try to see let's consider that i have an x random variable i have an x random variable so let's come to the stats interview question now in this stats interview question let's say that i have a random variable x and let's say that this random variable has this kind of distribution 4 5 6 7 3 two one and let's say that i have a bell curve which looks like this now i want to know my question is what percentage of scores fall above 4.25 now understand one thing where does 4.25 fall 4.25 will fall over here so this is basically my mean and 4.25 will fall over here let's consider that it is falling over here my question is that what is the these are my scores right let's say that these are my scores two three four five one like this are my scores i need to understand from this distribution from this my entire data set what is the percentage of scores that falls above 4.25 that basically means i am interested in this region i am basically interested in this region i am saying that what is the percentage of the scores that are greater than 4.25 this is my question okay simple question is this and now we'll try to understand how we can use z score in this so everybody knows about z score formula x of i minus mu divided by standard deviation here my mu is 4 standard deviation is 1 what is my x of i x of is nothing but 4.25 minus 4 divided by 1 this value is 0.25 standard deviation 0.25 standard deviation what does this basically mean 4.25 falls 0.25 standard deviation from the mean okay from the mean from the mean it is basically falling to 0.25 standard deviation now i got the standard deviation this is i got with the help of z score but now what is the next very important thing obviously from this we will not be able to understand okay how much what will be the percentage then probably this i have got that it is 0.25 is my standard deviation or a z score my my z score is 0.25 now i need i'm interested in this region so how do i come up with the overall percentage from this particular region understand one thing this is a symmetrical bell curve that basically means the entire area i can basically consider it as one now since i am interested in this region i will say this region as tail whenever we talk about tail the region that i'm actually interested in basically i want the value with respect to this one part of the region i'll say it as tail the other part that is the remaining portion i will basically say this as body full from here to here so this will basically be my body now understand one very important thing how do i check based on this z score what should be the value or what should be the body curve the area of the body curve i want to find out what is the area of this z scores actually help you to find area of the body curve how do we find out i'll talk about it z score will definitely help you to find out the area of the body curve now guys just think over it okay what do you think this percentage may be this black this red region percentage may be what do you think over here three numbers are there let's say that total numbers are seven and when i say three numbers on the right hand side what may be the percentage if i said three by seven what is three by seven it is it is approximately around 48 to 49 right now can we calculate the same thing with the help of z score the answer is yes i have already seen the z value is 0.25 now let me do one thing let me open something called a z table because i want to find out the area of the curve so z table if i go and search for it you will be able to see in the first link you will be able to see in the first link and over here i'll just go over here now see this is how my curves look like right here z score i'll just use another table because this table does not look right okay so let's consider this table so always remember three types of z score we can basically get one is this type which again i'll be discussing one is in this type okay now see this uh left z table and this is the right z table okay just a second i will just show you how to make the readings over here um two point two point z is point two five right point two five see over here what is my z score from here what is my z score over here point two five and remember this z table will be giving me the area of the body curve see a z table shows the area to the right hand side of the curve use these values to find the area between z is equal to 0 and any positive value for area in the left table look at the left tail z table instead okay if you want to find out the area in the left tail search for it guys if you want to find an area in the left tail look at the left tail z table instead in this particular case let me take left z table because i want to look at the area of this series guys this is the area right now this area i want to get the answer right if i get the answer of this area i can just subtract 1 minus 1 minus the left area i want to get this particular area let me explain once again okay everybody is able to see this over here just see when very very important thing the z table shows the area to the right hand side of the curve use these values to find the area between 0 and any positive value for area in the left tail look at the left tail z table instead so here you can see that i want to see the left l or right tail okay what you want to see okay first of all see that you come to this particular diagram you want to see this part or this part obviously you want to calculate this part but understand one thing in order to calculate this part if i get the value of this part i can just subtract 1 minus left area right if i subtract 1 minus left area will i not be able to get the right part otherwise you directly go and see in the right table otherwise directly go and see in the right table again i'm showing you here you can see 0.25 0.25 right so 0.25 you will be able to see this much this area will be giving from mean to this standard deviation right table is given don't worry left table is also given see over here left table is also given you can also check this this table will be giving you the value between this to this then probably you have to find out this one or subtract 1 minus this area then you will be able to understand it now i will go to the left table understand again i am going to repeat guys here clearly it is say given that for area and left table left tail look at the left l z table why i am seeing left tail because if i go over here this is my right tail this is my remaining body left tail can become this part so from the entire body if i subtract 1 minus this i will be able to get this very much simple now how do i check this i'll go over here it has given me the instruction over here for area and left l look at left tail z table instead so if i go and see this is my left z table now i will go and find out the z value of 0.2 and 0.2 and 5 so how much i am getting 0.5987 so 0.95987 will be my value of this my area of the body curve will be 0.5987 now in order to find out this i will subtract 1 minus 0.5987 0.4013 so what is the percentage of scores that fall above 4.25 it is nothing but 40 percentage why subtracting from one it's very simple no see guys again i'm talking about this my question is that this is my mean from this particular curve i want to find out what is the percentage of the distribution then what i can do if i want to find out this curve i can take this whole curve subtract with the left one then i will be getting this one so here you are able to get 40 now did you understand how important it is basically to understand z score yes 0.59 is the mean to all the left this entire region from this to this from 0.25 standard deviation to the left part now did you find out how important this is for the interview questions guys why not directly taking from the right table understand guys write table is not given no this is not right table this is only given from here to here if you want to find out from left table then this is the diagram for this for left z table understand one thing very much important you cannot take it from right table right table there is no information about it you can see this graph right it is only giving information from here to here in the left table you will be able to get the information of the body of of the area of the body of this particular part so this was an example with respect to z score standardization all these things we have probably discussed so the question is in india the average iq is 100 with a standard deviation with a standard deviation of 15 what percentage of the population would you expect to have an iq lower than 85 so my z score will be what so first of all let's discuss about this graph so here you can see that this is my graph so this particular value is how much the mean is 100 my standard deviation is standard deviation is 50. so 115 130 145 similarly i have 85 70 55. so i have all these values over here now with respect to this first of all let's go and compute the z score how do you compute the z score the same example that what we have done over here here in this particular case uh 4.25 falls over we are just taking iq lower than 85 so what is iq lower than 85 so it will become 85 minus 100 divided by divided by 50 what it is minus 15 by 15 it is minus 1 so one standard deviation this is my mean this is my minus 1 standard deviation now this is the area that i want to find out now when i want to find out this particular area this area is already the body part the left of the curve so what i will do i will just go and compute for minus 1 now if i want for minus 1 what it is go and compute it over here how much it is 1.0 so this is 0.86 let me just compare the answers and let me just select some different z table so that you will get an idea i'm actually not able to find the right z table yeah this looks good i will give you the link 0.84 so what i'm actually getting 0.84134 0.84134 this is plus 1 understand this is plus 1 plus 1 when i say understand over here plus 1 when i say it is basically from this region to this region now if i subtract 1 minus 0.8414 0.84134 that will basically be my values right lower than 85 understand lower than 85 lower you may also get an question iq between 90 to 120 like this question also you may get for the same problem statement so you may get questions like this at that point of time again you have to solve it in a different way but here is just an idea to talk about what is body area of the body yeah negative will not matter if you say negative it will come from here if you say sorry if you can say negative it will come from here if you say positive it will come from here understand both the side are symmetric minus 1 also you can look that only i'm saying you know in table whatever you are able to find out you can definitely check out minus one also from top minus one point zero same thing you'll be getting right minus one point two zero is one point one five eight eight six which is one minus point eight four right same thing now let me do one thing guys quickly show you google collab pro so that we can have some programming sessions so first of all i'm going to import some libraries as this import numpy as np import import matplotlib dot pi plot as plt and then probably i will say matplotlib inline so all these things we are actually done and then probably i'll also import statistics now first thing first how to compute mean mean median mode okay we are going to see that first of all let me load a data set which is called as i'll load a data set which is called as tips and this will basically be giving me df is equal to this one then i'll say df.head so here you can see this is my entire data set now quickly if you want to see how to do mean for this let's say that i'm using np.mean function for finding the total bill mean total bill of mean okay so if i execute this you will be able to see the answer so this is the what is the mean of the total bill if i want to probably find out the median also you will be able to find out median np dot median df of total bill so here you will be able to see np.median so over here you see some differences if you are seeing some differences think that there may be something like some kind of uh outliers also okay if you want to try for mode i can use statistics dot mode and again i will be using df of total underscore bill so here you go this is got mode is nothing but 13.42 now the thing is that if i want to go and see my box plot which is basically used to see outliers so if i use df of total bill total underscore bill so here you will be able to see my box plot also so this is one example of box plot so does this indicate it has an outlier now definitely over here outliers is present but what is this this is 25 sorry minimum 25 percentile median 75 percentile and max so all these things we have calculated if you write df of sns dot there is something called as risk plot which will basically help you to create histograms on a specific feature so if i execute this you will be able to see one example which looks like this is this a normal distributed data i guess no if you want to see with the probability density function i'll be using kde is equal to true so with kd is equal to true does this look like a normally distributed no it is like little bit skewed towards the right i'll also show you some examples with respect to uh normally distributed data so for that i will do sns dot load underscore data set i will be using iris data set iris flower data set basically is the data set which will actually help you to give a data of a different types of flowers with respect to iris so here you will be able to see that df1 dot head so here you have flowers like setosa oversee color and here you have four features sepal lens sepal width petal and then petal bit now let's see i will just try to plot the same thing with one of the feature okay let's say that i am doing it with sepal length triple underscore length df1 so here you can see that does this follow a gaussian distribution does this follow a gaussian distribution no i guess let's try with sepal width finally we'll be able to see something wow this follows a gaussian distribution definitely we can definitely say for this this is a gaussian distribution so this is specifically a gaussian distribution over here and here you can also apply that rule that is 68 95 99.7 percentage rule so all these things you can basically check out over here and you are getting this i'll also show you how to construct this pdf function and all as we go ahead okay it is normally distributed definitely we can say that it is normally distributed okay so this was one example with respect to normally distributed this is not normally distributed you know sns dot count plot of dfo if i use count plot with respect to species species spelling is wrong okay df1 again i'm writing df what is this plot guys this is a bar graph bar plot or bar graph whatever you want okay percentiles let's do for percentile so for percentile i can use np dot percentile and i can use my df of let me open one example that i had written for you so uh i will basically use over here like this let's say i'm going to use sepal len and here i can basically give some parameters like let's say that i want to get the 25 percentile and 75 percent so if i execute it here separate here's a df one so here you can see that i'm getting 5.1 as the 25 percentile and 64 75 percentile is 6.4 so my iqr will be 6.4 minus 5.1 if you want to probably get the 99 percentile also you can basically write like this 99 so here you will be able to get the value 5.1 and 7.7 hello guys so how are you all i hope everybody is doing well so let's start today what all things we are going to do first of all we are going to implement this iqr using python okay the second topic we are going to discuss about is probability the third thing that we are going to discuss about is something called as permutation and combination once we finish this up the fourth thing that we are going to discuss about is something called as confidence intervals so in confidence intervals then probably if we get time we will cover up p value and then we will start with hypothesis testing now what we are going to do first of all i am going to start with google collab you can also open google collab okay so i will just make a new notebook so first of all we'll try to implement z score and try to find out iqr and with respect to that we will try to see what all things we can basically implement other distribution will also come don't worry bernoulli binomial distribution power law distribution everything will be discussed first let's go in some specific order i have actually decided and when that is those distribution will basically come we'll discuss about it okay here you go so in this session we are going to first of all discuss about outline now first of all what i am actually going to do over here is that i am going to import some libraries import numpy as np okay import matplotlib dot pi plot that's plt and then i'm just going to import matpot label inline so i'll be executing this now the next thing that probably we will be discussing about is that let's define our data set so here i'm going to just define our data set data set you can take up anything that you probably want you can just define your own data set whatever data set you like now for for my sake i have just created one data set over here so here you will be able to see that this is my data set can you say some numbers that are like kind of outliers in this so uh now the first thing that we are probably going to do is that let's say that using z score i probably want to also find out some outliers now using z score how do you find out some outliers now let me just go and explain you over here let's say that you know about normal distribution till now you have discussed we have discussed so many things in normal distribution we know that this is the mean first standard deviation second standard division third standard deviation first second and third standard division to the left you know that 68 percentage of data 95 percentage of data and 99.7 percentage of data can i consider that during some of the scenarios if my data is normally distributed after the third standard deviation probably the data are outliers yes or no yes after third standard deviation whatever data is basically present right data outliers yes or no just think over it most of the time if the values are you know after probably third standard deviation they are like kind of outliers yes so just think over it guys it can be treated as an outlier right if if data is present after third standard deviation so first we'll try to implement this now what i am actually going to do over here is that first of all let me make a list okay so here i'm just saying it is outliers i'm going to basically create it as a list and put up all outliers inside let's define and how do you find out standard deviation or by using z-score right we can definitely find out z-score with the help of z how many uh data set or data points actually fall within the third standard deviation so here i'm actually going to create a function which says define detect underscore outliers so this will be my function and here i'm going to give my data now the first thing that i will create a threshold my threshold will basically be three standard deviation right anything that falls away from the three standard deviation i will basically be able to do it and i hope everybody remembers the formula the formula for z score is what if i go and probably define over here my z score formula is nothing but it is x of i minus mu divided by standard deviation we usually also write this formula by root n but i'll talk about it why specifically i'm not specifying root n over here uh over here i'll just try to use this formula okay so this is basically the z score formula okay so i have to implement this formula in python programming language okay so what i am actually going to do first of all obviously in in in this i need to compute mean i need to compute standard deviation you know how to compute mean right so here i will say mean is equal to np dot mean and here i can actually give my data points which will actually help me to find out mean then my standard deviation here i can basically write np dot standard deviation of that specific data i will be able to get the standard deviation so i have got my mean and standard deviation now for each and every points inside my data set i will just apply the z score formula so i'll say for i in data i can say z score is equal to i i is my x of 5 points right i'll say x i minus mean right divided by standard deviation so this is my z score formula and for every item i'm actually trying to find out the z score formula z score will basically give you how many standard deviation it is away from me so i can write one condition to check whether it falls below the third standard deviation or not so i can basically use nb dot absolute which will basically help us to round off the z score and i'll say z underscore score is greater than threshold if it is greater than threshold what does this basically mean let's let's define threshold over here i have already defined threshold right so if it is greater than threshold then what does this basically mean oh sorry it is data set i'm extremely sorry data set now tell me if np dot apps zsco greater than threshold what should i do what does this basically mean green more clarity you want i think now it is fine right what what should we do in this this basically means that it is an outlier right because it is falling away from third standard deviation it is falling below or beyond the third standard deviation so what i can basically do is that i can just write something like this because i have created a list i'll say outlier dot append and i'm going to append that specific set score value so i hope it is fine i'm just going to append the z score value not z score i will append the i value because i in data set yes i am just going to append this i yeah outliers sorry it is outliers dot append of i and then finally what i'm actually going to do i'm just going to return the outliers or return outlier let's see whether it will work or not i'm also trying it for the first time so this is my function that has got executed i will just execute one more code threshold three basically means this this defines our third standard deviation below like beyond third standard deviation i can basically say that this actually falls on if you want to probably go and check how this distribution is so i can write plot.test on a specific dataset plt is not defined why okay this should be plt it's okay whether it is normally distributed or not but i am actually trying to see this okay there are some definite outliers but it's okay let's see that whether we will be able to do this or not what is which our past has changed data set data in for loop it is simple right guys this this function everybody understood or not oh sorry this should be data this data i'm actually passing over here see threshold threshold here is my third standard deviation if you want the data set i can paste this entirely and given the chat so this is my chat with respect to the data set i've already given it to you all now let's go and execute it now i have executed this now what i am actually going to do over here i am just going to call detect underscore outliers and i am going to call the specific data set the data set nb.apps nb dot apps basically means nb dot absolute absolute function now once i execute it here you will be seeing that it will be returning this three outlier are these my outliers or not guys the for loop is very simple for i in data i'm finding for every data which is in the form of list all the z score and i am comparing if the z score is greater than 3 or not if it is greater than 3 i am considering it as an outlier here you can see all the outliers are there outliers means a big number right if you have not attended the previous session guys see if you have not attended the previous session you can drop off okay because you will not be able to understand this is a seven days live session now i have got the outliers now this is one way how we can use z score so this was an example of actual z score so i'm just going to write it as z score z score computation and basically we have done it now let's go towards the iqr iqr basically means interquartile range so for interquartile range what type of code i will be writing always understand in iqr what are we discussing in iqr first of all we need to find out q1 q1 is 25 percent time then we have q3 q3 is 20 75 percent time then if i subtract 75 percentile minus 25 percentile i will basically get the iqr right and always understand in iqr what we do we basically find out what what do we do in iqr in iqr we basically find out the low the lower fence and higher bits that we really need to find out in case of iqr so how do i write the code because this theoretical is already explained so i'll write down all the steps that is required so the first step is that i want to arrange i want to sort the data let's say that i'm sorting the data okay this is the first step the second step is that i will calculate q1 and q3 q1 and q3 is pretty much important in this particular case so i need to do it in this scenario i'll just move this up i'll copy and paste it over here so the first step is basically calculate sort the data and then calculate q1 and q3 then we need to find out iqr which is nothing but the third step which is nothing but the subtraction of q3 minus q1 then we need to find the lower fence find the lower fence now lower fence formula i hope everybody knows it so it is nothing but q1 q1 q1 plus or minus it is q1 minus 1.5 multiplied by iqr right this is the formula to basically find out the lower fence then find the upper fence here i will basically be using q3 plus 1.5 multiplied by iqr so these are the steps that we are probably going to do so these are my steps that i am actually going to plan for and based on the steps i will be implementing it so these are the steps that i will be performing in order to find the outliers with the help of iqr now first of all if i really want to find out the sorted data set how do i find out the sorted data set sorted data set i will just say this will be my data set and i can use sorted function and in sorted function if i give you the data set this will basically be my sorted dataset so sorted is an inbuilt function which will actually help you to sort all the numbers okay okay sort all the numbers over here so right now i have actually created a data set which is completely sorted so my first step is done so i am done with my first step now second step i need to calculate q1 and q3 so i will say q1 comma q3 and here i will basically use np dot percentile i will give my data set over here along with this i'll give two values one is 25 comma 75 so once i execute it you can see that it has got executed now i am going to just print q1 comma q3 so here you can see which is my q1 q3 this is my 25 percentile this is my this is my what percentile this is my 75 now once we have this now let's go ahead and compute the lower fence and the higher fits now in order to compute the lower fence and the higher pins here i'm just going to write the comment find the lower fence and higher prints the lower sense is equal to q1 right minus 1.5 multiplied by iqr and before that i need to compute the iqr let's say iqr is equal to q3 minus q1 so if i go ahead and print iqr what is this error it is coming up now if i go and execute this you will be seeing that iqr is three so this is my lower fence for the higher fence i will basically write higher fence is equal to q3 plus 1.5 multiplied by iq once i execute it now i know my lower fence and higher so i'm going to print lower underscore pens higher underscore so if i print it it is 7.5 to 19.5 now the further part i think you can comfortably do it and based on this higher lower fence and higher pins you can write a condition and you can remove all the elements that is required so now you can basically write don't worry whether the data is normally distributed or not here what we are doing is that whatever data set you are basically getting you are getting what you can actually do you can basically uh find the lower fence and higher fence and basically do this thing now instead of doing all these things if i import c bond as sns okay and execute it and there is an option which is called as hist plot not sorry box plot we also saw how to create box plot if the if the lower fence is negative then what you can do is that based on that condition any value lesser than that you can remove all those things right and here if i give my data set you will be able to see that this will be how a box plot will be created now this looks you see that there is a very big outline so that is the reason this same outlier we found out with the help of multiple things and here also you can see 7.5 to 19.5 so most of your data points that will be lying over here will be based on that if i probably remove those three elements and try to see that particular data set then this box plot will look bigger now let's go ahead and discuss about the next topic which is called as probability probability is super super important and in this session i will discuss major major important things in probability and we will try to see that what all things we can actually do with the help of probability probability is by default used in machine learning also in deep learning also many places let's say one example okay suppose i have two categories of data set i have another category of data set if i try to create a best fit line you can see that let's say that this belongs to class a this belongs to class b now over here you will be able to see that if i talk about this right when i draw this linear line this is basically used in linear regression let's say now my question is that what probability of this particular point belongs to class a and what probability of this particular point belongs to class b because it is passing through the line so based on probability we can definitely get a lot of things in linear regression it is used and logistically it is used and so probability really focuses uh like base is basically used over there and different different things are used let's understand what exactly is probability if you want to give a definition what exactly is a probability so here you can say that probability is a measure of the likelihood of an event probability is a measure of the likelihood of an event the reason why i am writing you this all definitions guys understand you really need to think you know what exactly is happening over here what is the definition you know if you can remember those definition in an easy way by example so that is the reason i also give you a lot of example let's say that i am flipping a dice in a dice what are my possible sample events you know that it is one two three four five six now if i ask you a question what is the probability when i roll a dice or sorry roll a dice not flip flipping a coin it should be i'll say roll at is okay so here i am basically saying roll a die so what is the probability of getting 6 if this is my question then how probability you will be able to calculate what is the answer obviously you will say one by six right it's very simple so how do we define probability i'll say that number of ways number of ways an event can occur an event can occur divided by number of possible outcomes so this is the exact definition of this now in this particular scenario number of ways an event can occur over here i am trying to find out what is the probability when i roll a dice i get a six so how many events can occur it can only occur as one and what is the number of total possible outcomes it is six so this is how we basically find out similarly if i give one more example let's say that i want to i want to toss a coin obviously i know what are my sample space head and tail what is the probability of getting head you will just say that 1 by 2 because the sample space is 2 and one number of event that can occur is 1 by 2. so you basically say this as probability of header one by two now let's go one step above probability which is called as additional rule now here is where you will probably discuss about something called as so let's let's go to the next topic over here i'm basically going to define as addition rule this is super important probably in your aptitudes you will be using this addition rule or we also say it as probability or or or or also you say it as like this or now in order to understand additional rule you need to understand about two things one is mutual exclusive events what is this mutual exclusive events so i can basically define two events are mutual exclusive two events are actually mutual exclusive they cannot occur at the same time if they cannot occur at the same time let's see an example rolling at is now when i roll a dice at a specific time i can either get 1 or i can either get 2 or i can either get 3 or i can either get 4 or i can either get 5 or 6 right you cannot get 1 and 2 at the same time or you can't get one two three four at the same time you will only get at one one probably one experiment or one event that you're probably rolling a dice at a single time you'll only be able to get one number you will not be able to get two numbers so this is specifically an example of mutual exclusive another example again uh tossing a coin in this particular case also tossing a coin in this particular case also what happens you may either get head or tail you cannot get both unless and until your coin is standing there like shown in the movies i hope which movie am i talking about which movies probably i'm talking about you can also consider you know good movies like show le and surely only one type of event occurs at every time right so for this kind of problem scenarios now let's let's discuss let's discuss about non-mutual exclusive obviously you understood that what is mutual exclusive now with respect to non-mutual exclusive obviously both the events can occur at the same time multiple events can occur at the same time here i'll say that multiple events that can occur at the same time two or more events can occur at the same time let's let's say one example let's take a deck of cards a very simple example with respect to this in deck of cards have you seen like what will happen in a deck of cards two events let's consider that from a deck of cards when i pull out a card a king can also come or or let's say that a queen card can come along with the screen card a red color heart card can also come hard card can also come right so here multiple events are there so this two cards are obviously not mutual exclusive so here you can see that okay i can also pick up a king it can be in black color it can also be in red color right multiple things are basically happening so this is an perfect example of a non-mutual exclusive now based on this there is some amazing problem statements that you can basically solve mutual exclusion suppose if i toss a coin so my first question is if i toss a coin which is again a mutual exclusive event what is the probability of the coin landing on heads or tail now whenever you get this kind of problem statement first of all you really need to think that okay whether it is mutual exclusive or not yes obviously it is mutual exclusive now i need to find out what is the probability of getting heads or tails right this is what i i need to find out i need to find out what is the probability of getting heads or tails right from this specific event so i want to define a common definition probably for this we can write probability of a or b where a and b are events is equal to probability of a plus probability of b so whenever you have a mutual exclusive event at that point of time you can define this specific definition which is also called as additional rule for mutual exclusive now here what is probability of a you know that it is 1 by 2 plus 1 by 2 so the answer will be 1. so probability of a or b to come is basically one these are some very very important things in in exams also you will be getting this in aptitude also you will be getting it in multiple things you will basically be getting it now i may also tell you okay let's take one more example what is the probability suppose if i roll a dice what is the probability of getting one or three or six yes many people are saying it right it is one by two so here i will basically say that what is the pro this i can definitely say it as probability of 1 plus probability of 3 plus probability of 3 6. these all are 1 by 6 plus 1 by 6 plus 1 by 6 which is nothing but 3 by 6 which is nothing but 1 by 2 which is nothing but 0.5 so 0.5 is basically with respect to this and here you can easily solve it now this was with respect to mutual exclusive this is what we have discussed in mutual exclusive if i take the next problem statement for non-mutual exclusive so for non-mutual exclusive let me take a very good example again so the so the question is very much simple over here let's say that you are picking a card you are picking a card randomly this is the question from a deck so the question is what is the probability of using a card that is queen or a heart so this is the question very simple obviously first of step you will see that whether it is mutually exclusive or non-mutual exclusive obviously you will say that in this particular scenario it is non-mutual exclusive or mutually exclusive it is non-mutual exclusive right because it can occur at the same time now let's go ahead towards the answer obviously you understood that this is non-mutual exclusive now how do you solve this specific problem now in this specific problem first of all you need to find out what all different things it is basically asked let's say that i have got probability of getting a queen what is probability of getting a queen guys just think over it how many cards how many queen card will be in deck of card in the total deck of cards there are 52 cards right if none of you have played cards please go buy today and see the probability of getting queen is nothing but 4 by 52 because in every deck there will be 4 queen cards now the next thing probability of heart cards so what is this probability of heart how many heart cards will basically be there in a deck obviously there will be 13 cards so i'll say 13 by 52. now the next thing is that probability of queen and heart because this is also one one possibility right this is also one possibility how many queen and heart will be there basically it will be only one so here i will write one by fifty two so these are the possible things that can occur right this is the thing now if i come to the formula and this is the addition rule for non-mutual exclusive event non-mutual i can write probability of a or b is equal to probability of a plus probability of b there will be one important thing which is this intersection which i have to basically separate it so it will be b probability minus p sorry minus probability of a intersection b a intersection b basically means a and b which is a possibility of both now my question is very much simple what is the probability of getting queen or hot i'll draw it with red color you have the answer with you this will be probability of queen plus probability of heart probability of heart minus probability of queen and heart so what is probability of q 4 by 52 what is probability of heart it is 13 by 52 and what is probability of queen and heart it is 1 micro q so here i am actually getting 52 this will be 17 minus 1 16 16 how much the 52 you can calculate this will basically be the probability now you have probably understood additional rule addition rule now we need to understand one more rule in probability see guys if you do this much i think you will be able to solve any problem statement that comes in your mind so here was the problem statement that we did and this was specifically to something called as addition rule now coming to the third one which is called as multiplication rule in multiplication rule you one thing you need to understand here we need to understand something called as independent independent events and non-independent events these are something very very important it should be and i said or hot okay independent events okay now in the case of independent events uh events what are specifically independent events let me talk about example let's say that i am rolling a dice if i roll a dice i may get one two three four five six suppose for the first instance i got one in the second instance it is possible i may get one in the third instance i may get two i may get any number so one event is not at all dependent on the other event right because anytime we roll every every possibilities or every outcomes has an equal probability to come so over here what you can understand is that each and every events each and every events each and every events are independent one if one one comes or if two comes out if any events come it is not going to impact any other event every time you probably have to roll and everybody has an equal probability to come over here this is what is an independent event called let me talk about non-independent event or i will also say it as non-not non-independent but instead i'll say dependent event so i will talk about dependent event now independent events suppose let's say that i have a i have a bag in this bag let's say i have three red marbles and two green marbles now in the first instance if i pick out if i if i pick up a marble what is the probability of red marble what is the probability of taking out the marble very simple you will be seeing that how many number of marbles are there there are total five marbles and how many number of red marbles are there there are three marbles so you are basically able to write three by five now let's consider in the first event you picked out a red marble so you picked out a red marble so i'll make it as red color now after taking out the red marble how many marbles are remaining so i will now update this okay so how i'll update this i'll update this bag now this bag will basically have two red marble and two green mark now if i try to go ahead and find out what is the probability of now taking out a green marble then how you will basically say how you will basically say you will see that okay how many number of marbles are there two by four so here what is happening after this particular event it has impacted this event because the number of models are reduced and finally you got 2 over here so this is a perfect example of a dependent multiplication rule basically says that in the case of an independent event we have to solve it in a different way in the case of a dependent event we have to solve in a different way because of this dependent event there is an amazing algorithm which is called as name bias have you heard of nate bias i think most of them heard of right there is a topic which is called as conditional probability this is where conditional probability will come into existence so i will i will talk about it okay so let's go and solve some problems so let me just go ahead and talk about a problem statement first we will talk about independent events so independent events we are going to basically discuss about the problem first thing the question is what is the probability of rolling a five and then a four so this is your question what it is saying what is the probability in the first event you have rolled a diaz you are getting five and then again you rolled a dice then you got 4 so what is the probability of getting 5 and then 4 this is a simple question and for this this obviously is an independent event you know that now how do we solve this particular problem so i'll say independent event uh here we'll apply the multiplication rule what is the multiplication what is the probability of a and b a and basically means first a event has occurred and then b event has occurred what is the probability of this so obviously here i'll define the formula over here first of all i'll say probability of a multiplied by probability of b a and then b okay so this is the usual formula that we use for an independent event in a multiplication rule so obviously you know what is probability of a so here i'll say probability of 5 and 4 you know probability of 5 probability of 5 is nothing but 1 by 2 multiplied by sorry 1 by 6 1 by 6 multiplied by 1 by 6 it is nothing but 1 by 36 now let's take another example obviously because independent event looks very simple so here i'm basically going to take another example and this example will be of a dependent event so let's go ahead and let's try to solve a problem for this what is the probability of drawing a queen and then a asus from a deck of card see over here two events are actually happening so let's go ahead first of all again you need to find out whether this is an independent or dependent event obviously in this case this will be a dependent event because a deck of card will get reduced so in this particular case i am saying what is the probability of a and b in the case of independent event so here i can basically write probability of a multiplied by probability of b given a now what does this mean this this term is basically called as conditional probability let me show you an example with respect to the bags right so i have a bag over here let's say that i have three marbles two red marble okay now in the first instance i want to find out what is probability of what is probability of what is probability of green and then red marble now see over here how many marbles are there in the first instance if i'm taking out green obviously there is three by five right in the first instance when i took out the green marble after i take out the green marble my total number of marbles that will be remaining is 4 so the probability of red will be 2 by 4. now this term this term is basically probability of green and what is this term 2 by 4 this is nothing but this only right multiplied by probability of green given red sorry probability of red given green given green basically means this green event has already occurred right so that is the reason the number of marbles has got reduced this is called as conditional probability and this is very very helpful in something called as name bias or i'll also say it as biased theorem in bias theorem this will be very very important so here what is probability of king sorry it is queen and king right queen and aces sorry so here what i'll do probability of queen multiplied by probability of asus given queen so what is probability of queen it is nothing but 4 by 52 multiplied by 4 by 51. so sorry 53 how many cards are there i forgot how many cards will be that in deck 53 right yeah no 52 only don't confuse me guys okay 4 by 52 multiplied by 4 by 50 so whatever answer you get over here this is basically your now let's discuss about something called as permutation and combination a very small topic probably in five minutes i will be able to complete it now let's say that first of all let's discuss about permutation let's say that um i have taken some students to a school trip and then we have gone to something like a chocolate factory in which many chocolates are basically they they they create a lot of chocolates they they okay so they they make a lot of chocolates okay so i i catch a word of a student and i say that okay i'll give you an assignment and let's say that in this chocolate factory six different types of chocolates are created like dairy milk right like five star milky bar and let's say eclairs okay jam how many one two three four five and one more chocolate uh normal toffee let's say one more category silk of dairy milk is there so these many chocolates are basically there so i have given a student an assignment to that saying that okay there are six chocolates that are getting created in this factory let's create in your diary you write the first three chocolates whichever you see whichever chocolates you see once you enter into that factory whichever chocolate you probably see the top three the first three you just write that name and you come up come back to me so that student went inside the factory now in the first instance how many different options this particular student can have of seeing the chocolates he may definitely have six different options now once he sees probably any one chocolate right he may have six options because six different any any chocolate he may see right so obviously he may have six options out of which he writes one name over here let's say in the next instance how many charts will remain total 5 will remain so how many options he will have to write the name 5 he will have the right to write the name of the chocolate then finally here you'll be seeing that when he comes and write the third name over there they'll be having four options now if i try to multiply this six multiplied by five multiplied by four it is nothing but 120 now 120 what it is it is all the possible permutations with respect to the chocolate name that he may see all the possible permutation like he may he may see in this way dairy milk gems milky bar he may also see in different way milky bar gem dairy milk so all the possible options that are possible is 120. now when i say 120 okay these are all the possible options now this is what permutation is permutation formula how do you write now let's go back to school days where directly used to ratify all the formulas npr is equal to n factorial divided by n minus r factorial over here n is nothing but the total number of chocolates r is nothing but how many names i have told that person to write so here you will be seeing 6 factorial divided by 6 minus 3 factorial which is nothing but 6 into multiplied by 4 multiplied by 3 factorial divided by 3 factorial this and this will get cut so total answer is 120. this is with respect to permutation now how does combination come into existence now and what is the difference between permutation and combination now in combination always understand permutation if i have the same element like this i have dairy milk i have gems i have gems i have probably eclairs if i've used this element once this combination i cannot use the same element and probably make a different combination so combination will be unique with respect to the elements that is used okay if i have used derivative gem and eclair i cannot again re-swap it and make it as a different order so in the case of combination you have a other formula which will actually for help you to focus on the uniqueness of the objects that you are picking up so for this the formula is ncr which is nothing but n factorial divided by r factorial n minus r factorial what is n factorial you know that the 6 factorial what is r factorial 3 factorial and 6 minus 3 factorial so here you will basically say 5 move 4 and this will be divided by 3 factorial this i'll make it as 3 2 1 multiplied by 3 factorial this and this will get cut two ones are two twos are three ones are three two five twos are ten ten to the twenty so twenty unique combinations you can basically have let's say first of all the first topic that we are probably going to discuss about is something called as p value super super important topic many people gets confused gets confused in this now let's take one example everybody uses a laptop let's say that this is my laptop this is my mouse pad this is your right button to click this is your left button to click your laptop mouse pad over here you will move the fingers right here you'll move the fingers let's go ahead and let's understand don't you think most of the time when you're moving your fingers you will be moving in this specific region in this specific region you will be moving your fingers in this specific region not in the corner hardly you will touch somewhere in the corner now why i am specifically drawing this because this thing will basically specify your distribution of touches and most of the time your distribution of touches will be also looking something like this now understand one thing why this area is bulged this area is bulked because most of the times you'll be touching here this area is less because over here hardly you will be touching away now let's consider that i say my p value for this position is my p value for this position is 0.8 now here what i am actually going to do what does this point 8 basically means that let's say i am doing 100 times i am touching this mouse pad 100 times i am touching or let's say that every 100 times every 100 times okay let's let's remove this i'll write in white color only every 100 time i touch the mouse pad 80 times out of this 100 80 times i touch this specific region i hope everybody understood this one every 100 times probably i touch this mousepad the probability of touching this region is 80 times that is 80 percentage similarly if i say my p value over here is 0.01 what does this mean similarly you can consider any region this region is the best like broadest right so this region may be p is equal to 0.9 that basically means out of every 100 touches i am basically touching 90 times over here this will be one time this will be only one time so i hope you are getting the understanding of p value p value basically says most of the time what is the probability with respect to a p value for that specific experiment now let's go ahead and let's understand something called as now i'm going to combine multiple topics the first topic that i am going to combine is something called as hypothesis testing in that i am going to combine confidence interval in that i am going to combine significance value in that i am going to combine many things okay let's say i am solving a problem okay my problem is to i have a coin i want to test whether this coin is a fair coin or not simple problem statement i have a coin i want to test whether this coin is a fair coin or not by performing 100 tosses now we are entering into inferential statistics okay very important super important when do you think a coin is a fair coin obviously when the probability of heads should be 0.5 when the probability of tail should be 0.5 if you have this to condition definitely you will be saying that yes in this particular scenario obviously the coin will be a fair coin but if you have a chole coin if you have a sholey coin then what will happen if you have a sholey coin then probability of heads was 100 so for this kind of things you'll definitely not say that it is a fair point now in order to support this i am performing 100 experiments 100 experiment basically means 100 tosses so 100 tosses i will be performed now inside this 100 tosses what i am going to do is that let's say that from this 100 tosses obviously what will be the mean let's say that i'm just focusing on probability of head i should basically get 50 times so from the 100 times from this 100 times if i'm performing 100 experiment i can definitely say that my probability of head or probably let's let's consider that forget about this probability of head the number of times i should get head is how much 50 right if i get 50 times head i can definitely say that this coin is the coin is pair the coin is fair i can definitely say this if the number of times after performing 100 experiment if i get 50 times head i can definitely say the coin is fair now very important first of all in this particular scenario we have to focus on something called as hypothesis testing you have to focus on hypothesis testing in hypothesis testing the first thing is that we need to define our null hypothesis the null hypothesis is usually given in the problem statement what is what is we want to test whether the coin is a fair coin or not so whatever the default question is i'm going to use it as a null hypothesis so here i'm saying that the coin is fair like one scenario you have right a person cannot be acquitted as a criminal unless and until it is proved so the coin is fair now the second thing that we basically define is something called as alternate hypothesis here i'll say the coin is unfair now the third step and always remember alternate hypothesis will be the opposite of null hypothesis whatever thing we are trying to pull okay now the third thing is that we perform the experiments and the experiment can be anything it can be a z test t test whatever things you want you can do all this practical i will discuss it don't worry now inside this experiment we see some values and based on that the fourth step that we do we reject or accept the null hypothesis null hypothesis these are the possible step of the hypothesis testing now let's define this guys let's say that my mean value is 50. i need to get at least 50 times ahead right i need to get 50 times head yes or no let's consider that this is my mean okay minimum 50 i'm not minimum but 50. 50 i should be getting in order to say that my coin is fair let's say that for this problem statement uh i'm just examining okay the standard deviation is 10 so it will come as 60 70 90 40 30 20 10 okay right in this particular case it is there and probably if i if i know my mean and standard deviation i may draw a curve which looks like this what happens if i want to prove this now see this i'll perform the experiment let's say i have performed 100 times now just imagine i got 30 times head let's imagine i got 30 times ahead 30 times head is nothing but it is somewhere at this point can i still say that this coin is fair or not can i say the coin is fair or not or coin is unfair can i say think over it if i am getting 30 times head can i say that this coil is unfair you tell me whether it should be fair or unfair tell me let's say that i have performed the experiment and i got 30 times head out of hundred so tell me whether this will be fair or no many people are saying no not fair fair fair not fair so for this to define it is always said that our experiment should be nearer to the mean okay nearer to the mean now how do we define that how far it can be away from the mean we need to define that how far it may be away from the mean so for that we use a very important property which is called as significance value now this significance value is basically given by alpha suppose let's consider that i am considering alpha as zero five now this point zero five what exactly it is what exactly it actually means this means that if i do one minus point zero five this answer let's say that this answer how much it will come it will basically come as uh this point zero five okay uh i've taken my significance value as point zero five when i convert this into percentage it will become five percent okay five percent so from my hundred percent if i subtract five percent this basically indicate that it is 95 confidence interval now what is this 95 confidence interval if i probably subtract from 1 my probably the 95 percent confidence interval is there okay now this 95 confidence interval is what part let's consider that i know my 2.5 is this part 2.5 is this part since this is a two-tailed test let i'll talk about two-tailed test also don't worry so let's consider this part to this part this is my entire 95 percent confidence interval this is defined by a domain expert different defined by a domain expert let's consider that it has been defined now what does that 0.05 indicate i'm trying to show it to you when i probably divide this into two parts here my 2.5 percent will come here my 2.5 will come now understand one thing very important over here now let's say that 30 i got 30 over here so this is my 30 right and i have also defined my confidence interval from this point to this point whenever we are coming inside this then we see we say that the coin is fair why because understand it is within this interval here we need to define because we don't know right what should be the number you said that when i got head 30 times many people is saying that not fair but who are we to decide domain the expert will decide and how will he decide with the help of this significance value suppose they say significance value is 0.05 that basically means that we the the experiment if it falls in this 95 confidence interval that time i will say that that coin is fair if it falls outside this confidence interval that time i will say that the coin is not fair now tell me let's say that this number that you are seeing is 20 let's say and this number that you are seeing is 75 20 to 75 is my confidence interval now i perform the experiment if i get 10 heads only out of 100 experiments should we accept or reject the null hypothesis the null hypothesis is basically the coin is fair the null alternate hypothesis coin is unfair so if i get 10 heads which region it is falling it will fall somewhere here it is not inside the confidence interval so we can definitely say that coin is not fair so for that particular case we reject the null hypothesis and we accept the alternate hypothesis i hope everybody is able to understand the terminologies that we are using over here i cannot teach you separate topics understand i have to combine these topics together to teach you how to do it what if we if we have okay let's say that guys if you have 95 heads in those 100 experiments which region it will fall will it not fall in this region 95 is somewhere here so should we accept the null hypothesis or reject the null hypothesis we have to obviously reject the null hypothesis and alternate hypothesis will be accepted it's very simple i perform the experiment whatever value i get i go and check in this okay let me tell you now one more one more scenario okay here let's say that my domain expertise said that krish you are a fool and probably i will now use this is 50 60 70 80 90. okay let's say that krish you are a fool why have you taken alpha 0.05 okay i don't want that oh so let's say that your alpha is 0.20 now what will be your confidence interval what will be your confidence interval let's say that your confidence interval will be now 80 percent instead of 95 so now your graph will look somewhere here like this it will be still more in this side so this side will basically have point one zero this side will basically have point one zero and this all will be your point eighty percent when you combine all this when you add up all this it will be one so at that point of time then you can go and find out your confidence interval this value will give you your lower confidence interval this value will be giving you a higher confidence interval you perform the experiment now just imagine you got 25 from that experiment whether you should reject or accept it tell me one thing if your alpha value is 0.3 what is your confidence interval vishu sharam i just took it for heads only right so what is your confidence interval if your alpha value is 0.3 obviously you'll say that it is 0.7 that is 70 percent confidence interval so alpha significance value and confidence interval are reverse right they you need to calculate in that specific way right now usually when we say when we say like p value right suppose if it does not follows in the confidence interval i may say that the p value is less than 0.3 so because of that i have to reject the null hypothesis hello guys today topics what all things we are going to see is that the first thing that we are going to check out is something called as type 1 type 2 error so the first topic that we are going to see is something like type 1 and type 2 error very super important probably in machine learning you will be discussing about um you know confusion matrix fine guys if it is not uploaded don't worry it will get uploaded today okay i will say the backend team to do it the second thing that we are probably going to discuss about after one tail or after type 1 and type 2 error is basically your one tailed and two-tailed test the third topic that we are going to see is that how to find out confidence interval okay that is what we are going to see now confidence interval how to calculate this probably when an alpha value is given i told you we need to define some confidence interval in order to solve uh you know some problems the fourth topic that we will try to see after confidence interval is something called as z test t test and if we get time we will also finish up chi square test so let's start the first topic that we are probably going to discuss about is something called as type 1 and type 2 error always understand whenever we do any kind of hypothesis testing one very important thing i told you that what we have the first topic that we are probably going to discuss about is type 1 and type 2 error type 1 and type 2 error always understand in any kind of hypothesis testing right we do have something called as null hypothesis null hypothesis is usually denoted by h0 we have something called as alternate hypothesis okay alternate hypothesis which is denoted by h1 okay now at the end of the day after performing any kind of experiments right let's say that i'm performing an experiment where to check whether the coin is fair or not i'll take the same example what we have discussed yesterday and coin is not fair i will probably go and check check whether it falls within the confidence interval i'll check the significance value based on that the confidence interval will be defined you know everything we will do and that is what i explained in the studies part now from this after we perform the experiment there are two types of decisions that can be made first of all we'll go with respect to the reality check so the reality check will be that either either null hypothesis will be true null hypothesis is true or null hypothesis is false right only these two things we will be able to see in reality check right if i go and check with respect to the decision because this is what i am actually trying to check test right in decision i may either get null hypothesis is true null hypothesis is true or null hypothesis is fall when null hypothesis is false i will say that alternate hypothesis is accepted right or we reject the null hypothesis now i from these two what you can basically derive is that see this very important first outcome let's see what what can what may be the possible outcome okay so what may be the possible outcome so outcome one i will say that okay outcome one is that we reject the null hypothesis with reject the null hypothesis that is my decision when it is when in reality it is false is this a good decision yes we reject the null hypothesis when in reality it is false is this a good decision yes it is obviously a very good decision this is how we should take a decision now when in reality when i say that in reality it is false obviously we are rejecting the null hypothesis okay very good decision now the second outcome let's go ahead and discuss the second outcome now suppose i write the outcome two what are the possible outcomes i'm just trying to show you okay we reject the null hypothesis we reject the null hypothesis when in reality it is true so over here what should be your decision whether this is a good decision or not again i have to note at this particular point when in reality it is true if you are rejecting the null hypothesis is this is this a correct decision on it over here in this particular case obviously many people will say that it is a bad decision so here i will say no and this kind of decisions is specifically called as type one error so this decision is basically called as type one error right so this decision when we are rejecting the null hypothesis when the when in reality it is true let's say that i take my null hypothesis as the person is innocent and my alternate hypothesis is percent is not innocent now in this particular case the person we are just activating him in movies we have seen right many people will just be awarded death sentence even though they have not done anything wrong so that kind of example is what you are actually seeing over here we reject the null hypothesis when in reality it is true that basically means the person is awarded at that sentence even though he did not do anything so this or this is the perfect example you have seen in movies right in movies okay you'll be seeing that a person will be awarded a death sentence even though he did not do anything in the case of a fake fake case so at that point of time this person is not innocent but in reality he is innocent so this becomes a perfect example of type 1 error outcome 3 outcome 3 basically says that this is also a very important outcome and these all things you will be able to relate in confusion matrix i i don't know how many people knows about confusion matrix so we retain the null hypothesis or we accept the null hypothesis let's say that i am saying we accept the null hypothesis when in reality it is false is this a good decision the answer should be no so this error is basically called as type two error only four outcomes will be there four outcome will definitely be there so here you can understand that in this particular case even though the person has committed crime he is not acuted so definitely this error is basically called as type two error okay i hope everybody's got is clear right now let's go to outcome four outcome four is that we accept the null hypothesis when in reality it is true so this is obviously a good case right so here i can say that fine this decision and this decision are perfectly fine but whenever we have this scenarios we basically have to consider it as type 1 and type 2 so i hope everybody is getting it right so similarly in the real world scenario you define something called as confusion matrix right in confusion matrix what you have you have true positive true positive right sorry true false and just a second in confusion matrix what are what you have you have true false positive negative right so here you are basically defining your true positive true negative false positive false negative right so this basically becomes your true positive true negative false positive false negative tell me out of this which is type 1 and type 2n either this can be type 1 or type 2 error that will be a answer for you so you have to tell me okay whether false positive will be a type an error a true negative will be type 2 error or vice versa perfect so this will be one assignment to you if you don't know just check out my one of my video you will be able to see it but clear guys was this explanation good for type 1 and type 2 error so we have completed this specific topic that is type 1 and type 2 perfect so some people are basically saying false positive is type 1 true negative is fp is type 1 so here you have actually solved a very good topic which is called as type 1 and type 1 now let's go to the next topic that is one tailed and two-tailed test this is also very much super important one tail and two tail test so one tail and two tail test now let's go ahead and let's try to understand what is one tailed and two tin test now already you have seen that i have probably drawn a curve a bell curve and in that i basically define a kind of one-tailed and two-tailed test still you have seen it but let me give you one good example okay so the example is that a college in let's say a college or let me write like this colleges in karnataka in karnataka have an 85 placement rate placement rate in the placements time a new college a new college was recently opened and it was found that a sample of 150 students had a placement rate of 88 with a standard deviation four percent does this does this college have or has a different placement rate than the other qualities okay so understand this question very much importantly oops sorry guys i made one mistake this should not be type 2 false negative should be type 2 right true positive and true negative are perfectly fine okay this should be type two true positive and true negative are always right let's try to understand some very important thing now what does this question basically say whether see there are colleges in karnataka which has 85 percent placement rate a new college was recently opened and it was found out that a sample of 150 students had a placement rate of 88 percent with a standard division four percent thus the college has a different placement range does this college this basically means the new college now in this particular case first of all think about the question now over here it says does this college has a different placement rate what is the placement rate of the entire college 85 percent so does it have a different rate than 85 percent that is what we really need to check right now in this particular case this becomes a two-tailed test why we'll think over it let's say that here the significance value is given as 0.05 let's consider let's consider that over here the significance value is given as 0.05 now what we do over here is that we will try to create a graph now when we have point zero five that basically means if it is a two tailed test two tail test basically means right now i have a placement rate of eighty five percent so 85 percent is uh you can just consider that 85 percent will be what in this particular case right 85 percent passage rate or sorry placement rate right so 85 percent but we need to find out over here when alpha is given 2.5 will be here and this will be my 95 percent confidence interval so 95 will basically be here if i combine all these things it will become 1. now you need to understand whether this will become a two-tailed test or a one-tailed test this is what is very much simple now this 85 percent will be my mean my value can be greater than 85 it can be less than 85 okay it can be greater than 85 it can be less than 85 because we are just checking whether it has a different placement rate it can be greater it can be less also so that is the reason this entire test becomes a two-tailed test because the new college that gets added it may fall in this region also it may fall in this region right now you'll be able to see that we are just trying to check whether it is greater than 85 or whether it is less than 85 so this becomes a two-tailed test now let me just make a little bit change into the question now my question says that let's say let's say that my question i'll just change the question saying that does this college have a placement rate greater than 85 percent now what now what will this be greater than 85 percent think over it what this will basically be yeah what this will be now my question will look like this this is my this my alpha value is 0.05 obviously it is 95 confidence interval but this is only focused in finding greater so this entire value i'll put over here and this region will be my 5 value this region will be my 95 value so this becomes one one tailed test that also in the right hand side because here the important keyword is something called as greater so this becomes a one-tailed test and remember we cannot divide this alpha value into two parts in this case only in one part it will be basically present now just think over it if this value is lesser then what will happen it will come in this particular slide so i hope you are understanding what is the difference between one tail and two tail test so always make sure that focus on the question what the question is basically said here does this college has a different placement rate from the experiment any experiment that i may carry on my answer will be either greater than 85 or less than 85 so that becomes a two-tailed test in this particular case i'm saying greater than 85 with alpha value is 0.05 so i am definitely sure that i am actually checking only this region i am not worried at this particular region because i need to check whether it is greater than 85 this is the most important thing with respect to one taylor two-tailed test now the next thing that we are going to discuss about is so we have finished one tail and two tail also now let's go and ahead and understand how to find out this confidence interval i told you right see this is very much important confidence intervals with respect to means i told you right in confidence interval what we do we basically have this graph when i say my alpha value is 0.05 then this becomes a two-tailed test suppose i need to find this value right i need to find these two values how do i find out these two values that is what we are going to see we are doing going to do some kind of calculations which will actually help me to understand so in order to find out confidence interval you really need to understand some things so if it asks less than 85 so it will be considered as one tailed test yes obviously one tail test but in the left hand side okay right now this will be in the right hand side now let's try to understand with respect to confidence interval now in order to understand confidence interval you really need to understand a topic which is called as point estimate okay so i will basically give the definition of point estimate what exactly is point estimate point estimate can be defined as a value of any statistic that estimates the value of a parameter is called a point estimate so a simple definition i have written over here i will define about what is this statistic and which estimates the value of a parameter so two things one is statistics and one is parameter so what exactly is point estimate a value of any statistics that estimates the value of a parameter now understand one thing guys in inferential statistics any work that we will be doing first of all we will be considering a sample data based on the sample data we will be estimating something for the population data right in this particular example let's consider that i will try to if i have the sample meal i'll try to estimate the population and usually this so many things happens in inferential stats you just have you just have the sample information probably population standard deviation you may know but you really need to find out or estimate the population bank and as you know like let's say that i'll give one example this is my x bar this x bar we will try to estimate the value of muba right because if i have a population with the help of sample i can definitely estimate mu but always remember this value may be approximately equal to this it may be also less it may be also greater right let's in one case i may say that if my x bar is 2.9 and probably my population mean is mu is equal to 3 right this may be equal this may be less this may be little bit greater also this is what point estimate is all about so point this is the point estimate which will be estimating the mu value so in this particular case i hope you understood what exactly is point estimate okay so point estimate is the value of any statistics that estimates the value of a parameter so this through this we are basically estimating the mean so at least get this specific knowledge now in most of the problem statement i will be given this and i really do need to estimate this how will i be able to do this so for that specific case we will try to see a problem statement and here we will something use something called as confidence interval now understand i told you that this value will be approximately equal to mean it may be less than mean it may be greater than mean so in this particular scenario we define something called as confidence intervals so that we will be able to come towards the population mean so confidence interval is usually given by the formula which is nothing but point estimate plus or minus margin of error so there is some margin of error there is some margin of error because over here you can see 2.9 this is obviously less it can also be greater so i have written plus or minus of margin of error because obviously we will not know the exact population mean right we don't know so obviously the point estimate plus margin of error will actually help us to get the same mean and this is how we determine the confidence interval now let's see one problem statement by this you will basically be able to understand what i am actually saying from this formula you will be able to understand that how close we are near to the population mean the second thing is that suppose if you are given the population standard deviation at that point of time what formula you should use to do this and how large your sample size is so let me just uh solve one one very simple problem uh and give it to you so the problem is very very simple not that difficult at all and we'll try to solve that specific problem so this is my question on the quant test of cat exam i hope everybody knows cat exam on the quant test of cat exam the population standard deviation the standard deviation is known to be known to be hundred now the next thing is that i will take a sample of a sample of 25 test takers 25 t stickers has a mean of has a mean of 520 score so here my question is that construct a 95 percentage confidence interval about the mean now let's see what all information is given over here you know that some information is definitely given you know that right so first information what is given over here you know your population standard deviation is given what is your population standard deviation here you can see that it is 100 100 is the population standard what is your small n size it is nothing but 25 what is your confidence interval with respect to this alpha i will get 0.05 and what is your mean what is your mean over here mean is nothing but x bar which is nothing but 520 is this information given in the question is this information given in the question is this information given in the question obviously it is given right now my graph looks something like this see this my graph is looking like this my mean is basically what is my mean my mean is nothing but 520 now my alpha value is 0.05 so here i have 2.5 here i have 2 point and this is my 95 confidence interval now i need to find out what this value what this range is basically if i say that i want to construct a 95 confidence interval about the mean what is this value what value from here to here it will range that is what i need to find out so this is what is my problem statement i have also given the standard deviation now here whenever population standard dev first thing whenever population standard deviation is given whenever population standard deviation is given guys why 90 alpha is 0.05 see i have given the question as 95 confidence interval right so it is nothing but 1 minus 0.95 which is nothing but 0.05 so this will be my alpha value right alpha and confidence interval are interlinked very simple now when population standard deviation is basically given we apply a test right what kind of test now here i know that this will be my point estimate plus or minus margin of error this is for my confidence interval formula now point estimate is obviously your x bar now plus or minus whenever view you have this population standard deviation you apply a z test so here you will write z alpha by 2 and the formula will be standard deviation divided by root n now this is your formula this this term i'll talk about this term this term that you see is called as standard error so in this particular case one more one more second point is that when we should use this formula to find out the confidence interval the thing next thing is that over here you will be able to see that i have taken a sample of 25 but usually the sample size will be greater than or equal to 30 but just for an example i have taken uh as 25 okay so it's okay now don't fight with mikrish why i have taken 25 take it 30 also we have to do the calculation but this two condition suits well for this kind of problem statement okay so for a z test to happen most of the time this two condition needs to be approved now this z test is nothing but z score okay z score to find out the z score that is what z test is basically used now understand over here what this alpha is okay so this is the entire formula to find out the confidence interval if your population standard deviation is given and when your sample size is greater than or equal to 30 now let's go and solve this particular problem now when i go and solve this particular problem the first thing is that i will split this equation into two part one is i will get one higher confidence interval alpha value is point zero five divided by two standard deviation is what is standard deviation over here it is nothing but 100 divided by root 25 now you understood why i have taken 25 because my calculation will become easier don't fight with me guys i don't have energy to fight nowadays i fight with a lot of people so this will basically be my upper bound upper bound of confidence interval similarly lower bound of confidence interval i'll try to find out that is x bar minus z 0.05 divided by 2 100 divided by root 25 now here i will write point zero zero sorry point zero five by 2 is nothing but z is nothing but 0.025 i hope everybody is getting this now how do i find out this particular value for this go and open your browser and open z table so if i go and open z table if i open that table let me just open a z table another z table i'll try to open just a second point here all minus are basically shown so i'll not use this z table which i'll use the other one because there are only negative values given here probably i'll be able to find out okay now in z table always understand always understand over here when i say point zero two five okay my entire area is how much so my entire area is one if i subtract one with point zero two five that basically means this part the entire area will become 0.975 so 0.975 i have to check in the z table so for this what i will do is that i will go to my browser and go and check it where is 0.975 0.975 is nothing but this specific area go and check this 0.975 i hope you are able to see this so what is this value 1.9 and if i go on top it is 0.06 that basically means the z value is 1.96 so go down over here you will be able to see 0.9750 it is nothing but 1.9 and this is 0.06 so this becomes my z score so finally i get my value as 1.96 now go and calculate it so what is my x bar for the upper bound i will say my x bar is nothing but what is the mean of the sample it is nothing but 520 okay so it is 520 plus 1.96 multiplied by 20. similarly the lower bound it is nothing but pi 20 minus 1.96 multiplied by 20. now go ahead and compute this 559.2 480.8 so this is my lower bound and upper bound that basically means whenever i am defining my confidence interval for this distribution with alpha is 0.05 and this this value will be 559.2 and this value will be 480.8 and my mean will basically be 520 right right so one stats interview question that i stole right find the average size of the sharks sharks throughout the world can you solve this by taking your own example because one of my student solved this particular problem and he gave some confidence interval he said that let's assume this this this this and try to solve in this particular way he said that okay let's consider oh there the interviewer said you know the population standard deviation you know the x bar value you know the n value try to solve it with alpha as point zero five i use naughty l understand that over here when my alpha value is point zero two five i am just worried about one tail right this side this entire area is 1 so 1 minus 0.025 is 0.975 now after performing any experiment if my value falls between these two at that point of time i will assume that it is it is we need to accept the null hypothesis and we can go ahead with it if it does not fall within this range then it is going to fall away from that but basically we need to reject the narrative now the next question that we are probably going to see is that what if the population standard deviation is not given now in that particular scenario what will you do for that particular case you really need to use something called as t test so let me just show you one very good example and that also will try to solve let's say that the same question this standard deviation is not given standard deviation is not given population standard deviation is not given but sample standard deviation is given so i'll write down the question over here to for you but i hope you are able to understand it so the question is that on the point test of cat exam on the coin test of a cat exam a sample of 25 test takers has a mean of 520 score with a standard deviation now this standard deviation that is given is basically your sample standard deviation has a standard deviation of 80 construct 95 percent confidence interval about the mean so this is basically my question right so this is my question so over here you can see that population standard deviation is not given so in this particular case i definitely have to use z test so over here sorry t test condition i'll write okay first of all we'll try to see what all things are given your n value is given which is 25 your x bar is given which is nothing but 520 right your sample standard deviation is given that is 80 and your alpha is 0.05 so when you see over here your values have not been given over here that basically means your you know the the the conditions and not the conditions but here your population standard deviation is not given so i can write a condition saying that here population standard deviation is not given so in this particular case we use something called as t test a population standard deviation given at that point of time you use t test let's go and try to compute it here also the same formula will be used point estimate plus or minus margin of error here your margin of error formula will change okay now what kind of formula it will have that you need to understand the formula will be something like x bar plus or minus instead of writing z alpha by 2 here you will be writing t alpha by 2 and then you will be using s by root n this is your standard error now go and substitute it so two things you will be basically having one is upper bound it will be x bar plus t point zero five by two s by root n right now first thing first always understand to calculate the t okay to calculate the t value you need to find out something called a degree of freedom because in the t table you will you will be asked this and degree of freedom formula is just like your sample variance problem that is n minus 1 which we also use with respect to basal correction so this will be 25 minus 1 which is nothing but 24. now i will go to my browser i will open over here t table so t table i am having here now first thing first you need to understand with respect to degree of freedom what is degree of freedom 24 degree of freedom is 24 25 let's see this this is 24 right i hope everybody is able to see the degree of freedom over here try to have a look on to this table point zero point zero two five point zero two five point zero two five is nothing but this one right this is what point nine seven five so if i see with respect to two point two sorry twenty four it is nothing but two point zero six four is everybody getting it we have to see in this line 24 degree of freedom on the left hand side on the right hand side you can see on top it is 0.025.05 so the answer is 2.06 2.064 so here i'm basically going to find your t 0.05 divided by 2 is equal to nothing but 2.064 now the next step uh once you get this i will go and see what is my x bar 520 520 plus 2.064 multiplied by s what is s over here it is nothing but 80 by 5 5 is nothing but root 25 is 5 553.024 and then if i go and compute the lower bound 520 minus 2.064 80 by 5 so this minus 520 so here i'm actually getting 486.97 so my lower bound is nothing but 486.97 the upper bound of the confidence interval is nothing but 553.02 so with this we have done wow i've written so much today we have finished confidence interval congratulations everybody we have successfully completed congrats why this is not two-tailed this is two-tail only no i told you no this is two tail why are you getting confused see over here if i see away a point zero two five for one tail for two tail this is point zero five now let's go ahead and rest try to do the first z test i hope everybody is understood why do we use z test so the first question that we are going to solve is one sample z test now we will perform hypothesis testing so the first problem that we are going to solve is one sample z test now we are going to perform hypothesis testing what exactly is one sample z test first of all i told you two conditions with respect to z test the first condition is that the population standard deviation is given at that time you use that test the second thing is that your sample size should be sample size should be have a size at least n is greater than or equal to 30. just to make calculation easier i just put it at n is equal to 25 because root of 25 was 5 so because of that i put it don't fight with me i have no energy i think you'll beat me and go but i don't have an analogy okay so let's go ahead and let's try to see how to do a hypothesis testing okay let's say that i'm writing a problem statement in the population the average iq with a standard deviation of 15 okay researchers wants to test a new medication to see if there is positive or negative effect on intelligence or no effect at all a sample of 30 participants who have taken the medication has a mean has a mean iq of has a mean iq of 140 did the medication the intelligence just by reading the question what do you get it from it guys did the medication improve the improve the intelligence or not now i'll show you how to perform a hypothesis testing okay obviously you got to know that what test it is used okay so let's go ahead and let's go ahead and discuss it now first of all how to perform a hypothesis testing okay so the first step is that we need to define the null hypothesis now in this particular case null hypothesis is 0 now in s0 what you will basically say that your mean is nothing but 100 can i say your mean is 100 because see the in the obviously we need to check whether the medication affect the intelligence or not so here i'll say that my null hypothesis will be that my mean is my mean iq is 100 my alternate will be that my mean is not equal to 100. i hope everybody is agreeing with this agreeing with this everybody is agreeing with this clearly right so over here s 0 is equal to mean is equal to 100 okay so this basically says that it is your null hypothesis the second thing is that we need to define our alternate hypothesis my alternate hypothesis is 0 where my mean is not equal to 100 my mean is not equal to 100 because if i am saying my null hypothesis is the mean is equal to 100 then this will not equal to 100 one important thing that i mentioned forgot to mention my alpha over here will be 0.05 that basically means my confidence interval is 95 so this is also a part of the question now the third step let's go to the third step okay mean is not 140 mean is 100 sample mean is 140 you can apply in any concept guys it need not be that you can only apply in something okay the third step we basically state our alpha value state alpha value so my alpha value is 0.05 the fourth step let's go to the fourth step now what what do you think the fourth step is in the fourth step i need to provide my decision rule so here i will say state decision rule and always understand in the decision rule you need to specify this graph and here you will basically say that since my alpha is 0.05 what kind of test this will be did the medication affect the intelligence the question understand this question did the medication affect the intelligence so here we are just focusing on that whether the medication increase your intelligence or whether it decreased your intelligence okay so either it can be so this definitely will become a two-tailed test so two-tail test so here i will be having 2.5 here i will be having 2.5 so this will be 2.5 this will be 2.5 and this will be 95 percent right everybody is clear with this can i get a quick yes right this will definitely become a two-tailed test and one more important thing over here when i say 2.5 then if i really want to find out with the help of z test what will be this value i have to check it for 1 minus 0.025 right so this will be 0.975 in this particular value i need to check in my z table so i need to find out this value and this value right go ahead and check it go ahead and check it what will be the value over here this will be plus 1.96 this will be minus 1.96 we just checked it right we just checked it over here you can see over here right where did it go here we got this this was for t right this was for z for z we got this right 1.96 see this we got 1.96 okay so 1.96 plus minus 1.9 now i know my decision rule my whatever experiment i will perform later on whatever z score value i'll be getting i should be getting within this minus 1.96 to plus 1.96 now here i will use my test statistics now what will be my test statistics over here it's very simple what is the z score formula we basically use calculate test statistics and this will be t test right so the formula that we basically use sorry z test i'm extremely sorry this will be z test not t test z test calculate z what is the formula that we basically use x minus mu divided by standard deviation i hope everybody remembers this formula right but understand one thing the real formula of z square is z test is this divided by root n the reason why we did not consider before root n understand for every sample my n value will be 1 so whenever i write root of 1 it will be 1 right but when we are working with a huge sample right when we are working with a huge sample we have to basically use root n and this is called as standard error standard error like how we divide something by you know n minus 1 there is some reasons why we do it okay and probably i don't know whether you have seen my video or not okay so this is basically called as standard error so we have to divide by root n always understand for one sample this root n will always already be one so we have to use this particular formula okay to do this okay so this is basically my standard error formula which is specified by this and always remember because why do we use this i'll just give you one example suppose i i take five samples right i take five samples five different different samples from a population five five five samples from different different let's let's say that i have a population of thousand points okay let's say that i have a population of thousand points thousand points let's consider that i am considering a sample of 100 points every time i come i i take otherwise just wait for it guys i will teach you there is a topic which is called a central limit theorem i will discuss that and then probably i'll teach you this particular topic but just right now understand that to make our standard error become very less because here we are working with sample data if you are working with population data directly we can write standard deviation like this but since we are working with sample data we have some kind of standard error and by dividing it by root n as the sample size keeps on increasing our values our mean values will be matching with the population so don't worry just give some time and i will explain you this for right now just consider that it is standard deviation uh divided by root n if i go ahead and calculate this particular thing so what is my x bar my x bar is obviously what let's see what is my mean a sample of 30 participation is nothing but 140 okay so 140 minus what is the population mean iq the average iq is 100 so here i will say 100 divided by what is standard deviation population standard deviation with a standard deviation of 15. so here i have 15 divided by root n what is n what is the sample that we have taken 30 so here it will be root of 30 so this will basically be 40 divided by 15 multiplied by root 30 i hope i am right it is 14.60 finally we state our decision now this is a very important step because from this particular step i will be able to understand something i got from my z test 14.60 now let's go and see our decision rule what did our decision rule basically say it should be between minus 1.96 to plus 1.96 is this greater than 1.96 or not so 14.96 over here 14.60 is greater than 1.96 which is obviously that my condition will be that if z is less than minus 1.96 or greater than 1.96 then what we have to do we have to just reject the null hypo ss because my z value over here is 14.60 but let's take out one amazing thing from this so when i reject my null hypothesis that basically means my mean is not equal to 100 right i'm accepting this now tell me one very important question does this medication improve the intelligence or did it decrease the interleavance my next question is that did the medication improve the intelligence or decrease now this you have to answer me after solving this much obviously it is improved improved guys improved the intelligence not decreased improved the intelligence very simple if i was getting the z value as minus 0.2 that time whatever is happening then what would happen happen it would have decreased the intelligence it has increased the intelligence rejecting the null hypothesis is saying is saying that the medical the medicine had an impact now do the same problem i will just change one value over here this mean will be 110 this mean of this 30 participants will be 110 try to solve the problem and tell me whether the null hypothesis is accepted or the alternate hypothesis whether the null hypothesis is accepted or rejected okay so do this from your side so do the problem statement so we will start our second test which is called as one sample t test now i hope you like the session so guys uh see i like to teach in this particular way right you know like write everything i never prepare ppts probably you have seen my youtube videos hardly i prepare any kind of ppt's you know i write it like this the reason i write it like this because it also helps me to practice it also helps me to see that what mistakes i'm probably making i'll become perfect in this so tomorrow if you call me in any session chris probably teach statistics or machine learning you know i will just go and start explaining everything over here now let's go towards the next problem second problem statement which is called as which is called as one sample t test now i hope you understood what is t test right first of all let me say z test whenever you have population standard deviation you use this right you really need to remember this if you don't have population standard deviation that is an unknown case of population standard deviation then what you do you use t test okay so this is the basic difference between t test and all okay now i'll take the same problem okay so let's solve the same question okay first of all in a population the average iq is equal to 100 then a team of researchers tried a medication to see whether there is a positive or negative effect a sample of 30 participants were taken and they have a mean of mean of mean iq of 140 with a sample standard deviation of 20. so did the medication affect the intelligence the first thing first answer what is your null hypothesis your mean is equal to 100 what is your h1 mean is not equal to 100 the second step the first step the second step is done now the third step that we do in t test which i have discussed before also calculate the degree of freedom here i basically use n minus 1 so this will be 30 minus 1 which is nothing but 29 fourth step what is my fourth step i will go ahead with the decision rule now my decision rule is nothing but it's very very simple i will go and define this graph i know what is my alpha value 0.05 i know my question did the medication affect the intelligence it can either increase or decrease so it will become a two-tailed test so here you have 2.5 here you have 2.5 here is your 95 what is this value we have probably found it out with degree of freedom 29 so let's go and try to find out with degree of freedom 29 what will be the value so it is t table degree of freedom 29 so 2.045 so 2.045 so here you will be able to see plus 2.0 what what was that 2.045 sorry i'm minus 2.045 right so this is your decision rule now your t value that you should be getting should be between this if it is greater or lesser than this you reject the null like that is what you have to probably do finally we go to the test statistics formula of t test the formula will be almost same t is equal to x bar minus mu divided by population standard deviation is not given sample is given and this will be root n so try to compute the values guys x bar is nothing but 140 mu is nothing but 100 s is nothing but 20 n is nothing but 30. so compute it so if i try to do the calculation entirely this entire answer will be 10.96 now since we have got 10.96 it is obviously greater than 2.05 so the t value which is nothing but 10.96 is greater than 2.045 and it is also greater than this particular value so what we do we reject null hypothesis now when we reject null hypothesis that basically means my p value is less than or equal to the significance value that is i am falling in this region or in this region now since i am getting 10.96 what do you think whether my medi whether my intelligence increased or not so obviously final conclusion you can see that it has increased the intelligence so what you do you reject the null hypothesis accept the null hypothesis sorry reject the null hypothesis except the alternate so from my teaching did your iq increased or not now let's see a real world problem and probably you can do this from yourselves a bank wants to open an atm machine in a specific area so this problem you have to formulate and you have to think over it how we can apply hypothesis testing you can consider any values that you want like you can say that average money people take out from the atm machine with 95 percent confidence interval you can formulate because this was one interview question in one bank that is called as standard chartered interview question from there think over it what all things basically required right think over it and try to solve it okay hello guys i hope you're doing mine today uh first of all uh we will continue uh with the discussion where we left so we will solve a chi square problem the second thing that i forgot about some of the topics over here is with respect to covariance correlation pearson correlation coefficient and the fourth topic that we are going to see is nothing but cpr man rank correlation coefficient peer men rank correlation coefficient we are going to discuss about this then probably we are also going to see practical implementations okay so we are also going to check out some practical implementation things now in this practical implementation we will try to perform z test t test and probably also see how to perform chi-square test this topic we will also see f-test which is the last topic which is also called as anova test the reason why i have kept f-test as large because the calculation will be uh very very uh the calculation is quite complex in that particular case so now let's go ahead and let's try to discuss about the chi square test uh chi square test has quite amazing uh problem statement so if i really want to discuss about chi square test it is mostly i'll i'll talk about it okay right now so let me just define what is exactly chi square test the chi square test claims about population proportions that basically means if someone asks you krish okay someone asked you in the interview why is chi square test use that why it is used you can just say that it is a non-parametric test that is performed on categorical variables categorical it can be nominal or ordinal data so this is how you basically define a chi-square test so uh it is a non-parametric test that is performed on categorical or ordinal data so this is what chi square test is basically used so if probably they ask you in the interview make sure that you are basically understanding why a specific test is actually done this is very very important okay because in the interview they'll not they may give you a problem statement and they may ask you what will be your plan to solve that specific problem statement but with respect to definition you should definitely be able to tell them let's go ahead and let's uh solve a specific problem for solving this specific problem i am just going to take a chi square test problem okay let's say that uh i'll take a very good example so this is my question in 2000 indian census the ages of the individual the ages of the individual in a small town in the small town were found to be the following now over here you have three categories less than 18 years 18 to 35 years and greater than 35 years so you had this information in the 2000 census that basically means less than 18 years were basically 20 percent 18 to 35 were somewhere around 30 percent and greater than 35 was somewhere around 50 okay so this is the information that is given from the complete sense considering this in 2010 ages of sample n is equal to 500 individuals were sampled below are the results so we basically have three columns again that basically means in 2010 again they took a sample of 500 people and they found out this was the basic results let's see so out of those 500 less than 18 18 to 35 and greater than 35 so less than 18 were 121 people 18 to 35 or 288 people and this was 91p so the question is using alpha as 0.05 would you conclude the population distribution of ages has changed in the last 10 years so this is the question that is basically given to you the question is very much simple it is saying that in 2000 uh in 2000 census in 2000 census the indian census the age of the individual in a small town were less than this is basically the data this is the population information like less than 18 percent were uh 20 18 to 35 were basically uh 30 percentage and greater than 35 was 50 percentage okay then in 2010 the ages of n is equal to 500 individuals were sampled below are the results then in 2010 what happened is that you know uh this again sam they again found out by picking up 500 people as a sample data and they found out that less than 18 were 121 people 18 to 35 or 288 people and greater than 35 or 91 people so using alpha is equal to 0.05 would you conclude the population distribution has changed in the last 10 years now what we are going to do over here is that we are basically going to solve this particular problem now you may be thinking that chris you have told that it is a non-parametric test that is performed on categorical that is nominal or ordinal data now what exactly is non-parametric test non-parametric test usually occurs with respect to population proportion whenever you are given some kind of proportions of data at that point of time you cannot specifically use a kind of parametric test so you have to go with non-parametric test now here you can actually see uh that whenever this is the original data with respect to the population then you sample the data and you found it out right and then we are just trying to see that what is the difference between this to this this to this do you think the population may have probably changed just by seeing the specific data or still you will probably just say that yeah sir it may be population has changed probably 18 to 35 you can see a huge quantity or number greater than 35 just seeing the percentage it shows a very less number obviously from the above population proportion you should be saying that greater than 35 should be more uh in this particular scenario what kind of assumptions we can make there is two kind of assumptions whether the population distribution has changed or whether it has not so how to go ahead and approach and solve this particular problem so here i am going to basically start the answer so the first step what we are going to do as usual uh you can let's let's go ahead and let's make two tables first of all so this is my first table this is my second table because this table will play a very important role guys okay the first table basically have the population information so i'm just going to draw it over here here i'm going to basically say less than this is less than 18 18 to 35 and greater than 35 now this is the expected see why i'm saying expected because this is the population information this is the population information so here less than 18 is 20 percent this is 30 and this is 50 now this is what your entire distribution is expected to be because in 2000 u.s sensor they found out this data now right now after 10 years when they took the sample of n is equal to 500 this is the observed one so the observed one was less than 18 less than 18 was 121 18 to 35 was 288 and greater than 35 were 91. so this two information you definitely have the reason why i'm drawing this 2 or writing this 2 information okay we will be able to understand it now we will create one more table and the table is something called as expect based this is the observed one right now what i'll do i'll create one more field and let's say based on this suppose if i take n is equal to 500 based on this what should be our expected what should be our expected distribution based on this data if i'm picking up 500 so we will try to divide this based on this percentage right so here my value will be 500 500 multiplied by what is 20 is less than 18 so i will multiply by 0.2 here i will say 500 multiplied by 0.3 here i will say 500 multiplied by 0.5 so this should be my expected distribution based on 2000 sensors observed is this one that is fine but we really need to find out our except expected also so if i multiply this two so if i multiply 500 multiplied by 0.2 this is basically 100 so here i'm basically going to write 100 this will be how much this will be 150 and this will be 250. this was what was the x what was the distribution i needed to have based on the 500 data based on this uh 2000 sensors but this is what is observed so now let's go and focus on this two table right now okay so we have got 100 150 250 obviously there is a huge difference by seeing this only you will be able to say that okay krish there is a huge difference here only i can definitely say that okay just reject the null hypothesis but understand over here alpha is basically given i want 95 percentage confidence interval why multiplied again understand why multiplied because we need to find the expected distribution based on this data from this 500 sample so if i consider 500 sample in 2010 also i need to get this data 100 150 250. so i'm basically going to say this is my this is my observation and this is my expected so this here has less than 18 18 to 35 and greater than 35 so less than 18 is how much i have basically 121 288 91 this is the observation and then i have 100 150 and 250 now these are my three categories this is one category this is two category and this is the third category now let's go ahead and let's try to understand what is the next step now next step i will first of all obviously you know you have to define your null hypothesis alternate hypothesis when you start the hypothesis testing so let's say that my null hypothesis is that the data meets the distribution meets the distribution this is the data right this is the data observation data it meets the distribution of 2010 sensors of sorry of 2000 senses my alternate hypothesis will say that the data does not meet the distribution of 2000 sensors so i hope everybody is able to understand the null hypothesis and the alternate hypothesis then the second step is my alpha value my alpha is 0.05 that basically means 95 percentage confidence interval now the third step in this is that whenever we do a chi square test we also need to know the degree of freedom so how do we calculate degree of freedom this is the steps guys and always this will be like this only n minus 1 what is n over here n is nothing but this is 1 2 and three this is where number of categories are coming into picture categories are coming into picture one two and three so three minus one is basically two age is now categorical right absolutely perfectly fine you know your degree of freedom your degree of freedom is 2 and your alpha value is 0.05 all you have to do is that go and check in the chi square table okay to find out your decision boundary is this a one tailed test or two-tailed test the data may be less than your distribution it may be more right so here is this a two-tailed test because alpha is point zero five guys we have to pick three as n because there are three h categories so this will become a two tailed test now in two tailed tests all i have to do is that open a chi square table let's see now this is my chi square table hope so i get the answer quickly so df is 2 or to look upon an area on the left subtract it from the 1 0.05 see 0.05 is here and degree of freedom is here so this becomes 5.99 so over here your we usually mention chi square by x square and chi square is basically denoted by x square and my decision boundary is that if chi square is greater than 5.99 i have to reject at zero now let's go ahead and compute the chi square test as usual very simple definition so my definition will be that fifth is calculate the test statistics which is called as chi square test this is nothing but x square is equal to summation of f 0 minus f e whole square divided by f e again notation can be used in all different ways but let me talk about what is f 0 f 0 basically means observed okay observed f e basically means expected so i am going to do the summation of all these three values so here i will first of all write 121 is my first observed value see 121 100 so 121 minus 100 whole square divided by 100 then if i go to the second element over here you can see 288 minus 150 divided by 150 so 288 minus 150 divided by 150 okay whole square then third one will be 91 minus 250 whole square divided by 250 so this will be 91 minus 250 whole square divided by 250 232.94 that basically means my x square is 232.94 which is obviously greater than 5.99 so what we have to do we have to reject the null hypothesis and which is absolutely true because the population distribution has changed so 232 is greater than 5.99 so we are rejecting the null hypothesis okay it is 494 okay let me write 494. so if you want to define chi square it claims about population proportion you can just say that it is a non-parametric test that is perform not categorical nominal or ordinal data it is specifically applied on nominal or categorical date okay let's see one python example okay so i'm just opening this okay let's say that in my uh i want to perform z test okay so let's say that i have some values like this so this is my question suppose the iq in certain population is normally distributed with a mean of mu is equal to 100 and standard deviation of 15 a researcher wants to know if a new drug affects iq level so he recruits 20 patients to try it and record the iq level now i am going to show you the code in python to determine if the new drugs causes the significant effect or not so i'm just going to execute this and let's say that i have this 20 records for z test we use this library which is called as stat models dot stats dot queen stats import z test as z test so these are my 20 patients and i have recorded the iq after the medication is basically applied now in order to apply z test no need to do that much calculation just write z test and here just give the data and the next parameter that you will probably be giving is this iq that is 100 okay which which you are actually trying to compare to basically reject our null hypothesis or not in this the null hypothesis will be mean is equal to 100 mean is not equal to 100. now when you execute this now here let's consider that i think the library is not there or tuple index tuple index what is the problem value is equal to 100 i have to write so here let's consider that my alpha value is see let's see this these are the two values that i'm getting the first value is the z test value the second value is the p value what does this p value basically mean now many people were asking the difference between significance and p value in z test they try to give us some kind of p value here they also give the z test value the z the z score that you are able to see is here and this p value this p value can be used along with significance value and suppose right now the p value is point zero zero one let's say that point zero for zero point one one this zero point one one suppose if it is less than significance level now in this particular case let's consider that i'm going to take a significance level of 0.05 so if this is less than this then obviously this we reject the null hypothesis this is just saying that based on this p value it is basically following falling in this region so that is the region it is great less than 0.05 so it obviously gives 2 value here just for understanding purpose we can definitely use this value and try to do the remaining calculations if you want because this is my real z test value other than that this value will basically help you to compare with the p value and then decide whether it has got rejected or not so i can give you the entire code in the okay since it is 0.11 this is less than 0.05 we are going to reject it suppose if we get 0.005 let's say that in this particular case i am going to use the mean as 110 now you see i zero zero two so do we reject or accept the null hypothesis here in this particular case if i go and probably see you will be able to see i'm getting point zero zero zero two point zero zero two which is obviously less than 0.05 so we accept the null hypothesis this is obviously not right not less than so we reject the null hypothesis this is not less than 0.11 is greater than 0.05 so we reject the null hypothesis alpha can vary you can have 0.01 you can have 0.10 it depends on the domain now let me discuss about the next point which is called as covariance yes if p value is less than if p value is less than significance value that basically means it falls in the tail region so we reject the null hypothesis if it is greater then we accept the knowledge if it is greater we have to accept or i'll say we accept the null hypothesis yes in medical domain it can be depending now like that you can do for t test what all data you require see whatever a question i am writing with respect to this that kind of data you require now in this case also when i when i gave this problem statement when i gave this problem statement here you can see that right i have written the same type of question suppose the iq in this this is this is there so it across 20 patient this is the 20 patient data and i'm basically checking out the z test so if mean iq before meditation is 110 and p value is 0.002 it means that even after taking medication the iq will be around 110 it means medication has no effect yes with respect to that specific thing let's say that it has got uh medication has been applied before the iq was 110 and after giving this medicine also it was near 100 alpha and significance value are one hint the same okay let me define once again this is my graph okay this is this initially i got 1 1 this was my p value this is obviously greater than 0.05 and let's consider about p value as 0.002 which is less than 0.05 when i have this scenario this basically means i am in the confidence interval if i am having this scenario where it is greater than 0.05 i am falling in this in the tail region i hope now you are able to understand if i am having this scenario where the p value is less than significance value i am in the confidence interval in this 95 percent if i am over here that basically means it is greater than the significance level yes if p value is less than we do not reject right in this case we accept the null hypothesis we fail to reject the null hypothesis we accept null hypothesis okay in this particular case we reject now let's go ahead and discuss about the next topic which is called as covariance let's say that i have two data set the two columns x and y so if i have these two columns let's say that this is basically my weight and this is basically my height feature okay in this particular scenario let's consider that you have some weights let's say you have like 50 you have height like 160 centimeters then you have 60 170 centimeters then you have 70 then you have 180 centimeters and then probably you have 175 you have 181 centimeter now in this particular thing you can what what kind of things you are seeing what kind of relationship you are seeing when x is increasing y is increasing and similarly can i say when x is decreasing y is decreasing so both this relationship will basically follow this specific thing we based on this particular data so when x is increasing y is increasing as x is decreasing y is decreasing suppose let's say that i have one more data set weight and height only let's say i have uh let's consider that uh number of hours study and number of hours play now in this particular case if i'm studying for two hours let's say i'm playing for six hour if i'm studying for three hours i'm playing for four hours if i'm studying for four hours i'm playing for three hours in this particular case what is the relationship you can see that when x is increasing y is decreasing or where x is decreasing y is increasing so this relationship is basically used over here so here you can see these two conditions right this two conditions now this is what you can observe but the main thing is that how do i quantify how can i quantify or show some relationship quantify relationship through numbers between x and y now in that particular case i can use a formula which is called as covariance now covariance is basically given by cov x comma y which is nothing but summation of i is equal to 1 to n x of i minus x bar y of i minus y bar divided by divided by n so this is basically the formula with respect to covariance if you are working with sample again this will be n minus 1 right now let's consider that we are working with sample now in this particular case you can see what is happening covariance of x comma y covariance of x comma y is nothing but x minus x bar x bar is nothing but the mean of x y minus y bar is y bar is in the mean of y now when we calculate you will be able to see either we will get a positive number or a negative number or i may get 0. now tell me what does a positive number basically indicate so positive number positive value indicate two things one is when x is increasing y is also increasing when x is decreasing y is also decreasing so this shows if this positive number is basically coming it basically shows or it basically quantifies the relationship between x and y in this particular way that basically means when x is increasing y is increasing when x is decreasing y is decreasing so here you will be able to see with this with a positive now similarly with a negative number so with a negative number here you can find out that when x is decreasing y is increasing as x is increasing y is decreasing so this relationship you will be able to find out so this is nothing but positive correlation i'll say and this will basically be negative correlation so here you will be able to see this if it is 0 that basically means when x is increasing y is not increasing or probably there is no relationship between x and y so understand this particular thing but let's understand with respect to covariance like suppose if i have a data set which looks like this now in this particular case if this is my x and y what do you think will this be whether it have a positive correlation or negative correlation think over it it will obviously have a positive correlation right because here when the x is increasing y is also increasing if the x y is decreasing x is also decreasing right both this condition are getting samples so here you can definitely see this positive correlation is there and when you are trying to apply this particular formula you will either get a positive value in this particular case suppose if i have another graph which looks like this which looks like this this is my x and this is my y if i have some data points which looks like this now in this particular case what type of correlation you will have you will basically have a negative correlation sorry i should not say correlation over here i'll say negative for now because we have not started correlation but here you will be having some negative correlation okay i can also say it as negative covariance suppose if i have another data set which looks like this with respect to x and y if my data set is like this then what will be the my value of covariance covariance will be 0 because there is no relationship covariance will be basically 0. now let's understand one basic disadvantage of covariance the covariance over here you will definitely be able to see positive or negative you will be able to find out the positive or negative correlation but with respect to the disadvantage there is no fixed value you may have plus 100 also you may have plus thousand also you may find out minus 200 also minus 2000 also like this with respect to the magnitude there is no such limit you will definitely be able to see the direction whether it is positive or negative but this magnitude is not limited so if we have two distribution how much positive how much negative that part if probably if you have two distribution one is plus hundred the other one is plus thousand you'll not be able to identify because it is just a magnitude value it is just a magnitude value now that is the reason we really need to restrict these values between some range so for that specific region we use another one which is called as pearson correlation a pearson correlation coefficient what it does is that it basically restricts all your value between minus one to plus one the more towards plus one or minus one more positively it is correlated sorry the more towards plus one more positively it is correlated the more towards minus one more negatively it is correlated you should be able to see that okay then what is the difference between covariance with respect to the formula now for the peers correlation you can basically use something like this x comma y it is nothing but it is very simple covariance of x comma y divided by so standard deviation of x and standard deviation of y because of this multiplication all your values will be between minus 1 to plus 1. so here you will be able to see that it is always between minus 1 to plus 1. now let me show you some examples in wikipedia so if you go and search for pearson correlation coefficient here you will be able to see this okay now tell me this particular diagram here you can see all the points are in one one straight line so when you draw this particular line your correlation obviously in this particular case was when x is in decreasing y is increasing right in this particular case if x is decreasing y is increasing if x is increasing y is decreasing this is the relation that it is found so it is negatively correlated and if it falls all in the straight line it is -1 then here you will be able to see that over here you have some of the data points distributed in this here also you can actually see negative correlation but not all are in the straight line so your value your correlation will be ranging between minus 1 to 0. similarly in this particular case here you can see that when x is increasing y is also decrea increasing so here will have a positive correlation since it does not follow in the straight line it is written 0 to 1. if it falls in the straight line then it is plus 1. so it captures the linear properties very well because everywhere you can see that there is a linear line it captures it in an amazing way that is the most advantageous things with respect to pearson correlation now in this particular case here you can see that the correlation is zero why because we cannot identify when x is increasing y is also increasing the the data is completely distributed here and there now some more examples here you can see this is one this is pointed point four zero minus point four minus point eight and minus one and similarly these all are one one one one this is zero minus one minus 1 minus 1 and similarly here you can see some more zeros you can also see some more zeros this you cannot definitely identify what exactly is this there is a lot of difference between covariance and correlation so here your values will always be between minus 1 to plus 1 and nothing more than that okay let me search for one more thing something called a spearman rank correlation now you'll be understanding why do we specifically use pr man rank correlation also so i'll go to wikipedia here one thing that you have to identify it captures the linear properties well linear when the line is linear obviously it will say you want even though the distribution is like this it will try to create a linear line and your it will tell you the value now let's go to spearman rank correlation now in spearman rank correlation just see this graph everybody this graph over here that you are actually being able to see this is obviously having a positive correlation because when the x is increasing y is increasing and when i try to calculate with respect to pearson correlation it is giving me 0.88 you will be able to see that at every point at every point over here at every point when x is increasing y is definitely increasing in this region it is increasing by a small amount so this properties has not been able to capture by pearson correlation so that is the reason it is showing you 0.88 even though when x is increasing y is also increasing we need to get 1 and that is where spearman rank correlation will come because spearman rank correlation will also satisfy non-linear properties pearson correlation is good at satisfying linear properties that we have already seen because if you see this example it tries to determine the linear properties and tries to give the value in this case non-linear properties will also work well okay so spearman correlation and what is the formula probably uh they will try to change it to the formula only one difference is there instead of writing suppose let's say that i'm going to find out the spearman rank correlation between x and y here everything will be same here instead of standard deviation of x here you will be having rank of standard deviation of x multiplied by rank of standard deviation of y now you may be thinking what is this rank of standard deviation of x and standard deviation of y let me just show you that also so this is the formula okay i i missed one more thing this will be covariance of covariance of rank of x comma rank of y now what is this rank of x and rank of pi let's consider that i have this feature weight and probably age if this is 170 the weight may be 45 if it is 160 the weight sorry weight is too high this is not possible so i will just say height and weight let's say so if i say the height is 170 the weight may be 75 kgs if i say height is 160 then the weight may be 62 150 the weight may be 60 145 the weight may be 55. now in this particular case how do i define my rank this is my x this is my y how do i define my rank of x now rank of x is very very simple you just assign rank over here you have four points okay which one which value you want to give the highest rank go and see over here this is having the highest value right highest value now let's see let's consider that i have one more 180 and this will be 85 let's consider in this you just need to convert this or you just need to assign rank to this particular data now rank basically gets applied to this in height if i say rank of x 180 is the highest right so i may give this rank as 1 then 170 is the next highest then 160 is the next higher than 150 then 150 45 right so here you will be able to see that i am assigning rank and similarly i will go and assign rank for y in this particular case my one rank is 85 then you have 2 then you have probably 3 then 4 then 5 like this this rank it will be basically used to do this calculation that is the reason i told right covariance of rank of x and rank of y divided by standard deviation of rank of x and ranko so this value will be taken this will be completely ignored so this is what is basically the entire spearman rank correlation and i hope you have understood but understand if someone asks you why do you use peer men rank correlation coefficient you should basically say that it captures the non-linear properties it captures the non-linear properties now let's go ahead and let's try to do this one example let's go and see something like t test and try to do it uh let's say let's see whether we'll be able to get or not so here i'm actually going to show you t test so suppose i have this ages let's consider i want to initialize this edges so this is my ages you can randomly initialize whatever you want because we are just doing a hypothesis testing okay so it's up to you if you want this ages also i can ping it in the chat so this is the ages now my main aim is that let's compute the mean of this edges so ages underscore mean is equal to np dot mean of ages so if i go and probably paint ages underscore mean so this is 30.34 now let's let's do one thing very simple from all these ages let's consider that these are my population i will just take a sample of age and then we will try to verify whether we are coming nearer to this mean or not using uh t test because here we don't know the population standard deviation so let's do one thing i'm just going to take my sample size as 10 this will basically be my sample size and i will just pick up all the sample uh from this particular ages so i'm going to say np dot random dot choice so here i'm just going to give my ages and this will basically be my sample size so if i okay i'm getting an error okay random random np dot random still error okay random it became now insert random my goodness so here now if i go and show you my age underscore sample here you will be able to see that this ages have been picked now can i basically whatever mean is basically coming from this can i actually come near to this population mean with the help of t test that is what i'm actually going so i'm going to say from sky pi dot stats import t test underscore one sample this we have done yesterday okay t test underscore one sample basically means uh one sample t test that we have probably done yesterday that is what we are going to do now t test underscore one stamp here i'm basically going to give you two things one is my age underscore sample and probably i want to give and compare with respect to this mean okay so here i'm just going to give you 30. so here you can see that i'm getting the p value as 0.76 if you don't believe me just go and compute the np dot mean of age underscore sample i'm getting 31.5 right which is little bit away from here now it is up to you i got the p value as 0.76 now if i say my alpha value my alpha value is 0.05 in this particular case my p value is greater than the alpha value so tell me whether it should be accepted or rejected suppose if i execute the same code and i write sample size with respect to 31 now i'm getting 0.918 suppose if i execute with respect to this and i take up with my sample as 28 now i am getting 0.48 if i keep on doing this and make it to 26 here you will be able to see 0.27 so this is with respect to different different things i can also even change this now if i go and execute this here i'm getting 0.60 here i'm getting 0.45 here i'm getting 0.96 here i'm getting 0.67 if i try to change this random value again and again let's say that i have taken a different sample my sample is mean is nothing but 24.3 now if i execute this this is 0.015 it is tell me 0.05 right greater than 0.05 or less than 0.05 now in this particular case it is if i say with respect to 31 this is 0.006 it is within that confidence interval or not similarly if i go and see with respect to 28 0.085 0.397 so here you can basically see and here i've just taken a small example here i've just taken a small example usually in the main scenario you will basically have a huge data set to check out all these particular things so this was an example with respect to t test let's take one more example now i have a problem statement i will consider so my example is that suppose i take college the ages of the entire college student suppose i take ages of the college student of the college student now what i'm going to do i'm going to take the class let's say one class students ages i'm going to take student i'm going to take and then i'll probably find the mean of all the ages and then we'll try to compare whether this will be able to give that specific output basically can we come to the population mean ages of the college student that is what i'm actually trying to so first of all let's say that i'm having this code let's say this is there now everybody focus on the code here this is a poison distribution uh it is just saying that you have to start from 18 age and the mean is 35 and we are going to consider our population ages as 1500 then we are basically considering class a with starting age as 18 mean as 30 and size that is only 60 sample so in this particular case if i go and see school underscore ages here is my value and similarly if i go and see class a underscore ages so this is my class a underscore ages which are basically my 60 data now let's do something uh one amazing thing first of all let's try to find out the class a underscore ages dot mean so here you can see that it is 46.9 now what i'm actually going to do i'm basically going to apply again this t test t test one sam and here my first data will basically be my class a ages and then the second parameter will basically be my mean my mean i will try to give this specific meal school underscore ages dot me so this will be a parameter if i go and see away and press shift tab you will be able to see that the second parameter i have to give as mean this is nothing but my pop me now here you can see that i'm getting the p value as this one so tell me whether this needs to be accepted or rejected and if i go and probably see the school ages mean this far away right it is 46.9 and this is 53. if you're considering alpha is 0.05 it will obviously be very very so similarly you have to reject it guys not accept it okay because this is way higher than that you have to basically get p value less than the significance let's say that i am putting this nearer to nearer to the class a mean let's see what will happen 47 there is something coming somewhere here oh sorry i have to give this as my sample me so like this you can basically check out all the things and verify it whatever we have done we have done it is 10 raised to minus 13 it should not be rejected guys it is 10 raised to minus 13 i'm extremely sorry it should not be rejected so that is the reason what we do we can put if so here i will just say underscore comma p underscore value is equal to this so if i say if p underscore value is less than 0.05 then what we do print accept s0 so what i'm actually getting over here okay guys okay one more final thing that i probably missed out let's say that i am using c bond so df is equal to because i need to show you correlation and all also so we'll check out that also so that you'll be able to check it out okay so sns dot load underscore data set let's consider that i'm going to use iris data set so this will be my df.head so if i use the correlation df dot corr so this is how my diagram looks like here you can see that it is basically showing you the correlation with sepal length and petal length it is positively correlated now see with respect to correlation also we have various ways okay i will not use in this instead i can also use snh dot pair plot df so here also you have a way to see in a diagrammatic way so here you can see this is how your diagram looks like okay the entire correlation in visualized way guys this will be reverse sorry basically right less than or equal to 0.05 so i have written it correctly over here but in the code i have written wrong okay if p value is less than the significance value we reject okay code wise but this wise uh i've written it correctly sorry this should also be rejected this should be reject this should be accept anywhere i made a mistake here again okay here also i think except your reject i made one mistake guys again i'm going to repeat it c if p value is less than or equal to 0.05 in this particular case we reject the null hypothesis the reason why we do this because p p is basically defining the probability part right now in this particular case they are just saying that it is less than five percent probability that the null is correct it basically says that five percent probability the null hypothesis is correct this will be applicable in coding guys so don't worry in coding so this is what it is basically saying okay so if p is greater than or equal to 0.05 here we accept the null hypothesis which in turn is saying that more than five percent probability there is a chances of more than five percent probability for the null hypothesis correct so this is basically for alpha point zero five i hope now it is clear here i have written it anywhere wrong so that this gets solved this is fine because this is greater than 5.99 only the relationship between alpha p value and alpha is important guys see in coding we always get a p value so here also if i go and see with respect to the coding let's say in this particular case if i'm getting 0.015 this basically indicates that what does it indicate we have to reject the null hypothesis so here also i will just write a condition saying that this condition will work over here it is always a confusing thing p value because p value i defined in that particular way everything is correct because see what what exactly p value basically specifies i'll talk about it i told you an example of mouse bar right if i have like this here if i say p value is 0.8 that basically means out of all the 100 touches 80 percentage of time you are going to touch over here out of all the 100 criteria if you say p value is 0.01 over here one time you are saying now if you are saying p value is less than point zero five that basically means you have less than five percent probability for the null hypothesis to be true which is basically present over here p value less than or equal to point zero five basically is specifying your tail region and this is specifying your this confidence interval region when the p value is greater than or equal to point greater than point zero always understand this relationship between p value and significance value significance value specifies your confidence interval p value is from the test result if this is less than or equal to c i the confidence interval or in this particular case the significance value here you specifically reject the null hypothesis because it is just saying that less than five percent probability it is there so we reject the null hypothesis and in the other case we accept the null hypothesis okay when the p value is greater than alpha the first topic that we will discuss about p value and significance values so today i'm going to talk about the exact relationship between this p value and significance value because from the tests that we were doing you know uh we were seeing that okay most of the tests from that test how do we derive this p value that is what i'm actually going to discuss about it and practically also one example will be shown then we'll move towards distribution first we will discuss about central limit theorem central limit theorem uh then we are going to discuss about distributions like bernoulli's distribution bernoulli distribution then fifth we are basically going to discuss about binomial distribution and then sixth we'll also be seeing something called as pareto's distribution okay i'll include log normal also right log normal distribution poison uh pareto distribution there is something called as power law we will discuss about it one one final thing that is pending is called as f test which is also called as anova test this will take one hour time guys just to do this i will upload a separate video okay i'll upload a separate video on the same today i will show you how you will derive the p value so we are basically going to see how do we derive the p value and what is the relationship between p value and significance value so this all things we will be discussing let's take a problem statement the problem statement uh i'm going to take it off as that test and uh let's let's take the let's write down the question before that everybody ready take up your book and pen i've already discussed about permutation and combination guys in the previous session the question is nothing but it is very simple the average we'll do a z test problem and then we'll try to derive this the average weight of all residents in bangalore city in bangalore city is 168 pounds we take a sample now we take a sample okay one one data i have missed so over here the average weight of all residents in bangalore city is 168 pounds with a standard deviation 3.9 now what we are saying we take a sample 36 individuals and the mean is 169.5 pounds from this information we really need to check whether whether the sample is being able to tell us the weights are same or not okay so this is what it is given and our confidence interval is basically 95 percentage so over here you know what is my what is your mean my mean is 168 points the standard deviation is 3.9 the x bar is nothing but 169.5 and my end sample is greater than 36 and obviously my n sample is given my population standard deviation is given so i am going to basically use that test very good the alpha value is 0.05 1 minus 95 percent is 1 0.05 let's go ahead and solve this particular problem so what is your null hypothesis mean is equal to 168. what is the alternate hypothesis your mean is not equal to 168. then what we do we basically come to the second step where we specify our alpha 0.05 the third step we basically find out our decision boundary so my decision boundary is quickly how much it is nothing but it is this graph it is a two-tailed test it can be greater than 168 less than 168 so here i have basically 0.025 here i have 0.025 here i have 95 percent now what is this value that i can get from the z table that we see can i say 1.96 plus minus 1.96 if you open a z table with respect to 1 minus 0.025 you will be getting 0.9750 we are going to check this area of curve and usually we get 1.96 and minus 1.96 now the next step i hope everybody is clear because we have already done this in our previous session next step we do the that is my fifth step we calculate the z test now z test formula is very much simple i hope everybody remembers it x minus mu divided by standard deviation of root n what is x 169.5 minus 168 divided by what is standard deviation 3.9 root by 169.5 not one root of 36 so here we are basically going to get 169.5 1.5 divided by 3.9 multiplied by 6 so through 0.307 so right now let's go to our decision rule my z value is 2.307 is it greater than is it greater than 1.96 it is greater than 1.96 so we reject the null hypothesis but this is already we have done many number of times okay we have done many number of times but now one step will go ahead this is fine this is one way of solving this problem but where does p value comes into existence where does p value comes into existence in this particular case now what it is saying is that initially my this graph i was checking it for this to this where it was plus 1.96 minus 1.96 and obviously i got 2.307 so it is falling somewhere here it is falling somewhere here right if i'm if i'm considering 2.307 it is falling somewhere here it is on the top hand side so we are rejecting the null hypothesis now if i really want to find out the p value what i am actually going to do i am going to remove this and now my curve will be little bit bigger because based on this i got the z value as 2.307 and here also i got minus 2.307 because both are symmetrical now the next step what i will do i will take out my z table i will take out my z table and i will try to find out what is those values with respect to my z score with respect to my z score of 2.307 right so what i'm going to do over here 2.3 i'll check based on the z score what is the area under the curve what is the specific area what is the specific area i really need to find out and i don't know what is the area right now so i will go ahead and calculate it now based on 2.307 okay so 2.3 is here and 0 7 if i say 0 7 it is somewhere here so 2.307 i am getting somewhere around 0.99 triple 1. i hope everybody is able to understand what i am getting over here 0.99 triple 1 right so what i am getting over here it is nothing but 0.99 triple 1. so here based on this my area under the curve is basically 0.99 triple 1. so this with respect to the area under the curve i'm actually getting this now understand one thing if i subtract with one see 0.99 triple 1 is basically the area under the curve of this particular curve now if i subtract this with this how much i will be getting so this area is nothing but 0.0089 and this is nothing but 0.0089 so i am getting 0.008 9.0089 right so this you can see this eight eight nine sorry it is eight eight nine eight eight nine so here i get eight nine here also i'm getting eight nine now according to the p value now see this this middle one is point nine triple 1. if i add up all this particular value i should be getting one and if you add it up and probably you will be getting one p value is nothing but i have to add this area of curve of this tail and this tail because it is two-tailed i have to add this up and then this will basically give my p value 0.0089 so once i add this particular value i am actually going to get point 0 1 7 7 8 is it not 0.889 divided by 2 uh no because see both both the area are symmetrical understand one thing both the area are symmetrical if i am getting one value over there if i'm getting one value over there probably i'll be able to see that specific part right because this part is symmetrical to this part do you think it is divided by 2 do you think it is divided by 2 no i don't think so it should be divided by two it should it is basically considered at this part and this part right 0.99111 oh yeah should we divide by 2 yes yes then only probably i will be able to okay so probably we are getting more than one so over here i'm getting two point three zero seven two point three zero seven is greater than one point nine six with respect to two point three zero zero seven i am actually getting the value as point nine nine triple one okay so one minus point nine nine triple one will be nothing but so point 0.0044 this area point zero zero four four now if we add it we will be getting till one now in order to add the p get the p value i will take this area four four plus point zero zero four four now here you can see that i am getting point zero zero eight eight now this is basically my p value okay because based on the real z score that i have got i'll be deriving my p value from here now obviously we know that it we have to reject the null hypothesis now from the p value also we can actually verify here now this p value is obviously less than 0.05 right which is my significance value so obviously 0.0088 is less than 0.05 so what happens over here we basically reject the null hypothesis so here we are rejecting the null hypothesis suppose if this p value is greater than 0.05 always understand one two important points one is p value is less than your significance value i hope you understood how to calculate the p value right so if this is less than or equal to the significance value this means we have to reject the null hypothesis if the p value is greater than significance value then what we do we fail to reject the null hypothesis we failed to or accept the null hypothesis it failed to reject the null hypothesis now it is clear guys from the yesterday's session now you can try out in every problem that we have probably discussed this many days see guys whenever we have a z table right right now one thing is that first of all i'll check with 0.025 1 minus 0.025 is 0.9750 so if i go and see probably somewhere you will be able to see 0.9750 where it is here so it is nothing but plus 1.9 and this is 6. but we saw that our real z score was coming as 2.37 two 2.30 three zero is this one okay guys two point three zero is this one i took two point three zero seven okay so again there was a confusion over here you can also take this one see one one minus point nine eight two eight one minus 0.9828 you can also do this 1 minus 0.9828 so if i subtract this 1 minus 0.9828 0.0172 if i divide by 2 this will be 0.0086 you can take this up okay again i'm going to repeat this okay let's see i'm i'm planning to repeat it okay fine not a problem see initially what i got my value was this right i got this as minus 1.96 this has plus 1.96 but based on the z score calculation how much we got with respect to z score calculation here you can see i got two point three zero seven okay two point three zero seven so here i'm actually going to get 2.30 okay let's take this so obviously my my z is 2.30 it is greater than 1.96 so i have told you we have to reject the null hypothesis in this case this is with the help of z score now what i'm actually going to do let's calculate the p value now in order to calculate the p value okay what i will do i will rub this and i'll try to find out the area with respect to this z score that is 2.30 so plus 2.30 minus 2.30 and i will try to find out what is this area and based on this i will be able to find out this area also so let's go ahead in the z table now so this is my z table two point three zero okay two point three zero so here is my value point nine eight see this guys i'm again repeating it two point three 0 right this is my z square value so 0.98928 so here i'm actually getting 0.9828 right so this is my area under the curve 0.9828 i guess 0.9828 only 2.3.98 92928 okay 0.98928 now when i subtract 1 minus 1.9828 then i will be getting this area and this area right since i will have to get this particular area so i have to subtract with the whole one so if i go and calculate now 1 minus 0.98928 so it is nothing but 0.0171072 now understand this is not one-tailed test this is two-tailed test so i have to divide the area from here to here also so that is the reason why i divide by two so i'm going to divide by two so i'm actually going to get this as point zero zero one minus point nine eight nine two eight divided by two so it is nothing but point zero zero five three six then point zero zero five three six in p value i will add this two term understand guys what i'm actually trying to do you have to check out the z table so if i add this probably then i will be getting some value and then check whether this is less than uh significance value less than alpha less than or equal to alpha then you reject the null hypothesis and obviously this case also it will be less than let's solve this problem so uh the average age of a college 24 years with a standard deviation 1.5 so this is a college over here the average age of a college is 24 years with a standard deviation of 1.5 now what i am actually going to do over here is that i am just going to say that okay fine uh the average age of the college is this much this much so we take a sample of 35 students let's say that the mean we take a sample of 35 students and uh we find out that the mean is 25 years then with alpha as 0.05 that is the confidence interval as 95 percent with alpha as one point alpha as 0.05 and confidence interval do the age where i okay so this is the question h0 you'll say mean is equal to 24 h1 you'll say mean is equal to mean is not equal to 24. you know their standard deviation it is 1.5 you know your n value it is 35 let's take it as 36 okay and then your x bar is 25 and your alpha value is 0.05 now tell me whether this is a two tailed test or one tailed test it's a two-tailed test so here you have your alpha as point zero five now if i make my confidence interval my decision tree sorry my decision over here this will be point zero five why point zero two five this point zero five will be divided into two region since it is two tailed test if it is a one tailed test focus over here only no focus over here only to solve it why you don't have to worry about all those things you know then let's go and solve with respect to z score z score x bar minus mu divided by standard deviation by root n so what is my x bar it is nothing but 25 25 minus 24 divided by 1.5 multiplied by 6. so it is 1 multiplied by 6 divided by 1.5 go and calculate it the z score is 1.2 you know the decision boundary what is the decision boundary plus 1.96 plus 1.96 right now you are getting 1.2 if you are getting 1.2 then obviously 1.2 is less than 1.96 should we reject or accept the null hypothesis 4 are we getting 4 oh sorry it is 4 extremely sorry now if you are getting 4 the 4 is greater than 4 is greater than 1.96 so we reject null hypothesis now what you are going to do for this particular 4 you have got a 4 value right so this will be your plus 4 this will be your minus 4. now go to the z table try to find out what is the four value so go over here try to find out what is 4 it is 0.99997 497 right that is 0.9997 so i will go and subtract this to this now if i subtract this to this what will happen and i have to divide this by 2 since this is since this is what two-tailed test so this side will basically be my area as point zero zero four zeros one five and this will be my area as point four zero one five and this middle one will basically be point nine nine nine seven now what is my p value my p value is point zero zero zero one five plus point zero zero zero one five so what this will be this will be nothing but the same thing point four zero three 0.403 now my p value is obviously very very lesser than significance value so what we have to do we have to reject the null hypothesis reject the null hypothesis so here you can definitely say that with the sample size that we have taken definitely we will not be able to conclude that the mean is that much so let's go ahead with log normal distribution okay guys so log normal distribution usually log normal distribution it will have this kind of shape obviously we have seen a lot of examples like wealth distribution these all things are actually there so this was the example of log normal distribution now suppose i say that if uh if y is a random variable that belongs to a log normal distribution with mean as with some mean let's say that this is there it belongs to a log normal distribution then if i apply if i apply log of y then it should follow a normal distribution so if it is satisfying this condition we can say a distribution is basically in this kind of log normal distribution so log normal distribution i have already discussed in the previous section also a lot of examples are there people writing comment session people writing bigger comments people writing there will be very less number of people who write big comments right big comments so this is one example and again this is also i have uploaded a detailed video in my stats playlist let's go to the next distribution if i say that next distribution is there so i will talk about bernoulli's distribution so let's talk about bernoulli distribution okay let's start away and talk about bernoulli distribution now in bernalillo distribution you can see that uh it's more about p and q it's more about see whenever you have a bernoulli distribution that basically means you need to understand there are only two outcomes so if i go and probably open it over here in bernoulli distribution they are specifically two outcomes two outcomes basically means that it can be either zero or one let's say that i have two outcomes of zero but i really need to find out the probability you know when when we need to focus on probability with respect to bernoulli distribution we defined by two values one is p and one is q suppose i say i'm considering an experiment which is called as tossing a coin in tossing a coin i know what is the probability of head let's say that i'm getting probability of head as 0.5 so this will basically become my p value okay when i talk about the p q value it will become one minus p okay that basically means if the probability of head because in pro when when we are probably tossing a coin there are two choices either you get head or either you get t so when i say probability of head is 0.5 so this is what one outcome probability is there what about the other outcome so that is nothing but 1 minus p so here if i have 0.5 then q will be 1.5 then this will be 0.5 suppose i do not have a fair coin i do not have a fair coin do not have a fair coin now in this particular case and this is only related to single trial not multiple trials single trial distribution now let's say i do not have a fair coin let's say that my probability of head is 0.3 now in this particular case what will be my probability of tail this is basically p then my probability of tail will obviously be q which is nothing but 1 minus p which is nothing but 1 minus 0.3 0.7 so this is basically 0.7 over here you can see over here now similarly if i go and probably discuss with respect to this here you can see that a bernoulli distribution named after swiss mathematician jacobi vernier bernoulli as a discrete probability distribution now here you see this okay three examples of bernoulli distribution here the probability of x is equal to zero with one of the outcome is point two so we are drawing this graph this line see this will come as point two this will come as point eight over here next next outcome you can see over here is 0.8 and 0.2 so this is how you basically create in this 0.8 and this is 0.2 in the green color you can see 0.5 and 0.5 so this is my 0.5 and 0.5 now understand one important thing whenever we draw this kind of like this kind of experiment if we draw in the form of graphs on the left hand side obviously you know what will be there with respect to this and the right hand side there will be probability so this is basically point two point four eight point six point eight and one now suppose let's consider that i have three i have one coin over here this one coin is basically head this is basically tail now if i try to show you with respect to the probability of head and tail i can basically draw suppose if i say this probability of head is 0.5 so i can draw this line like this i can basically draw this line with respect to this and then if i'm drawing this line then probability of tail will also be 0.5 suppose if i say the probability of head of or of of not a fair coin is nothing but 0.8 then we will draw a line like this here i can basically say then what will happen if this is 0.8 then this will become 0.2 so this is how we basically draw this and this is not a probability density function understand this is a probability mass function in probability density function it is completely different probability density function is for continuous variable this is specifically for categorical variables so this probability mass function that we have over here we will basically say it has pmf before we used to say it as pdf so whenever we have this kind of variables categorical variables at that point of time this is basically called as probability mass function so i hope everybody is able to understand with respect to this now let's go to the wikipedia page so here you can see probability mass function and the same thing probability mass function is same if k is equal to 0 i will write q is equal to 1 minus p p if k is equal to 1 and the pmf is basically defined in this particular manner any probability that i want to form i want to find out this is how the formula is basically utilized we really need to know only this much things about the distribution and one probability formula this was with respect to bernoulli distribution now let's go ahead and try to discuss about binomial distribution see binomial distribution is also very much good till now we discussed about single trial right single trial whenever we take a multiple trial then it becomes a binomial distribution inside this let me write it down over here so if i go and see with respect to binomial distribution binomial distribution says that obviously with respect to every trial there will be a bernoulli distribution bernoulli distribution but here we have multiple trial that basically means we have the combination of many bernoullis distribution over here suppose in this trial my probability of head is this much suppose in one more trial i will go and write probability of head is 0.6 this is 0.4 like this i will be having many trials combined together in one kind of binomial distribution whenever you have a categorical variable and whenever we try to draw this kind of diagram then it is called as the probability mass function in in the case of a continuous variable we have probability density function so if i go and probably see this the binomial distribution is given by two notation n comma p so n is nothing but number of trials p is nothing but success or probability for each try where q is equal to one minus p okay and this is the formula with respect to the probability mass function to calculate the probability of a binomial distribution now this is done now let's go to one very important distribution which is called as pareto distribution now pareto distribution is a non-gaussian it is not a gaussian distribution it looks something like this one application of pareto distribution is nothing but power law distribution so if i show you power law so here everybody see this diagram with respect to this i'm just going to take a snippet of it we'll discuss about this this is something very much important okay let's let's paste it over here now in this particular case when we are discussing about power law distribution let's see that what important information we can take out from this power law distribution basically says that you have to remember this rule which is called as 80 20 rule you can see that this is probably my 80 percentage of this entire value and this is my 20 percentage of the entire value my x-axis may be something my y-axis may be something but understand the 80 of one kind of distribution will be falling here and remaining 20 will be falling here let's say some take some examples okay now suppose if i say that 80 percentage of the wealth is distributed with 20 percentage of the people the second question any example any other examples can i say 80 percentage of the company projects are done by twenty percentage of the people twenty percent of people in a team eighty percentage of sales is done by 20 percentage of the most famous project any example more one more example i can take 80 percentage of the match cricket match okay let's say is one by 20 percentage of the 20 percentage of the t eighty percentage of videos are completed by twenty percent eighty percentage are serious out of all the hundred percent like eighty percentage of the syllabus are completed by twenty percentage of the eighty percent spamming on youtube video has been done by 20 percentage of the people yes any kind of examples you can basically take you can also consider salaries you can also consider yes 80 percentage of oil coming from 20 percentage of the land so whenever you have what this kind of distribution it is called as power law distribution and this is also called as a pareto distribution now listen to me one thing guys this is something very much amazing right now this diagram that you see it looks something like this looks something like this right if i extend this diagram and probably make it like this if i probably extend this diagram if i extend this diagram and make it like this see this this is a very important thing then what kind of distribution this is this is my power law distribution what distribution is this is not normal this is can i say this is log normal distribution log normal guys not normal the right hand side over here this will get extended by a lot log normal distribution probably i did not draw the diagram properly but it is a log norm log normal right skewed data something like this let's say so this specifically is a log normal distribution so there is a very good relationship between log normal and power law distribution of pareto distribution so mathematically if i talk about you can also convert this distribution into normal distribution also and for this you have to watch one of my video which is called as transformation data transformation so definitely check out those video and probably this is what is basically spoken about it you know so that is called as pareto distribution with respect to this now guys uh yes box cost transformation is basically used in order to convert this data into normal distribution so probably you will be able to see from that video link that i have actually given i can also show you the code so this is how the code looks like i have covered everything guys now it's your time to flourish and learn everything see all the transformation normalization standardization scaling this this this square root everything is given over here so this is here i have also discussed about q q plot so if i go and probably show you so all the transformation is basically used i've used all the transformation in this you have to follow that video guys because it will probably take me one hour to explain all these things see this is what is q q plot is reciprocal transformation logarithmic transformation then you have gaussian transform this all transformation either logarithmic reciprocal square root exponential transformation box cox transformation so all this transformation is basically used in in the initial stages we basically apply with respect to all the features and then we will be able to can i say this distribution follows what what kind of distribution this image follows so it is in the same link in the youtube channel that i have actually given what kind of distribution this follows this follows a pareto distribution a power law distribution right so we can basically use a box cost transformation to convert this data so if you if you go through this you are well covered with respect to everything so central limit theorem basically says that if i have a distribution that is either normal that is not normal or that is any kind of distribution that i have whenever we basically take up multiple samples let's say that i have this distribution or this distribution this distribute if i take up some multiple samples let's say that n is greater than or equal to 30 if i start taking multiple samples from this particular data let's say that i have taken multiple samples like this like this up to n is greater than or equal to 30 like many many samples okay and for every sample if i start finding the mean if i start finding the mean like this up till x m why i'm saying uh n should be greater than or equal to 30 because the more greater than or equal to 30 the more the central limit theorem holds okay so if i take this entire data of this sample mean all the sample mean and if i populate it in the form of pdf then that basically says that it will get converted into a normal distribution so this will basically be a normal distribution all the sample mean will follow the sample mean will follow a normal distribution so here you can see that whatever distribution it may be if we take some samples specifically n is greater than or equal to 30 for each and every sample if i try to find out the sample mean sample size so see sample size i told you it should be n is greater than or equal to 30 sample size the number of elements over here that we are picking should be greater than or equal to 30 and let's consider that we have taken m samples m samples can be anything but more the bigger value more better we will be able to solve this particular central limit here so here you will be able to see that as we go on doing this finally you will be able to see that if we populate all the sample mean we get this normal distribution initially whatever distribution that particular data may be it may be a long normal it may be normal it may be anything now one assignment for you all will be poison distribution now you have got a lot of idea about data now okay you just go and search in wikipedia see the distribution same this is also a non-gaussian distribution it also follows the pareto distribution you can see in this way just go and check it out that's it this was it for my side have a great day ahead thank you mandal bye bye keep on rocking keep on learning bye bye thank you guys

[Music] hello guys what are we basically going to cover from basics to advanced uh this will be specifically related to positions like data scientist data analyst related to business intelligence tool everything will get covered over here we need to understand the basic differences between descriptive statistics and second one is inferential stats the differences between descriptive stats and inferential stats because the entire statistics with respect to data science is divided into this two concept in descriptive stats some of the topics that i really want to mention is measure of central tendency measure of dispersions these are some of the examples anything that is related to summarizing the data so all the tools that you&#39;re probably using like histograms you&#39;re using box plot whisker plot everything will probably come over here if i sub divide many of the topics here we are basically going to understand histograms we are going to understand about pdf we are going to understand about cdf we are going to see that probably how do we create this pdf by what techniques we care create this pdf cdf everything uh we will also be understanding some topics like probability permutations which are pretty much improbability is very much important in terms for data science mean median mode so you also have variance standard deviation we&#39;re going to cover many distributions let me name the distributions over here like gaussian distribution then you have log normal distribution other type of distribution like binomial distribution then you have bernoulli&#39;s distribution pareto distributions this is also called as power law distribution the we&#39;ll also be discussing about standard normal distributions the seventh thing that probably we will be discussing about is uh in standard normal distribution we may also have different different techniques we&#39;ll be discussing about transformation we&#39;ll be discussing about standardization we&#39;ll be discussing about different kind of transformation and this all will be with the help of python also we&#39;ll try to see we&#39;ll distribute we&#39;ll discuss about something called as q q plot we&#39;ll try to find out how how to determine whether a distribution is a normal distribution or not that all things we will try to discuss these are some of the topics that i have written uh there is also very something very much important which is called as inferential stats now in inferential stats our main focus is basically like z test t test anova test chi-square test if i just consider some example with respect to z-test there are multiple ways to actually perform z-test so in z-test probably you will be having different ways and this i will also try to show you by executing in python t test also i&#39;ll be showing you by using python programming language chi square test anova test so anova test is also called as something called as f test we&#39;ll be discussing about this uh like factorial anova different kind of anovas that we are going to discuss most important thing we forgot right which is called as hypothesis testing how can i forget this okay we are also going to discuss about hypothesis testing right in hypothesis testing how do you determine your null hypothesis alternate hypothesis everything will probably get covered in this uh here we are specifically going to understand about p values one very much important thing is something called as confidence intervals confidence interval then i&#39;ll also teach you how to see z table um you know which is a kind of sheet where you can directly get the values over there similarly t table is there chi square table is there many things will basically be there let&#39;s start the first topic the first topic uh that obviously anybody needs to understand is that what is statistics okay we really need to understand because whatever i&#39;m discussing right it is very much important in terms of interview in terms of interview i&#39;m actually going to teach so that you will definitely be able to understand many things so the first thing we will understand what exactly its statistics many people have different kind of definition with statistics but i really want to give a very simple definitions which is from wikipedia so i&#39;m going to say statistics is the science of collecting organizing and analyzing data now you know based on the amount of data that is getting generated now you can just understand directly like how important stats is you have tons and tons of data you have huge amount of data and definitely you can actually utilize this particular data to make sure that uh there is improvement in your products there is improvement in your business goals and that actually helps you to finally make a very good decision so finally why why we are doing this for why we are doing this we are doing this for better decision making so we are specifically doing this for better decision making everything that is basically getting covered on this and if i try to now dis define statistics or the types of statistics first of all there is one very important thing which is called as data so data over here is nothing but facts or pieces of information that can be measured so what is data in short of facts or pieces of information that can definitely be measured and let&#39;s go ahead and let&#39;s see some of the examples what do you think about data definitely if i if i say that okay uh let&#39;s let&#39;s consider one very simple example i am basically going to say that fine with respect to the data i can give you some lot of examples so one example i can say that let&#39;s see if i want to measure the iq of a class of the students right i want to measure the iq of the students of the class so i may probably get values between 0 to 100. suppose let&#39;s say that i am getting this one i&#39;m getting this i&#39;m getting 55 i&#39;m getting 75 i&#39;m getting 65 so this is one example of data here we can basically measure and the example is iq of a class suppose i i want to give one more example okay the age the ages of student of a class i may have different ages like 30 25 24 23 27 28 what is this this is specifically data and always remember the most intrinsic meaning of data is that it can be measured that is the most important thing types of statistics the first type as i said is called something called as descriptive so the first type is basically called as descriptive stats now how do you define descriptive stats descriptive stats i&#39;ll just say that it consists of organizing and summarizing of data it consists of organizing and summarizing data that&#39;s it very simple if i really want to understand i&#39;ll probably make you understand more about what is descriptive stats but let&#39;s go towards the definition of inferential stats now in inferential stats you can basically say that it is it is a technique wherein we use the data that we have measured to form conclusions now if i talk about two important things one is conclusion and one is about data now first of all we will understand about descriptive stats and then probably i i&#39;ll give you a very good example okay i&#39;ll try to give you a very good example and based on that particular example what is the type of question that may come up in descriptive stack so let&#39;s let&#39;s consider that i have a classroom of math students and in this classroom let&#39;s consider that there are around 20 people and now i want to find out the marks marks of the first sem let&#39;s say now here probably the marks with respect to percentage are like this 84 86 78 72 75 65 80 81 92 95 96 97 so over here you can see how many data there 1 2 3 4 5 6 7 8 9 10 11 12 let&#39;s consider that we have around 20 data what is the average age of the students in the class student in the class so this may be a perfect example of descriptive stats now here i&#39;ve just told about the average it can be anything it can be standard deviation it can be mean it can be mode it can be different different things so here you can see that i&#39;ve taken a very simple example i have uh our math students like 20 people over here and probably you can basically understand over here that we are trying to find out what is the average age of the student in the class you may also say that what is the percentage of the people passing out from the class you can also say that different different examples probably you&#39;ll be able to understand when i talk about percentiles and all now let me go ahead and let me find out and let me tell you the other example of inferences stats based on this what kind of thing what kind of question you can ask with respect to inferential stats i have told you the definition what what inferential stats basically consists of it is a technique wherein we use the data that we have measured to form conclusions i may say that are the ages of the students of this classroom similar to the age of the college similar and let&#39;s say age of the college but age of the maths classroom in the college so this is basically my question my question basically says that are the ages of the student of this classroom similar to the age of the math classroom in the college so here maths classroom in the entire college is my population and probably just a classroom student&#39;s age is just like my sample sorry did i discuss about max okay sorry i&#39;ll not say age but average marks i&#39;ll just try to change over here just a second guys i&#39;m extremely sorry so this is not age this is marks so i&#39;ll not say this as ages and let&#39;s let me but you can also take ages as an example i will say it as marks like that let&#39;s consider the maths classroom there are different five different classrooms and i have actually taken the data of only one classroom and from this this is basically called a sample and this is my entire population now since we have discussed about population and sample and i&#39;ll be coming more on making you understand about descriptive when we are deep diving into various topics now it is time that we really understand about population and samples so coming over here is basically population and sample what exactly is population now population basically means let&#39;s consider one example again see guys i will definitely give you lot of examples the reason why i&#39;m giving you examples is that because understand if we learn statistics in such a way that we have examples in mind we will be able to explain the interviewer in a very good way so let&#39;s take an example of elections probably you may be talking about goa you may be talking about up let&#39;s consider this two-state so obviously it is not possible probably let&#39;s consider that the election has finished and we really need to find out the exit poll now exit poll what usually this press reporters and all will do what they will do is that they cannot go and ask each and every person suppose the goa population is this big let&#39;s consider it is not possible for every reporter to go and ask each and every person that whom you have voted because it is not possible you may not find some people some people may be traveling some people may be doing different different things and also it is not possible at all so what happens in this exit poll this reporters what they do is that they take up sample of population from different different region and again there are different different kinds of sampling techniques they take up different different samples and then what they do is that they ask that whom did you vote and based on that maximum number of people whom did they vote they basically say based on that they actually create their exit poll now in this particular case what is my population data my population data is this entire population of goa so this specific thing is my population data and this round circles that i have actually done is basically my sample data so i hope you have basically got some some examples with respect to that guys i hope everybody is clear with this i basically told age over here so don&#39;t get confused sometime when i&#39;m teaching sometimes students may come ages may come or marks may come so you will not get confused don&#39;t worry so here is one example now let&#39;s go ahead and let&#39;s try to understand one thing now in this particular scenario in this particular example many people have told about krish why are you just considering okay you are considering samples to solve a particular problem what are the different sampling techniques you really need to understand or tell us that because there are different different sampling techniques what are the different kind of sampling techniques but before i go ahead usually population if i talk about population you really need to understand about some of the notation population is basically given by capital n and sample is basically given by small n so this is how we basically denote population this is how we basically denote sample now the next question comes that krish why you have selected samples randomly is there any better ways to do sampling also or just we need to uh do the sample randomly i would like to say that guys this entire sampling takes place based on various scenarios and for that i will be showing you some of the examples so let&#39;s go and understand about some of the sampling techniques and what are different different sampling techniques we basically have the first sampling techniques let me write it down for you now the first sampling techniques which is most of the time used is called as simple random sampling simple random sampling very simple very important suppose i have some data i have some i&#39;m sorry i have some population suppose this is my population simple random sampling will be just like you go and pick up some people like this anyhow you want there is no there is no such confusion as such you just go and randomly pick up people simple random sampling and simple random sampling is basically used in many of the scenarios probably in exit poll you can use simple random sampling suppose if you want to use some kind of medicines right you do some kind of test for the medicines at that point of time you cannot use simple random simple random sampling you have to pick up some people probably have to check their medical history based on that you have to apply but simple random sampling it&#39;s all about i can basically say that i&#39;ll just give you a small definition over here when performing simple random sampling every member of the population has an equal chance of being selected for your sample n now coming to the second type the second type of sampling is called as stratified sampling let&#39;s let let me give you a definition stratified sampling is a technique where the population that is capital n is split into non-overlapping groups so one example i&#39;ll be talking about it don&#39;t worry this is also called as strata strata basically means layering stratified layering like that we basically say this is what a stratified sampling basically means let me give you one example let&#39;s let&#39;s consider gender i want to do this sampling based on two things one is male and female let&#39;s consider that i want to do a survey and for a survey obviously i will be requiring some people and based on that my samples will basically be divided right based on male and female male people will give different kind of or survey female people may give different kind of survey okay so something like this so this is definitely one example any other example that you would like to say obviously wherever you can see that there can be non-overlapping groups obviously you can do it let me give you one more example suppose i want a survey to be done by zero to ten years of kids i want to next uh probably i&#39;ll try to make this kind of layering based on age probably 10 to 20 will be one age group probably 20 to 40 will be another age group and probably it will be for 40-100 will be another age group so based on different different age group i can also do a sampling understand one thing this terminology is very much important non-overlapping it should not overlap over here there is no chance of overlapping based on profession can i do stratified sampling based on profession can i do stratified sampling hey a profession may be that let&#39;s let&#39;s say that this profession is with respect to different different different different people who are working okay suppose a person is a dotnet developer a person is a php developer a person is a you know data scientist or he&#39;s working specifically in python over here definitely you can say that they have different different stratified layers but there may be some scenarios that it may overlap a php person may know dotnet a dotnet person may know python so both the scenarios will be there if a person is highly experienced he says that no i don&#39;t know dot net then it will not become overlapping but definitely we can apply it for doctors engineers doctors engineers different different survey can be there so just understand that in some of the cases we can do stratified sampling but by applying some other conditions we can make sure that that sampling satisfies all the conditions coming to the third one the third technique is basically called as systematic sampling the third technique is called as systematic sampling here from the population n what we do we just pick up every nth individual i&#39;ll give you a very good example nth individual from this population what does this basically mean let&#39;s consider that i&#39;m outside the mall and i want to do a survey regarding covet so what i am doing every seventh or eighth person that i see i am saying that for this person do the survey so in systematic sampling you consider any eighth person i&#39;m just saying as an example every eighth person i may take every first person that i see every fifth person that i see or every tenth person that i see in front of my eyes i&#39;ll just tell him to do the survey so this is what systematic sampling is all about in systematic sampling there is no reason why you&#39;re selecting the eighth or the ninth person you just said that okay it is my personal duty what i&#39;m actually going to do whichever person that i see on the seventh time i&#39;m just going to catch him and i&#39;m going to basically ask him about this survey so thanos when he snapped the when he snapped his finger what do you think what kind of sampling techniques may have used do you think random sampling is basically getting used because you could see right ah probably random sampling may have been used okay now let&#39;s come to the next sampling which is called as uh uh you can say it as convenient sampling you can say it as voluntary response sampling i&#39;ll just say it as the fourth technique i will say it as convenient sampling this kind of sample so suppose let&#39;s consider that i am doing a survey only those people who are who are a domain expert is in that particular survey will be doing will be participating in that particular survey suppose let&#39;s say consider that i am doing a survey related to data science i will say that any person who is probably interested in data science and has the knowledge of data science if you consider only those people only those people then it basically becomes a convenient sampling only those people who are basically interested in this will basically be doing it or who are expert in that will definitely be doing it because this is a specific topic which requires domain knowledge which require some uh amazing things in that basically he should be knowing based on this survey because those service will be important through surveys you take out some kind of information you you will be able to make some kind of decisions so that is very much important who is taking the survey like many people also how do you generate your data set that is also said like in many companies what they do is that they make sure that the people actually try to put some kind of surveys in front of the people and they basically use that data for doing different different things again i&#39;m going to repeat what is convenient sampling let&#39;s consider that i am doing a survey related to a specific topic in this particular example data science obviously i will not go to some other people who don&#39;t have the knowledge of data science to do that specific survey so i may collect my sample in a bit different way where will focus on people who is giving the survey should have knowledge on that specific topic okay now let me give you some of the examples let&#39;s say that there is an exit poll what kind of sampling we would be better okay guys again people are getting confused with respect to system stratified sampling and this sampling in convenience sampling we are just specifically considering a domain there we are dividing groups based on something so tell me the examples of exit poll what kind of sampling technique we may use so obviously we will be using over here as random sampling the rbi i hope everybody knows rbi they do something survey with respect to household household service for this household service what kind of sampling probably they may use hey guys you may also consider that over here you need to follow some stratified random sampling obviously we can&#39;t do but over here most of the time random sampling is basically done in household surveys rbi make sure that they have to fill the survey from a human where probably they are trying to find out like what is the cost expenditure in running a house so here you can probably consider stratified sampling if you don&#39;t want to consider stratified sampling we can also do convenience sampling only women you can basically consider over there and you can do it now understand sampling techniques may be different it is completely dependent on the use case that we are following based on the use case that you really want to do based on that you will do and it is not like we will just be dependent on one kind of data we try to use different different sampling techniques and finally we try to come to a conclusion on the same let me give you one more example a drug needs to be tested so for this what kind of samples we may take now here i can bring up multiple use cases first of all to whom this drug needs to be tested if i get that specific information i will basically do the age groupings and then i may probably apply let&#39;s consider this drug is for everyone probably then i may consider picking up some samples but at least i&#39;ll put a condition that at least it should be greater than 15 years because we cannot just directly use a specific drugs on kids so different different it depends on the use case that you&#39;re probably trying to do and based on that you will probably try to select it and again there may be many things many many questions that may come is that okay krish why not this why not that why not this why not that right this kind of questions may come that is where we basically experiment in multiple things so in the real-world scenario also when you are probably collecting the data you will find this kind of scenarios a lot now let&#39;s go with the next topic what is called as variables now what is a variable obviously if you are a coder you obviously know that what is a variable so i will just give you a definition that is much more related to you i&#39;ll say that a variable is a property that can take on any value a variable is a property that can take any value let&#39;s say an example i&#39;ll say height i may say weight these are variables we can have any value we can have 170 centimeters 172 centimeters 185 centimeters 190 centimeters anything i can have different different values with respect to height 182 178 168 150 160 170 anything similarly with respect to weight i can have any values like 78 99 100 or 60 or 50 anything that i want so this is a simple definition with respect to a simple variable with lot of examples now understand there are two kinds of variables so let me go ahead and let me teach you this there are two kinds of variables the first kind is basically quantitative variable quantitative variable the second type is basically you just send me the answer i&#39;ll pause for five second the second type is something called as qualitative variables qualitative or categorical variables so these are the two types of variables that we specifically use now i will try to divide this into many types and we&#39;ll try to understand this variable because these are also very much important now first of all coming to the quantitative part this quantitative part will have some properties it can be measured numerically so we can measure them by putting numbers we can perform lot of operations like add subtract divide multiply right we can we can perform any kind of operations that we want so one example of this is i may consider age i may consider weight i may consider height some of the examples with respect to quantitative variable if i say that okay age is a quantitative variable in qualitative and categorical variables if i specifically take an example let&#39;s consider gender in gender i have male and female now what does this basically mean based on some characteristics based on some characteristics we can derive some categorical variables or we can derive categorical variables that basically means we have categories in categorical variable here we cannot add subtract or do some kind of mathematical equations because here we don&#39;t have that option another example i may basically say that i may say okay i may have categories of let&#39;s let&#39;s consider that i have iq iq if i say 0 to 10 i will divide this iq 10 to 50 and 5200 wherever the values are between 0 to 10 i may say that less iq whenever i say 10 to 50 i may say that medium iq suppose i say this 5200 i may say good i i&#39;m just saying it this as an example now based on some characteristics i have derived or i have classified this into multiple categories which is called as iq here don&#39;t tell me sir krish how sir like how crash more than 50 you are saying that good iq then probably i&#39;m just taking an example over here blood group is another example i may have a positive a negative like that i may have lot of iqs i may also say t-shirt size based on the properties you know we may have large excel medium small this kind of things now coming to the quantitative part quantitative also has two different kind of categories obviously we know continue quantitative basically means we have some numerical values here i am going to divide this into two one one is the discrete variables and one is the continuous variable so discrete variables and continuous variable in discrete variable you will specifically have a whole number let me just talk about some of the examples number of bank accounts of a person in this particular case the example is that you&#39;ll say that i have two bank account three bank account four five six bank account seven bank account you can&#39;t say that you have two point five back in count another example that i would like to give number of children in a family so this is why another example here you obviously will say that okay there are two children three children four children five children but you cannot say it is 2.5 children or 3.5 children right now let&#39;s go with respect to continuous variable here we have already discussed that any values it can have okay suppose i say height i can say that the person is 172.5 centimeters i can say that the person is 162 centimeters i can say a person is 163.5 centimeters any value that can come over here similarly with respect to weight here i can say the person is 100 kgs i can say 99.5 kg i may say 99.75 kgs i can also talk about amount of rainfall which is measured in inches suppose i say uh it is 1.1 inches 1.25 inches 1.35 inches right all these things are basically there so this was an example with respect to continuous variables i&#39;ll give you some examples what kind of variable gender is what kind of variable marital statuses what kind of variable river length is what kind of variable the population of a state is what kind of variable song length is so gender is obviously a discrete one i&#39;ll not say discrete but i&#39;ll say categorical sorry not discrete okay so it is a qualitative or categorical variable marital status again same thing river length continuous if i want to say discrete continuous or normal continuous it will be a continuous quantitative variable population of the state it will be discrete and song length will also be continuous what kind of variable blood pressure is blood pressure it will also be continuous what kind of variable is pin code discrete or categorical don&#39;t worry as we go ahead in some of the classes you will be able to understand this okay that is where when you will be getting a problem statement in data science where you have specifically pin code in a data set how you&#39;re going to handle those okay now let&#39;s go to the next one next topic variable measurement so here we are probably going to understand how do we measure variables so over here we basically have four different types of measured variable the first type is nominal the second type is ordinal the third type is something called an interval and the fourth type is something called as the ratio now first of all we&#39;ll try to understand about nominal probably i&#39;ll here also i&#39;m going to give you a lot of examples and why why it is very much necessary to know this kind of measured variables four type of measurement wells because your data set will also have this kind of variables you&#39;ll have nominal data you&#39;ll have ordinal data you will have internal data it&#39;s our interval data ratio related data so that you&#39;ll be able to do a good data analysis okay so you basically use this kind of variables so if i talk about nominal variable so nominal uh data also i can say these are specifically categorical or qualitative data so whenever i say categorical data you know that it is split into different different classes colors color is one example you have example gender you have example different different things type of flower these are some of the examples with respect to the nominal data because the first thing i&#39;ve heard this interview are asking what is the difference between ordinal and nominal data now let&#39;s go ahead and let&#39;s discuss about ordinal data in order to understand ordinal data i would like to say some example here over here in this particular data the order of the values the order of the data matters but value does not i&#39;ll talk about it why i&#39;m saying value does not let&#39;s say that i have five students and here the marks of the students are like 100 96 57 85 and 44. now tell me over here if i just try to find out the rank rank basically means who is having the highest marks will get the first rank 96 will then get second this 85 will get third and this we will get as fourth and this finally will get our fifth this data that we specifically have is my ordinal data here we focus more on the order not on the values here we mostly focus on this ranks we are not worried like what marks that particular person has got yes he has got the first track so this was with respect to the ordinal data now let&#39;s me let me come towards the so over here you can basically say that uh ordinal data will be present and we also use a different technique to analyze those data and probably we try to probably when we&#39;ll be seeing some data set in the future we will probably try to see that okay scenarios also now internal interval data here the order matters here the value also matters and one thing is that your natural zero is not present what is this natural zero yeah order also matters values also matter so if i take an example of interval let&#39;s say that i have an interval of temperatures and let&#39;s consider fahrenheit fahrenheit temperature i&#39;m just talking about i may have values like this 70 to 80 fahrenheit 80 to 90 fahrenheit or i may have 70 to 80 fahrenheit 80 to 90 fahrenheit here interval is there definitely some kind of values are there 90 to 100 fahrenheits but if i say zero fahrenheit it won&#39;t it won&#39;t basically make a useful meaning in this so definitely this is basically called as an interval you have some range of values between them and the order also basically matters a lot i may also have distance 10 to 20 20 to 30 30 to 40 where probably this interval data may be used in ola i think you have probably booked cabs you booked the cap for let&#39;s say you&#39;re booking the cab for six hours there they&#39;ll be saying that you can actually go till 0 to 60 and then you can probably uh if you are more than 60 that time you have to pay more natural zero zero will not be present right zero fahrenheit will not make any difference now ratio data will be an assignment for you let me go ahead and let me take another topic which is called as frequency distribution now this is pretty much important because in the later stages you will be understanding about histogram and all let&#39;s say that i have a sample data set and suppose in this particular data set i have three types of flowers rose lily and sunflower now similarly in this particular data set i have lot of flowers like rose lily data sunflower then again i have rose then again i have lily then again i have lily okay so suppose let&#39;s consider that this is my entire data set now usually for showcasing this uh data set in some kind of visualized manner we can basically use this frequency distribution table based on the flower type and how much is the frequency okay and this will be very much important suppose if i say rows in rows uh how many times i have one two three so 3 is the count of rows if i consider lilly so lilia what is the basic count i am basically having 1 2 3 4. so 4 is the count if i consider sunflower what is the count 1 and 2. so this is the frequency of this particular values of this particular data set with respect to different different categories okay so here you can see that i this is entirely frequency distribution table and from this table you can derive bar charts you can derive pie charts you can derive different different things now one more topic now this you know that it is a frequency distribution but there is something called as cumulative frequency cumulative frequency basically says that initially i have rose three flowers then i am going to add this to this so it will be seven then i am going to add this to this it will be nine at the end of the day when we go with respect to cumulative frequency and when we go to the last category we will be able to find out how many total number of flowers are present this is basically the cumulative frequency the frequency is getting added and finally you&#39;ll be able to see the cumulative frequency over here now what we can basically derive from this i&#39;ll just show you an example there&#39;s something called as by uh bar bar graphs and pie charts so that particular part now we&#39;ll try to draw from this and we&#39;ll try to see that how does it look like in the case of discrete variables we can definitely draw a bar chart if the variable is continuous at that point of time we can draw a continuous we can draw a histogram so let me just talk about bar graph so first one is the bar graph in bar graph in the x axis i will probably have all my flowers so this is the rose this is lily and this is sunflower in the y-axis i will probably have frequency so this will be my value one two three obviously i know how many roses are there roses are three so i&#39;m just going to create this graph over here which will be looking like so this is my bar chart for rose lilies are obviously four so i&#39;m going to basically lilies i may use blue color and this will be my four value and finally you will be able to see sunflower this will be sunflower let&#39;s say sunflower is only two so i&#39;m going to create this this specific diagram is basically called as bar chart from this why do we use it as i say summarizing the data this is still the part of descriptive statistics descriptive statistics so this is how you basically define it here you can see that the values is discrete variables now what if i consider the next example which is called as histograms now in the case of histogram how do we define first of all your data should be continuous it can be discrete continuous it cannot be discrete continuous let&#39;s take one example of age suppose i have a data set of ages and i have values like 10 12 14 18 24 26 30 35 36 37 40 41 42 43 and 50 51 okay so suppose if i have this specific edges okay now here you know that it is a continuous value now in the case of continuous value if you want to represent it through some diagrams for the data analysis purpose you can basically use something called as histograms you can basically use histogram so the histogram will have like this now understand one very important thing in histogram we make something called as bins bins basically means we make some kind of grouping by default the bin size is usually 100 sorry 10. now if i really want to make this bins what i&#39;ll do in the y-axis i will be having the frequency obviously you will know this now let&#39;s make the bin i told you 10 will be the bin size 30 40 50 60 70 80 90. you can change the bin value also between 0 to 10 between 0 to 10 i don&#39;t have any value so i&#39;m not going to create it so let&#39;s see that 1 is there 2 is there 3 is there 4 is there and 5 is there okay so this is my frequency count now between 10 to 20 i have how many values one two three four so four values i&#39;m going to create a diagram over here between 10 to 20 i have four different values i have four different values and then 20 to 30 i have three different values one two three let&#39;s consider i&#39;m going to draw my next diagram then i have between 30 to 36 i have one two three again i have three between 32 no sorry i have four because i&#39;m also going to count one two three four okay so i&#39;m going to draw my another building over here and this building is basically called as histograms okay and then uh between 40 to 50 i have one two three four okay so again i have four over here and finally between 50 to 60 i have one this buildings that you see is basically called as histograms this building that you basically see is called as histograms and this in this histograms your values will be continuous now one amazing thing because people ask about what is pdf i say that&#39;s pdf is smoothening of histogram so i&#39;ll just tell you one example if i smoothen this histogram my my pdf function will look something like this now you may be considering krish how is this basically getting created okay how is this basically getting created i&#39;ll say that there is something called as kernel density estimator now this kernel density estimator how it is done we will try to understand that in the upcoming classes probably that is little bit in the advanced side but i hope everybody got an idea about histograms i everybody got an idea about barch paragraph definitely only for okay tell me what is the difference between bar versus histogram bar versus histogram why do we use bar graph and why do we use histogram bar is specifically used for discrete this is used for continuous now if somebody asks you what exactly is probability density function you&#39;re just smoothing the histograms hello guys so yesterday if you remember we have discussed all the basic things today we will be moving from basics to intermediate stats specifically for data science so this is what we are going to discuss and there are so many topics that i am probably going to cover today we are basically going to cover measure of central tendency measure of central tendency measure of dispersions gaussian distribution then fourth we are going to understand z score then we are going to understand standard normal distribution standard normal distribution and there are some more topics that we really need to cover so the first topic that probably we are going to discuss is something called as arithmetic mean for population and sample mean basically means over here specifically we are talking about average now with population and with sample we really need to understand the formulas of mean and we will try to understand in this specific way population is basically given by capital n sample is given by small n now coming to the first thing whenever we are probably discussing about mean you need to remember that we are trying to find out the average of a specific distribution so let&#39;s say that my data sets look something like this 3 3 4 5 comma 5 comma 6 so if i really want to find out the mean of this population mean of this population i can basically give by a symbol which is mu and i&#39;ll say summation of i is equal to 1 to capital n x of i divided by n now what is this x of i let&#39;s consider that this is my random variable x and probably i have so many different values inside my data set 1 1 2 2 3 3 4 5 5 6. so if i really want to expand this thing x of i basically we are going to iterate through all these n elements so i may write 1 plus 1 plus 1 plus 2 plus 2 plus 3 plus 3 plus 4 plus 5 plus 5 plus 6 divided by capital n over here capital n is what 1 2 3 4 5 6 7 8 9 10. so 10 elements so it is 32 by 10 which is nothing but 3.2 so 3.2 is basically my average now with respect to population always remember how the symbol is basically given we can write x bar which is specified by sample mean here i&#39;m going to write summation of i is equal to 1 to small n and here i can basically write x of i divided by n obviously we&#39;ll get the same answer because we are going to take the same data set so this was the example with respect to arithmetic mean always understand that notation the notation is quite important over here the reason why i&#39;m saying you notations over here in this way because i want because in the real world industry when you are working when you&#39;re explaining someone as a data scientist you really need to use this well-known notation you can use your own notation whatever you like but think of a larger point of view here you really need to make sure that whatever standards is being followed we need to try to follow in that specific way so this was the basic things with respect to mean uh mean is the part of central measure of tendency apart from mean there are two more things so uh let me just define what is central tendency which we basically say central measure of tendency there are three main things one is mean second one is median and third one is mode now if i really want to make you understand because it is a very important interview question if someone says you that what is central tendency or what is measure of central tendency you can just say that it refers to the measure of measure used to determine the center of the distribution of the data so that basically means whenever i have a data if i really want to find out the center part of that particular distribution i can use mean median mode why specifically will be using it that all i&#39;ll be talking about but i hope everybody got the definition till here everybody clear with this definition it refers to the measure used to determine the center of the distribution of the data so average and mean are one and the same guys understand average mean okay we use the same formula that is basically used okay so this was the part with respect to central tendency now let&#39;s go ahead and let&#39;s try to solve some problems right obviously i have given you lot of examples with respect to mean but now let&#39;s go ahead and try to understand median and why do we specifically use median so i&#39;m going to take the same data set whatever data set i have used over here that is one one two two three three four five five six okay so one comma 1 comma 2 comma 2 comma 3 comma 3 comma 4 and what was the data then you had 5 5 6 right so here i am basically going to take 5 5 6. so suppose if this is my data obviously the mean we found out was nothing but 3.2 now what if i tell you that in this distribution you add one more element like 100 so when you add 100 then it will become 32 plus 100 divided by 11. 32 is basically from the sum of all the numbers that we have done or taken previously plus 100 which is a new element that is basically added and we are just going to divide by 11. so once i do 132 by 11 we are getting 12. before when 100 was not added when this element was not added at that time my mean was 3.2 but after adding 100 in that specific distribution my mean became 12. now here you can see that there is a huge movement of mean there is a huge difference with respect to this mean and why it is basically added because of this number we consider this number as outliers outliers really have a adverse impact on the entire distribution so that is the reason why we should be very much careful with outliers in data science also in statistics also we use different techniques to remove the outliers which also i&#39;ll be discussing today when we are going to discuss about percentiles and all so remember outliers has a major impact because here you can see that the entire distribution of the central data is basically moving and the difference is quite huge so for this particular case what we do we can definitely use median now in median if i take the same number like 1 1 2 2 or 3 3 4 5 5 6 and then probably i have hundred always understand in median the first thing that you really need to do is sort the numbers so first step is sort the numbers so over here you can see that the numbers are already sorted if your numbers is not sorted at that point of time you will be able to see that you know you probably have to sort it right now by default i have made sure that the number is already sorted so do you define distribution in statistical term distribution basically means that how your data looks see what is distribution okay how do you see how your data is basically distributed there are various ways we use pdf we use histograms we use different different techniques so i will be coming in making you understand about different distributions still i have not started that now first step is always sorting the numbers i have sorted the numbers after sorting the numbers what i am actually going to do is that i am basically going to take the central element a central element means which one suppose if i have odd number of elements so over here what is the count 1 2 3 4 5 6 7 8 9 10 11. so 11 is the count in order to find out the central element okay we will probably find out the center one so one two three four five one two three four five so this will basically be my central element this will be my central element because it is the middle element now in this particular case i can definitely say that my mode is nothing but 3. now understand even though the outlier is added see outlier basically means what outlier is a number which is completely different from the entire distribution over here you can see that 100 is a completely different from the entire distribution now what if now your question may rise that okay krish what if i have one more number like this let&#39;s say that i have one more number 112. now in this particular case you told to pick up the central element now in this case which will be my central element now in this case my total number of elements are 12. so in order to find out the central element what i will do i will take up the middle two elements one two three four five one two three four five so the middle elements is basically present over here now here i will take this middle element which is my three and four and i will do the average of them so three plus four divided by two so in this particular case i can say that my mode is 3.5 even though i had two different outliers but yes if i keep on increasing the number of outliers then the distribution will become normal now understand one thing why mode actually works in a better way before because of this outlier my mean was 12. even though after adding the outlier my mean was 12 but now here you can see that even though i added two outliers my mode which is again a measure of central tendency okay which is again a measure of central tendency there is highly any difference a very less difference so that is the reason why we use median did i say mode oh sorry sorry it is median i by mistake i wrote mode it should not be okay median okay so median i hope everybody understood what exactly is median we basically take the central elements if the number of elements is even then we probably take the central two elements we try to find out the average and we try to calculate it but understand one thing over here what is the main purpose initially when we did not add outlier and we tried to calculate the mean at that time i got 3.2 when i tried to calculate by adding an outlier my median was 12. sorry my mean was 12. when i try to do this with respect to median even though i had outli added the outlier it came as 3 and finally you&#39;ll be able to see that when i probably used two outliers and then probably i got the median as 3.5 now here you can basically see that there is less difference right less difference when compared to this if i talk about median it works well with outlier so this is the proper statement that i want to consider so in the case of mode the third topic now suppose if i have a specific data set like this one two three four five six six six seven eight now even though i have some outliers like hundred two hundred now in this particular case what should be our mode which is again taken as a measure of central tendency over here in this particular case in mode we find out the most frequent element mode most frequent element now in the most frequent element we just try to count and we try to see that which element is having the maximum number of elements so over here you can see six is basically having three the count is three if you see two the count is two so in this particular case my mode should definitely be six my mode should be definitely six which is again the measure of central tendency now see guys now in this particular case there is one disadvantage even though suppose let&#39;s consider that i have many many outliers like this 100 100 or 100 let&#39;s consider now in this particular case since we find out the most frequent element we try to take this as an outlier so usually in most of the outliers that we specifically use we basically use median now where specifically mode is used let&#39;s consider one data set let&#39;s say that i have a data set which is called as gender salary let&#39;s consider i have this in gender you will probably find out male female male female some different different values may be there or let me just change this data set and make it in a simpler way why specifically we use mode in mode also we use it in both integer and categorical variables but it works well with categorical variables let&#39;s say that this is a type of flower type of flower and this is petal length and petal width now over here you&#39;ll be able to see different different flowers like rose lily sunflower and you have some flowers let&#39;s consider that you have some missing data over here and based on this missing data now this particular data set has come to me and let&#39;s say consider that i have seen in this particular data there are 10 percent missing data now what do you think in order to handle this missing values what type of things we can definitely use from mean median mode don&#39;t you think i can definitely use mode over here because the most frequent occurring flower can be replaced with this missing value so the what i&#39;m saying the missing value will be replaced with most frequent occurring element so we can definitely say that most frequent element you can actually get it by using mode which is most frequently used and this specifically works well categorical variable now let&#39;s take another example suppose i have a feature age age i have values like 25 26 dash dash dash 32 34 38 now in this particular case what do you think what may be a suitable thing suppose let&#39;s say that these are my ages of students should i apply mean median or mode which do you think based on the scenario that is ages of students we should definitely apply just tell me this answer in this particular case definitely i would suggest let&#39;s go with meat because i know students age will basically range from one value to one value it won&#39;t extend more than that specific value so here a domain knowledge will also come into existence if i say that this is probably the ages of all the population throughout the world probably i&#39;ll not go with me so something like that you know so this is a very good example very good uh understanding with respect to various use cases that we can actually think now uh let&#39;s go and discuss about the next major topic which is called as measure of dispersion now in measure of dispersion what all things we specifically discuss we discuss about two main topics one is variance and the second one is something called as standard deviation so these are the two topics that we are probably going to discuss so uh let&#39;s go ahead and let&#39;s discuss about this now first topic is basically with respect to variance now how do we define variance variance is a concept of measure of dispersion and probably for an interviewer also this may be a confusing question they may ask candidates you know and they may probably make them understand different different things and they may again confuse you but when i say dispersion dispersion basically means spread please make sure that you remember this word this basically means spread okay spread how spread how well spread your data is with the mean obviously see let&#39;s say that i have two data sets i have data set like 1 1 2 2 4 what is the mean 10 divided by 5 is 2 now let&#39;s consider that i have another distribution which looks like this 2 2 2 2 2 this is my next distribution if i try to find out the average then also it is 2. so for both this distribution we are getting the same average or mean i am getting the same mean right so if i am getting the same mean then how do i identify that this two distributions are different because we need to think about it right how do we basically come up like with this is that how this two distribution is different we really need to understand okay and probably interviewer will say you and he may confuse you in dispersion what is variance he may definitely confuse you so for that specific reason if you really want to identify how two distribution are different at that point of time we may use variance and standard deviation now let&#39;s go ahead and let&#39;s try to understand the formula with respect to variance and standard deviation and here also i will probably talk about two different things and here one very very important interview question will come one is population variance and one is about sample variance so these two things why i&#39;m teaching you with respect to population sample it will all make sense so usually population variance uh is given by something called as sigma square here you basically use as summation of i is equal to 1 to capital n x of i minus mu whole square divided by n sample variance is basically given by small s square summation of i is equal to 1 to small n x of i minus x bar x bar basically means sample mean divided by n minus 1 now many people will say why n minus 1 n minus 1 yes this is an interview question i will talk about it okay i will talk about it don&#39;t worry so let&#39;s let&#39;s take one very good example probably and we&#39;ll try to solve this specific problem let&#39;s consider that i have my x value as 1 2 2 3 4 5 so this is my distribution so probably over here this is basically my data set so first thing first let&#39;s go and calculate now we&#39;ll go and calculate so with respect to population we will go and calculate the mu 2.83 so here mu is basically two point eight three two point eight three two point eight three two point eight three two point eight three the next thing is that from this equation i will try to calculate x minus mu so what is x minus mu over here just do the calculation and it&#39;s good that you do the calculation so here i get minus 1.83 here i get minus 0.83 this will basically become 0.17 this is plus and over here you will be able to see that this will become 1.17 and then for the 5 you have this and it will become 2.17 now the next step you basically do the squaring now if you do the squaring that is x minus mu whole square you just have to do the square of this 3.34 so here you can see 0.6889 then here also i can see 0.6889 and then for the remaining one you can do the calculation so here it will be 0.03 1.37 and finally you will be able to see 4.71 then what we do we do the addition of this because summation of this is there right so once we do the addition probably then we probably calculated if i do the addition this is nothing but ten point eight four ten point eight four divided by one two three four five six one point eight one now understand one thing let&#39;s say if i have a data set which looks something like this and if i have a data set which looks something like this comparing this to data where do you think the variance is more variance understand variance variance whenever your things comes into mind it should be talking about spread so over here in the second picture definitely variance will be higher let&#39;s consider that i&#39;m just going to take this example here my variance is 1.81 let&#39;s consider that this is 1.81 and tomorrow if i probably get 5.45 can i say that it it may belong to this particular distribution yes so the variance will be definitely higher because the spread is quite high spread when when we say spread is basically high that basically means the elements that is present in the central region is more whenever i talk about more variance that basically means the data is more dispersed let me talk about this also to you so that you can understand okay now let&#39;s forget about standard deviation for right now now in this particular image let&#39;s see in this particular image what do you see over here you can see where standard deviation is 10 standard deviation is 50. now if you see standard deviation formula it is nothing but root of variance now here you can see when the standard deviation is smaller that basically means you&#39;re you&#39;re having a very huge curve that basically means the gra the data is not that much distributed when you have a big standard deviation like 50 60 and all you can see your data is highly distributed so this is very much important to understand why variance is more for dispersed data because over here you can see right guys okay let me when when you probably calculate i&#39;ll show you some of the problem statements over here but just understand this graphically okay later on i&#39;ll just show you one example where probably i will talk about it and let&#39;s try to solve that particular example and then we can definitely understand it but some idea you basically got because obviously the variance needs to be spreaded high if the variance is high right the dispersion becomes high because you have more number of values inside it now let&#39;s go ahead and let&#39;s try to see now i got my variance as 1.81 now my standard deviation is nothing but root of variance root of variance that basically means it is nothing but root of 1.81 so if i go and open my calculator i&#39;ll just say root of 1.81 and there i&#39;m actually getting is nothing but 1.345 so one point three four five now see what the standard deviation basically mean what is the mean in this particular case what is the mean mean is nothing but two point eight three right let&#39;s consider this one the mean is 2.83 now from this mean your data will be distributed because mean is basically specifying your measure of central tendency it basically says that where the center is there for that specific distribution so from here if i go one step right one standard deviation to the right you have seen standard deviation formula the next element that may probably fall between the one standard deviation will range between let&#39;s consider that this is my first standard deviation to the right then it will basically have 2.83 plus 3.4 so this is nothing but 4.17 that basically means in this distribution whatever elements are basically present between 2.83 to 4.17 will be falling within the first standard deviation and if i consider the same thing towards the left that basically is one standard deviation towards the left then what i&#39;ll do i&#39;ll just subtract 1.34 so this will basically be 9 7 4 1 so it will basically become 1.49 now here it basically says that any elements that falls between 1.49 to 2.83 will be falling in this region that is one standard deviation to the left similarly we will go with the second standard deviation now in this particular case it will be 4.17 one point three four five five five point five one similarly you go and calculate similarly you go and calculate similarly one now your standard deviation is a very small number still i&#39;ll say that this is a small number and if i probably try to construct a graph it will look something like this the tip right this this region that you probably will see this is basically called as a bell curve and based on the standard deviation and variance you will be able to decide two important things with the help of variance definitely you will be able to understand how the data is spread and with standard deviation you will be able to understand that between one standard deviation to the right and the left what may be the range of data that may be following it so standard deviation is nothing but it is a root square root of variance that basically means from the mean right how far a element can be let&#39;s consider that if i consider 5 now for 5 if you try to calculate it may fall somewhere here so how you are going to represent 5 you will say that it falls in 1.5 standard deviation from the mean so this kind of definition you will be able to tell them so that basically means from the mean how far a specific number is with respect to standard deviation you&#39;re calculating you&#39;re using a unit called as standard deviation for saying that and variance specifically talk about spread if the variance is high the values the the data spread that is there is very very high now let&#39;s understand some amazing basic things which is called as percentile and quartiles this is the first step to find outliers how do we find an outlier so probably we are going to discuss in this the first and with the help of code also you can basically do now with respect to percent times let&#39;s try to understand what is percentiles and how do you find out percentile now before understanding percentile you basically need to understand about percentage suppose if i have a distribution i say one two three four five now my question is that what is the percentage of numbers that are odd so how do you basically apply a formula over here so i can basically say percentage is equal to number of numbers that are odd divided by total numbers so if i really try to calculate how many numbers are odd 1 2 3 so 3 divided by 5 is nothing but how much 0.6 which is nothing but 60 percentage very simple this is how we basically calculate percentage now and i hope everybody knows this now let&#39;s understand a very very important topic which is called as percentile now i probably think you have heard about percentiles in lot of things percentile probably if you have given gate exam cat exam gmat exam sat exam okay one real life example i&#39;ll show it to you that is related to my my uh youtube ranking also if you can see my youtube ranking social blade so here if i show you one example here you can see that you can see education rank here if you hover over here it shows 96.1 percentile if i hover away it shows 94.98 percentile over here if it&#39;s if i hover it shows 94.958 percent time so we&#39;ll try to discuss about this percentiles right now first of all we will give the definition what is a percentile so percentile is a value below which a certain percentage of observations lie so this is the definition of percentile it it is basically saying it is a value if i say okay this number is the 25 percent type this basically says that 25 percentage of the entire distribution is less than that particular value so percentile is a value below which a certain percentage of observation will lie let me take a very good example and show it to you suppose i have a data set and inside this data set i have elements like 2 comma 2 3 comma 4 comma 5 comma 5 6 comma 7 comma 8 comma 8 1 8 1 2 3 4 5 9 9 10 11 11 12. so let&#39;s consider that this many number of elements that i actually have now in this specific number of elements i want to find out what is the percentile let&#39;s consider this one my question is what is the percentile ranking of 10 so this is my question we solve this problem by using a simple formula i want to find out the percentile rank of 10 right so my formula let&#39;s consider this x is equal to 10 okay so here i&#39;m specifically going to write x so my formula will basically be number of values below x divided by small n which is my sample multiplied by 100 so if you try to calculate this number of values below x divided by n what is n over here n size is sample size 1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20. so 20 is basically my sample size so here i&#39;m going to say number of values below x so how many number of values x is 10 how many number of values are below x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16. so this will basically become 16 16 divided by 20 multiplied by 100 in short this will become four forza 16 for fisa ones are 20 so 80 percentile will basically be my answer for this that basically means if i really want to find out what this 10 value percentile is it is 80. now understand what is the main meaning out of it the main meaning is that 80 percentage please listen to me very very carefully 80 percentage of the entire distribution is less than 10 this is the real meaning that you can probably understand from it now quickly what is the percentile ranking of 11 of value 11 so uh how many elements are present below 11 i&#39;ll say 17 divided by 20 multiplied by 100 once a fisa 85 percent let&#39;s do the reverse of this so from this particular distribution what value exists at percentile ranking of 25 so how do you calculate this for this you use a very simple formula and the formula is something like this value is equal to percentile divided by 100 multiplied by n plus 1. now see guys i&#39;m not going to derive the formula why it is n plus 1 y is n minus 1 why it is this for sample variance i&#39;ll discuss about y n minus 1 but understand we really need to understand what things we are doing and how we are using it in some specific purpose so percentile over here is 25 by 100 multiplied by 21 now understand this this 5.25 is the index position it is very much important to understand this is not the value the index position now i will go and find out which is 5.25 so this is my first element first index second index third index fourth index fifth index and 5.25 will be in between this but right now i don&#39;t see any element find between this so what we do is that we take fifth and sixth index and then we do the average and we calculate the value in this particular case my answer will be 5. so 5 is the value for 25 percentile try to find out what is 75 percentile so if i use 75 divided by 100 multiplied by 21 15.75 is the index position now go and count which is 15.75 from the top 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15. 15.75 is the sum of these two numbers so my answer is 9. 15.75 is the index position so here i&#39;m basically getting the nine answer now let&#39;s go and discuss about a new topic which is called as five number summary in five number summary we need to discuss about something called as first one is something called as minimum the second topic that we should discuss about is something called as first quartile which is also denoted by q1 the third topic that we must discuss about is something called as median the fourth topic that we should discuss about herd quartile which is also said as q3 and the fifth topic we basically discuss about maximum and with the help of this we will be using these values to basically remove the outliers so let&#39;s take one example and let&#39;s see that by the help of five number summary how do we remove an outline so removing an outlier a very important thing which is also called as iqr so here we are going to discuss about removing the outliers now removing the outliers let&#39;s consider that i have one data set which is like this one two two three three four five five five six six six six seven eight eight nine twenty seven now from this distribution guys what do you think is what do you think which is the outline so obviously you&#39;ll be saying that 27 is the outlier always understand guys whenever we need to remove an outlier we really need to define a lower fence let&#39;s consider that i am going to define a lower fins and then i am going to define a higher pens the values that you have over here will be between lower fence to higher fence that basically means after a greater number all the numbers above that number will be an outlier after a smaller number all the number below that particular number below this lower fence will be actually treated as an outlier it should also have higher it should also have lower if i consider that i had one element which is called as minus 50 is minus 50 an outlier for this distribution yes the answer is definitely yes right if you have minus 50 over here that is probably in the lower fan side below the lower fence line and it can be treated as an outlier so in order to define the lower fence we write a very simple formula and the formula looks something like this so here you can define lower fence is equal to q1 minus 1.5 multiplied by iqr i&#39;ll talk about what is iqr and upper fence is basically defined by q3 plus 1.5 multiplied by iqr this two things are basically there now what exactly is iqr you really need to understand about iqr what exactly is iqr iqr is nothing but it is called as inter quartile range interquartile range is basically iqr and it is given by the formula q3 minus q1 q3 is nothing but 75 percentile and q1 is nothing but 25 percentile now quickly check this distribution and try to find out the 25 percentile so what exactly is 25 percentile what exactly is 75 percentile simple formula 25 multiplied by 100 multiplied by small n small n is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 19 plus 1 right so this is nothing but 25 by 100 multiplied by 20 which is nothing but 5 5 this 5 is nothing but index index position so what is the fifth element index position 1 2 3 4 5 is everybody getting 25 percentile is nothing but 3 is everybody getting 25 percentile or q1 is equal to 3 similarly if you try to find out q3 you will be able to get that it is 7 q3 7 you will get the 15th index for q3 so you are basically going to get 7. now if i go and compute the interquartile range what is interquartile range 7 minus 3 which is nothing but 4. now you have calculated the iqr so what all things we have calculated the iqr q3 q1 everything is being computed now let&#39;s go ahead and compute the lower fence now the lower fence basically say q1 minus 1.5 multiplied by iqr right this is what lower fence formula is so what is q1 q1 basically is nothing but what is q1 in this particular case i have computed it it is 3 you can see over here q1 is 3 q3 is 7. so i&#39;m going to write 3 minus 1.5 what is iqr4 so 3 minus 6 which is nothing but minus 3. so the lower fence value is -3 now let&#39;s go and compute about the higher fence higher fence basically say q3 plus 1.5 multiplied by iqr q3 is 7 7 plus 6 is equal to 30 so my lower fence to higher fence range is between minus 3 2 plus 13 now tell me which is the outlier from here minus 3 to plus 13 anything that is greater than 13 is considered as an outlier anything lesser than -3 is considered as an outlier so which number should we remove we should remove 27 why 27 is greater than 13 which is from the higher fence now let me write the distributions once again for all of you let me write the distribution after removing the 13. so the remaining data what i have 1 2 2 3 3 4 5 5 5 6 6 6 6 7 8 8 9 27 but i told you we are removing 27 right so 27 is removed because it is an outlier now you know what is the what is the minimum value out of all these numbers minimum value is 1 what is q1 my first quartile we have computed over here q1 is nothing but 3 median you calculate and tell me quickly then you have q3 q3 7 and the maximum number after removing the outlier is nothing but 9 so here you are getting your 5 number summary now quickly compute median and tell me what is median median is nothing but phi now let&#39;s draw a plot which is called as box plot by this specific data you can definitely draw a box plot now how does a box plot basically get drawn so you will be having x axis and let&#39;s consider that in this particular x axis you have values like minus 2 0 2 4 6 8 10. so this is your x axis now just to go and find out where is minimum element minimum element will probably fall over here that is in one q one will basically fall at three so this will be your three median is basically 5 so this is basically your 5 q3 is nothing but 7 so this is your 7 and max is nothing but 9 so this is your line now all you have to do is that join this lines so this exactly is your box plot if i had kept 27 as an element i would have to extend this line this much big and probably put 27 somewhere here and this used to be one dot over here have you seen this kind of plot this value is nothing but minimum this is my q1 this is my median this is my q3 and this is my max and this technique of removing an outlier we basically say with respect to lower fence and higher pins and we also use something called as iqr the first application that i shown you here this is also used extensively in data visualization so you really need to know all these things i have drawn it in front of you i can also do this with code i have to just install a library in matpotlib you have a library where you can probably do all these things now let&#39;s come back to understand about variance summation of i is equal to 1 to n x of i minus x bar whole square divided by n minus 1. this n minus 1 why we do it it is also called as basal correction we also say it as degree of freedom and i have probably made this video in my stats playlist y sample variance is divided by n minus 1. you can go and search for that you can definitely understand these things now now tell me one interview question may come that what is the application of box plot box plots can be used to determine outliers because as i told you that if i was giving 27 over here my element would have come over here so box plot actually gives you a visualization way to basically see where an outlier is actually present if someone asks you how do you create or how do you determine an outlier you can explain this entire concepts whatever i have explained with respect to percentiles hello guys uh regarding the agenda first of all i&#39;m going to talk about we&#39;re going to discuss about lot of distributions now in this distribution you will specifically have something called as normal distribution or gaussian distribution then we will try to discuss about standard normal distribution standard normal distribution then probably one more example on z scores we will try to see z scores both with uh uh you know z table there is a concept called a z table and y z scores are actually used then we will discuss about log normal distribution then probably we will also discuss about bernoulli distribution then finally we will discuss about binomial distribution and we&#39;ll see some examples we&#39;ll solve some examples and then whatever practical part is left that we have not covered till now like mean median mode everything will get covered over here so if you want to do mean median mode we&#39;ll try to do with python programming language okay and uh we&#39;ll also do variance standard deviation the third thing we&#39;ll try to create histograms we&#39;ll try to create pdfs probability density functions we&#39;ll try to understand how does a distribution this normal distribution will look in code we&#39;ll try to find out how to find out this iqr using code and uh we&#39;ll see all these things and uh some examples of log normal distribution we are going to see okay i can also discuss about bar plot not to worry that also we&#39;ll try to discuss about it okay i can also discuss about violent plot so whatever things will come we&#39;ll discuss about the first thing first uh today we are going to discuss about distributions now what exactly is distributions understand distribution of data when i say i have a data set let&#39;s say that i have a data set of ages like 24 26 27 28 30 32 you know so we have lot of data set now when we have this particular data set always okay always in the first thing that we need to focus on is that how do we basically see this data set in a visualized way because obviously this is a continuous data we always we already know that this is a discrete continuous data in this particular case age i&#39;m just going to consider as discrete continuous data now in the case of continuous data what kind of graphs do you see probably you&#39;ll be able to understand about that specific data so if i really want to get one analysis or if i really want to start my analysis i really need to see lot of visualize diagrams and that is where when i consider this entire distribution they are multiple ways to visualize this data through various graphs and these graphs can really play a very important role whenever probably we are discussing about uh whenever probably we are creating reports where we are doing exploratory data analysis and many things so let us go towards distribution suppose i have a specific distribution of data i probably want to plot this data through some way let us consider that i want to probably plot this data through some way and the best and the easy way that you can probably think about is your histogram right so we have already seen how to create histograms you will be able to create diagrams like this buildings like this right so you will be able to get buildings like this and finally what you do you smooth in this histogram to get some kind of curve and this curve right now looks like a bell curve okay so considering this let&#39;s go to the first distribution the first distribution that i&#39;d like to focus on is something called as gaussian or normal distribution now why as i said y distribution is basically used distribution main purpose is to uh why why this different different kind of distributions are there so that we can basically have some idea about a data set now first of all when we discuss about gaussian or normal distribution most of the time you have seen this kind of distribution in this specific way so here probably you have seen a bell curve now this bell curve this is my bell curve now they&#39;re very important information men might probably talk about this bell curve this will basically this can be your center line that you see can be your mean it can be your median it can be a mode so what does this basically mean if i have a distribution and probably this distribution follows this kind of bell curve and one important property of this bell curve is that this side is exactly symmetrical to this side so there are many inferential statistics that we will probably be discussing about in the future about this bell curve about this entire distribution or gaussian distribution here you can see that it is exactly similar it is i mean it is exactly symmetrical the right part of the curve when i say consider this particular particular path is equal to this part that basically means that the amount of data that is present in this particular part will also be equal to the amount of data that will be basically present in this part so here you can basically see that exactly this forms a bell curve and whenever we have a specific distribution which exactly follows this kind of bell curve we can definitely say this as a normal or gaussian distribution so this is basically my normal distribution now why we are specifically focused on this distribution this distribution is very much important because from this we can derive lot of conclusions what all different kind of conclusions we can derive that i&#39;ll just talk about it now let&#39;s go ahead and let&#39;s discuss about this distribution always understand whenever let&#39;s draw this distribution once again now suppose this is my distribution so this will be a mean median mode then you can go one step towards right second step towards right third step towards right so what is this exactly called standard deviation one step towards the right one step one step or one standard deviation towards the right two standard deviation towards the right three standard deviation towards the right similarly i may have one standard deviation to the left second standard deviation to the left and finally i can also have one more standard deviation to the left this will be very very much important guys now what kind of different conclusions or what kind of uh things we can actually conclude from this kind of graph this side is symmetrical to this side now let&#39;s go ahead and discuss about some of the important things in this suppose if i draw this line can i say this is my first standard deviation towards the right and second standard deviation towards the left so this is my region of my first standard deviation the center one over here i can basically write it as mu this will basically become mu plus sigma mu plus two sigma and this will be just a second mu plus 3 sigma similarly here i can write mu minus sigma mu minus 2 sigma mu plus sorry mu minus three sigma because of less space i am just trying to include it in this particular way now the first thing that we will probably come up with is called as empirical formula now this is very much important empirical formula now this empirical formula basically says that you really need to understand this 68 95 99.7 percentage rule now what does this basically mean this basically indicates that let&#39;s go with 68 within the first standard deviation around suppose if i have some distribution data let&#39;s consider that i have a data set which have 100 data points now what does this basically indicate is that between the first standard deviation between this region in this entire region around 68 percentage of the distribution is present that basically means out of this 100 data point 68 data points will be present in this region that is the reason it is basically called as a bell cup that specific region in that central area you have lot of data so 68 percentage of the entire data set lies in this region within the first standard deviation now coming to the second standard deviation this is something very very important i&#39;ll also talk about what you can derive from all these things between the second standard deviation around let&#39;s come to the 68 percent this is clear then within the second two standard deviation right within the two standard division region which is this specific region around 95 percentage of the entire data lies in this region and similarly if i go and consider with respect to the third standard deviation which is from here to here around 99.7 percentage of the entire distribution will fall in this region so that is the reason why it is basically called as 68 95 and 99.7 percentile low so everybody is clear that basically means now if you have a distribution which is gaussian or normally distributed then this conclusion can definitely made that within the first standard deviation how much data is basically falling within the second standard deviation how much data is falling and within the third standard deviation how much data is basically falling now let&#39;s see some examples some of the examples if i talk about like height height is basically normally distributed who is saying this i am not saying it the domain expert is basically saying it now who is the domain expert in this particular case in this particular case the domain expert is a doctor doctor have taken various samples from different different places and whenever the doctor was constructing this bell curve they it was forming something like this and from that he was able to understand he was able to derive right he or she was able to derive that within the first standard deviation how much data is basically falling within the second standard deviation how much data is falling and within the third standard deviation how much data is falling second example if you consider weight weight will also follow a gaussian distribution third i hope everybody knows about iris data set in irish data set if you talk about petal length sepal length it actually follows gaussian distribution i will show you practically don&#39;t worry about that does that following the empirical rule necessary imply that it is distributed see whenever you have a gaussian distributed data at that time it will follow this 68 95 99.7 percentile rule so this was the thing with respect to gaussian or normally distributed now let&#39;s go ahead and try to see this let&#39;s take an example suppose my i have a data set where my mean is 4 and my standard deviation is 1. if i have this two information can i construct a distribution suppose this is 4 then in the next step what it will come 5 6 7 8 right and then 3 2 1 and 0. so i will be able to create this and let&#39;s consider that this is basically following this kind of distribution so this basically follows this kind of distribution now understand this middle one is basically your mean and standard deviation sorry mean is 4 and standard deviation is 1. now see one thing guys if i talk about 4.5 my question is that where does 4.5 fall in terms of standard deviation so you may be thinking okay 4.5 where exactly it is it is somewhere here obviously when i say 5 is first standard deviation to the right that basically means 4 will be plus 0.5 standard deviation to the right understand 0.5 standard decision if you say one standard deviation it is basically coming to 5 it is 0.5 standard deviation now similarly if i say where does 4.75 fall then how you will be able to see it the point the standard deviation was 1 i told 4.5 so 4.5 will be something falling over here and this is like 0.5 standard deviation but in the case of 4.75 it will be very much difficult for you to do the calculation so that is the reason what we can do is that we can use a concept which is called as z score now z score will basically help you find out whenever i talk about a value how much standard deviation away it is from the mean so this formula is x of i minus mu divided by standard deviation now i need to find out for 4.75 i will just write 4.75 minus mu is what mu is 4 4 divided by standard deviation is 1 so here i am actually getting 0.75 so now i can see that it is 0.75 standard deviation to the right why it is saying right because this is positive value now if i give you the same question try to find out where does 3.75 fall like how many standard deviation whether what should be the standard deviation with respect to 3.75 then you go and apply the same formula so here i&#39;ll say z score is equal to 3.75 minus 4 divided by 1 which is nothing but minus 0.25 so whenever minus comes that basically means you have to check in this side and it is basically saying that 3.75 will be falling somewhere here that is nothing but minus 2.25 standard deviation to the left now let&#39;s go to the next thing suppose i consider this same graph now you understood if i really want to find out how many standard deviation to the right or the left i need to find out i can definitely use z score let&#39;s consider this thing i will use the same graph i&#39;m using the same bell curve this is my 4 this is my 5 this is my 6 this is my 3 this is my 2 this is my 1 here you know that my mean is 4 and standard deviation is 1. understand one thing over here i&#39;ll talk about z score again don&#39;t worry now let&#39;s apply z score to every values what will happen if i apply z score to every values what will happen okay what is z score formula x of i minus mu divided by standard deviation okay you know the mean mean is nothing but 4 standard deviation is 1. now if i apply z score to everything initially my distribution was like this 1 2 3 4 5 6 7. now this was my distribution initially now after applying z score to this what will be my distribution that will be coming apply apply for 1 first of all so if i apply z score to 1 then what will happen 1 minus 4 divided by 1 this is minus 3 can i say this 1 is getting converted to minus 3 1 is converted to minus 3 then if i apply the z to the next element 2 then what is 2 minus 4 my 1 it is nothing but minus 2 so here i&#39;m actually getting minus 2 then if i go and apply the z score to 3 then what will happen z of 3 so 3 minus 4 divided by 1 what will happen minus 1 so minus 3 will now get converted to minus 1 then 4 will get converted to 0 then it will get converted to 1 2 3. now understand the main magic in this with the help of z score is this not the standard deviation of the same elements that we got over here is this not the standard deviation of this all elements that we got after applying the z score after we applied this initially my data set was like this then i got this this element falls at -3 standard deviation this elements fall at -2 standard deviation so here you can definitely see that i am able to get the standard deviation now what is happening see over here one beautiful thing that is basically happening i had a distribution which was 1 2 3 4 5 6 seven after i applied a z score this got converted to minus three minus two minus one zero one two three and probably uh yeah right i got this now what is this distribution then called what this was initially a normal distribution a normal distribution or a gaussian distribution after i applied a z score what kind of distribution we are actually getting and what is this basic distribution called as so this distribution is called as standard normal distribution so one of the most important property with respect to standard normal distribution is that your mean is 0 and standard deviation is 1 is this satisfying this property or not it is being satisfied right so can i write can i write a random variable x or y will belong to standard normal distribution where specifically your mean will be 0 and standard deviation will be 1. so after applying a z score we are able to get into a different distribution which is called a standard normal distribution now the question arises why do we do this what is the use of doing this let&#39;s go ahead with one practical application and we do this in machine learning we do this in most of the algorithms now let&#39;s go ahead and try to see the practical application suppose i have a data set let&#39;s consider that i am solving a machine learning problem statement i have a data set age i have features like salary i have features like weight suppose in this particular data set i have these three columns now understand one thing h by what unit we will calculate by years salary we may calculate by rupees or dollar weight we may calculate in kgs understand this units these are these are what these are basically units units of calculation now whenever i have some values like this like 24 25 26 27 salary may be 40k 50k 60k 70k something weight maybe 70 kgs 80 kgs 55 kgs 45 kg now here when you have this kind of data always understand now in this data obviously you can see the units are completely different our main target should be that we should try to bring up in a form probably in this particular form where my mean is 0 and standard deviation equal to 1 at that point of time i can definitely apply standard normal distribution that basically means i can take up this entire data and apply z score and convert this into standard normal distribution similarly i can go ahead and take up this particular data set i can apply z score and i can basically convert this into standard normal distribution this process is basically called as standardization very super important many people will talk about normalization i&#39;ll talk about the difference between standardization and normalization whenever we talk about standardization in short internally there is a z score formula getting applied so standardization is a process where i am basically trying to convert a distribution into standard normal distribution the property is that the mean is 0 and the standard deviation is 1. now let&#39;s go ahead towards something called as normalization now what exactly is normalization in standardization whenever we talk about here we are getting converted as mean is equal to 0 and standard deviation equal to 1. now in normalization you have an option you will say that i want to i want to shift this entire values or whatever values that i have between 0 to 1 let&#39;s consider like this i want to change all these particular values between 0 to 1. so in this particular case i may definitely apply normalization now how do we do normalization there is a very important formula which is called as min max scalar in the mean max scalar you just have to provide 0 to 1 and automatically this kind of normalization will happen and yes i will show you practically also don&#39;t worry if i want to probably shift this between minus 1 to plus 1 i can basically apply this so normalization gives you a process where you can basically define the lower bound and upper bound and you can convert your data between them now very important thing where do we use normalization i hope everybody knows about deep learning in cnn whenever you are doing image training image classification or object detection in this particular case understand every images has a pixels suppose i have a 4 cross 4 image 1 2 3 4 1 2 3 4. each and every pixel ranges between 0 to 255. now 0 to 255 what we do before we start training this can be applied with min max scalar and it gets converted between 0 to 1 where the minimum value 0 is assigned to 0 and the maximum value 255 is converted to 1. so when we do this automatically we can apply this kind of min max scala or normalization so in this particular case i will definitely not use min max scalar because min max scalar has a different power formula i will take each and every pixel divide by 255 so when we do this specific division by divide by 255 all your values will be getting changed between 0 to 1 and this is another type of normalization process so till here we have discussed about min max scalar we have discussed about normalization standardization now let&#39;s solve one practical example for z score okay recently india versus south africa where india lost it obviously now let&#39;s consider that if i consider odi series let&#39;s say and every time in last year also odi series happened this year also it happened the series average of 2021 was somewhere around let&#39;s say 250 the standard deviation of the score was somewhere around [Music] 10 and rishab let&#39;s say rishabh final score was 17 so this was the series information for 2021 let&#39;s consider now similarly i have a data for 2020 series let&#39;s say the series average in 2020 let&#39;s say that the series average is a little bit different in 2020 the series average of the team scoring in 2020 was 260. the standard deviation of the score of all the matches ah is 12. and then over here probably rishabh final score is 68 okay my question is that this two data i have compared to both the series in which year rishab punt final score was better so for checking this obviously many people will say 2020 2021 lot of confusion will be there so we will just try to apply for z score now for the 2021 we will apply the z score so z score will be nothing but it will be x of i minus mu divided by standard deviation we know what is x of i in this particular case x of i is nothing but so 70 minus 250 divided by 10 so what we are getting over here and similarly for 2020 my z score will be x of i minus mu divided by standard deviation so first one you know this properly this values may not be coming let me change this data a little bit okay rishabh1 final i&#39;ll say average score not final score so that we change this data a little bit otherwise the data will be very very bad okay rishab month average score let&#39;s consider that it is 240 okay and resub1 for average score is somewhere around 245. let&#39;s consider like this okay 240 and 245 because i gave one score so that is the reason a huge standard deviation is basically coming uh at that point of time i&#39;m just taking average score average score of the series guys rishaban this players average score of the series average score of the series okay now let me just make some changes and let me put somewhere over here as 240 so 240 minus average score of the series guys three match three match series so 240 minus 250 is nothing but minus 10 divided by 10 so this is minus 1 standard deviation and this data will now change to 245 so 245 this will be minus 15 divided by 2l which is nothing but 15 by 2l which is nothing but minus 1.25 okay clear everybody so understand along with the not out rule something 240 is the average okay let&#39;s consider in that specific way i know the data is not approximately right but i could also instead of rishabh month average score i could have team team average score okay team average score and probably team played well probably in the last match or the first match like that okay in this series they played well that also you can basically say over here instead of rishabhanth i could write team team average score team team score in final match i messed up with the problem statement because i was just thinking something score final match score like that okay team final score here also i can say team final score this will probably be more problematic team final score now based on this i have always again this is an example guys just think of it the main idea is to teach you something so that you can apply that anywhere okay so here i&#39;ve got minus 1 here i got minus 1 here i got minus 1.25 now see i have seen that the mean is 2 in in 21 20 21 so let me write it down again for you so if in 2021 the mean is 250 over here you can see the mean is 250 x of i is nothing but how much uh it is nothing but 240 and the mean is 10 oh sorry and the standard deviation is 10. if i have this information can i draw the bell curve so this is my bell curve the mean is how much 250 standard deviation is 10 basically means this will come as 260 270 280 right this will come as 240 to 30 to 220 right and this is my mean now where does 240 fall into 240 is falling into minus 1 standard deviation so that basically means 240 will fall here now in 2020 in 2020 you know that your mean is how much to 60 right your mean is 260 x of i that is your final score is 245 and your standard deviation is nothing but 12. now based on this i will definitely be able to create another curve which will have this kind of bell curve and my central element will be 260. since my standard deviation is 12 this will become 272 then it will become 284 then it will become 296. similarly over here it will become 248 then it will become 236 then it will become 224 so here i have my value over here and what is the standard deviation over here it is 1.25 so 1.2 minus 1.25 is this specific standard deviation now here you can see the area is little bit less here the area is little bit more so where do you think india has probably performed well in the final match in the final match whether india performed well in 2020 or in 2021 based on this information this information basically tells many thing about probably the pitch condition whenever we say the standard deviation is less that basically means most of the score was rotating around that much values so tell me where probably india may have performed well understand guys here the standard deviation is more here the standard deviation is less understand over here obviously the z score value is minus 1 here the z score value is minus 1.25 which is greater okay now let&#39;s go to one more practical example of z score now this this example most of the time with respect to statistics will come this may be probably asked in exa in interviews also and this is a very very important and important question i will probably take one very good example and show it to you how to be done how you can basically do this and how you can actually run learn it okay so uh one problem statement that i am actually going to give to you is that one example i&#39;ll give you then we will try to see let&#39;s consider that i have an x random variable i have an x random variable so let&#39;s come to the stats interview question now in this stats interview question let&#39;s say that i have a random variable x and let&#39;s say that this random variable has this kind of distribution 4 5 6 7 3 two one and let&#39;s say that i have a bell curve which looks like this now i want to know my question is what percentage of scores fall above 4.25 now understand one thing where does 4.25 fall 4.25 will fall over here so this is basically my mean and 4.25 will fall over here let&#39;s consider that it is falling over here my question is that what is the these are my scores right let&#39;s say that these are my scores two three four five one like this are my scores i need to understand from this distribution from this my entire data set what is the percentage of scores that falls above 4.25 that basically means i am interested in this region i am basically interested in this region i am saying that what is the percentage of the scores that are greater than 4.25 this is my question okay simple question is this and now we&#39;ll try to understand how we can use z score in this so everybody knows about z score formula x of i minus mu divided by standard deviation here my mu is 4 standard deviation is 1 what is my x of i x of is nothing but 4.25 minus 4 divided by 1 this value is 0.25 standard deviation 0.25 standard deviation what does this basically mean 4.25 falls 0.25 standard deviation from the mean okay from the mean from the mean it is basically falling to 0.25 standard deviation now i got the standard deviation this is i got with the help of z score but now what is the next very important thing obviously from this we will not be able to understand okay how much what will be the percentage then probably this i have got that it is 0.25 is my standard deviation or a z score my my z score is 0.25 now i need i&#39;m interested in this region so how do i come up with the overall percentage from this particular region understand one thing this is a symmetrical bell curve that basically means the entire area i can basically consider it as one now since i am interested in this region i will say this region as tail whenever we talk about tail the region that i&#39;m actually interested in basically i want the value with respect to this one part of the region i&#39;ll say it as tail the other part that is the remaining portion i will basically say this as body full from here to here so this will basically be my body now understand one very important thing how do i check based on this z score what should be the value or what should be the body curve the area of the body curve i want to find out what is the area of this z scores actually help you to find area of the body curve how do we find out i&#39;ll talk about it z score will definitely help you to find out the area of the body curve now guys just think over it okay what do you think this percentage may be this black this red region percentage may be what do you think over here three numbers are there let&#39;s say that total numbers are seven and when i say three numbers on the right hand side what may be the percentage if i said three by seven what is three by seven it is it is approximately around 48 to 49 right now can we calculate the same thing with the help of z score the answer is yes i have already seen the z value is 0.25 now let me do one thing let me open something called a z table because i want to find out the area of the curve so z table if i go and search for it you will be able to see in the first link you will be able to see in the first link and over here i&#39;ll just go over here now see this is how my curves look like right here z score i&#39;ll just use another table because this table does not look right okay so let&#39;s consider this table so always remember three types of z score we can basically get one is this type which again i&#39;ll be discussing one is in this type okay now see this uh left z table and this is the right z table okay just a second i will just show you how to make the readings over here um two point two point z is point two five right point two five see over here what is my z score from here what is my z score over here point two five and remember this z table will be giving me the area of the body curve see a z table shows the area to the right hand side of the curve use these values to find the area between z is equal to 0 and any positive value for area in the left table look at the left tail z table instead okay if you want to find out the area in the left tail search for it guys if you want to find an area in the left tail look at the left tail z table instead in this particular case let me take left z table because i want to look at the area of this series guys this is the area right now this area i want to get the answer right if i get the answer of this area i can just subtract 1 minus 1 minus the left area i want to get this particular area let me explain once again okay everybody is able to see this over here just see when very very important thing the z table shows the area to the right hand side of the curve use these values to find the area between 0 and any positive value for area in the left tail look at the left tail z table instead so here you can see that i want to see the left l or right tail okay what you want to see okay first of all see that you come to this particular diagram you want to see this part or this part obviously you want to calculate this part but understand one thing in order to calculate this part if i get the value of this part i can just subtract 1 minus left area right if i subtract 1 minus left area will i not be able to get the right part otherwise you directly go and see in the right table otherwise directly go and see in the right table again i&#39;m showing you here you can see 0.25 0.25 right so 0.25 you will be able to see this much this area will be giving from mean to this standard deviation right table is given don&#39;t worry left table is also given see over here left table is also given you can also check this this table will be giving you the value between this to this then probably you have to find out this one or subtract 1 minus this area then you will be able to understand it now i will go to the left table understand again i am going to repeat guys here clearly it is say given that for area and left table left tail look at the left l z table why i am seeing left tail because if i go over here this is my right tail this is my remaining body left tail can become this part so from the entire body if i subtract 1 minus this i will be able to get this very much simple now how do i check this i&#39;ll go over here it has given me the instruction over here for area and left l look at left tail z table instead so if i go and see this is my left z table now i will go and find out the z value of 0.2 and 0.2 and 5 so how much i am getting 0.5987 so 0.95987 will be my value of this my area of the body curve will be 0.5987 now in order to find out this i will subtract 1 minus 0.5987 0.4013 so what is the percentage of scores that fall above 4.25 it is nothing but 40 percentage why subtracting from one it&#39;s very simple no see guys again i&#39;m talking about this my question is that this is my mean from this particular curve i want to find out what is the percentage of the distribution then what i can do if i want to find out this curve i can take this whole curve subtract with the left one then i will be getting this one so here you are able to get 40 now did you understand how important it is basically to understand z score yes 0.59 is the mean to all the left this entire region from this to this from 0.25 standard deviation to the left part now did you find out how important this is for the interview questions guys why not directly taking from the right table understand guys write table is not given no this is not right table this is only given from here to here if you want to find out from left table then this is the diagram for this for left z table understand one thing very much important you cannot take it from right table right table there is no information about it you can see this graph right it is only giving information from here to here in the left table you will be able to get the information of the body of of the area of the body of this particular part so this was an example with respect to z score standardization all these things we have probably discussed so the question is in india the average iq is 100 with a standard deviation with a standard deviation of 15 what percentage of the population would you expect to have an iq lower than 85 so my z score will be what so first of all let&#39;s discuss about this graph so here you can see that this is my graph so this particular value is how much the mean is 100 my standard deviation is standard deviation is 50. so 115 130 145 similarly i have 85 70 55. so i have all these values over here now with respect to this first of all let&#39;s go and compute the z score how do you compute the z score the same example that what we have done over here here in this particular case uh 4.25 falls over we are just taking iq lower than 85 so what is iq lower than 85 so it will become 85 minus 100 divided by divided by 50 what it is minus 15 by 15 it is minus 1 so one standard deviation this is my mean this is my minus 1 standard deviation now this is the area that i want to find out now when i want to find out this particular area this area is already the body part the left of the curve so what i will do i will just go and compute for minus 1 now if i want for minus 1 what it is go and compute it over here how much it is 1.0 so this is 0.86 let me just compare the answers and let me just select some different z table so that you will get an idea i&#39;m actually not able to find the right z table yeah this looks good i will give you the link 0.84 so what i&#39;m actually getting 0.84134 0.84134 this is plus 1 understand this is plus 1 plus 1 when i say understand over here plus 1 when i say it is basically from this region to this region now if i subtract 1 minus 0.8414 0.84134 that will basically be my values right lower than 85 understand lower than 85 lower you may also get an question iq between 90 to 120 like this question also you may get for the same problem statement so you may get questions like this at that point of time again you have to solve it in a different way but here is just an idea to talk about what is body area of the body yeah negative will not matter if you say negative it will come from here if you say sorry if you can say negative it will come from here if you say positive it will come from here understand both the side are symmetric minus 1 also you can look that only i&#39;m saying you know in table whatever you are able to find out you can definitely check out minus one also from top minus one point zero same thing you&#39;ll be getting right minus one point two zero is one point one five eight eight six which is one minus point eight four right same thing now let me do one thing guys quickly show you google collab pro so that we can have some programming sessions so first of all i&#39;m going to import some libraries as this import numpy as np import import matplotlib dot pi plot as plt and then probably i will say matplotlib inline so all these things we are actually done and then probably i&#39;ll also import statistics now first thing first how to compute mean mean median mode okay we are going to see that first of all let me load a data set which is called as i&#39;ll load a data set which is called as tips and this will basically be giving me df is equal to this one then i&#39;ll say df.head so here you can see this is my entire data set now quickly if you want to see how to do mean for this let&#39;s say that i&#39;m using np.mean function for finding the total bill mean total bill of mean okay so if i execute this you will be able to see the answer so this is the what is the mean of the total bill if i want to probably find out the median also you will be able to find out median np dot median df of total bill so here you will be able to see np.median so over here you see some differences if you are seeing some differences think that there may be something like some kind of uh outliers also okay if you want to try for mode i can use statistics dot mode and again i will be using df of total underscore bill so here you go this is got mode is nothing but 13.42 now the thing is that if i want to go and see my box plot which is basically used to see outliers so if i use df of total bill total underscore bill so here you will be able to see my box plot also so this is one example of box plot so does this indicate it has an outlier now definitely over here outliers is present but what is this this is 25 sorry minimum 25 percentile median 75 percentile and max so all these things we have calculated if you write df of sns dot there is something called as risk plot which will basically help you to create histograms on a specific feature so if i execute this you will be able to see one example which looks like this is this a normal distributed data i guess no if you want to see with the probability density function i&#39;ll be using kde is equal to true so with kd is equal to true does this look like a normally distributed no it is like little bit skewed towards the right i&#39;ll also show you some examples with respect to uh normally distributed data so for that i will do sns dot load underscore data set i will be using iris data set iris flower data set basically is the data set which will actually help you to give a data of a different types of flowers with respect to iris so here you will be able to see that df1 dot head so here you have flowers like setosa oversee color and here you have four features sepal lens sepal width petal and then petal bit now let&#39;s see i will just try to plot the same thing with one of the feature okay let&#39;s say that i am doing it with sepal length triple underscore length df1 so here you can see that does this follow a gaussian distribution does this follow a gaussian distribution no i guess let&#39;s try with sepal width finally we&#39;ll be able to see something wow this follows a gaussian distribution definitely we can definitely say for this this is a gaussian distribution so this is specifically a gaussian distribution over here and here you can also apply that rule that is 68 95 99.7 percentage rule so all these things you can basically check out over here and you are getting this i&#39;ll also show you how to construct this pdf function and all as we go ahead okay it is normally distributed definitely we can say that it is normally distributed okay so this was one example with respect to normally distributed this is not normally distributed you know sns dot count plot of dfo if i use count plot with respect to species species spelling is wrong okay df1 again i&#39;m writing df what is this plot guys this is a bar graph bar plot or bar graph whatever you want okay percentiles let&#39;s do for percentile so for percentile i can use np dot percentile and i can use my df of let me open one example that i had written for you so uh i will basically use over here like this let&#39;s say i&#39;m going to use sepal len and here i can basically give some parameters like let&#39;s say that i want to get the 25 percentile and 75 percent so if i execute it here separate here&#39;s a df one so here you can see that i&#39;m getting 5.1 as the 25 percentile and 64 75 percentile is 6.4 so my iqr will be 6.4 minus 5.1 if you want to probably get the 99 percentile also you can basically write like this 99 so here you will be able to get the value 5.1 and 7.7 hello guys so how are you all i hope everybody is doing well so let&#39;s start today what all things we are going to do first of all we are going to implement this iqr using python okay the second topic we are going to discuss about is probability the third thing that we are going to discuss about is something called as permutation and combination once we finish this up the fourth thing that we are going to discuss about is something called as confidence intervals so in confidence intervals then probably if we get time we will cover up p value and then we will start with hypothesis testing now what we are going to do first of all i am going to start with google collab you can also open google collab okay so i will just make a new notebook so first of all we&#39;ll try to implement z score and try to find out iqr and with respect to that we will try to see what all things we can basically implement other distribution will also come don&#39;t worry bernoulli binomial distribution power law distribution everything will be discussed first let&#39;s go in some specific order i have actually decided and when that is those distribution will basically come we&#39;ll discuss about it okay here you go so in this session we are going to first of all discuss about outline now first of all what i am actually going to do over here is that i am going to import some libraries import numpy as np okay import matplotlib dot pi plot that&#39;s plt and then i&#39;m just going to import matpot label inline so i&#39;ll be executing this now the next thing that probably we will be discussing about is that let&#39;s define our data set so here i&#39;m going to just define our data set data set you can take up anything that you probably want you can just define your own data set whatever data set you like now for for my sake i have just created one data set over here so here you will be able to see that this is my data set can you say some numbers that are like kind of outliers in this so uh now the first thing that we are probably going to do is that let&#39;s say that using z score i probably want to also find out some outliers now using z score how do you find out some outliers now let me just go and explain you over here let&#39;s say that you know about normal distribution till now you have discussed we have discussed so many things in normal distribution we know that this is the mean first standard deviation second standard division third standard deviation first second and third standard division to the left you know that 68 percentage of data 95 percentage of data and 99.7 percentage of data can i consider that during some of the scenarios if my data is normally distributed after the third standard deviation probably the data are outliers yes or no yes after third standard deviation whatever data is basically present right data outliers yes or no just think over it most of the time if the values are you know after probably third standard deviation they are like kind of outliers yes so just think over it guys it can be treated as an outlier right if if data is present after third standard deviation so first we&#39;ll try to implement this now what i am actually going to do over here is that first of all let me make a list okay so here i&#39;m just saying it is outliers i&#39;m going to basically create it as a list and put up all outliers inside let&#39;s define and how do you find out standard deviation or by using z-score right we can definitely find out z-score with the help of z how many uh data set or data points actually fall within the third standard deviation so here i&#39;m actually going to create a function which says define detect underscore outliers so this will be my function and here i&#39;m going to give my data now the first thing that i will create a threshold my threshold will basically be three standard deviation right anything that falls away from the three standard deviation i will basically be able to do it and i hope everybody remembers the formula the formula for z score is what if i go and probably define over here my z score formula is nothing but it is x of i minus mu divided by standard deviation we usually also write this formula by root n but i&#39;ll talk about it why specifically i&#39;m not specifying root n over here uh over here i&#39;ll just try to use this formula okay so this is basically the z score formula okay so i have to implement this formula in python programming language okay so what i am actually going to do first of all obviously in in in this i need to compute mean i need to compute standard deviation you know how to compute mean right so here i will say mean is equal to np dot mean and here i can actually give my data points which will actually help me to find out mean then my standard deviation here i can basically write np dot standard deviation of that specific data i will be able to get the standard deviation so i have got my mean and standard deviation now for each and every points inside my data set i will just apply the z score formula so i&#39;ll say for i in data i can say z score is equal to i i is my x of 5 points right i&#39;ll say x i minus mean right divided by standard deviation so this is my z score formula and for every item i&#39;m actually trying to find out the z score formula z score will basically give you how many standard deviation it is away from me so i can write one condition to check whether it falls below the third standard deviation or not so i can basically use nb dot absolute which will basically help us to round off the z score and i&#39;ll say z underscore score is greater than threshold if it is greater than threshold what does this basically mean let&#39;s let&#39;s define threshold over here i have already defined threshold right so if it is greater than threshold then what does this basically mean oh sorry it is data set i&#39;m extremely sorry data set now tell me if np dot apps zsco greater than threshold what should i do what does this basically mean green more clarity you want i think now it is fine right what what should we do in this this basically means that it is an outlier right because it is falling away from third standard deviation it is falling below or beyond the third standard deviation so what i can basically do is that i can just write something like this because i have created a list i&#39;ll say outlier dot append and i&#39;m going to append that specific set score value so i hope it is fine i&#39;m just going to append the z score value not z score i will append the i value because i in data set yes i am just going to append this i yeah outliers sorry it is outliers dot append of i and then finally what i&#39;m actually going to do i&#39;m just going to return the outliers or return outlier let&#39;s see whether it will work or not i&#39;m also trying it for the first time so this is my function that has got executed i will just execute one more code threshold three basically means this this defines our third standard deviation below like beyond third standard deviation i can basically say that this actually falls on if you want to probably go and check how this distribution is so i can write plot.test on a specific dataset plt is not defined why okay this should be plt it&#39;s okay whether it is normally distributed or not but i am actually trying to see this okay there are some definite outliers but it&#39;s okay let&#39;s see that whether we will be able to do this or not what is which our past has changed data set data in for loop it is simple right guys this this function everybody understood or not oh sorry this should be data this data i&#39;m actually passing over here see threshold threshold here is my third standard deviation if you want the data set i can paste this entirely and given the chat so this is my chat with respect to the data set i&#39;ve already given it to you all now let&#39;s go and execute it now i have executed this now what i am actually going to do over here i am just going to call detect underscore outliers and i am going to call the specific data set the data set nb.apps nb dot apps basically means nb dot absolute absolute function now once i execute it here you will be seeing that it will be returning this three outlier are these my outliers or not guys the for loop is very simple for i in data i&#39;m finding for every data which is in the form of list all the z score and i am comparing if the z score is greater than 3 or not if it is greater than 3 i am considering it as an outlier here you can see all the outliers are there outliers means a big number right if you have not attended the previous session guys see if you have not attended the previous session you can drop off okay because you will not be able to understand this is a seven days live session now i have got the outliers now this is one way how we can use z score so this was an example of actual z score so i&#39;m just going to write it as z score z score computation and basically we have done it now let&#39;s go towards the iqr iqr basically means interquartile range so for interquartile range what type of code i will be writing always understand in iqr what are we discussing in iqr first of all we need to find out q1 q1 is 25 percent time then we have q3 q3 is 20 75 percent time then if i subtract 75 percentile minus 25 percentile i will basically get the iqr right and always understand in iqr what we do we basically find out what what do we do in iqr in iqr we basically find out the low the lower fence and higher bits that we really need to find out in case of iqr so how do i write the code because this theoretical is already explained so i&#39;ll write down all the steps that is required so the first step is that i want to arrange i want to sort the data let&#39;s say that i&#39;m sorting the data okay this is the first step the second step is that i will calculate q1 and q3 q1 and q3 is pretty much important in this particular case so i need to do it in this scenario i&#39;ll just move this up i&#39;ll copy and paste it over here so the first step is basically calculate sort the data and then calculate q1 and q3 then we need to find out iqr which is nothing but the third step which is nothing but the subtraction of q3 minus q1 then we need to find the lower fence find the lower fence now lower fence formula i hope everybody knows it so it is nothing but q1 q1 q1 plus or minus it is q1 minus 1.5 multiplied by iqr right this is the formula to basically find out the lower fence then find the upper fence here i will basically be using q3 plus 1.5 multiplied by iqr so these are the steps that we are probably going to do so these are my steps that i am actually going to plan for and based on the steps i will be implementing it so these are the steps that i will be performing in order to find the outliers with the help of iqr now first of all if i really want to find out the sorted data set how do i find out the sorted data set sorted data set i will just say this will be my data set and i can use sorted function and in sorted function if i give you the data set this will basically be my sorted dataset so sorted is an inbuilt function which will actually help you to sort all the numbers okay okay sort all the numbers over here so right now i have actually created a data set which is completely sorted so my first step is done so i am done with my first step now second step i need to calculate q1 and q3 so i will say q1 comma q3 and here i will basically use np dot percentile i will give my data set over here along with this i&#39;ll give two values one is 25 comma 75 so once i execute it you can see that it has got executed now i am going to just print q1 comma q3 so here you can see which is my q1 q3 this is my 25 percentile this is my this is my what percentile this is my 75 now once we have this now let&#39;s go ahead and compute the lower fence and the higher fits now in order to compute the lower fence and the higher pins here i&#39;m just going to write the comment find the lower fence and higher prints the lower sense is equal to q1 right minus 1.5 multiplied by iqr and before that i need to compute the iqr let&#39;s say iqr is equal to q3 minus q1 so if i go ahead and print iqr what is this error it is coming up now if i go and execute this you will be seeing that iqr is three so this is my lower fence for the higher fence i will basically write higher fence is equal to q3 plus 1.5 multiplied by iq once i execute it now i know my lower fence and higher so i&#39;m going to print lower underscore pens higher underscore so if i print it it is 7.5 to 19.5 now the further part i think you can comfortably do it and based on this higher lower fence and higher pins you can write a condition and you can remove all the elements that is required so now you can basically write don&#39;t worry whether the data is normally distributed or not here what we are doing is that whatever data set you are basically getting you are getting what you can actually do you can basically uh find the lower fence and higher fence and basically do this thing now instead of doing all these things if i import c bond as sns okay and execute it and there is an option which is called as hist plot not sorry box plot we also saw how to create box plot if the if the lower fence is negative then what you can do is that based on that condition any value lesser than that you can remove all those things right and here if i give my data set you will be able to see that this will be how a box plot will be created now this looks you see that there is a very big outline so that is the reason this same outlier we found out with the help of multiple things and here also you can see 7.5 to 19.5 so most of your data points that will be lying over here will be based on that if i probably remove those three elements and try to see that particular data set then this box plot will look bigger now let&#39;s go ahead and discuss about the next topic which is called as probability probability is super super important and in this session i will discuss major major important things in probability and we will try to see that what all things we can actually do with the help of probability probability is by default used in machine learning also in deep learning also many places let&#39;s say one example okay suppose i have two categories of data set i have another category of data set if i try to create a best fit line you can see that let&#39;s say that this belongs to class a this belongs to class b now over here you will be able to see that if i talk about this right when i draw this linear line this is basically used in linear regression let&#39;s say now my question is that what probability of this particular point belongs to class a and what probability of this particular point belongs to class b because it is passing through the line so based on probability we can definitely get a lot of things in linear regression it is used and logistically it is used and so probability really focuses uh like base is basically used over there and different different things are used let&#39;s understand what exactly is probability if you want to give a definition what exactly is a probability so here you can say that probability is a measure of the likelihood of an event probability is a measure of the likelihood of an event the reason why i am writing you this all definitions guys understand you really need to think you know what exactly is happening over here what is the definition you know if you can remember those definition in an easy way by example so that is the reason i also give you a lot of example let&#39;s say that i am flipping a dice in a dice what are my possible sample events you know that it is one two three four five six now if i ask you a question what is the probability when i roll a dice or sorry roll a dice not flip flipping a coin it should be i&#39;ll say roll at is okay so here i am basically saying roll a die so what is the probability of getting 6 if this is my question then how probability you will be able to calculate what is the answer obviously you will say one by six right it&#39;s very simple so how do we define probability i&#39;ll say that number of ways number of ways an event can occur an event can occur divided by number of possible outcomes so this is the exact definition of this now in this particular scenario number of ways an event can occur over here i am trying to find out what is the probability when i roll a dice i get a six so how many events can occur it can only occur as one and what is the number of total possible outcomes it is six so this is how we basically find out similarly if i give one more example let&#39;s say that i want to i want to toss a coin obviously i know what are my sample space head and tail what is the probability of getting head you will just say that 1 by 2 because the sample space is 2 and one number of event that can occur is 1 by 2. so you basically say this as probability of header one by two now let&#39;s go one step above probability which is called as additional rule now here is where you will probably discuss about something called as so let&#39;s let&#39;s go to the next topic over here i&#39;m basically going to define as addition rule this is super important probably in your aptitudes you will be using this addition rule or we also say it as probability or or or or also you say it as like this or now in order to understand additional rule you need to understand about two things one is mutual exclusive events what is this mutual exclusive events so i can basically define two events are mutual exclusive two events are actually mutual exclusive they cannot occur at the same time if they cannot occur at the same time let&#39;s see an example rolling at is now when i roll a dice at a specific time i can either get 1 or i can either get 2 or i can either get 3 or i can either get 4 or i can either get 5 or 6 right you cannot get 1 and 2 at the same time or you can&#39;t get one two three four at the same time you will only get at one one probably one experiment or one event that you&#39;re probably rolling a dice at a single time you&#39;ll only be able to get one number you will not be able to get two numbers so this is specifically an example of mutual exclusive another example again uh tossing a coin in this particular case also tossing a coin in this particular case also what happens you may either get head or tail you cannot get both unless and until your coin is standing there like shown in the movies i hope which movie am i talking about which movies probably i&#39;m talking about you can also consider you know good movies like show le and surely only one type of event occurs at every time right so for this kind of problem scenarios now let&#39;s let&#39;s discuss let&#39;s discuss about non-mutual exclusive obviously you understood that what is mutual exclusive now with respect to non-mutual exclusive obviously both the events can occur at the same time multiple events can occur at the same time here i&#39;ll say that multiple events that can occur at the same time two or more events can occur at the same time let&#39;s let&#39;s say one example let&#39;s take a deck of cards a very simple example with respect to this in deck of cards have you seen like what will happen in a deck of cards two events let&#39;s consider that from a deck of cards when i pull out a card a king can also come or or let&#39;s say that a queen card can come along with the screen card a red color heart card can also come hard card can also come right so here multiple events are there so this two cards are obviously not mutual exclusive so here you can see that okay i can also pick up a king it can be in black color it can also be in red color right multiple things are basically happening so this is an perfect example of a non-mutual exclusive now based on this there is some amazing problem statements that you can basically solve mutual exclusion suppose if i toss a coin so my first question is if i toss a coin which is again a mutual exclusive event what is the probability of the coin landing on heads or tail now whenever you get this kind of problem statement first of all you really need to think that okay whether it is mutual exclusive or not yes obviously it is mutual exclusive now i need to find out what is the probability of getting heads or tails right this is what i i need to find out i need to find out what is the probability of getting heads or tails right from this specific event so i want to define a common definition probably for this we can write probability of a or b where a and b are events is equal to probability of a plus probability of b so whenever you have a mutual exclusive event at that point of time you can define this specific definition which is also called as additional rule for mutual exclusive now here what is probability of a you know that it is 1 by 2 plus 1 by 2 so the answer will be 1. so probability of a or b to come is basically one these are some very very important things in in exams also you will be getting this in aptitude also you will be getting it in multiple things you will basically be getting it now i may also tell you okay let&#39;s take one more example what is the probability suppose if i roll a dice what is the probability of getting one or three or six yes many people are saying it right it is one by two so here i will basically say that what is the pro this i can definitely say it as probability of 1 plus probability of 3 plus probability of 3 6. these all are 1 by 6 plus 1 by 6 plus 1 by 6 which is nothing but 3 by 6 which is nothing but 1 by 2 which is nothing but 0.5 so 0.5 is basically with respect to this and here you can easily solve it now this was with respect to mutual exclusive this is what we have discussed in mutual exclusive if i take the next problem statement for non-mutual exclusive so for non-mutual exclusive let me take a very good example again so the so the question is very much simple over here let&#39;s say that you are picking a card you are picking a card randomly this is the question from a deck so the question is what is the probability of using a card that is queen or a heart so this is the question very simple obviously first of step you will see that whether it is mutually exclusive or non-mutual exclusive obviously you will say that in this particular scenario it is non-mutual exclusive or mutually exclusive it is non-mutual exclusive right because it can occur at the same time now let&#39;s go ahead towards the answer obviously you understood that this is non-mutual exclusive now how do you solve this specific problem now in this specific problem first of all you need to find out what all different things it is basically asked let&#39;s say that i have got probability of getting a queen what is probability of getting a queen guys just think over it how many cards how many queen card will be in deck of card in the total deck of cards there are 52 cards right if none of you have played cards please go buy today and see the probability of getting queen is nothing but 4 by 52 because in every deck there will be 4 queen cards now the next thing probability of heart cards so what is this probability of heart how many heart cards will basically be there in a deck obviously there will be 13 cards so i&#39;ll say 13 by 52. now the next thing is that probability of queen and heart because this is also one one possibility right this is also one possibility how many queen and heart will be there basically it will be only one so here i will write one by fifty two so these are the possible things that can occur right this is the thing now if i come to the formula and this is the addition rule for non-mutual exclusive event non-mutual i can write probability of a or b is equal to probability of a plus probability of b there will be one important thing which is this intersection which i have to basically separate it so it will be b probability minus p sorry minus probability of a intersection b a intersection b basically means a and b which is a possibility of both now my question is very much simple what is the probability of getting queen or hot i&#39;ll draw it with red color you have the answer with you this will be probability of queen plus probability of heart probability of heart minus probability of queen and heart so what is probability of q 4 by 52 what is probability of heart it is 13 by 52 and what is probability of queen and heart it is 1 micro q so here i am actually getting 52 this will be 17 minus 1 16 16 how much the 52 you can calculate this will basically be the probability now you have probably understood additional rule addition rule now we need to understand one more rule in probability see guys if you do this much i think you will be able to solve any problem statement that comes in your mind so here was the problem statement that we did and this was specifically to something called as addition rule now coming to the third one which is called as multiplication rule in multiplication rule you one thing you need to understand here we need to understand something called as independent independent events and non-independent events these are something very very important it should be and i said or hot okay independent events okay now in the case of independent events uh events what are specifically independent events let me talk about example let&#39;s say that i am rolling a dice if i roll a dice i may get one two three four five six suppose for the first instance i got one in the second instance it is possible i may get one in the third instance i may get two i may get any number so one event is not at all dependent on the other event right because anytime we roll every every possibilities or every outcomes has an equal probability to come so over here what you can understand is that each and every events each and every events each and every events are independent one if one one comes or if two comes out if any events come it is not going to impact any other event every time you probably have to roll and everybody has an equal probability to come over here this is what is an independent event called let me talk about non-independent event or i will also say it as non-not non-independent but instead i&#39;ll say dependent event so i will talk about dependent event now independent events suppose let&#39;s say that i have a i have a bag in this bag let&#39;s say i have three red marbles and two green marbles now in the first instance if i pick out if i if i pick up a marble what is the probability of red marble what is the probability of taking out the marble very simple you will be seeing that how many number of marbles are there there are total five marbles and how many number of red marbles are there there are three marbles so you are basically able to write three by five now let&#39;s consider in the first event you picked out a red marble so you picked out a red marble so i&#39;ll make it as red color now after taking out the red marble how many marbles are remaining so i will now update this okay so how i&#39;ll update this i&#39;ll update this bag now this bag will basically have two red marble and two green mark now if i try to go ahead and find out what is the probability of now taking out a green marble then how you will basically say how you will basically say you will see that okay how many number of marbles are there two by four so here what is happening after this particular event it has impacted this event because the number of models are reduced and finally you got 2 over here so this is a perfect example of a dependent multiplication rule basically says that in the case of an independent event we have to solve it in a different way in the case of a dependent event we have to solve in a different way because of this dependent event there is an amazing algorithm which is called as name bias have you heard of nate bias i think most of them heard of right there is a topic which is called as conditional probability this is where conditional probability will come into existence so i will i will talk about it okay so let&#39;s go and solve some problems so let me just go ahead and talk about a problem statement first we will talk about independent events so independent events we are going to basically discuss about the problem first thing the question is what is the probability of rolling a five and then a four so this is your question what it is saying what is the probability in the first event you have rolled a diaz you are getting five and then again you rolled a dice then you got 4 so what is the probability of getting 5 and then 4 this is a simple question and for this this obviously is an independent event you know that now how do we solve this particular problem so i&#39;ll say independent event uh here we&#39;ll apply the multiplication rule what is the multiplication what is the probability of a and b a and basically means first a event has occurred and then b event has occurred what is the probability of this so obviously here i&#39;ll define the formula over here first of all i&#39;ll say probability of a multiplied by probability of b a and then b okay so this is the usual formula that we use for an independent event in a multiplication rule so obviously you know what is probability of a so here i&#39;ll say probability of 5 and 4 you know probability of 5 probability of 5 is nothing but 1 by 2 multiplied by sorry 1 by 6 1 by 6 multiplied by 1 by 6 it is nothing but 1 by 36 now let&#39;s take another example obviously because independent event looks very simple so here i&#39;m basically going to take another example and this example will be of a dependent event so let&#39;s go ahead and let&#39;s try to solve a problem for this what is the probability of drawing a queen and then a asus from a deck of card see over here two events are actually happening so let&#39;s go ahead first of all again you need to find out whether this is an independent or dependent event obviously in this case this will be a dependent event because a deck of card will get reduced so in this particular case i am saying what is the probability of a and b in the case of independent event so here i can basically write probability of a multiplied by probability of b given a now what does this mean this this term is basically called as conditional probability let me show you an example with respect to the bags right so i have a bag over here let&#39;s say that i have three marbles two red marble okay now in the first instance i want to find out what is probability of what is probability of what is probability of green and then red marble now see over here how many marbles are there in the first instance if i&#39;m taking out green obviously there is three by five right in the first instance when i took out the green marble after i take out the green marble my total number of marbles that will be remaining is 4 so the probability of red will be 2 by 4. now this term this term is basically probability of green and what is this term 2 by 4 this is nothing but this only right multiplied by probability of green given red sorry probability of red given green given green basically means this green event has already occurred right so that is the reason the number of marbles has got reduced this is called as conditional probability and this is very very helpful in something called as name bias or i&#39;ll also say it as biased theorem in bias theorem this will be very very important so here what is probability of king sorry it is queen and king right queen and aces sorry so here what i&#39;ll do probability of queen multiplied by probability of asus given queen so what is probability of queen it is nothing but 4 by 52 multiplied by 4 by 51. so sorry 53 how many cards are there i forgot how many cards will be that in deck 53 right yeah no 52 only don&#39;t confuse me guys okay 4 by 52 multiplied by 4 by 50 so whatever answer you get over here this is basically your now let&#39;s discuss about something called as permutation and combination a very small topic probably in five minutes i will be able to complete it now let&#39;s say that first of all let&#39;s discuss about permutation let&#39;s say that um i have taken some students to a school trip and then we have gone to something like a chocolate factory in which many chocolates are basically they they they create a lot of chocolates they they okay so they they make a lot of chocolates okay so i i catch a word of a student and i say that okay i&#39;ll give you an assignment and let&#39;s say that in this chocolate factory six different types of chocolates are created like dairy milk right like five star milky bar and let&#39;s say eclairs okay jam how many one two three four five and one more chocolate uh normal toffee let&#39;s say one more category silk of dairy milk is there so these many chocolates are basically there so i have given a student an assignment to that saying that okay there are six chocolates that are getting created in this factory let&#39;s create in your diary you write the first three chocolates whichever you see whichever chocolates you see once you enter into that factory whichever chocolate you probably see the top three the first three you just write that name and you come up come back to me so that student went inside the factory now in the first instance how many different options this particular student can have of seeing the chocolates he may definitely have six different options now once he sees probably any one chocolate right he may have six options because six different any any chocolate he may see right so obviously he may have six options out of which he writes one name over here let&#39;s say in the next instance how many charts will remain total 5 will remain so how many options he will have to write the name 5 he will have the right to write the name of the chocolate then finally here you&#39;ll be seeing that when he comes and write the third name over there they&#39;ll be having four options now if i try to multiply this six multiplied by five multiplied by four it is nothing but 120 now 120 what it is it is all the possible permutations with respect to the chocolate name that he may see all the possible permutation like he may he may see in this way dairy milk gems milky bar he may also see in different way milky bar gem dairy milk so all the possible options that are possible is 120. now when i say 120 okay these are all the possible options now this is what permutation is permutation formula how do you write now let&#39;s go back to school days where directly used to ratify all the formulas npr is equal to n factorial divided by n minus r factorial over here n is nothing but the total number of chocolates r is nothing but how many names i have told that person to write so here you will be seeing 6 factorial divided by 6 minus 3 factorial which is nothing but 6 into multiplied by 4 multiplied by 3 factorial divided by 3 factorial this and this will get cut so total answer is 120. this is with respect to permutation now how does combination come into existence now and what is the difference between permutation and combination now in combination always understand permutation if i have the same element like this i have dairy milk i have gems i have gems i have probably eclairs if i&#39;ve used this element once this combination i cannot use the same element and probably make a different combination so combination will be unique with respect to the elements that is used okay if i have used derivative gem and eclair i cannot again re-swap it and make it as a different order so in the case of combination you have a other formula which will actually for help you to focus on the uniqueness of the objects that you are picking up so for this the formula is ncr which is nothing but n factorial divided by r factorial n minus r factorial what is n factorial you know that the 6 factorial what is r factorial 3 factorial and 6 minus 3 factorial so here you will basically say 5 move 4 and this will be divided by 3 factorial this i&#39;ll make it as 3 2 1 multiplied by 3 factorial this and this will get cut two ones are two twos are three ones are three two five twos are ten ten to the twenty so twenty unique combinations you can basically have let&#39;s say first of all the first topic that we are probably going to discuss about is something called as p value super super important topic many people gets confused gets confused in this now let&#39;s take one example everybody uses a laptop let&#39;s say that this is my laptop this is my mouse pad this is your right button to click this is your left button to click your laptop mouse pad over here you will move the fingers right here you&#39;ll move the fingers let&#39;s go ahead and let&#39;s understand don&#39;t you think most of the time when you&#39;re moving your fingers you will be moving in this specific region in this specific region you will be moving your fingers in this specific region not in the corner hardly you will touch somewhere in the corner now why i am specifically drawing this because this thing will basically specify your distribution of touches and most of the time your distribution of touches will be also looking something like this now understand one thing why this area is bulged this area is bulked because most of the times you&#39;ll be touching here this area is less because over here hardly you will be touching away now let&#39;s consider that i say my p value for this position is my p value for this position is 0.8 now here what i am actually going to do what does this point 8 basically means that let&#39;s say i am doing 100 times i am touching this mouse pad 100 times i am touching or let&#39;s say that every 100 times every 100 times okay let&#39;s let&#39;s remove this i&#39;ll write in white color only every 100 time i touch the mouse pad 80 times out of this 100 80 times i touch this specific region i hope everybody understood this one every 100 times probably i touch this mousepad the probability of touching this region is 80 times that is 80 percentage similarly if i say my p value over here is 0.01 what does this mean similarly you can consider any region this region is the best like broadest right so this region may be p is equal to 0.9 that basically means out of every 100 touches i am basically touching 90 times over here this will be one time this will be only one time so i hope you are getting the understanding of p value p value basically says most of the time what is the probability with respect to a p value for that specific experiment now let&#39;s go ahead and let&#39;s understand something called as now i&#39;m going to combine multiple topics the first topic that i am going to combine is something called as hypothesis testing in that i am going to combine confidence interval in that i am going to combine significance value in that i am going to combine many things okay let&#39;s say i am solving a problem okay my problem is to i have a coin i want to test whether this coin is a fair coin or not simple problem statement i have a coin i want to test whether this coin is a fair coin or not by performing 100 tosses now we are entering into inferential statistics okay very important super important when do you think a coin is a fair coin obviously when the probability of heads should be 0.5 when the probability of tail should be 0.5 if you have this to condition definitely you will be saying that yes in this particular scenario obviously the coin will be a fair coin but if you have a chole coin if you have a sholey coin then what will happen if you have a sholey coin then probability of heads was 100 so for this kind of things you&#39;ll definitely not say that it is a fair point now in order to support this i am performing 100 experiments 100 experiment basically means 100 tosses so 100 tosses i will be performed now inside this 100 tosses what i am going to do is that let&#39;s say that from this 100 tosses obviously what will be the mean let&#39;s say that i&#39;m just focusing on probability of head i should basically get 50 times so from the 100 times from this 100 times if i&#39;m performing 100 experiment i can definitely say that my probability of head or probably let&#39;s let&#39;s consider that forget about this probability of head the number of times i should get head is how much 50 right if i get 50 times head i can definitely say that this coin is the coin is pair the coin is fair i can definitely say this if the number of times after performing 100 experiment if i get 50 times head i can definitely say the coin is fair now very important first of all in this particular scenario we have to focus on something called as hypothesis testing you have to focus on hypothesis testing in hypothesis testing the first thing is that we need to define our null hypothesis the null hypothesis is usually given in the problem statement what is what is we want to test whether the coin is a fair coin or not so whatever the default question is i&#39;m going to use it as a null hypothesis so here i&#39;m saying that the coin is fair like one scenario you have right a person cannot be acquitted as a criminal unless and until it is proved so the coin is fair now the second thing that we basically define is something called as alternate hypothesis here i&#39;ll say the coin is unfair now the third step and always remember alternate hypothesis will be the opposite of null hypothesis whatever thing we are trying to pull okay now the third thing is that we perform the experiments and the experiment can be anything it can be a z test t test whatever things you want you can do all this practical i will discuss it don&#39;t worry now inside this experiment we see some values and based on that the fourth step that we do we reject or accept the null hypothesis null hypothesis these are the possible step of the hypothesis testing now let&#39;s define this guys let&#39;s say that my mean value is 50. i need to get at least 50 times ahead right i need to get 50 times head yes or no let&#39;s consider that this is my mean okay minimum 50 i&#39;m not minimum but 50. 50 i should be getting in order to say that my coin is fair let&#39;s say that for this problem statement uh i&#39;m just examining okay the standard deviation is 10 so it will come as 60 70 90 40 30 20 10 okay right in this particular case it is there and probably if i if i know my mean and standard deviation i may draw a curve which looks like this what happens if i want to prove this now see this i&#39;ll perform the experiment let&#39;s say i have performed 100 times now just imagine i got 30 times head let&#39;s imagine i got 30 times ahead 30 times head is nothing but it is somewhere at this point can i still say that this coin is fair or not can i say the coin is fair or not or coin is unfair can i say think over it if i am getting 30 times head can i say that this coil is unfair you tell me whether it should be fair or unfair tell me let&#39;s say that i have performed the experiment and i got 30 times head out of hundred so tell me whether this will be fair or no many people are saying no not fair fair fair not fair so for this to define it is always said that our experiment should be nearer to the mean okay nearer to the mean now how do we define that how far it can be away from the mean we need to define that how far it may be away from the mean so for that we use a very important property which is called as significance value now this significance value is basically given by alpha suppose let&#39;s consider that i am considering alpha as zero five now this point zero five what exactly it is what exactly it actually means this means that if i do one minus point zero five this answer let&#39;s say that this answer how much it will come it will basically come as uh this point zero five okay uh i&#39;ve taken my significance value as point zero five when i convert this into percentage it will become five percent okay five percent so from my hundred percent if i subtract five percent this basically indicate that it is 95 confidence interval now what is this 95 confidence interval if i probably subtract from 1 my probably the 95 percent confidence interval is there okay now this 95 confidence interval is what part let&#39;s consider that i know my 2.5 is this part 2.5 is this part since this is a two-tailed test let i&#39;ll talk about two-tailed test also don&#39;t worry so let&#39;s consider this part to this part this is my entire 95 percent confidence interval this is defined by a domain expert different defined by a domain expert let&#39;s consider that it has been defined now what does that 0.05 indicate i&#39;m trying to show it to you when i probably divide this into two parts here my 2.5 percent will come here my 2.5 will come now understand one thing very important over here now let&#39;s say that 30 i got 30 over here so this is my 30 right and i have also defined my confidence interval from this point to this point whenever we are coming inside this then we see we say that the coin is fair why because understand it is within this interval here we need to define because we don&#39;t know right what should be the number you said that when i got head 30 times many people is saying that not fair but who are we to decide domain the expert will decide and how will he decide with the help of this significance value suppose they say significance value is 0.05 that basically means that we the the experiment if it falls in this 95 confidence interval that time i will say that that coin is fair if it falls outside this confidence interval that time i will say that the coin is not fair now tell me let&#39;s say that this number that you are seeing is 20 let&#39;s say and this number that you are seeing is 75 20 to 75 is my confidence interval now i perform the experiment if i get 10 heads only out of 100 experiments should we accept or reject the null hypothesis the null hypothesis is basically the coin is fair the null alternate hypothesis coin is unfair so if i get 10 heads which region it is falling it will fall somewhere here it is not inside the confidence interval so we can definitely say that coin is not fair so for that particular case we reject the null hypothesis and we accept the alternate hypothesis i hope everybody is able to understand the terminologies that we are using over here i cannot teach you separate topics understand i have to combine these topics together to teach you how to do it what if we if we have okay let&#39;s say that guys if you have 95 heads in those 100 experiments which region it will fall will it not fall in this region 95 is somewhere here so should we accept the null hypothesis or reject the null hypothesis we have to obviously reject the null hypothesis and alternate hypothesis will be accepted it&#39;s very simple i perform the experiment whatever value i get i go and check in this okay let me tell you now one more one more scenario okay here let&#39;s say that my domain expertise said that krish you are a fool and probably i will now use this is 50 60 70 80 90. okay let&#39;s say that krish you are a fool why have you taken alpha 0.05 okay i don&#39;t want that oh so let&#39;s say that your alpha is 0.20 now what will be your confidence interval what will be your confidence interval let&#39;s say that your confidence interval will be now 80 percent instead of 95 so now your graph will look somewhere here like this it will be still more in this side so this side will basically have point one zero this side will basically have point one zero and this all will be your point eighty percent when you combine all this when you add up all this it will be one so at that point of time then you can go and find out your confidence interval this value will give you your lower confidence interval this value will be giving you a higher confidence interval you perform the experiment now just imagine you got 25 from that experiment whether you should reject or accept it tell me one thing if your alpha value is 0.3 what is your confidence interval vishu sharam i just took it for heads only right so what is your confidence interval if your alpha value is 0.3 obviously you&#39;ll say that it is 0.7 that is 70 percent confidence interval so alpha significance value and confidence interval are reverse right they you need to calculate in that specific way right now usually when we say when we say like p value right suppose if it does not follows in the confidence interval i may say that the p value is less than 0.3 so because of that i have to reject the null hypothesis hello guys today topics what all things we are going to see is that the first thing that we are going to check out is something called as type 1 type 2 error so the first topic that we are going to see is something like type 1 and type 2 error very super important probably in machine learning you will be discussing about um you know confusion matrix fine guys if it is not uploaded don&#39;t worry it will get uploaded today okay i will say the backend team to do it the second thing that we are probably going to discuss about after one tail or after type 1 and type 2 error is basically your one tailed and two-tailed test the third topic that we are going to see is that how to find out confidence interval okay that is what we are going to see now confidence interval how to calculate this probably when an alpha value is given i told you we need to define some confidence interval in order to solve uh you know some problems the fourth topic that we will try to see after confidence interval is something called as z test t test and if we get time we will also finish up chi square test so let&#39;s start the first topic that we are probably going to discuss about is something called as type 1 and type 2 error always understand whenever we do any kind of hypothesis testing one very important thing i told you that what we have the first topic that we are probably going to discuss about is type 1 and type 2 error type 1 and type 2 error always understand in any kind of hypothesis testing right we do have something called as null hypothesis null hypothesis is usually denoted by h0 we have something called as alternate hypothesis okay alternate hypothesis which is denoted by h1 okay now at the end of the day after performing any kind of experiments right let&#39;s say that i&#39;m performing an experiment where to check whether the coin is fair or not i&#39;ll take the same example what we have discussed yesterday and coin is not fair i will probably go and check check whether it falls within the confidence interval i&#39;ll check the significance value based on that the confidence interval will be defined you know everything we will do and that is what i explained in the studies part now from this after we perform the experiment there are two types of decisions that can be made first of all we&#39;ll go with respect to the reality check so the reality check will be that either either null hypothesis will be true null hypothesis is true or null hypothesis is false right only these two things we will be able to see in reality check right if i go and check with respect to the decision because this is what i am actually trying to check test right in decision i may either get null hypothesis is true null hypothesis is true or null hypothesis is fall when null hypothesis is false i will say that alternate hypothesis is accepted right or we reject the null hypothesis now i from these two what you can basically derive is that see this very important first outcome let&#39;s see what what can what may be the possible outcome okay so what may be the possible outcome so outcome one i will say that okay outcome one is that we reject the null hypothesis with reject the null hypothesis that is my decision when it is when in reality it is false is this a good decision yes we reject the null hypothesis when in reality it is false is this a good decision yes it is obviously a very good decision this is how we should take a decision now when in reality when i say that in reality it is false obviously we are rejecting the null hypothesis okay very good decision now the second outcome let&#39;s go ahead and discuss the second outcome now suppose i write the outcome two what are the possible outcomes i&#39;m just trying to show you okay we reject the null hypothesis we reject the null hypothesis when in reality it is true so over here what should be your decision whether this is a good decision or not again i have to note at this particular point when in reality it is true if you are rejecting the null hypothesis is this is this a correct decision on it over here in this particular case obviously many people will say that it is a bad decision so here i will say no and this kind of decisions is specifically called as type one error so this decision is basically called as type one error right so this decision when we are rejecting the null hypothesis when the when in reality it is true let&#39;s say that i take my null hypothesis as the person is innocent and my alternate hypothesis is percent is not innocent now in this particular case the person we are just activating him in movies we have seen right many people will just be awarded death sentence even though they have not done anything wrong so that kind of example is what you are actually seeing over here we reject the null hypothesis when in reality it is true that basically means the person is awarded at that sentence even though he did not do anything so this or this is the perfect example you have seen in movies right in movies okay you&#39;ll be seeing that a person will be awarded a death sentence even though he did not do anything in the case of a fake fake case so at that point of time this person is not innocent but in reality he is innocent so this becomes a perfect example of type 1 error outcome 3 outcome 3 basically says that this is also a very important outcome and these all things you will be able to relate in confusion matrix i i don&#39;t know how many people knows about confusion matrix so we retain the null hypothesis or we accept the null hypothesis let&#39;s say that i am saying we accept the null hypothesis when in reality it is false is this a good decision the answer should be no so this error is basically called as type two error only four outcomes will be there four outcome will definitely be there so here you can understand that in this particular case even though the person has committed crime he is not acuted so definitely this error is basically called as type two error okay i hope everybody&#39;s got is clear right now let&#39;s go to outcome four outcome four is that we accept the null hypothesis when in reality it is true so this is obviously a good case right so here i can say that fine this decision and this decision are perfectly fine but whenever we have this scenarios we basically have to consider it as type 1 and type 2 so i hope everybody is getting it right so similarly in the real world scenario you define something called as confusion matrix right in confusion matrix what you have you have true positive true positive right sorry true false and just a second in confusion matrix what are what you have you have true false positive negative right so here you are basically defining your true positive true negative false positive false negative right so this basically becomes your true positive true negative false positive false negative tell me out of this which is type 1 and type 2n either this can be type 1 or type 2 error that will be a answer for you so you have to tell me okay whether false positive will be a type an error a true negative will be type 2 error or vice versa perfect so this will be one assignment to you if you don&#39;t know just check out my one of my video you will be able to see it but clear guys was this explanation good for type 1 and type 2 error so we have completed this specific topic that is type 1 and type 2 perfect so some people are basically saying false positive is type 1 true negative is fp is type 1 so here you have actually solved a very good topic which is called as type 1 and type 1 now let&#39;s go to the next topic that is one tailed and two-tailed test this is also very much super important one tail and two tail test so one tail and two tail test now let&#39;s go ahead and let&#39;s try to understand what is one tailed and two tin test now already you have seen that i have probably drawn a curve a bell curve and in that i basically define a kind of one-tailed and two-tailed test still you have seen it but let me give you one good example okay so the example is that a college in let&#39;s say a college or let me write like this colleges in karnataka in karnataka have an 85 placement rate placement rate in the placements time a new college a new college was recently opened and it was found that a sample of 150 students had a placement rate of 88 with a standard deviation four percent does this does this college have or has a different placement rate than the other qualities okay so understand this question very much importantly oops sorry guys i made one mistake this should not be type 2 false negative should be type 2 right true positive and true negative are perfectly fine okay this should be type two true positive and true negative are always right let&#39;s try to understand some very important thing now what does this question basically say whether see there are colleges in karnataka which has 85 percent placement rate a new college was recently opened and it was found out that a sample of 150 students had a placement rate of 88 percent with a standard division four percent thus the college has a different placement range does this college this basically means the new college now in this particular case first of all think about the question now over here it says does this college has a different placement rate what is the placement rate of the entire college 85 percent so does it have a different rate than 85 percent that is what we really need to check right now in this particular case this becomes a two-tailed test why we&#39;ll think over it let&#39;s say that here the significance value is given as 0.05 let&#39;s consider let&#39;s consider that over here the significance value is given as 0.05 now what we do over here is that we will try to create a graph now when we have point zero five that basically means if it is a two tailed test two tail test basically means right now i have a placement rate of eighty five percent so 85 percent is uh you can just consider that 85 percent will be what in this particular case right 85 percent passage rate or sorry placement rate right so 85 percent but we need to find out over here when alpha is given 2.5 will be here and this will be my 95 percent confidence interval so 95 will basically be here if i combine all these things it will become 1. now you need to understand whether this will become a two-tailed test or a one-tailed test this is what is very much simple now this 85 percent will be my mean my value can be greater than 85 it can be less than 85 okay it can be greater than 85 it can be less than 85 because we are just checking whether it has a different placement rate it can be greater it can be less also so that is the reason this entire test becomes a two-tailed test because the new college that gets added it may fall in this region also it may fall in this region right now you&#39;ll be able to see that we are just trying to check whether it is greater than 85 or whether it is less than 85 so this becomes a two-tailed test now let me just make a little bit change into the question now my question says that let&#39;s say let&#39;s say that my question i&#39;ll just change the question saying that does this college have a placement rate greater than 85 percent now what now what will this be greater than 85 percent think over it what this will basically be yeah what this will be now my question will look like this this is my this my alpha value is 0.05 obviously it is 95 confidence interval but this is only focused in finding greater so this entire value i&#39;ll put over here and this region will be my 5 value this region will be my 95 value so this becomes one one tailed test that also in the right hand side because here the important keyword is something called as greater so this becomes a one-tailed test and remember we cannot divide this alpha value into two parts in this case only in one part it will be basically present now just think over it if this value is lesser then what will happen it will come in this particular slide so i hope you are understanding what is the difference between one tail and two tail test so always make sure that focus on the question what the question is basically said here does this college has a different placement rate from the experiment any experiment that i may carry on my answer will be either greater than 85 or less than 85 so that becomes a two-tailed test in this particular case i&#39;m saying greater than 85 with alpha value is 0.05 so i am definitely sure that i am actually checking only this region i am not worried at this particular region because i need to check whether it is greater than 85 this is the most important thing with respect to one taylor two-tailed test now the next thing that we are going to discuss about is so we have finished one tail and two tail also now let&#39;s go and ahead and understand how to find out this confidence interval i told you right see this is very much important confidence intervals with respect to means i told you right in confidence interval what we do we basically have this graph when i say my alpha value is 0.05 then this becomes a two-tailed test suppose i need to find this value right i need to find these two values how do i find out these two values that is what we are going to see we are doing going to do some kind of calculations which will actually help me to understand so in order to find out confidence interval you really need to understand some things so if it asks less than 85 so it will be considered as one tailed test yes obviously one tail test but in the left hand side okay right now this will be in the right hand side now let&#39;s try to understand with respect to confidence interval now in order to understand confidence interval you really need to understand a topic which is called as point estimate okay so i will basically give the definition of point estimate what exactly is point estimate point estimate can be defined as a value of any statistic that estimates the value of a parameter is called a point estimate so a simple definition i have written over here i will define about what is this statistic and which estimates the value of a parameter so two things one is statistics and one is parameter so what exactly is point estimate a value of any statistics that estimates the value of a parameter now understand one thing guys in inferential statistics any work that we will be doing first of all we will be considering a sample data based on the sample data we will be estimating something for the population data right in this particular example let&#39;s consider that i will try to if i have the sample meal i&#39;ll try to estimate the population and usually this so many things happens in inferential stats you just have you just have the sample information probably population standard deviation you may know but you really need to find out or estimate the population bank and as you know like let&#39;s say that i&#39;ll give one example this is my x bar this x bar we will try to estimate the value of muba right because if i have a population with the help of sample i can definitely estimate mu but always remember this value may be approximately equal to this it may be also less it may be also greater right let&#39;s in one case i may say that if my x bar is 2.9 and probably my population mean is mu is equal to 3 right this may be equal this may be less this may be little bit greater also this is what point estimate is all about so point this is the point estimate which will be estimating the mu value so in this particular case i hope you understood what exactly is point estimate okay so point estimate is the value of any statistics that estimates the value of a parameter so this through this we are basically estimating the mean so at least get this specific knowledge now in most of the problem statement i will be given this and i really do need to estimate this how will i be able to do this so for that specific case we will try to see a problem statement and here we will something use something called as confidence interval now understand i told you that this value will be approximately equal to mean it may be less than mean it may be greater than mean so in this particular scenario we define something called as confidence intervals so that we will be able to come towards the population mean so confidence interval is usually given by the formula which is nothing but point estimate plus or minus margin of error so there is some margin of error there is some margin of error because over here you can see 2.9 this is obviously less it can also be greater so i have written plus or minus of margin of error because obviously we will not know the exact population mean right we don&#39;t know so obviously the point estimate plus margin of error will actually help us to get the same mean and this is how we determine the confidence interval now let&#39;s see one problem statement by this you will basically be able to understand what i am actually saying from this formula you will be able to understand that how close we are near to the population mean the second thing is that suppose if you are given the population standard deviation at that point of time what formula you should use to do this and how large your sample size is so let me just uh solve one one very simple problem uh and give it to you so the problem is very very simple not that difficult at all and we&#39;ll try to solve that specific problem so this is my question on the quant test of cat exam i hope everybody knows cat exam on the quant test of cat exam the population standard deviation the standard deviation is known to be known to be hundred now the next thing is that i will take a sample of a sample of 25 test takers 25 t stickers has a mean of has a mean of 520 score so here my question is that construct a 95 percentage confidence interval about the mean now let&#39;s see what all information is given over here you know that some information is definitely given you know that right so first information what is given over here you know your population standard deviation is given what is your population standard deviation here you can see that it is 100 100 is the population standard what is your small n size it is nothing but 25 what is your confidence interval with respect to this alpha i will get 0.05 and what is your mean what is your mean over here mean is nothing but x bar which is nothing but 520 is this information given in the question is this information given in the question is this information given in the question obviously it is given right now my graph looks something like this see this my graph is looking like this my mean is basically what is my mean my mean is nothing but 520 now my alpha value is 0.05 so here i have 2.5 here i have 2 point and this is my 95 confidence interval now i need to find out what this value what this range is basically if i say that i want to construct a 95 confidence interval about the mean what is this value what value from here to here it will range that is what i need to find out so this is what is my problem statement i have also given the standard deviation now here whenever population standard dev first thing whenever population standard deviation is given whenever population standard deviation is given guys why 90 alpha is 0.05 see i have given the question as 95 confidence interval right so it is nothing but 1 minus 0.95 which is nothing but 0.05 so this will be my alpha value right alpha and confidence interval are interlinked very simple now when population standard deviation is basically given we apply a test right what kind of test now here i know that this will be my point estimate plus or minus margin of error this is for my confidence interval formula now point estimate is obviously your x bar now plus or minus whenever view you have this population standard deviation you apply a z test so here you will write z alpha by 2 and the formula will be standard deviation divided by root n now this is your formula this this term i&#39;ll talk about this term this term that you see is called as standard error so in this particular case one more one more second point is that when we should use this formula to find out the confidence interval the thing next thing is that over here you will be able to see that i have taken a sample of 25 but usually the sample size will be greater than or equal to 30 but just for an example i have taken uh as 25 okay so it&#39;s okay now don&#39;t fight with mikrish why i have taken 25 take it 30 also we have to do the calculation but this two condition suits well for this kind of problem statement okay so for a z test to happen most of the time this two condition needs to be approved now this z test is nothing but z score okay z score to find out the z score that is what z test is basically used now understand over here what this alpha is okay so this is the entire formula to find out the confidence interval if your population standard deviation is given and when your sample size is greater than or equal to 30 now let&#39;s go and solve this particular problem now when i go and solve this particular problem the first thing is that i will split this equation into two part one is i will get one higher confidence interval alpha value is point zero five divided by two standard deviation is what is standard deviation over here it is nothing but 100 divided by root 25 now you understood why i have taken 25 because my calculation will become easier don&#39;t fight with me guys i don&#39;t have energy to fight nowadays i fight with a lot of people so this will basically be my upper bound upper bound of confidence interval similarly lower bound of confidence interval i&#39;ll try to find out that is x bar minus z 0.05 divided by 2 100 divided by root 25 now here i will write point zero zero sorry point zero five by 2 is nothing but z is nothing but 0.025 i hope everybody is getting this now how do i find out this particular value for this go and open your browser and open z table so if i go and open z table if i open that table let me just open a z table another z table i&#39;ll try to open just a second point here all minus are basically shown so i&#39;ll not use this z table which i&#39;ll use the other one because there are only negative values given here probably i&#39;ll be able to find out okay now in z table always understand always understand over here when i say point zero two five okay my entire area is how much so my entire area is one if i subtract one with point zero two five that basically means this part the entire area will become 0.975 so 0.975 i have to check in the z table so for this what i will do is that i will go to my browser and go and check it where is 0.975 0.975 is nothing but this specific area go and check this 0.975 i hope you are able to see this so what is this value 1.9 and if i go on top it is 0.06 that basically means the z value is 1.96 so go down over here you will be able to see 0.9750 it is nothing but 1.9 and this is 0.06 so this becomes my z score so finally i get my value as 1.96 now go and calculate it so what is my x bar for the upper bound i will say my x bar is nothing but what is the mean of the sample it is nothing but 520 okay so it is 520 plus 1.96 multiplied by 20. similarly the lower bound it is nothing but pi 20 minus 1.96 multiplied by 20. now go ahead and compute this 559.2 480.8 so this is my lower bound and upper bound that basically means whenever i am defining my confidence interval for this distribution with alpha is 0.05 and this this value will be 559.2 and this value will be 480.8 and my mean will basically be 520 right right so one stats interview question that i stole right find the average size of the sharks sharks throughout the world can you solve this by taking your own example because one of my student solved this particular problem and he gave some confidence interval he said that let&#39;s assume this this this this and try to solve in this particular way he said that okay let&#39;s consider oh there the interviewer said you know the population standard deviation you know the x bar value you know the n value try to solve it with alpha as point zero five i use naughty l understand that over here when my alpha value is point zero two five i am just worried about one tail right this side this entire area is 1 so 1 minus 0.025 is 0.975 now after performing any experiment if my value falls between these two at that point of time i will assume that it is it is we need to accept the null hypothesis and we can go ahead with it if it does not fall within this range then it is going to fall away from that but basically we need to reject the narrative now the next question that we are probably going to see is that what if the population standard deviation is not given now in that particular scenario what will you do for that particular case you really need to use something called as t test so let me just show you one very good example and that also will try to solve let&#39;s say that the same question this standard deviation is not given standard deviation is not given population standard deviation is not given but sample standard deviation is given so i&#39;ll write down the question over here to for you but i hope you are able to understand it so the question is that on the point test of cat exam on the coin test of a cat exam a sample of 25 test takers has a mean of 520 score with a standard deviation now this standard deviation that is given is basically your sample standard deviation has a standard deviation of 80 construct 95 percent confidence interval about the mean so this is basically my question right so this is my question so over here you can see that population standard deviation is not given so in this particular case i definitely have to use z test so over here sorry t test condition i&#39;ll write okay first of all we&#39;ll try to see what all things are given your n value is given which is 25 your x bar is given which is nothing but 520 right your sample standard deviation is given that is 80 and your alpha is 0.05 so when you see over here your values have not been given over here that basically means your you know the the the conditions and not the conditions but here your population standard deviation is not given so i can write a condition saying that here population standard deviation is not given so in this particular case we use something called as t test a population standard deviation given at that point of time you use t test let&#39;s go and try to compute it here also the same formula will be used point estimate plus or minus margin of error here your margin of error formula will change okay now what kind of formula it will have that you need to understand the formula will be something like x bar plus or minus instead of writing z alpha by 2 here you will be writing t alpha by 2 and then you will be using s by root n this is your standard error now go and substitute it so two things you will be basically having one is upper bound it will be x bar plus t point zero five by two s by root n right now first thing first always understand to calculate the t okay to calculate the t value you need to find out something called a degree of freedom because in the t table you will you will be asked this and degree of freedom formula is just like your sample variance problem that is n minus 1 which we also use with respect to basal correction so this will be 25 minus 1 which is nothing but 24. now i will go to my browser i will open over here t table so t table i am having here now first thing first you need to understand with respect to degree of freedom what is degree of freedom 24 degree of freedom is 24 25 let&#39;s see this this is 24 right i hope everybody is able to see the degree of freedom over here try to have a look on to this table point zero point zero two five point zero two five point zero two five is nothing but this one right this is what point nine seven five so if i see with respect to two point two sorry twenty four it is nothing but two point zero six four is everybody getting it we have to see in this line 24 degree of freedom on the left hand side on the right hand side you can see on top it is 0.025.05 so the answer is 2.06 2.064 so here i&#39;m basically going to find your t 0.05 divided by 2 is equal to nothing but 2.064 now the next step uh once you get this i will go and see what is my x bar 520 520 plus 2.064 multiplied by s what is s over here it is nothing but 80 by 5 5 is nothing but root 25 is 5 553.024 and then if i go and compute the lower bound 520 minus 2.064 80 by 5 so this minus 520 so here i&#39;m actually getting 486.97 so my lower bound is nothing but 486.97 the upper bound of the confidence interval is nothing but 553.02 so with this we have done wow i&#39;ve written so much today we have finished confidence interval congratulations everybody we have successfully completed congrats why this is not two-tailed this is two-tail only no i told you no this is two tail why are you getting confused see over here if i see away a point zero two five for one tail for two tail this is point zero five now let&#39;s go ahead and rest try to do the first z test i hope everybody is understood why do we use z test so the first question that we are going to solve is one sample z test now we will perform hypothesis testing so the first problem that we are going to solve is one sample z test now we are going to perform hypothesis testing what exactly is one sample z test first of all i told you two conditions with respect to z test the first condition is that the population standard deviation is given at that time you use that test the second thing is that your sample size should be sample size should be have a size at least n is greater than or equal to 30. just to make calculation easier i just put it at n is equal to 25 because root of 25 was 5 so because of that i put it don&#39;t fight with me i have no energy i think you&#39;ll beat me and go but i don&#39;t have an analogy okay so let&#39;s go ahead and let&#39;s try to see how to do a hypothesis testing okay let&#39;s say that i&#39;m writing a problem statement in the population the average iq with a standard deviation of 15 okay researchers wants to test a new medication to see if there is positive or negative effect on intelligence or no effect at all a sample of 30 participants who have taken the medication has a mean has a mean iq of has a mean iq of 140 did the medication the intelligence just by reading the question what do you get it from it guys did the medication improve the improve the intelligence or not now i&#39;ll show you how to perform a hypothesis testing okay obviously you got to know that what test it is used okay so let&#39;s go ahead and let&#39;s go ahead and discuss it now first of all how to perform a hypothesis testing okay so the first step is that we need to define the null hypothesis now in this particular case null hypothesis is 0 now in s0 what you will basically say that your mean is nothing but 100 can i say your mean is 100 because see the in the obviously we need to check whether the medication affect the intelligence or not so here i&#39;ll say that my null hypothesis will be that my mean is my mean iq is 100 my alternate will be that my mean is not equal to 100. i hope everybody is agreeing with this agreeing with this everybody is agreeing with this clearly right so over here s 0 is equal to mean is equal to 100 okay so this basically says that it is your null hypothesis the second thing is that we need to define our alternate hypothesis my alternate hypothesis is 0 where my mean is not equal to 100 my mean is not equal to 100 because if i am saying my null hypothesis is the mean is equal to 100 then this will not equal to 100 one important thing that i mentioned forgot to mention my alpha over here will be 0.05 that basically means my confidence interval is 95 so this is also a part of the question now the third step let&#39;s go to the third step okay mean is not 140 mean is 100 sample mean is 140 you can apply in any concept guys it need not be that you can only apply in something okay the third step we basically state our alpha value state alpha value so my alpha value is 0.05 the fourth step let&#39;s go to the fourth step now what what do you think the fourth step is in the fourth step i need to provide my decision rule so here i will say state decision rule and always understand in the decision rule you need to specify this graph and here you will basically say that since my alpha is 0.05 what kind of test this will be did the medication affect the intelligence the question understand this question did the medication affect the intelligence so here we are just focusing on that whether the medication increase your intelligence or whether it decreased your intelligence okay so either it can be so this definitely will become a two-tailed test so two-tail test so here i will be having 2.5 here i will be having 2.5 so this will be 2.5 this will be 2.5 and this will be 95 percent right everybody is clear with this can i get a quick yes right this will definitely become a two-tailed test and one more important thing over here when i say 2.5 then if i really want to find out with the help of z test what will be this value i have to check it for 1 minus 0.025 right so this will be 0.975 in this particular value i need to check in my z table so i need to find out this value and this value right go ahead and check it go ahead and check it what will be the value over here this will be plus 1.96 this will be minus 1.96 we just checked it right we just checked it over here you can see over here right where did it go here we got this this was for t right this was for z for z we got this right 1.96 see this we got 1.96 okay so 1.96 plus minus 1.9 now i know my decision rule my whatever experiment i will perform later on whatever z score value i&#39;ll be getting i should be getting within this minus 1.96 to plus 1.96 now here i will use my test statistics now what will be my test statistics over here it&#39;s very simple what is the z score formula we basically use calculate test statistics and this will be t test right so the formula that we basically use sorry z test i&#39;m extremely sorry this will be z test not t test z test calculate z what is the formula that we basically use x minus mu divided by standard deviation i hope everybody remembers this formula right but understand one thing the real formula of z square is z test is this divided by root n the reason why we did not consider before root n understand for every sample my n value will be 1 so whenever i write root of 1 it will be 1 right but when we are working with a huge sample right when we are working with a huge sample we have to basically use root n and this is called as standard error standard error like how we divide something by you know n minus 1 there is some reasons why we do it okay and probably i don&#39;t know whether you have seen my video or not okay so this is basically called as standard error so we have to divide by root n always understand for one sample this root n will always already be one so we have to use this particular formula okay to do this okay so this is basically my standard error formula which is specified by this and always remember because why do we use this i&#39;ll just give you one example suppose i i take five samples right i take five samples five different different samples from a population five five five samples from different different let&#39;s let&#39;s say that i have a population of thousand points okay let&#39;s say that i have a population of thousand points thousand points let&#39;s consider that i am considering a sample of 100 points every time i come i i take otherwise just wait for it guys i will teach you there is a topic which is called a central limit theorem i will discuss that and then probably i&#39;ll teach you this particular topic but just right now understand that to make our standard error become very less because here we are working with sample data if you are working with population data directly we can write standard deviation like this but since we are working with sample data we have some kind of standard error and by dividing it by root n as the sample size keeps on increasing our values our mean values will be matching with the population so don&#39;t worry just give some time and i will explain you this for right now just consider that it is standard deviation uh divided by root n if i go ahead and calculate this particular thing so what is my x bar my x bar is obviously what let&#39;s see what is my mean a sample of 30 participation is nothing but 140 okay so 140 minus what is the population mean iq the average iq is 100 so here i will say 100 divided by what is standard deviation population standard deviation with a standard deviation of 15. so here i have 15 divided by root n what is n what is the sample that we have taken 30 so here it will be root of 30 so this will basically be 40 divided by 15 multiplied by root 30 i hope i am right it is 14.60 finally we state our decision now this is a very important step because from this particular step i will be able to understand something i got from my z test 14.60 now let&#39;s go and see our decision rule what did our decision rule basically say it should be between minus 1.96 to plus 1.96 is this greater than 1.96 or not so 14.96 over here 14.60 is greater than 1.96 which is obviously that my condition will be that if z is less than minus 1.96 or greater than 1.96 then what we have to do we have to just reject the null hypo ss because my z value over here is 14.60 but let&#39;s take out one amazing thing from this so when i reject my null hypothesis that basically means my mean is not equal to 100 right i&#39;m accepting this now tell me one very important question does this medication improve the intelligence or did it decrease the interleavance my next question is that did the medication improve the intelligence or decrease now this you have to answer me after solving this much obviously it is improved improved guys improved the intelligence not decreased improved the intelligence very simple if i was getting the z value as minus 0.2 that time whatever is happening then what would happen happen it would have decreased the intelligence it has increased the intelligence rejecting the null hypothesis is saying is saying that the medical the medicine had an impact now do the same problem i will just change one value over here this mean will be 110 this mean of this 30 participants will be 110 try to solve the problem and tell me whether the null hypothesis is accepted or the alternate hypothesis whether the null hypothesis is accepted or rejected okay so do this from your side so do the problem statement so we will start our second test which is called as one sample t test now i hope you like the session so guys uh see i like to teach in this particular way right you know like write everything i never prepare ppts probably you have seen my youtube videos hardly i prepare any kind of ppt&#39;s you know i write it like this the reason i write it like this because it also helps me to practice it also helps me to see that what mistakes i&#39;m probably making i&#39;ll become perfect in this so tomorrow if you call me in any session chris probably teach statistics or machine learning you know i will just go and start explaining everything over here now let&#39;s go towards the next problem second problem statement which is called as which is called as one sample t test now i hope you understood what is t test right first of all let me say z test whenever you have population standard deviation you use this right you really need to remember this if you don&#39;t have population standard deviation that is an unknown case of population standard deviation then what you do you use t test okay so this is the basic difference between t test and all okay now i&#39;ll take the same problem okay so let&#39;s solve the same question okay first of all in a population the average iq is equal to 100 then a team of researchers tried a medication to see whether there is a positive or negative effect a sample of 30 participants were taken and they have a mean of mean of mean iq of 140 with a sample standard deviation of 20. so did the medication affect the intelligence the first thing first answer what is your null hypothesis your mean is equal to 100 what is your h1 mean is not equal to 100 the second step the first step the second step is done now the third step that we do in t test which i have discussed before also calculate the degree of freedom here i basically use n minus 1 so this will be 30 minus 1 which is nothing but 29 fourth step what is my fourth step i will go ahead with the decision rule now my decision rule is nothing but it&#39;s very very simple i will go and define this graph i know what is my alpha value 0.05 i know my question did the medication affect the intelligence it can either increase or decrease so it will become a two-tailed test so here you have 2.5 here you have 2.5 here is your 95 what is this value we have probably found it out with degree of freedom 29 so let&#39;s go and try to find out with degree of freedom 29 what will be the value so it is t table degree of freedom 29 so 2.045 so 2.045 so here you will be able to see plus 2.0 what what was that 2.045 sorry i&#39;m minus 2.045 right so this is your decision rule now your t value that you should be getting should be between this if it is greater or lesser than this you reject the null like that is what you have to probably do finally we go to the test statistics formula of t test the formula will be almost same t is equal to x bar minus mu divided by population standard deviation is not given sample is given and this will be root n so try to compute the values guys x bar is nothing but 140 mu is nothing but 100 s is nothing but 20 n is nothing but 30. so compute it so if i try to do the calculation entirely this entire answer will be 10.96 now since we have got 10.96 it is obviously greater than 2.05 so the t value which is nothing but 10.96 is greater than 2.045 and it is also greater than this particular value so what we do we reject null hypothesis now when we reject null hypothesis that basically means my p value is less than or equal to the significance value that is i am falling in this region or in this region now since i am getting 10.96 what do you think whether my medi whether my intelligence increased or not so obviously final conclusion you can see that it has increased the intelligence so what you do you reject the null hypothesis accept the null hypothesis sorry reject the null hypothesis except the alternate so from my teaching did your iq increased or not now let&#39;s see a real world problem and probably you can do this from yourselves a bank wants to open an atm machine in a specific area so this problem you have to formulate and you have to think over it how we can apply hypothesis testing you can consider any values that you want like you can say that average money people take out from the atm machine with 95 percent confidence interval you can formulate because this was one interview question in one bank that is called as standard chartered interview question from there think over it what all things basically required right think over it and try to solve it okay hello guys i hope you&#39;re doing mine today uh first of all uh we will continue uh with the discussion where we left so we will solve a chi square problem the second thing that i forgot about some of the topics over here is with respect to covariance correlation pearson correlation coefficient and the fourth topic that we are going to see is nothing but cpr man rank correlation coefficient peer men rank correlation coefficient we are going to discuss about this then probably we are also going to see practical implementations okay so we are also going to check out some practical implementation things now in this practical implementation we will try to perform z test t test and probably also see how to perform chi-square test this topic we will also see f-test which is the last topic which is also called as anova test the reason why i have kept f-test as large because the calculation will be uh very very uh the calculation is quite complex in that particular case so now let&#39;s go ahead and let&#39;s try to discuss about the chi square test uh chi square test has quite amazing uh problem statement so if i really want to discuss about chi square test it is mostly i&#39;ll i&#39;ll talk about it okay right now so let me just define what is exactly chi square test the chi square test claims about population proportions that basically means if someone asks you krish okay someone asked you in the interview why is chi square test use that why it is used you can just say that it is a non-parametric test that is performed on categorical variables categorical it can be nominal or ordinal data so this is how you basically define a chi-square test so uh it is a non-parametric test that is performed on categorical or ordinal data so this is what chi square test is basically used so if probably they ask you in the interview make sure that you are basically understanding why a specific test is actually done this is very very important okay because in the interview they&#39;ll not they may give you a problem statement and they may ask you what will be your plan to solve that specific problem statement but with respect to definition you should definitely be able to tell them let&#39;s go ahead and let&#39;s uh solve a specific problem for solving this specific problem i am just going to take a chi square test problem okay let&#39;s say that uh i&#39;ll take a very good example so this is my question in 2000 indian census the ages of the individual the ages of the individual in a small town in the small town were found to be the following now over here you have three categories less than 18 years 18 to 35 years and greater than 35 years so you had this information in the 2000 census that basically means less than 18 years were basically 20 percent 18 to 35 were somewhere around 30 percent and greater than 35 was somewhere around 50 okay so this is the information that is given from the complete sense considering this in 2010 ages of sample n is equal to 500 individuals were sampled below are the results so we basically have three columns again that basically means in 2010 again they took a sample of 500 people and they found out this was the basic results let&#39;s see so out of those 500 less than 18 18 to 35 and greater than 35 so less than 18 were 121 people 18 to 35 or 288 people and this was 91p so the question is using alpha as 0.05 would you conclude the population distribution of ages has changed in the last 10 years so this is the question that is basically given to you the question is very much simple it is saying that in 2000 uh in 2000 census in 2000 census the indian census the age of the individual in a small town were less than this is basically the data this is the population information like less than 18 percent were uh 20 18 to 35 were basically uh 30 percentage and greater than 35 was 50 percentage okay then in 2010 the ages of n is equal to 500 individuals were sampled below are the results then in 2010 what happened is that you know uh this again sam they again found out by picking up 500 people as a sample data and they found out that less than 18 were 121 people 18 to 35 or 288 people and greater than 35 or 91 people so using alpha is equal to 0.05 would you conclude the population distribution has changed in the last 10 years now what we are going to do over here is that we are basically going to solve this particular problem now you may be thinking that chris you have told that it is a non-parametric test that is performed on categorical that is nominal or ordinal data now what exactly is non-parametric test non-parametric test usually occurs with respect to population proportion whenever you are given some kind of proportions of data at that point of time you cannot specifically use a kind of parametric test so you have to go with non-parametric test now here you can actually see uh that whenever this is the original data with respect to the population then you sample the data and you found it out right and then we are just trying to see that what is the difference between this to this this to this do you think the population may have probably changed just by seeing the specific data or still you will probably just say that yeah sir it may be population has changed probably 18 to 35 you can see a huge quantity or number greater than 35 just seeing the percentage it shows a very less number obviously from the above population proportion you should be saying that greater than 35 should be more uh in this particular scenario what kind of assumptions we can make there is two kind of assumptions whether the population distribution has changed or whether it has not so how to go ahead and approach and solve this particular problem so here i am going to basically start the answer so the first step what we are going to do as usual uh you can let&#39;s let&#39;s go ahead and let&#39;s make two tables first of all so this is my first table this is my second table because this table will play a very important role guys okay the first table basically have the population information so i&#39;m just going to draw it over here here i&#39;m going to basically say less than this is less than 18 18 to 35 and greater than 35 now this is the expected see why i&#39;m saying expected because this is the population information this is the population information so here less than 18 is 20 percent this is 30 and this is 50 now this is what your entire distribution is expected to be because in 2000 u.s sensor they found out this data now right now after 10 years when they took the sample of n is equal to 500 this is the observed one so the observed one was less than 18 less than 18 was 121 18 to 35 was 288 and greater than 35 were 91. so this two information you definitely have the reason why i&#39;m drawing this 2 or writing this 2 information okay we will be able to understand it now we will create one more table and the table is something called as expect based this is the observed one right now what i&#39;ll do i&#39;ll create one more field and let&#39;s say based on this suppose if i take n is equal to 500 based on this what should be our expected what should be our expected distribution based on this data if i&#39;m picking up 500 so we will try to divide this based on this percentage right so here my value will be 500 500 multiplied by what is 20 is less than 18 so i will multiply by 0.2 here i will say 500 multiplied by 0.3 here i will say 500 multiplied by 0.5 so this should be my expected distribution based on 2000 sensors observed is this one that is fine but we really need to find out our except expected also so if i multiply this two so if i multiply 500 multiplied by 0.2 this is basically 100 so here i&#39;m basically going to write 100 this will be how much this will be 150 and this will be 250. this was what was the x what was the distribution i needed to have based on the 500 data based on this uh 2000 sensors but this is what is observed so now let&#39;s go and focus on this two table right now okay so we have got 100 150 250 obviously there is a huge difference by seeing this only you will be able to say that okay krish there is a huge difference here only i can definitely say that okay just reject the null hypothesis but understand over here alpha is basically given i want 95 percentage confidence interval why multiplied again understand why multiplied because we need to find the expected distribution based on this data from this 500 sample so if i consider 500 sample in 2010 also i need to get this data 100 150 250. so i&#39;m basically going to say this is my this is my observation and this is my expected so this here has less than 18 18 to 35 and greater than 35 so less than 18 is how much i have basically 121 288 91 this is the observation and then i have 100 150 and 250 now these are my three categories this is one category this is two category and this is the third category now let&#39;s go ahead and let&#39;s try to understand what is the next step now next step i will first of all obviously you know you have to define your null hypothesis alternate hypothesis when you start the hypothesis testing so let&#39;s say that my null hypothesis is that the data meets the distribution meets the distribution this is the data right this is the data observation data it meets the distribution of 2010 sensors of sorry of 2000 senses my alternate hypothesis will say that the data does not meet the distribution of 2000 sensors so i hope everybody is able to understand the null hypothesis and the alternate hypothesis then the second step is my alpha value my alpha is 0.05 that basically means 95 percentage confidence interval now the third step in this is that whenever we do a chi square test we also need to know the degree of freedom so how do we calculate degree of freedom this is the steps guys and always this will be like this only n minus 1 what is n over here n is nothing but this is 1 2 and three this is where number of categories are coming into picture categories are coming into picture one two and three so three minus one is basically two age is now categorical right absolutely perfectly fine you know your degree of freedom your degree of freedom is 2 and your alpha value is 0.05 all you have to do is that go and check in the chi square table okay to find out your decision boundary is this a one tailed test or two-tailed test the data may be less than your distribution it may be more right so here is this a two-tailed test because alpha is point zero five guys we have to pick three as n because there are three h categories so this will become a two tailed test now in two tailed tests all i have to do is that open a chi square table let&#39;s see now this is my chi square table hope so i get the answer quickly so df is 2 or to look upon an area on the left subtract it from the 1 0.05 see 0.05 is here and degree of freedom is here so this becomes 5.99 so over here your we usually mention chi square by x square and chi square is basically denoted by x square and my decision boundary is that if chi square is greater than 5.99 i have to reject at zero now let&#39;s go ahead and compute the chi square test as usual very simple definition so my definition will be that fifth is calculate the test statistics which is called as chi square test this is nothing but x square is equal to summation of f 0 minus f e whole square divided by f e again notation can be used in all different ways but let me talk about what is f 0 f 0 basically means observed okay observed f e basically means expected so i am going to do the summation of all these three values so here i will first of all write 121 is my first observed value see 121 100 so 121 minus 100 whole square divided by 100 then if i go to the second element over here you can see 288 minus 150 divided by 150 so 288 minus 150 divided by 150 okay whole square then third one will be 91 minus 250 whole square divided by 250 so this will be 91 minus 250 whole square divided by 250 232.94 that basically means my x square is 232.94 which is obviously greater than 5.99 so what we have to do we have to reject the null hypothesis and which is absolutely true because the population distribution has changed so 232 is greater than 5.99 so we are rejecting the null hypothesis okay it is 494 okay let me write 494. so if you want to define chi square it claims about population proportion you can just say that it is a non-parametric test that is perform not categorical nominal or ordinal data it is specifically applied on nominal or categorical date okay let&#39;s see one python example okay so i&#39;m just opening this okay let&#39;s say that in my uh i want to perform z test okay so let&#39;s say that i have some values like this so this is my question suppose the iq in certain population is normally distributed with a mean of mu is equal to 100 and standard deviation of 15 a researcher wants to know if a new drug affects iq level so he recruits 20 patients to try it and record the iq level now i am going to show you the code in python to determine if the new drugs causes the significant effect or not so i&#39;m just going to execute this and let&#39;s say that i have this 20 records for z test we use this library which is called as stat models dot stats dot queen stats import z test as z test so these are my 20 patients and i have recorded the iq after the medication is basically applied now in order to apply z test no need to do that much calculation just write z test and here just give the data and the next parameter that you will probably be giving is this iq that is 100 okay which which you are actually trying to compare to basically reject our null hypothesis or not in this the null hypothesis will be mean is equal to 100 mean is not equal to 100. now when you execute this now here let&#39;s consider that i think the library is not there or tuple index tuple index what is the problem value is equal to 100 i have to write so here let&#39;s consider that my alpha value is see let&#39;s see this these are the two values that i&#39;m getting the first value is the z test value the second value is the p value what does this p value basically mean now many people were asking the difference between significance and p value in z test they try to give us some kind of p value here they also give the z test value the z the z score that you are able to see is here and this p value this p value can be used along with significance value and suppose right now the p value is point zero zero one let&#39;s say that point zero for zero point one one this zero point one one suppose if it is less than significance level now in this particular case let&#39;s consider that i&#39;m going to take a significance level of 0.05 so if this is less than this then obviously this we reject the null hypothesis this is just saying that based on this p value it is basically following falling in this region so that is the region it is great less than 0.05 so it obviously gives 2 value here just for understanding purpose we can definitely use this value and try to do the remaining calculations if you want because this is my real z test value other than that this value will basically help you to compare with the p value and then decide whether it has got rejected or not so i can give you the entire code in the okay since it is 0.11 this is less than 0.05 we are going to reject it suppose if we get 0.005 let&#39;s say that in this particular case i am going to use the mean as 110 now you see i zero zero two so do we reject or accept the null hypothesis here in this particular case if i go and probably see you will be able to see i&#39;m getting point zero zero zero two point zero zero two which is obviously less than 0.05 so we accept the null hypothesis this is obviously not right not less than so we reject the null hypothesis this is not less than 0.11 is greater than 0.05 so we reject the null hypothesis alpha can vary you can have 0.01 you can have 0.10 it depends on the domain now let me discuss about the next point which is called as covariance yes if p value is less than if p value is less than significance value that basically means it falls in the tail region so we reject the null hypothesis if it is greater then we accept the knowledge if it is greater we have to accept or i&#39;ll say we accept the null hypothesis yes in medical domain it can be depending now like that you can do for t test what all data you require see whatever a question i am writing with respect to this that kind of data you require now in this case also when i when i gave this problem statement when i gave this problem statement here you can see that right i have written the same type of question suppose the iq in this this is this is there so it across 20 patient this is the 20 patient data and i&#39;m basically checking out the z test so if mean iq before meditation is 110 and p value is 0.002 it means that even after taking medication the iq will be around 110 it means medication has no effect yes with respect to that specific thing let&#39;s say that it has got uh medication has been applied before the iq was 110 and after giving this medicine also it was near 100 alpha and significance value are one hint the same okay let me define once again this is my graph okay this is this initially i got 1 1 this was my p value this is obviously greater than 0.05 and let&#39;s consider about p value as 0.002 which is less than 0.05 when i have this scenario this basically means i am in the confidence interval if i am having this scenario where it is greater than 0.05 i am falling in this in the tail region i hope now you are able to understand if i am having this scenario where the p value is less than significance value i am in the confidence interval in this 95 percent if i am over here that basically means it is greater than the significance level yes if p value is less than we do not reject right in this case we accept the null hypothesis we fail to reject the null hypothesis we accept null hypothesis okay in this particular case we reject now let&#39;s go ahead and discuss about the next topic which is called as covariance let&#39;s say that i have two data set the two columns x and y so if i have these two columns let&#39;s say that this is basically my weight and this is basically my height feature okay in this particular scenario let&#39;s consider that you have some weights let&#39;s say you have like 50 you have height like 160 centimeters then you have 60 170 centimeters then you have 70 then you have 180 centimeters and then probably you have 175 you have 181 centimeter now in this particular thing you can what what kind of things you are seeing what kind of relationship you are seeing when x is increasing y is increasing and similarly can i say when x is decreasing y is decreasing so both this relationship will basically follow this specific thing we based on this particular data so when x is increasing y is increasing as x is decreasing y is decreasing suppose let&#39;s say that i have one more data set weight and height only let&#39;s say i have uh let&#39;s consider that uh number of hours study and number of hours play now in this particular case if i&#39;m studying for two hours let&#39;s say i&#39;m playing for six hour if i&#39;m studying for three hours i&#39;m playing for four hours if i&#39;m studying for four hours i&#39;m playing for three hours in this particular case what is the relationship you can see that when x is increasing y is decreasing or where x is decreasing y is increasing so this relationship is basically used over here so here you can see these two conditions right this two conditions now this is what you can observe but the main thing is that how do i quantify how can i quantify or show some relationship quantify relationship through numbers between x and y now in that particular case i can use a formula which is called as covariance now covariance is basically given by cov x comma y which is nothing but summation of i is equal to 1 to n x of i minus x bar y of i minus y bar divided by divided by n so this is basically the formula with respect to covariance if you are working with sample again this will be n minus 1 right now let&#39;s consider that we are working with sample now in this particular case you can see what is happening covariance of x comma y covariance of x comma y is nothing but x minus x bar x bar is nothing but the mean of x y minus y bar is y bar is in the mean of y now when we calculate you will be able to see either we will get a positive number or a negative number or i may get 0. now tell me what does a positive number basically indicate so positive number positive value indicate two things one is when x is increasing y is also increasing when x is decreasing y is also decreasing so this shows if this positive number is basically coming it basically shows or it basically quantifies the relationship between x and y in this particular way that basically means when x is increasing y is increasing when x is decreasing y is decreasing so here you will be able to see with this with a positive now similarly with a negative number so with a negative number here you can find out that when x is decreasing y is increasing as x is increasing y is decreasing so this relationship you will be able to find out so this is nothing but positive correlation i&#39;ll say and this will basically be negative correlation so here you will be able to see this if it is 0 that basically means when x is increasing y is not increasing or probably there is no relationship between x and y so understand this particular thing but let&#39;s understand with respect to covariance like suppose if i have a data set which looks like this now in this particular case if this is my x and y what do you think will this be whether it have a positive correlation or negative correlation think over it it will obviously have a positive correlation right because here when the x is increasing y is also increasing if the x y is decreasing x is also decreasing right both this condition are getting samples so here you can definitely see this positive correlation is there and when you are trying to apply this particular formula you will either get a positive value in this particular case suppose if i have another graph which looks like this which looks like this this is my x and this is my y if i have some data points which looks like this now in this particular case what type of correlation you will have you will basically have a negative correlation sorry i should not say correlation over here i&#39;ll say negative for now because we have not started correlation but here you will be having some negative correlation okay i can also say it as negative covariance suppose if i have another data set which looks like this with respect to x and y if my data set is like this then what will be the my value of covariance covariance will be 0 because there is no relationship covariance will be basically 0. now let&#39;s understand one basic disadvantage of covariance the covariance over here you will definitely be able to see positive or negative you will be able to find out the positive or negative correlation but with respect to the disadvantage there is no fixed value you may have plus 100 also you may have plus thousand also you may find out minus 200 also minus 2000 also like this with respect to the magnitude there is no such limit you will definitely be able to see the direction whether it is positive or negative but this magnitude is not limited so if we have two distribution how much positive how much negative that part if probably if you have two distribution one is plus hundred the other one is plus thousand you&#39;ll not be able to identify because it is just a magnitude value it is just a magnitude value now that is the reason we really need to restrict these values between some range so for that specific region we use another one which is called as pearson correlation a pearson correlation coefficient what it does is that it basically restricts all your value between minus one to plus one the more towards plus one or minus one more positively it is correlated sorry the more towards plus one more positively it is correlated the more towards minus one more negatively it is correlated you should be able to see that okay then what is the difference between covariance with respect to the formula now for the peers correlation you can basically use something like this x comma y it is nothing but it is very simple covariance of x comma y divided by so standard deviation of x and standard deviation of y because of this multiplication all your values will be between minus 1 to plus 1. so here you will be able to see that it is always between minus 1 to plus 1. now let me show you some examples in wikipedia so if you go and search for pearson correlation coefficient here you will be able to see this okay now tell me this particular diagram here you can see all the points are in one one straight line so when you draw this particular line your correlation obviously in this particular case was when x is in decreasing y is increasing right in this particular case if x is decreasing y is increasing if x is increasing y is decreasing this is the relation that it is found so it is negatively correlated and if it falls all in the straight line it is -1 then here you will be able to see that over here you have some of the data points distributed in this here also you can actually see negative correlation but not all are in the straight line so your value your correlation will be ranging between minus 1 to 0. similarly in this particular case here you can see that when x is increasing y is also decrea increasing so here will have a positive correlation since it does not follow in the straight line it is written 0 to 1. if it falls in the straight line then it is plus 1. so it captures the linear properties very well because everywhere you can see that there is a linear line it captures it in an amazing way that is the most advantageous things with respect to pearson correlation now in this particular case here you can see that the correlation is zero why because we cannot identify when x is increasing y is also increasing the the data is completely distributed here and there now some more examples here you can see this is one this is pointed point four zero minus point four minus point eight and minus one and similarly these all are one one one one this is zero minus one minus 1 minus 1 and similarly here you can see some more zeros you can also see some more zeros this you cannot definitely identify what exactly is this there is a lot of difference between covariance and correlation so here your values will always be between minus 1 to plus 1 and nothing more than that okay let me search for one more thing something called a spearman rank correlation now you&#39;ll be understanding why do we specifically use pr man rank correlation also so i&#39;ll go to wikipedia here one thing that you have to identify it captures the linear properties well linear when the line is linear obviously it will say you want even though the distribution is like this it will try to create a linear line and your it will tell you the value now let&#39;s go to spearman rank correlation now in spearman rank correlation just see this graph everybody this graph over here that you are actually being able to see this is obviously having a positive correlation because when the x is increasing y is increasing and when i try to calculate with respect to pearson correlation it is giving me 0.88 you will be able to see that at every point at every point over here at every point when x is increasing y is definitely increasing in this region it is increasing by a small amount so this properties has not been able to capture by pearson correlation so that is the reason it is showing you 0.88 even though when x is increasing y is also increasing we need to get 1 and that is where spearman rank correlation will come because spearman rank correlation will also satisfy non-linear properties pearson correlation is good at satisfying linear properties that we have already seen because if you see this example it tries to determine the linear properties and tries to give the value in this case non-linear properties will also work well okay so spearman correlation and what is the formula probably uh they will try to change it to the formula only one difference is there instead of writing suppose let&#39;s say that i&#39;m going to find out the spearman rank correlation between x and y here everything will be same here instead of standard deviation of x here you will be having rank of standard deviation of x multiplied by rank of standard deviation of y now you may be thinking what is this rank of standard deviation of x and standard deviation of y let me just show you that also so this is the formula okay i i missed one more thing this will be covariance of covariance of rank of x comma rank of y now what is this rank of x and rank of pi let&#39;s consider that i have this feature weight and probably age if this is 170 the weight may be 45 if it is 160 the weight sorry weight is too high this is not possible so i will just say height and weight let&#39;s say so if i say the height is 170 the weight may be 75 kgs if i say height is 160 then the weight may be 62 150 the weight may be 60 145 the weight may be 55. now in this particular case how do i define my rank this is my x this is my y how do i define my rank of x now rank of x is very very simple you just assign rank over here you have four points okay which one which value you want to give the highest rank go and see over here this is having the highest value right highest value now let&#39;s see let&#39;s consider that i have one more 180 and this will be 85 let&#39;s consider in this you just need to convert this or you just need to assign rank to this particular data now rank basically gets applied to this in height if i say rank of x 180 is the highest right so i may give this rank as 1 then 170 is the next highest then 160 is the next higher than 150 then 150 45 right so here you will be able to see that i am assigning rank and similarly i will go and assign rank for y in this particular case my one rank is 85 then you have 2 then you have probably 3 then 4 then 5 like this this rank it will be basically used to do this calculation that is the reason i told right covariance of rank of x and rank of y divided by standard deviation of rank of x and ranko so this value will be taken this will be completely ignored so this is what is basically the entire spearman rank correlation and i hope you have understood but understand if someone asks you why do you use peer men rank correlation coefficient you should basically say that it captures the non-linear properties it captures the non-linear properties now let&#39;s go ahead and let&#39;s try to do this one example let&#39;s go and see something like t test and try to do it uh let&#39;s say let&#39;s see whether we&#39;ll be able to get or not so here i&#39;m actually going to show you t test so suppose i have this ages let&#39;s consider i want to initialize this edges so this is my ages you can randomly initialize whatever you want because we are just doing a hypothesis testing okay so it&#39;s up to you if you want this ages also i can ping it in the chat so this is the ages now my main aim is that let&#39;s compute the mean of this edges so ages underscore mean is equal to np dot mean of ages so if i go and probably paint ages underscore mean so this is 30.34 now let&#39;s let&#39;s do one thing very simple from all these ages let&#39;s consider that these are my population i will just take a sample of age and then we will try to verify whether we are coming nearer to this mean or not using uh t test because here we don&#39;t know the population standard deviation so let&#39;s do one thing i&#39;m just going to take my sample size as 10 this will basically be my sample size and i will just pick up all the sample uh from this particular ages so i&#39;m going to say np dot random dot choice so here i&#39;m just going to give my ages and this will basically be my sample size so if i okay i&#39;m getting an error okay random random np dot random still error okay random it became now insert random my goodness so here now if i go and show you my age underscore sample here you will be able to see that this ages have been picked now can i basically whatever mean is basically coming from this can i actually come near to this population mean with the help of t test that is what i&#39;m actually going so i&#39;m going to say from sky pi dot stats import t test underscore one sample this we have done yesterday okay t test underscore one sample basically means uh one sample t test that we have probably done yesterday that is what we are going to do now t test underscore one stamp here i&#39;m basically going to give you two things one is my age underscore sample and probably i want to give and compare with respect to this mean okay so here i&#39;m just going to give you 30. so here you can see that i&#39;m getting the p value as 0.76 if you don&#39;t believe me just go and compute the np dot mean of age underscore sample i&#39;m getting 31.5 right which is little bit away from here now it is up to you i got the p value as 0.76 now if i say my alpha value my alpha value is 0.05 in this particular case my p value is greater than the alpha value so tell me whether it should be accepted or rejected suppose if i execute the same code and i write sample size with respect to 31 now i&#39;m getting 0.918 suppose if i execute with respect to this and i take up with my sample as 28 now i am getting 0.48 if i keep on doing this and make it to 26 here you will be able to see 0.27 so this is with respect to different different things i can also even change this now if i go and execute this here i&#39;m getting 0.60 here i&#39;m getting 0.45 here i&#39;m getting 0.96 here i&#39;m getting 0.67 if i try to change this random value again and again let&#39;s say that i have taken a different sample my sample is mean is nothing but 24.3 now if i execute this this is 0.015 it is tell me 0.05 right greater than 0.05 or less than 0.05 now in this particular case it is if i say with respect to 31 this is 0.006 it is within that confidence interval or not similarly if i go and see with respect to 28 0.085 0.397 so here you can basically see and here i&#39;ve just taken a small example here i&#39;ve just taken a small example usually in the main scenario you will basically have a huge data set to check out all these particular things so this was an example with respect to t test let&#39;s take one more example now i have a problem statement i will consider so my example is that suppose i take college the ages of the entire college student suppose i take ages of the college student of the college student now what i&#39;m going to do i&#39;m going to take the class let&#39;s say one class students ages i&#39;m going to take student i&#39;m going to take and then i&#39;ll probably find the mean of all the ages and then we&#39;ll try to compare whether this will be able to give that specific output basically can we come to the population mean ages of the college student that is what i&#39;m actually trying to so first of all let&#39;s say that i&#39;m having this code let&#39;s say this is there now everybody focus on the code here this is a poison distribution uh it is just saying that you have to start from 18 age and the mean is 35 and we are going to consider our population ages as 1500 then we are basically considering class a with starting age as 18 mean as 30 and size that is only 60 sample so in this particular case if i go and see school underscore ages here is my value and similarly if i go and see class a underscore ages so this is my class a underscore ages which are basically my 60 data now let&#39;s do something uh one amazing thing first of all let&#39;s try to find out the class a underscore ages dot mean so here you can see that it is 46.9 now what i&#39;m actually going to do i&#39;m basically going to apply again this t test t test one sam and here my first data will basically be my class a ages and then the second parameter will basically be my mean my mean i will try to give this specific meal school underscore ages dot me so this will be a parameter if i go and see away and press shift tab you will be able to see that the second parameter i have to give as mean this is nothing but my pop me now here you can see that i&#39;m getting the p value as this one so tell me whether this needs to be accepted or rejected and if i go and probably see the school ages mean this far away right it is 46.9 and this is 53. if you&#39;re considering alpha is 0.05 it will obviously be very very so similarly you have to reject it guys not accept it okay because this is way higher than that you have to basically get p value less than the significance let&#39;s say that i am putting this nearer to nearer to the class a mean let&#39;s see what will happen 47 there is something coming somewhere here oh sorry i have to give this as my sample me so like this you can basically check out all the things and verify it whatever we have done we have done it is 10 raised to minus 13 it should not be rejected guys it is 10 raised to minus 13 i&#39;m extremely sorry it should not be rejected so that is the reason what we do we can put if so here i will just say underscore comma p underscore value is equal to this so if i say if p underscore value is less than 0.05 then what we do print accept s0 so what i&#39;m actually getting over here okay guys okay one more final thing that i probably missed out let&#39;s say that i am using c bond so df is equal to because i need to show you correlation and all also so we&#39;ll check out that also so that you&#39;ll be able to check it out okay so sns dot load underscore data set let&#39;s consider that i&#39;m going to use iris data set so this will be my df.head so if i use the correlation df dot corr so this is how my diagram looks like here you can see that it is basically showing you the correlation with sepal length and petal length it is positively correlated now see with respect to correlation also we have various ways okay i will not use in this instead i can also use snh dot pair plot df so here also you have a way to see in a diagrammatic way so here you can see this is how your diagram looks like okay the entire correlation in visualized way guys this will be reverse sorry basically right less than or equal to 0.05 so i have written it correctly over here but in the code i have written wrong okay if p value is less than the significance value we reject okay code wise but this wise uh i&#39;ve written it correctly sorry this should also be rejected this should be reject this should be accept anywhere i made a mistake here again okay here also i think except your reject i made one mistake guys again i&#39;m going to repeat it c if p value is less than or equal to 0.05 in this particular case we reject the null hypothesis the reason why we do this because p p is basically defining the probability part right now in this particular case they are just saying that it is less than five percent probability that the null is correct it basically says that five percent probability the null hypothesis is correct this will be applicable in coding guys so don&#39;t worry in coding so this is what it is basically saying okay so if p is greater than or equal to 0.05 here we accept the null hypothesis which in turn is saying that more than five percent probability there is a chances of more than five percent probability for the null hypothesis correct so this is basically for alpha point zero five i hope now it is clear here i have written it anywhere wrong so that this gets solved this is fine because this is greater than 5.99 only the relationship between alpha p value and alpha is important guys see in coding we always get a p value so here also if i go and see with respect to the coding let&#39;s say in this particular case if i&#39;m getting 0.015 this basically indicates that what does it indicate we have to reject the null hypothesis so here also i will just write a condition saying that this condition will work over here it is always a confusing thing p value because p value i defined in that particular way everything is correct because see what what exactly p value basically specifies i&#39;ll talk about it i told you an example of mouse bar right if i have like this here if i say p value is 0.8 that basically means out of all the 100 touches 80 percentage of time you are going to touch over here out of all the 100 criteria if you say p value is 0.01 over here one time you are saying now if you are saying p value is less than point zero five that basically means you have less than five percent probability for the null hypothesis to be true which is basically present over here p value less than or equal to point zero five basically is specifying your tail region and this is specifying your this confidence interval region when the p value is greater than or equal to point greater than point zero always understand this relationship between p value and significance value significance value specifies your confidence interval p value is from the test result if this is less than or equal to c i the confidence interval or in this particular case the significance value here you specifically reject the null hypothesis because it is just saying that less than five percent probability it is there so we reject the null hypothesis and in the other case we accept the null hypothesis okay when the p value is greater than alpha the first topic that we will discuss about p value and significance values so today i&#39;m going to talk about the exact relationship between this p value and significance value because from the tests that we were doing you know uh we were seeing that okay most of the tests from that test how do we derive this p value that is what i&#39;m actually going to discuss about it and practically also one example will be shown then we&#39;ll move towards distribution first we will discuss about central limit theorem central limit theorem uh then we are going to discuss about distributions like bernoulli&#39;s distribution bernoulli distribution then fifth we are basically going to discuss about binomial distribution and then sixth we&#39;ll also be seeing something called as pareto&#39;s distribution okay i&#39;ll include log normal also right log normal distribution poison uh pareto distribution there is something called as power law we will discuss about it one one final thing that is pending is called as f test which is also called as anova test this will take one hour time guys just to do this i will upload a separate video okay i&#39;ll upload a separate video on the same today i will show you how you will derive the p value so we are basically going to see how do we derive the p value and what is the relationship between p value and significance value so this all things we will be discussing let&#39;s take a problem statement the problem statement uh i&#39;m going to take it off as that test and uh let&#39;s let&#39;s take the let&#39;s write down the question before that everybody ready take up your book and pen i&#39;ve already discussed about permutation and combination guys in the previous session the question is nothing but it is very simple the average we&#39;ll do a z test problem and then we&#39;ll try to derive this the average weight of all residents in bangalore city in bangalore city is 168 pounds we take a sample now we take a sample okay one one data i have missed so over here the average weight of all residents in bangalore city is 168 pounds with a standard deviation 3.9 now what we are saying we take a sample 36 individuals and the mean is 169.5 pounds from this information we really need to check whether whether the sample is being able to tell us the weights are same or not okay so this is what it is given and our confidence interval is basically 95 percentage so over here you know what is my what is your mean my mean is 168 points the standard deviation is 3.9 the x bar is nothing but 169.5 and my end sample is greater than 36 and obviously my n sample is given my population standard deviation is given so i am going to basically use that test very good the alpha value is 0.05 1 minus 95 percent is 1 0.05 let&#39;s go ahead and solve this particular problem so what is your null hypothesis mean is equal to 168. what is the alternate hypothesis your mean is not equal to 168. then what we do we basically come to the second step where we specify our alpha 0.05 the third step we basically find out our decision boundary so my decision boundary is quickly how much it is nothing but it is this graph it is a two-tailed test it can be greater than 168 less than 168 so here i have basically 0.025 here i have 0.025 here i have 95 percent now what is this value that i can get from the z table that we see can i say 1.96 plus minus 1.96 if you open a z table with respect to 1 minus 0.025 you will be getting 0.9750 we are going to check this area of curve and usually we get 1.96 and minus 1.96 now the next step i hope everybody is clear because we have already done this in our previous session next step we do the that is my fifth step we calculate the z test now z test formula is very much simple i hope everybody remembers it x minus mu divided by standard deviation of root n what is x 169.5 minus 168 divided by what is standard deviation 3.9 root by 169.5 not one root of 36 so here we are basically going to get 169.5 1.5 divided by 3.9 multiplied by 6 so through 0.307 so right now let&#39;s go to our decision rule my z value is 2.307 is it greater than is it greater than 1.96 it is greater than 1.96 so we reject the null hypothesis but this is already we have done many number of times okay we have done many number of times but now one step will go ahead this is fine this is one way of solving this problem but where does p value comes into existence where does p value comes into existence in this particular case now what it is saying is that initially my this graph i was checking it for this to this where it was plus 1.96 minus 1.96 and obviously i got 2.307 so it is falling somewhere here it is falling somewhere here right if i&#39;m if i&#39;m considering 2.307 it is falling somewhere here it is on the top hand side so we are rejecting the null hypothesis now if i really want to find out the p value what i am actually going to do i am going to remove this and now my curve will be little bit bigger because based on this i got the z value as 2.307 and here also i got minus 2.307 because both are symmetrical now the next step what i will do i will take out my z table i will take out my z table and i will try to find out what is those values with respect to my z score with respect to my z score of 2.307 right so what i&#39;m going to do over here 2.3 i&#39;ll check based on the z score what is the area under the curve what is the specific area what is the specific area i really need to find out and i don&#39;t know what is the area right now so i will go ahead and calculate it now based on 2.307 okay so 2.3 is here and 0 7 if i say 0 7 it is somewhere here so 2.307 i am getting somewhere around 0.99 triple 1. i hope everybody is able to understand what i am getting over here 0.99 triple 1 right so what i am getting over here it is nothing but 0.99 triple 1. so here based on this my area under the curve is basically 0.99 triple 1. so this with respect to the area under the curve i&#39;m actually getting this now understand one thing if i subtract with one see 0.99 triple 1 is basically the area under the curve of this particular curve now if i subtract this with this how much i will be getting so this area is nothing but 0.0089 and this is nothing but 0.0089 so i am getting 0.008 9.0089 right so this you can see this eight eight nine sorry it is eight eight nine eight eight nine so here i get eight nine here also i&#39;m getting eight nine now according to the p value now see this this middle one is point nine triple 1. if i add up all this particular value i should be getting one and if you add it up and probably you will be getting one p value is nothing but i have to add this area of curve of this tail and this tail because it is two-tailed i have to add this up and then this will basically give my p value 0.0089 so once i add this particular value i am actually going to get point 0 1 7 7 8 is it not 0.889 divided by 2 uh no because see both both the area are symmetrical understand one thing both the area are symmetrical if i am getting one value over there if i&#39;m getting one value over there probably i&#39;ll be able to see that specific part right because this part is symmetrical to this part do you think it is divided by 2 do you think it is divided by 2 no i don&#39;t think so it should be divided by two it should it is basically considered at this part and this part right 0.99111 oh yeah should we divide by 2 yes yes then only probably i will be able to okay so probably we are getting more than one so over here i&#39;m getting two point three zero seven two point three zero seven is greater than one point nine six with respect to two point three zero zero seven i am actually getting the value as point nine nine triple one okay so one minus point nine nine triple one will be nothing but so point 0.0044 this area point zero zero four four now if we add it we will be getting till one now in order to add the p get the p value i will take this area four four plus point zero zero four four now here you can see that i am getting point zero zero eight eight now this is basically my p value okay because based on the real z score that i have got i&#39;ll be deriving my p value from here now obviously we know that it we have to reject the null hypothesis now from the p value also we can actually verify here now this p value is obviously less than 0.05 right which is my significance value so obviously 0.0088 is less than 0.05 so what happens over here we basically reject the null hypothesis so here we are rejecting the null hypothesis suppose if this p value is greater than 0.05 always understand one two important points one is p value is less than your significance value i hope you understood how to calculate the p value right so if this is less than or equal to the significance value this means we have to reject the null hypothesis if the p value is greater than significance value then what we do we fail to reject the null hypothesis we failed to or accept the null hypothesis it failed to reject the null hypothesis now it is clear guys from the yesterday&#39;s session now you can try out in every problem that we have probably discussed this many days see guys whenever we have a z table right right now one thing is that first of all i&#39;ll check with 0.025 1 minus 0.025 is 0.9750 so if i go and see probably somewhere you will be able to see 0.9750 where it is here so it is nothing but plus 1.9 and this is 6. but we saw that our real z score was coming as 2.37 two 2.30 three zero is this one okay guys two point three zero is this one i took two point three zero seven okay so again there was a confusion over here you can also take this one see one one minus point nine eight two eight one minus 0.9828 you can also do this 1 minus 0.9828 so if i subtract this 1 minus 0.9828 0.0172 if i divide by 2 this will be 0.0086 you can take this up okay again i&#39;m going to repeat this okay let&#39;s see i&#39;m i&#39;m planning to repeat it okay fine not a problem see initially what i got my value was this right i got this as minus 1.96 this has plus 1.96 but based on the z score calculation how much we got with respect to z score calculation here you can see i got two point three zero seven okay two point three zero seven so here i&#39;m actually going to get 2.30 okay let&#39;s take this so obviously my my z is 2.30 it is greater than 1.96 so i have told you we have to reject the null hypothesis in this case this is with the help of z score now what i&#39;m actually going to do let&#39;s calculate the p value now in order to calculate the p value okay what i will do i will rub this and i&#39;ll try to find out the area with respect to this z score that is 2.30 so plus 2.30 minus 2.30 and i will try to find out what is this area and based on this i will be able to find out this area also so let&#39;s go ahead in the z table now so this is my z table two point three zero okay two point three zero so here is my value point nine eight see this guys i&#39;m again repeating it two point three 0 right this is my z square value so 0.98928 so here i&#39;m actually getting 0.9828 right so this is my area under the curve 0.9828 i guess 0.9828 only 2.3.98 92928 okay 0.98928 now when i subtract 1 minus 1.9828 then i will be getting this area and this area right since i will have to get this particular area so i have to subtract with the whole one so if i go and calculate now 1 minus 0.98928 so it is nothing but 0.0171072 now understand this is not one-tailed test this is two-tailed test so i have to divide the area from here to here also so that is the reason why i divide by two so i&#39;m going to divide by two so i&#39;m actually going to get this as point zero zero one minus point nine eight nine two eight divided by two so it is nothing but point zero zero five three six then point zero zero five three six in p value i will add this two term understand guys what i&#39;m actually trying to do you have to check out the z table so if i add this probably then i will be getting some value and then check whether this is less than uh significance value less than alpha less than or equal to alpha then you reject the null hypothesis and obviously this case also it will be less than let&#39;s solve this problem so uh the average age of a college 24 years with a standard deviation 1.5 so this is a college over here the average age of a college is 24 years with a standard deviation of 1.5 now what i am actually going to do over here is that i am just going to say that okay fine uh the average age of the college is this much this much so we take a sample of 35 students let&#39;s say that the mean we take a sample of 35 students and uh we find out that the mean is 25 years then with alpha as 0.05 that is the confidence interval as 95 percent with alpha as one point alpha as 0.05 and confidence interval do the age where i okay so this is the question h0 you&#39;ll say mean is equal to 24 h1 you&#39;ll say mean is equal to mean is not equal to 24. you know their standard deviation it is 1.5 you know your n value it is 35 let&#39;s take it as 36 okay and then your x bar is 25 and your alpha value is 0.05 now tell me whether this is a two tailed test or one tailed test it&#39;s a two-tailed test so here you have your alpha as point zero five now if i make my confidence interval my decision tree sorry my decision over here this will be point zero five why point zero two five this point zero five will be divided into two region since it is two tailed test if it is a one tailed test focus over here only no focus over here only to solve it why you don&#39;t have to worry about all those things you know then let&#39;s go and solve with respect to z score z score x bar minus mu divided by standard deviation by root n so what is my x bar it is nothing but 25 25 minus 24 divided by 1.5 multiplied by 6. so it is 1 multiplied by 6 divided by 1.5 go and calculate it the z score is 1.2 you know the decision boundary what is the decision boundary plus 1.96 plus 1.96 right now you are getting 1.2 if you are getting 1.2 then obviously 1.2 is less than 1.96 should we reject or accept the null hypothesis 4 are we getting 4 oh sorry it is 4 extremely sorry now if you are getting 4 the 4 is greater than 4 is greater than 1.96 so we reject null hypothesis now what you are going to do for this particular 4 you have got a 4 value right so this will be your plus 4 this will be your minus 4. now go to the z table try to find out what is the four value so go over here try to find out what is 4 it is 0.99997 497 right that is 0.9997 so i will go and subtract this to this now if i subtract this to this what will happen and i have to divide this by 2 since this is since this is what two-tailed test so this side will basically be my area as point zero zero four zeros one five and this will be my area as point four zero one five and this middle one will basically be point nine nine nine seven now what is my p value my p value is point zero zero zero one five plus point zero zero zero one five so what this will be this will be nothing but the same thing point four zero three 0.403 now my p value is obviously very very lesser than significance value so what we have to do we have to reject the null hypothesis reject the null hypothesis so here you can definitely say that with the sample size that we have taken definitely we will not be able to conclude that the mean is that much so let&#39;s go ahead with log normal distribution okay guys so log normal distribution usually log normal distribution it will have this kind of shape obviously we have seen a lot of examples like wealth distribution these all things are actually there so this was the example of log normal distribution now suppose i say that if uh if y is a random variable that belongs to a log normal distribution with mean as with some mean let&#39;s say that this is there it belongs to a log normal distribution then if i apply if i apply log of y then it should follow a normal distribution so if it is satisfying this condition we can say a distribution is basically in this kind of log normal distribution so log normal distribution i have already discussed in the previous section also a lot of examples are there people writing comment session people writing bigger comments people writing there will be very less number of people who write big comments right big comments so this is one example and again this is also i have uploaded a detailed video in my stats playlist let&#39;s go to the next distribution if i say that next distribution is there so i will talk about bernoulli&#39;s distribution so let&#39;s talk about bernoulli distribution okay let&#39;s start away and talk about bernoulli distribution now in bernalillo distribution you can see that uh it&#39;s more about p and q it&#39;s more about see whenever you have a bernoulli distribution that basically means you need to understand there are only two outcomes so if i go and probably open it over here in bernoulli distribution they are specifically two outcomes two outcomes basically means that it can be either zero or one let&#39;s say that i have two outcomes of zero but i really need to find out the probability you know when when we need to focus on probability with respect to bernoulli distribution we defined by two values one is p and one is q suppose i say i&#39;m considering an experiment which is called as tossing a coin in tossing a coin i know what is the probability of head let&#39;s say that i&#39;m getting probability of head as 0.5 so this will basically become my p value okay when i talk about the p q value it will become one minus p okay that basically means if the probability of head because in pro when when we are probably tossing a coin there are two choices either you get head or either you get t so when i say probability of head is 0.5 so this is what one outcome probability is there what about the other outcome so that is nothing but 1 minus p so here if i have 0.5 then q will be 1.5 then this will be 0.5 suppose i do not have a fair coin i do not have a fair coin do not have a fair coin now in this particular case and this is only related to single trial not multiple trials single trial distribution now let&#39;s say i do not have a fair coin let&#39;s say that my probability of head is 0.3 now in this particular case what will be my probability of tail this is basically p then my probability of tail will obviously be q which is nothing but 1 minus p which is nothing but 1 minus 0.3 0.7 so this is basically 0.7 over here you can see over here now similarly if i go and probably discuss with respect to this here you can see that a bernoulli distribution named after swiss mathematician jacobi vernier bernoulli as a discrete probability distribution now here you see this okay three examples of bernoulli distribution here the probability of x is equal to zero with one of the outcome is point two so we are drawing this graph this line see this will come as point two this will come as point eight over here next next outcome you can see over here is 0.8 and 0.2 so this is how you basically create in this 0.8 and this is 0.2 in the green color you can see 0.5 and 0.5 so this is my 0.5 and 0.5 now understand one important thing whenever we draw this kind of like this kind of experiment if we draw in the form of graphs on the left hand side obviously you know what will be there with respect to this and the right hand side there will be probability so this is basically point two point four eight point six point eight and one now suppose let&#39;s consider that i have three i have one coin over here this one coin is basically head this is basically tail now if i try to show you with respect to the probability of head and tail i can basically draw suppose if i say this probability of head is 0.5 so i can draw this line like this i can basically draw this line with respect to this and then if i&#39;m drawing this line then probability of tail will also be 0.5 suppose if i say the probability of head of or of of not a fair coin is nothing but 0.8 then we will draw a line like this here i can basically say then what will happen if this is 0.8 then this will become 0.2 so this is how we basically draw this and this is not a probability density function understand this is a probability mass function in probability density function it is completely different probability density function is for continuous variable this is specifically for categorical variables so this probability mass function that we have over here we will basically say it has pmf before we used to say it as pdf so whenever we have this kind of variables categorical variables at that point of time this is basically called as probability mass function so i hope everybody is able to understand with respect to this now let&#39;s go to the wikipedia page so here you can see probability mass function and the same thing probability mass function is same if k is equal to 0 i will write q is equal to 1 minus p p if k is equal to 1 and the pmf is basically defined in this particular manner any probability that i want to form i want to find out this is how the formula is basically utilized we really need to know only this much things about the distribution and one probability formula this was with respect to bernoulli distribution now let&#39;s go ahead and try to discuss about binomial distribution see binomial distribution is also very much good till now we discussed about single trial right single trial whenever we take a multiple trial then it becomes a binomial distribution inside this let me write it down over here so if i go and see with respect to binomial distribution binomial distribution says that obviously with respect to every trial there will be a bernoulli distribution bernoulli distribution but here we have multiple trial that basically means we have the combination of many bernoullis distribution over here suppose in this trial my probability of head is this much suppose in one more trial i will go and write probability of head is 0.6 this is 0.4 like this i will be having many trials combined together in one kind of binomial distribution whenever you have a categorical variable and whenever we try to draw this kind of diagram then it is called as the probability mass function in in the case of a continuous variable we have probability density function so if i go and probably see this the binomial distribution is given by two notation n comma p so n is nothing but number of trials p is nothing but success or probability for each try where q is equal to one minus p okay and this is the formula with respect to the probability mass function to calculate the probability of a binomial distribution now this is done now let&#39;s go to one very important distribution which is called as pareto distribution now pareto distribution is a non-gaussian it is not a gaussian distribution it looks something like this one application of pareto distribution is nothing but power law distribution so if i show you power law so here everybody see this diagram with respect to this i&#39;m just going to take a snippet of it we&#39;ll discuss about this this is something very much important okay let&#39;s let&#39;s paste it over here now in this particular case when we are discussing about power law distribution let&#39;s see that what important information we can take out from this power law distribution basically says that you have to remember this rule which is called as 80 20 rule you can see that this is probably my 80 percentage of this entire value and this is my 20 percentage of the entire value my x-axis may be something my y-axis may be something but understand the 80 of one kind of distribution will be falling here and remaining 20 will be falling here let&#39;s say some take some examples okay now suppose if i say that 80 percentage of the wealth is distributed with 20 percentage of the people the second question any example any other examples can i say 80 percentage of the company projects are done by twenty percentage of the people twenty percent of people in a team eighty percentage of sales is done by 20 percentage of the most famous project any example more one more example i can take 80 percentage of the match cricket match okay let&#39;s say is one by 20 percentage of the 20 percentage of the t eighty percentage of videos are completed by twenty percent eighty percentage are serious out of all the hundred percent like eighty percentage of the syllabus are completed by twenty percentage of the eighty percent spamming on youtube video has been done by 20 percentage of the people yes any kind of examples you can basically take you can also consider salaries you can also consider yes 80 percentage of oil coming from 20 percentage of the land so whenever you have what this kind of distribution it is called as power law distribution and this is also called as a pareto distribution now listen to me one thing guys this is something very much amazing right now this diagram that you see it looks something like this looks something like this right if i extend this diagram and probably make it like this if i probably extend this diagram if i extend this diagram and make it like this see this this is a very important thing then what kind of distribution this is this is my power law distribution what distribution is this is not normal this is can i say this is log normal distribution log normal guys not normal the right hand side over here this will get extended by a lot log normal distribution probably i did not draw the diagram properly but it is a log norm log normal right skewed data something like this let&#39;s say so this specifically is a log normal distribution so there is a very good relationship between log normal and power law distribution of pareto distribution so mathematically if i talk about you can also convert this distribution into normal distribution also and for this you have to watch one of my video which is called as transformation data transformation so definitely check out those video and probably this is what is basically spoken about it you know so that is called as pareto distribution with respect to this now guys uh yes box cost transformation is basically used in order to convert this data into normal distribution so probably you will be able to see from that video link that i have actually given i can also show you the code so this is how the code looks like i have covered everything guys now it&#39;s your time to flourish and learn everything see all the transformation normalization standardization scaling this this this square root everything is given over here so this is here i have also discussed about q q plot so if i go and probably show you so all the transformation is basically used i&#39;ve used all the transformation in this you have to follow that video guys because it will probably take me one hour to explain all these things see this is what is q q plot is reciprocal transformation logarithmic transformation then you have gaussian transform this all transformation either logarithmic reciprocal square root exponential transformation box cox transformation so all this transformation is basically used in in the initial stages we basically apply with respect to all the features and then we will be able to can i say this distribution follows what what kind of distribution this image follows so it is in the same link in the youtube channel that i have actually given what kind of distribution this follows this follows a pareto distribution a power law distribution right so we can basically use a box cost transformation to convert this data so if you if you go through this you are well covered with respect to everything so central limit theorem basically says that if i have a distribution that is either normal that is not normal or that is any kind of distribution that i have whenever we basically take up multiple samples let&#39;s say that i have this distribution or this distribution this distribute if i take up some multiple samples let&#39;s say that n is greater than or equal to 30 if i start taking multiple samples from this particular data let&#39;s say that i have taken multiple samples like this like this up to n is greater than or equal to 30 like many many samples okay and for every sample if i start finding the mean if i start finding the mean like this up till x m why i&#39;m saying uh n should be greater than or equal to 30 because the more greater than or equal to 30 the more the central limit theorem holds okay so if i take this entire data of this sample mean all the sample mean and if i populate it in the form of pdf then that basically says that it will get converted into a normal distribution so this will basically be a normal distribution all the sample mean will follow the sample mean will follow a normal distribution so here you can see that whatever distribution it may be if we take some samples specifically n is greater than or equal to 30 for each and every sample if i try to find out the sample mean sample size so see sample size i told you it should be n is greater than or equal to 30 sample size the number of elements over here that we are picking should be greater than or equal to 30 and let&#39;s consider that we have taken m samples m samples can be anything but more the bigger value more better we will be able to solve this particular central limit here so here you will be able to see that as we go on doing this finally you will be able to see that if we populate all the sample mean we get this normal distribution initially whatever distribution that particular data may be it may be a long normal it may be normal it may be anything now one assignment for you all will be poison distribution now you have got a lot of idea about data now okay you just go and search in wikipedia see the distribution same this is also a non-gaussian distribution it also follows the pareto distribution you can see in this way just go and check it out that&#39;s it this was it for my side have a great day ahead thank you mandal bye bye keep on rocking keep on learning bye bye thank you guys

Transcript for:Data Science Statistics Lecture Notes

Transcript for:
Data Science Statistics Lecture Notes