Transcript for:
Mathematics for Machine Learning

[Music] mathematics a great friend in the disguise of a foo since the old ages its importance isn't something that I need to reiterate because all of you know what it has helped us achieve through our history so what does mathematics really have to do with machine learning let's find out in today's session hello everyone this is archives from medio Rica and I welcome all of you to this interesting session on the mathematics from machine learning let's take a look at the topics we shall cover in today's session we shall firstly understand why math is important in machine learning from there we will move over to linear algebra covering vectors matrices the operations and more once we are clear with that we move ahead to multivariate calculus where we will learn about differentiation partial derivatives and how they can help us in reality once done we move over to probability where we will learn its basics and discuss about the naive Bayes classifier we shall be then ending the session with statistics basics the types hypothesis testing and a practical example so I hope you are clear with the topics being covered for today now before we get started subscribe to the Eddie wake our YouTube channel and hit the bell icon to never miss an update from us on the training technologies also if you're looking for an online training certification on machine learning check out the link in the description box below so let's jump into the first topic for today why mathematics in machine learning aspiring machine learning engineers often tend to ask me what is the use of mathematics machine learning when we have computers to do it all well that is true our computers have become capable enough to do the math in split seconds where you take minutes or even hours to perform the calculations but in reality it is not the ability to solve the math rather it is the eye of how the math needs to be applied you need to analyze the data and infer information from it so that you can create a model that learns from the data Matt can help you in so many ways that it becomes mind boggling that someone could hate the subject of course doing math by hand is something I hate to but knowing how I use math is enough to explain my love for math so allow me to extend this love to you guys too because I won't be teaching you just the mathematics of machine learning but the various applications you can use it for in real life so what Matt do we need to learn from machine learning here is a pie chart which comprises of all the needed math linear algebra covers a major part followed by the multivariate calculus statistics and probability also play a big role and you need to know the knowledge of algorithms and much more this is the requirement that is needed to master machine learning so now that we have developed this understanding let's do some math math we shall kick off with linear algebra linear algebra is used most widely when it comes to machine learning it covers so many aspects making it unavoidable if you want to learn mathematics from machine learning linear algebra helps you in optimizing data operations that can be performed on pixels such as sharing a rotation and much more you can understand why linear algebra is such an important aspect when it comes to mathematics from machine learning so let's move over to the first topic in linear algebra scalars so what is a scaler a skyla is basically a value of you it represents something right so scalars are just values that represent something thing suppose we had a laptop on sale and it is priced at fifty thousand rupees right so this fifty thousand rupees is the scalar value of that laptop what are the operations that can be performed on scalars first it is just basic arithmetic so for example we have addition subtraction multiplication division all of those operations can be applied on Scylla's okay for example over here we are buying a laptop and the accessories what is the total price it will be the addition of both the prices so fifty thousand for the laptop and five thousand for the accessories brings it up to fifty five thousand rupees what happens if you're buying a laptop at a 50% discount so it is half the price right it is 50,000 divided by 2 that becomes 25000 rupees for the laptop so this is just a brief introduction and all that is required from Skyler's so once we are clear with this let's move over to vectors so vectors can get a bit complicated as they are different for different backgrounds let me tell you how computer science people can interpret vectors as a list of numbers that represent something physicists consider vectors to be a scalar with a direction and it is independent of the plane mathematicians take vectors to be a combination of both and try to generalize it for everybody all of these standpoints are absolutely correct and that's what makes it so confusing for anyone learning about linear algebra for mathematics for machine learning in machine learning we usually consider vectors in the standpoint of a computer scientist where the data is in the tabular form consisting of rows and columns right and when our data is in the form of pixels or pictures we consider them as vectors that are bound to the origin and transform than to matrices and perform operations that we shall discuss later so now that you have a beef idea about vectors let's jump over to the operations that you need to know when working with vectors operations on vectors can be applied only when you know what kind of data you are working with suppose you have pixel data and you want to apply rotations but end up doing something wholly different your model will not work because it is doing all the wrong operations here it's important that you make sure that you know what you're working with only then you will be able to apply the required operations so the first operation that we have here is vector addition so let's understand what vector addition does us so for example vector addition is also called as dot product vector addition is something that is completely different from operations that we've been learning now for scalars okay it's not simple arithmetic it is actually the total work that is done by both the vectors in a quantified form so for example let me say that I want to walk forward by 50 meters and that is one vector okay so me walking forward for 50 meters is one vector and from there I go right for 25 meters right so that is the second vector so what is the work that is completely done by both these vectors is me moving forward and then moving right so let me say for example v1 is what I walked forward okay and then v2 is something that I moved right it so what is the addition of this the addition is basically putting both the vectors point to point and then finding the displacement or the work that has been done if you look at it over here it is v1 plus we use work that is in the quantified form of v1 plus v2 it is the displacement so for example v1 is a distance and v2 is a distance V 1 plus V 2 is the displacement so that is basically what is a dot product I hope you've understood this so let's move over to the next operation that is a scalar multiplication so what is scalar multiplication so whenever a vector is multiplied with a scalar value it either grows or shrinks what this means is that you have a particular value a scalar value which is either positive or it is negative and it may be greater than 1 or less than 1 whichever is lesser than 1 and it is negative it will always make it shrink or else it will do the opposite it'll make it grow so let me just show you an example so we you know so let's say I have a vector called v1 now if I'm trying to multiply this with a positive scalar value Sikhi positive scalar value into v1 will make the vet of grow whereas if I am trying to multiply this particular vector with a negative scalar value say minus key it will be shrinking down down so it is minus K into V 1 gives me a shrink matter so as you can see this was my normal vector and if I multiplied it with a constant K which was positive it and if it was negative its shrink down so that is basically what is scalar multiplication so the next vector operation is a projection so what does projection help us with so for example let's say I have two vectors okay V 1 and V 2 now I do not know much about B 1 and I just know more about V 2 so if I can try to find a way in which I can project the vector V 1 1 to V 2 I'll be able to obtain information about V 1 so let me just show you so for example I have a vector V 1 and I had another vector V 2 so I know all about V 2 but I do not know about V 1 vector so what happens over here is if I'm able to project the projection of vector V 1 onto V 2 I will be able to analyze and know the unknown features that vector V 1 has okay actually if you go with this into deep learning rate you can be able to find unknown features of the vector which can help you modify your one image into so many images that you can basically simplify and modify that image into something that it is not okay that comes under deep learning but we are not gonna cover that but this is a very important concept that you need to understand it is basically like it's said over here projection is the shadow of a vector that it places on the other vector so whatever information that V 1 vector has I'll be able to somehow extract it from the vector V 2 because the projection falls on to V 2 that basically brings us to the end of all the vector operations that you need to remember for machine learning let's move over to matrices so what is a matrix a matrix is the composition or it is the mixture of numbers symbols expressions which are in a rectangular array it can be rectangular it can be square it depends on the order so what do we use matrices for we use matrices to convert our equations into the form of arrays for example if you've got an equation you cannot simply put that into your computer saying that okay solve this and give it to me you need to convert it into a list or an array so that you will be able to perform your operations on a tray that is the reason matrices are so much important to us and that is the reason it is much more easier for us so that we can convert our equations into lists and RS and then perform our operations on them so suppose for example you have two equations over here so how do you convert these two equations into matrices so what does a equation tell you if you have 2x plus 2y is equal to 10 what this basically is trying to convey the information to you is that you have two constants x and y in these constants if you keep giving different numbers you'll be able to find a different value for it let's see for example the scalar value is 10 and you have 2x plus 2y which is basically a vector I need to find the points of x and y find the points of x and y meaning that I am trying to find the direction of the vector and then if I'm able to substitute values and then find out how to X plus 2y is equal to 10 I am finding all the information I need from that vector so that I can basically find out what all the other information I need from it that is how functions are very important you need to understand what a function is trying to convey to you let's say for example you have the equation of a straight line so what does a straight line be it is y is equal to MX plus C what does this mean it means that the y coordinate is equal to some value M into X plus a constant so I am able to plot this I'll be getting the y coordinate the x coordinate and I'll be able to get a straight line so whatever numbers I put into these I will always be getting a straight line so that is the reason it is called the equation of a straight line it is always going to be constant so we have to understand what a function is trying to convey to you we are done with that let me just tell you how you convert equations into matrices it's really simple take out all the numbers so you have 2x plus 2y equal to 10 right so 2x and 2y has 2 numbers too so those two become 1 Row 4 and 1 in the second equation become one room x and y becomes one column and 10 and 18 so let me just show it to you so as you can see we have a matrix 2 2 4 1 then we have x and y is equal to 10 and 18 if you multiply it accordingly you will be getting back the same equation that we had earlier this is how matrices are used in linear algebra so once you know how the equations are converted into matrices let's move over to the matrix operations so what is the first operation simple addition what does addition help you with it is just basically adding all the directions of two vectors why is it two vectors because I am converting vectors into matrices right you are simply just going to add the corresponding elements of both matrices simply add the corresponding elements of both the matrices and you have to remember that if you are adding two matrices it has to be of the same order what do I mean by order it is basically the number of rows in to the number of columns so for example if I have the first matrix that is 2 2 4 and 1 I have 2 rows and I have 2 columns so that becomes a 2 cross 2 matrix that is the order of the matrix the next one also 2 3 1 4 it has 2 rows and 2 columns that makes it an order of 2 cross 2 so if I had 3 rows and I had 2 columns that would be 3 into 2 so I hope you have understood what order is let's say I had these two matrices and I want to add them so what I do it is 2 plus 2 so let's see I have 2 over here and I have 2 over here so what is 2 plus 2 it becomes 4 what is 2 plus 3 five what is four plus one it becomes five what is one plus four it becomes Phi so it is adding the corresponding elements of the matrix so I hope you understood that so let's move over to the next operation matrix subtraction it's the same thing as how you did it for addition you just subtract the corresponding elements if you had 2 minus 2 it becomes 0 2 minus 3 becomes minus 1 4 minus 1 becomes 3 1 minus 4 becomes minus 3 simple you can understand this very easily moving ahead from matrix subtraction is matrix multiplication what do you do with matrix multiplication you are basically multiplying the rows with the columns the matrix of Row one with the columns of matrix 2 what happens you have to remember that the number of rows in matrix 1 has to be equal to the number of columns in matrix 2 only then will you be able to perform the matrix multiplication so for example if I had a 2 cross 2 matrix it would be a 1 1 and 2 B 1 1 a 1 2 n 2 B 2 1 a 1 1 into B 1 2 plus the a 1 2 into B 2 2 accordingly so let me show you an example for matrix multiplication so that you can understand how it works suppose I have a 2 cross 2 matrix that I've already been using for matrix addition and subtraction now how am I going to perform the matrix multiplication is basically I am going to you know multiply the first row into both the columns of the matrix 2 so I have 2 into 2 plus 2 into 1 then I have 2 into 3 plus 2 into 4 the same thing goes for the second row also so what do I get I get 4 plus 2 6 plus 8 8 plus 1 and 2 L plus 4 but accordingly gives me 6 14 9 and 16 so this is the matrix multiplication it's really simple to understand so if you have any doubts and leave them in the comment section we get back to you as soon as possible so once we are done with matrix multiplication let's move over to the transpose so the transpose is a really simple operation all you do is you convert the rows into columns that's it but why is it so important transpose is really important when you want to change the dimensionality of your data suppose all your data is in a row you can reiterate through there or all your data is in a column you can do it accordingly so it is really important when you work with transpose because it helps you to change the dimensionality it helps you flip the dimension so for example if you're working with pics inside pictures if you change the rows or the columns of your pictures data you are basically changing the picture and then you can you know analyze it more get more information from it so transpose plays a very important role you need to understand that so for example I have two two four and one what happens is the first column becomes the first row and the second row becomes the second column so as you can see here it is a two to four n1 and that is all transposes so let's move over to the next operation we have determinant of a matrix by now all we have been doing is having matrices being added subtracted so what is all of this it is basically all the vectors being added subtracted multiplied you know then flipped with your dimensions you will understand this when you do all of this practically because all of this is something that we do not do it in our daily lives but when you're learning with machine learning when you're performing machine learning all of these operations come into picture so all that you've learned by now is basically adding values of vectors subtracting vectors multiplying two vectors then transpose that is the flipping of vectors and now you're going to learn about determinants what is a determinant all the matrices that we had till now they are basically directions they are basically directions and values of all the vectors now all these vectors are definitely going to have a scalar value in them so that you can understand the weight and you can understand the depth of that particular matrix what is the determinant helping you with with the determinants helps you with understanding the we'd although you know sensitivity that it can provide on the data set that is the reason determinant is really important it is the scale of value of the matrix and it can help you give the eigenvalues of the matrix what do I mean by eigenvalues I'll teach that to you in the next part because we are going to understand in depth water eigenvalues and eigenvectors that is the reason determinants are so important so let's say for example I had this particular matrix a b c d e f g h i it would be a into e f h i and then it would be minus B into D F G I plus C into de G H so what happens now is I'm going to get this particular equation so it's a e i+ b FG plus CD h minus AF h minus BD i - c e g this is basically what is going to give me the scalar value of the matrix so i hope you've understood the importance and how determinants are important in machine learning so next we will learn about the inverse of a matrix so how do I explain inverse to you it's a very simple example if you understand that suppose I am walking on a road and I walked create for 50 meters but then I remember that I had to get back something from the place I started so I walked back 50 meters what is the distance that I traveled the distance that I traveled was 100 meters why because I went 50 meters forward and 15 meters backward but what is the work that I have done it is zero why is that it is because I am at the same place that I started from that is not the work done if I move from one place to another place that is the work done that is some displacement that my body has achieved but I have not achieved that over here why it's because I have gone straight and come back straight for 50 meters making my work done zero so just the same way inverse of a matrix works books suppose you have a vector that moves in the forward direction you will have the inverse of that particular vector that comes in the negative direction it makes all the work done zero sometimes there is no inverse of a matrix that exists that's because the vector does not have all the information that is required to obtain its inverse so I hope you have understood what is a inverse let's see how do you find an inverse of it okay for a do cross 2 matrix and this is how you're going to find it it is finding the determinant into the transpose or you know finding the inverse of the matrix so it is a and D that is going to be you know switched over and minus B and minus C this gives you the inverse of the matrix so for all those three in above what do you do do you find the determinant of E and then you accordingly find although different determinants inside of it so that is basically how you find the inverse of matrix let's move over to how a vector can be used as a matrix okay so by now that I've been telling that vectors can be easily translated into matrices and I've already showed it to you and why is it so important it's because they help you apply operations on the data very easily and then you have certain valuing operations such as scaling rotation sharing and much more all of this come under computer graphics okay making these operations on your image or your vectors becomes really easy when you're working with matrices so that's the reason matrices are so important so for example right now I've shown you that if there was an equation V 1 as 3 X plus 4 it would be 3 & 4 and then you had x and y then you would have X plus 2y which would become 1 and 2 and then X and y that is how vector as a matrix works what are some of the well-known operations right I'm assuming that there is going to be a 2 cross 2 matrix whenever you're scaling it is basically increasing the size so how do you increase the size it is SX and sy which are the scaling factors which you perform on your x and y coordinates then you have sharing which is basically moving or reshaping your particular object that you're working with so it can be M which is the sharing factor and then you have the rotation so how do you rotate your particular object in which direction so all of that can be done using these particular matrices so let's move over and understand how matrices can help you solve equations and you know obtain solutions much more easily so that is basically vector as a matrix fix we have two methods over here they are the ashkelon method and the inverse method for our tutorial I have been taking the row echelon method because it's much more comfortable for me if you are much more comfortable with column echelon method you can go ahead with that but because I am comfortable with row echelon methods I've used row echelon method and next we have the inverse method so we are basically going to solve the equations we are going to find the coordinates at which our points give us that particular value so let's move with the first method which is the row echelon method and I have some equations over here which go ahead like 2x plus y minus Z equal to 2x plus 3y plus 2z equal to 1 X plus y plus Z equal to 2 so I'm going to convert all of this into a particular matrix so let me show you how the matrix looks it is a 2 1 minus 2 and 2 1 3 2 1 and 1 1 1 1 so all the information about the equations have been put into the matrix form and the answers that we are going to get are X is equal to 2 y is equal to minus 1 and Z is equal to 1 so this is the reason I am going to get all these particular values that is 2 1 & 2 which I am getting from the equations whenever I put the x y&z values into them I am going to you know get those particular coordinates now let me just tell you why we are solving equations ok so if I am telling that that is this particular vector that goes like this or comes in this direction or something like that I want to find the x y&z coordinates so that I can obtain the information visualize it and then learn more about that particular vector now say for example this is just 2x plus y minus Z equal to 2 let's see for example if what if it was a two boxes player one candle - one chocolate is equal to giving me a profit of two rupees okay so as you can see these are just simply equations and these are the coordinates that we get X is equal to two Y is equal to minus one and set is equal to one but what is the importance of this okay let me just give you a simple example so that you can understand what we are actually trying to find out over here okay suppose I have a factory and I want to get a profit of two rupees and I have boxes I have candles and I have chocolates so if I take the first equation I am saying that if I have two boxes one candle - the chocolate it is going to give me two rupees profit so what is XY and Z exactly XY and Z other investment costs how much do I need to invest into them so that I can get back what I'd really require so four boxes I need to invest two rupees for chocolates I need to invest nothing for candles I need to invest minus one or something like that okay if I invest so much I'm going to get two rupees profit so this is like a real-life example which is being converted into some equation and then we just need to solve it that is the reason vectors are so important and how they can be converted into matrices and then you know solve very easily how does the row echelon method work I am going to be taking the step one which is R 1 divided by 2 R 1 when we divided by two we get 1/2 minus 1/2 + 1 so what's the step two we will do R 2 minus R 1 so we have R 2 which is the mind so far 1 I am going to get the particular equation which is 1/2 minus 1/2 + 1 0 5 by 2 5 by 2 1 0 1 1 1 2 simple enough right so what is the next step it will be R 3 minus R 1 so it is 1 half minus half 1 0 5 by 2 5 by 2 0 and 0 half 3 by 2 1 1 the next step is 2 by 5 by r 2 so you get 1/2 minus 1/2 1 0 1 1 0 and 0 1 by 2 3 by - so what's the step 5 step 5 is basically r1 minus half of r2 so I get 1 0 minus 1 and 1 0 1 1 0 and 0 1 by 2 3 by 2 n 1 step 6 is r3 is equal to r3 minus 1/2 of r2 so I have 1 0 minus 1 and 1 0 1 1 0 and 0 0 1 1 step 7 is R 1 plus R 3 so I have 1 0 0 2 0 1 1 0 and 0 0 1 1 step eight is r2 is equal to R 2 minus R 3 so I have a 1 0 0 2 0 1 0 minus 1 and 0 0 1 1 if you put all of this back into the equation you will find out that for the first row 1 0 0 2 is equal to X plus 0 plus 0 is equal to 2 the value of x is is equal to 2 the value of y is equal to minus 1 and the value of Z is equal to 1 so this is how the row echelon method works and what do you have to actually perform in row echelon it is basically the first row has to start with the value 1 the second row has to have a 0 and then somewhere in the middle it has to have a value of 1 and then the third row should be starting with zeros and then getting some value where it starts with 1 only so row echelon or column echelon whatever method you use you have to make sure that whichever element is the first starting element of the row it has to have a 1 or it has to be proceeding with zeros so as you can see here the first row has 1 0 0 2 and then you have 0 1 0 minus 1 and then you have 0 0 1 1 so that is how you are performing the row echelon method to solve your matrix so if you remember we had X is equal to 2 y is equal to minus 1 and Z is equal to 1 those were the answers that we wanted once we are done with row echelon method let's move over to the inverse method so suppose we have some equations a 4x plus 3y is equal to minus 13 and minus 10x minus 2 y is equal to 5 what does the inverse method tell us let me show it to you so this is the matrix that you come up with so it is a 4 3 minus 10 and minus 2 then you have x and y and which is equal to minus 13 and 5 so what does the inverse method actually tell us for example we know that a into a inverse is equal to zero work done right it is the identity matrix I matrix so if I have a into the a inverse so it is like I'm doing the work and I don't do the work is equal to no work done so a into a inverse gives me the identity matrix if a into B is equal to C so a is my you know all the numbers then B is my X and Y and C is the output value if I do a into a inverse into B is equal to C into a inverse I am going to get B is equal to C into a inverse because a into a inverse gets cancelled so B is equal to C into the a inverse of it so that's what I'm going to do so what is the inverse of the 2 cross 2 matrix the inverse is found by this formula and the inverse is actually 1 by 22 minus 2 minus 3 10 and 4 so that's the inverse of the matrix so let me show you how it works what happens is I'm having a into a inverse into B is equal to C into the a inverse so into a inverse is obviously going to get cancelled out because you know it's an identity matrix so I'm just going to be left with x + y is equal to 1 by 20 2 into minus 13 and 5 minus 2 minus 3 10 + 4 so what I do later I'm going to just multiply and this is what I get so it's basically one by 22 and 11 by minus 110 so if I do 1 by 22 into the particular matrix that I've found it is going to be X is equal to 1/2 and Y is equal to minus 5 so that is how I'm going to be using the inverse method to find or solve the equations I hope that was simple to understand so once we've done all of that we are now capable enough to understand water I convector is eigenvectors are really important when you working with data all of that what vectors matrices what all we've been doing by now think of it as the data you're working with in your machine learning algorithm right what is an eigenvector eigenvector does not change the direction even if what transformation is applied to it what does a heigen vector really give us so for example I have a data if I am trying to do any transformation on it it should not change from what it was really it should not change from what it was initially it should not become something that is totally irrelevant to me if I have such data that even if I apply transformations to it it does not change what it was supposed to be that is an eigenvector what does an eigenvector give me e they are the most sensitive parts of my data set and I can use them and I can trust them for my analysis purposes so that is why I can vectors are so important I can vectors help you transform and you know make your data much more careful so for example let me show you what an eigenvector looks like okay so I have a particular rectangle and I have a lot of vectors but for my example I'm just going to be using two vectors okay I have a v1 and v2 these vectors are basically trying to explain all the data that is that in the particular rectangle what happens if I perform the shearing operation when I perform the shearing operation my V 1 vector has changed its direction completely it was at some other direction and after sharing or after performing some operation it has completely changed its dimensionality its direction all the information that it held initially has now completely been changed whereas my b2 it has not changes Direction yes it has moved forward it has multiplied but it gives me the same information that it was giving me earlier also so that becomes an eigenvector think of this now in the form of a matrix if my matrix is changing so drastically how will I be able to perform operations on it it becomes really difficult okay so for example think of the vector v2 as a you know which is then just multiplied by two it does not make a difference why it's because even if I multiply by two it's just going to double the value but the operation that I want to do on it the information that it is giving it to me that's the same so that's a reason I can vectors are so important and what are again values the values that we perform on these eigenvectors are all that are eigenvalues it's that simple to understand so that is all you need to know about eigenvectors all the data which is not transforming anything is basically the data that you should be learning with what are the applications that linear algebra helps us with you have the first and foremost which is the principal component analysis PCA it is used for dimensionality reduction and it helps increase the quality of data then you are working with applying transformations it helps in encoding it helps in the single value decomposition where it is again reducing the dimensionality of a single value data and then it is used for natural language processing for latent semantic analysis optimization of your deep learning models all of this can be achieved through linear algebra so it is a very important aspect of machine learning once we are done with all of this let's start our coding for PCA we will now be coding for principal component analysis let me show you how it really works so I am having all the programs already prepared for you guys so we do not extend this big tutorial even move further let me go to the presentation mode mode so what happens over here is I am having these important libraries which are import numpy is empty then I have the pipe lot as PLD and then from SQL under decomposition import the PCF so principal component analysis is already present I am just going to important from SQL on so I am going to get a range of around you know 200 300 how much ever it is after I am getting that particular range I am going to get data I am going to get all of that data and then I am going to put it into my particular list so what happens over here I have MP dot dot and in the range of 2 comma 2 and it is in the range of 2 comma 200 so in the range of 2 comma 200 whatever numbers that you can randomly generate please give it to me and then dot B why is it dot T it's because all of this which I am going to get is basically in the form of a column I do not want it in the form of a call him I want it in the form of a row so that is the reason I am using dot T V dot T is used to transform it so I'm just going to scatter that on the plot and then show it to you guys that is what this particular function does and then I am doing PCA is equal to PC a of n components is equal to 2 and then I am going to fit the data into it and then I have a function which is to draw the vector so I am going to do arrow props dictionary style all of that I have just given the styling over here then then I am going to annotate it and draw it accordingly so this is my draw vector function and then I'm just going to scatter plot everything that I had so I have the vector how am I going to get a vector it's using this particular formula which is the vector and 3 and n P dot s QR T so what is PCA explained variance and components okay so if you remember I had fit PCA with the data that I have generated right so explain variance is basically going to tell me how much is the variance between my data points and what are the components that are really required that I look at and perform my analysis on them so these are what the required components of for me and after I do all of that I'm going to get another vector I'll show that to you guys and then what I'm going to do is I'm going to find one component only over here and I'm going to fit the data okay and I'm going to transform and perform the inverse transform so that I am going to get a particular linear line which I can just look at and find all the important aspects of the data that I wanted to from it let me run the program and let me show you all the outputs so as you can see here this is all the data that I had generated using the dot function of the random generating thing so this is all the random data that I have generated then what I did I had to find out two components right then what I had to do I had to find out two components what can I understand from this particular information this one is a small vector so this small vector means that this is the only amount of information that it can give whereas this vector is so long what this means is that the longer the length of the vector is the more information that it can pass to me simple enough right longer the length of the vector the more information that it can pass to me so what are information I have through this particular vector all of this is going to give me the best for my analysis so that's what I am going to do I am going to get all the data points that are according to this vector so as you can see here all of these data points are very very important for my analysis so this is how I converted such a big huge data set into one particular line and if I follow all these points I am going to get almost the same amount of information the whole data set is going to give it to me as you can see this was such a huge data set but it's just come up to this one line it's so easy think if you're working with a tons and tons of data it becomes really really difficult when you're working with it that's the reason you use PCA and it's just not used pca make sure that you are able to reduce the data as much as you can so that you can get at least all the amount of information that you can from their particular data you do not need all the data but you need the information from that data that is basically how pca works I hope it was clear to you guys so basically that was all we needed to learn from linear algebra and I hope it was clear to you guys so now that we know all about vectors let's move over to multivariate calculus multivariate calculus is one of the most important parts of the mathematics in machine learning it helps to solve the second most important problem that we were facing in developing machine learning models the first problem is obviously the pre-processing of the data the next is the optimization of the model multivariate calculus helps us to optimize and increase the performance of a model and give us the most reliable results so how does something that almost half the class hatred help us solve such a problem so let's pick all the ice that's surrounding this but before that we need to understand the basics so let's calculus so the first topic in calculus is a differentiation differentiation is basically breaking down the function into several parts so that you can understand every element and analyze it in depth they are very helpful in finding the sensitivity of a function to the weighting inputs a good function gives you a good output which can be described using that rather easy equation the same cannot be said for a bad function so for example right here I have Y is equal to E power X it is really easy to tell that ok if it is e power X this is going to be the graph for it but look at Y is equal to 1 by X it is such a horrendous graph that I cannot explain it using a particular equation those are the types of equations those are the good functions and the bad functions we know all of this but what are we really after so let's do all of that right now ok so most of us already know all of this but what are we really after so let's understand that with a really simple example so let's assume that we have a car moving in a single direction only and is already in motion so if we plot a graph of its speed versus time it communicates to us how the speed varies as the time keeps increasing and it holds after a certain point now if we want to know the rate at which the speed varies with respect to the time it turns out to be that we are actually finding the celery's the acceleration of the car can be plotted as follows so what this means is that acceleration is actually a derivative of speed because the speed we have here is just the magnitude and does not hold any more factors just for simplifying everything so if there was no speed there would be no acceleration it's as simple as that now that we have the acceleration we can justify whether the car had a varying or a constant change in its speed whether it was moving or not and much more but in this case that is all we need noting that you want to find the change in acceleration between a certain range in a time span okay we mark two points X and some variable which is a small portion more than X we can denote this using X plus Delta X so let's denote this on the graph so you can see it as it's shown over here now we know that this range has many values in between but what if we want to know the rate of change only between one point and the next we know that this is really kind of impossible because between a range there are countless numbers as the function is continuous thus we approximate that the limit or the step between two input variables is zero remember that this zero that we are assuming is only the smallest possible value that we can make up and it is not the absolute zero if we ever work with absolute zero we would never have had any functions at all so if you are still confused this number is basically some value of zero point zero zero zero zero zero it just keeps going and then some numbers ahead but it is never really zero that is the only way that we can put this as ok so now that we have understood what we are really trying to find let's make it a general equation so let's derive the derivation formula so the derivation formula is f dash of X is equal to limit Delta X tends to 0 which is not exactly zero but we are assuming it to be zero point something so we are just going to write zero and then we have f of X plus Delta X minus f of X divided by Delta X that is how the formula comes into existence this helps us find the rate of change between one point to the other it is such an important concept as it plays a huge role in the optimization of machine learning models you need to understand that this is the first order of derivation only if we differentiate the output of the first differentiation it becomes a second order differentiation and so on but that is all the introduction we need from differentiation so now that we have the basic idea and formula of derivation let's move over and understand some of the important rules that we need in differentiation so the first tool that we will be talking about is the power rule so whenever you have a function which has a variable with some power you can basically use this formula and solve the equation much more faster so for example let's say that I had a equation called as f is equal to 3x square so let me put this into the formula so I have F dash of X is equal to limit X tends to 0 3 off X plus Delta X the whole square minus 3x whole square divided by Delta X so X plus Delta X the whole square becomes a formula plus let me expand it and then this is what I actually get plus X square and minus X square get cancelled and then I have Delta X which is being removed out of them so Delta X and Delta X also get cancelled then I have 3 into Delta X plus 2 X so I am going to put the value of Delta X is equal to 0 into the equation which is not exactly 0 but it is so small that I can ignore it so I have 3 into 2x which becomes 6x so that's how I had to solve the whole entire equation right so what about the power rule power rule is just 3 and whatever number I had right so if it is a 3x square let me take out 3 because 3 I am going to treat it as a constant so X square so X to the power of 2 so then what I do I bring 2 down which is n and then I have X of n minus 1 so it is 2 minus 1 so it becomes 3 into 2 X Y Z is equal to 6 X so I get the entire same answer but I just had to do it in 3 steps I can actually do it either in one step but just for simplification I have done it like this the next role that we are going to be studying is the rule of some so if you have a particular function which is the sum of two variables there differentiations some also is the differentiation of the whole function so if you have f of X 1 plus X 2 it will be X 1 dash plus X 2 dash so a simple example will be 3 X square plus 5 X ok so I'm going to solve the equation right now 3 X square I hope you remember I just showed it to you before and the same thing goes for 5x also so 5x plus a 5 Delta X minus 5 and then I'm going to just you know cancel out 5 X's and then I have Delta X being canceled out and then I have X square minus X square being canceled out so the whole same process goes on then Delta X gets canceled then I have 6s plus 5 so I am going to use the same rule let me loan tell you that was the power rule and then I have 3 into 2x plus 5 so I get 6x plus 5 this is basically how the sum rule works next I have something called as the product rule rule so if I have a function which is in the form of X into 2x or something like that the differentiation is a very simple this is how the differentiation looks if it is f dash of X 1 into X 2 that is equal to X 1 dash into X 2 plus X 2 dash into X 1 that is how you're going to find out the differentiation and the next tool that we are going to look at is the chain rule so basically if I have f dash of G dash of X it will be f dash of G into X dot G dash of X this is a very simple example and it's really easy to understand so if you want to learn more about the common functions I have my blog which is also put up in the description and you can go ahead click over there and learn more about the common functions ok so now that we have understood all the basics that we really needed from differentiation chain will now be moving over to partial differentiation partial differentiation is an important concept that most of us have ignored to order Academy what has partial differentiation helped us achieve might be the question that most of you may be having right now so let me give you an example so that you can understand its importance okay suppose you are a car designer who deals with the exterior of the car on D you have been given the task to maximize the performance of the car so how do you do that you do not go into the car you do not take out the engine you do not tune the engine you are not supposed to do any of that you just are going to change the particular exterior of the car so what's happening over here you know that even the engine and all the other tires and all that all those are variables also that you know come to making the car better in its performance but you are not going to do that you're not supposed to do that because you are only going to deal with the exterior of the car those variables of the engine and all of that those are kept constant if those are kept constant that means that those variables will be performed by somebody else you are just going to look at your particular variable that you have to change that is the exterior of the car so this is how partial differentiation comes into play you are just going to change one particular variable all the other variables become a constant okay it's that simple so what do you do you would change the windshields change the body if necessary air vents and many many more I've just given whatever I could find you are just changing what is necessary that is basically what is partial differentiation so what are the equations that partial differentiation goes according with now if you can see it is the f dash of x y&z if you remember the previous slides we were just working on one variable that is X which is X 1 X 2 which may be x squared 3 X 5 X it was all all on X partial differentiation is not like that it is XYZ whatever variables it can find so that is what makes partial differentiation much more realistic much more easier to understand because you are just going to change your factors your variables you're going to keep everything else constant so let's say for example you have to find it with respect to Y so Y & Z become a constant F dash of X you want to find it with respect to Y xnz become a constant F dash of Phi with respect to say x and y become a constant F dash of Z so what is the complete differentiation it is the addition of all the three differentiation that we've just found out so this is how partial differentiation works let's show you an example so if I have a function which is X square plus three y plus four X Z square how am I going to find the differentiation of this so with respect to X Y becomes a constant so it is zero and then you have a 2x plus a for Z square so with respect to Y X square plus four X Z square become a constant so I will just have three so with respect to Z it will become X square becomes a constant 3y becomes a constant and then for X Z square is what I am going to differentiate so I will have eight X Z so what is the total differentiation it is a two x plus four Z square plus eight X Z plus three so I hope you've understood this a really simple example of it so this is how partial differentiation works so what are the applications that multivariate calculus comes into play we have something called as the Jacobian vector now what is the Jacobian vector it is basically differentiating the vector once Jacobian is basically differentiating the vector once and then putting it into the form of a matrix that becomes the Jacobian vector it helps in finding the global maximum of third data set set it is pointing to the maximum data that is there in the data set and it also helps in linearizing alone linear function to a linear at a particular point so that's how the Jacobian helps so differentiating the Jacobian or differentiating the actual vector twice gives us the Hessian and it is used in minimizing the errors it is used in deep learning models gradient descent for optimizing all the weights and all of that so that is how multivariate calculus helps us in real life let me show you how the gradient descent works I already have code ready for you guys let me go back to the presentation mode I have a particular import which is the numpy and then I have the sigmoid function which is basically 1 divided by 1 minus e to the power of minus sop so this is what I am going to return back to it and then I have the arrow that is a predicted minus target whole square or X to the power of 2 this is what the error is basically and then I have the errors a derivation which I am going to use over here so that is basically 2x or 2 into predicted minus the target part and then I have the actual function over here which is sigmoid into 1 minus Sigma Y which is the activation function which is basically trying to find if my particular model is working or not and then I have the derivation of it which is just basically X and then I have the updating function of the filter so I'm just going to get the weight I am going to get the gradient and I am going to get learning rate I am going to basically keep trying to you know find out the best way that I can find my weight so here's my function I have X is equal to zero point one which is my input and then it has a target which is 0.3 and I am going to have a learning rate of 0.01 and then I am going to get a random number which is the weight of it and then this is the initial weight what I'm going to do then is I'm going to find the value of y in X and I am going to find those errors also and then I am going to find all the gradients and then I am going to pass the gradient into it and then update the weight let me show you an example now how increasing the number of steps is going to make need each my target much more better so as you can see my initial weight was 0.5 0 thing in Oland and I have just run this for 10 times right let me change this to thousand my input is zero point one and this is zero point three I have to make 0.1 as zero point three right so let me run this again it is still at 0.50 let me add another zero which is ten thousand times let me run this so as you can see it has now become zero point four seven which is much more better than what we were actually doing let me add another zero over here so this is going to take a lot of time because the amount of you note steps to be performed on moon but as you can see with them which was 0.5 previously has now come up to 0.36 which is much much better than what we were actually getting let me add another zero let me run it again you know 4.6 4.4 4.0 as you can see it is reducing it's just reducing the input and our output is almost almost close enough what we really needed the target was 0.3 and as you can see here we've already read 0.30 and all the numbers that are succeeding it right if you remember we had 10 steps which would just give us 0.5 0 or 5 on something now for the amount of times it has learnt it has become much much much better than what we've previously getting so now it is zero point three zero zero zero zero to something it's still reducing it is still reducing so this is how the gradient descent works it keeps repeating learning learning this is the final output that we have achieved zero point three zero zero zero five seven seven and it is much much much better than what we were already getting from the output so this is how the gradient descent works it uses differentiation whereas the derivation over here this is my derivation function this is my derivation function so these derivation functions are what are helping me to get my new error get my new error and then put that into the gradient descent function and then basically find and make my weight much more better so that I can get the output which is from my input zero point one I have to get the target of zero point three so I hope you understood how the gradient descent works it keeps going in and in and it uses a differentiation Lord so I hope that was very easy for you guys to understand so with that we have come to the end of all that was required from multivariate calculus let's move over to the next topic which is a probability so what is a probability probability is measuring how likely an event will occur what this means is that how much are you sure or how much quantity that you can give to your shortness that this particular event is going to happen so that is what is probability it is the ratio of at the desired outcomes by the total outcomes right so with that particular formula you will be able to understand what is the probability it is the desired outcomes by the total outcomes so always remember that probabilities always come up to one so if you have 0.6 that means you have a 60% probability that this particular thing will happen if you have 0.6 of something happening and 0.4 of something not happening they always sum up to one remember probabilities always sum up to one so some examples that I've given over here are rolling a dice there are six possibilities right so every possibility has one outcome out of the six outcomes so for example the probability of getting a number two is 1 by 6 so that is the probability of getting a number 2 / 2 this is basically what this probability so what are the terminologies that you need to understand when it comes to probability there are three things okay you need to understand what is a random experiment what is a sample space and what are the events let's understand all of these terminologies one by one now so what is the random experiment it is an experiment or a process for which the outcome cannot be predicted with certainty so for example if I give you a dice okay and I tell you to roll it and tell me the number that you're going to get it you are not sure because that's a random experiment you have this uncertainty whether this particular event will happen or not okay so that process where you roll a dice and all of that that is a random experiment so what is the sample space the entire possible set of outcomes of a random experiment is the sample space of that experiment it's really simple right I tell you to roll the dice there are six outcomes so it can be either one two three four five six that particular range of one two three four five six is the sample space it is all the possible outcomes for the random experiment that you've been doing simple enough so what is an event an event is one or more outcomes of the experiment so if I told you roll the dice you get a number one that is an event you roll the dice again you get a number six that is an event so what outcome you get from your random experiment is called the event there are two types of events there are joint events and disjoint events let's understand both of these so joint events have common outcomes right joint events can be together that is what is a joint event for example you have a student who can get 100 marks in statistics and hundred marks in probability the outcome of a ball that can be delivered can be a Nobel also and it can be a six also and it can be possible called joint events disjoint events do not have common outcomes so the outcome of a ball that is to be delivered cannot be a six and a hit wicket or something like that okay so a single card cannot be a king and a queen together and a man cannot be alive and dead at the same time so those are disjoint events those are not at all possible to happen now that we've understood all the types of events we need to understand those distributions so now we have three distributions that come under probability we have the probability density function we have the normal distribution and central limit theorem so let's understand all of these distributions one by one so the first one is PDF probability distribution function so the probability now can be described using an equation the equation that describes a continuous probability distribution is called the probability density function I hope it's very clear to you if it is not let me just simplify it even more you have a function which is going to describe the probability of something happening in the form of a graph it's that simple enough to understand you have a function which will give you the probability which you can plot a graph on on so that is the reason it is a continuous probability distribution okay as you can see here you have a and B between the range of a and B is the most likely that something is going to happen so now that you understood what our PDF is let's understand the properties of the PDF the graph of the PDF will always be continuous it's simple enough to understand that then the area bounded by the curve of the density function and the x-axis is always going to be 1 and the probability that a random variable assumes a value between a and B is equal to the area under the PDF bounded by a NP what this means is basically if you have any particular value or any particular probability that is between a N it is going to be equal to the area that is bounded by a MP the probability value so that is basically what a probability density function is so next we have the normal distribution chain so what is a normal distribution it is a probability distribution that associates the normal random variable X with the cumulative probability so how do you find this particular cumulative probability P it is using this formula which is y is equal to 1 by Sigma into the square root of 2 pi into e to the power of minus X minus mu into 2 divided by 2 Sigma 2 where X is the normal random variable that we have just taken up then mu is the mean and Sigma is the standard deviation and this is how the normal distribution looks like the graph of the normal distribution depends on basically two factors which is the mean and the standard deviation so the mean determines the location of the center of the graph and the standard deviation determines the height of the graph so if you have a very big deviation between you know your variables it becomes a very short graph and very wide graph whereas if you have a good deviation which is you know very less you will have a tall graph that is basically what's going to happen over you that is all we have from normal distribution then the central limit theorem so this is basically a theorem which is stating that the probability is always going to be in the center of the graph let me tell you the statement of this the central limit theorem states that the sampling distribution of the mean of any independent random variable will be normal or nearly normal if the sample size is large enough so what this basically means is that the distribution of your mean or your samples are always going to be near the center or always going to be near the mean of your graph if your sample size is much much large enough enough so what this means is that if your sample size is very very big it is going to be a normal distribution it is as simple as that so as you can see if I have n is equal to 1 component you can see the graph as it increases to 3 as it increases to 10 as it increases to 50 so as you can see it almost becomes nearly identical to a normal distribution so that is what the central limit theorem is so once we are done with that we will look at the types of probability getting we have marginal probability joint probability and conditional probability let's understand each of this so what is marginal probability marginal probability is the probability of occurrence of a single element if I flip a coin it is going to be either head or dirty that occurrence of that single event is called as a marginal probability so for example I have a deck of cards here and I randomly take out a card from there and I want it to be a hard card right so that will be the probability which is 13 by 2 and it can be expressed by this particular formula so once we are done with marginal probability let's go to joint probability what is joint probability it is the measure of two events happening at the same time so the same example over here I have a deck of cards out of which I want to take out the ACE which is the hearts so it has to be a ace card and it has to be your heart card which will be the probability giving me 1 by 52 so now that we've understood what is joint probability let's understand what is conditional probability so the probability that depends on something already happened is called as the conditional probability it is basically the outcome of the event is based on the occurrence of a previous event or an outcome conditional probability of an event B he is the probability that an event will occur given that a event has already occurred what are the two types of formula that we have in conditional probability if a and B are dependent events then the expression for the conditional probability is given by P of B by a is equal to P of a and B divided by P of a if a and B are independent events that means the expression will be P of B only so it is just the probability of B occurring whereas if they are dependent events means that if you want to find the probability of B a already has to be occurred so it is a probability of a and B divided by the probability of P so that is basically all the introduction that was required from probability we now have Bayes theorem so what is Bayes theorem it the relationship of one conditional probability and its inverse what is base theorem it shows the relation between one conditional probability and its inverse so this is the formula that is P of a by B is equal to P of B by a into P of a divided by the P of P so what does this particular thing mean so what we are trying to find out is the probability of the occurrence of a given that B has already occurred is equal to the probability of B occurring when a is already there and the P of a which is the prior probability which is something that we already think of and then the probability of B occurring so this is basically what the Bayes theorem is let me give you an example for it okay suppose you're a doctor and you are running a clinic where you test if a patient has a liver problem or not all the previous patient that had come to you out of them 10% of your patients had liver problems okay so all the patients that had come before they had around 15% probability that they would drinking so now we have found out the probability of E and probability of B the prior probability and the probability of B so what was the probability that out of all of those patients even though they had a liver problem they were drinking it is five-person which is P of a by B hence the bayes theorem says that the probability that a new patient whoever's entering will have a particular liver disease is calculated by 0.05 into zero point one zero divided by zero point one five is equal to three point three three percent what this means is that if a new patient is going into the clinic there is a chance of three point three three percent that that particular person is also having a liver problem this is basically how the Bayes theorem works so what are the applications of probability probability helps you to optimize your model classification of our algorithms also requires probability the loss can also be calculated using a probability then models are built on probability so let me show you how the naive Bayes classifier works over here so let me go to naive Bayes this is the knife based classifier so let me tell you what I have done over here I have imported the data sets the matrix and all the required naive Bayes classifier and have imported the Gaussian naive Bayes classifier the Gaussian is also another name for the normal distribution okay so normal distributions of Cole also called as the Gaussian naive Bayes classifiers and I have done this is because I had visualized it and I came to know that it was a normal distribution okay so I load the iris that is the load iris dataset and I load a model and then I'm just going to fit the model okay then I am going to get the predicted model of it and then them just going to you know basically find all the metrics of it so basically let me show it to you what is the precision recall of all of that so this is all the accuracy then my model has an accuracy of 96% and this is the confusion matrix which means that out of 50 it was able to classify all the 50 correctly then out of 50 at class five forty seven correctly and three were wrong and then out of 50 again over here three were classified wrong and forty seven were classified correctly so this is the naive Bayes classifier so as you can see probability is actually being used in all of this though Gaussian an eye base classifier already does all of that so we do not need to you know worry about it much more so now that we have finished up everything that we needed to do from probability let's move over to the last topic for today statistics what is a statistics statistics is an area of applied mathematics concerned with data collection analysis interpretation and presentation what this means is that you are going to analyze understand all the data that you have collected and how you are going to present it so for example your company has created a new drug that make your cancer how do you conduct a test to confirm the drugs effectiveness what are you going to do you are basically going to collect a large amount of people around you're going to take off all of those people you're going to you know give them the basically the new drug that you have created and then you're going to interpret all the results that you get from it and then you're going to present that order of let's say thousand people around 900 of them a cure or 950 record so all of those parts come under statistics okay so I hope you've understood that what are the terminologies that you need to understand when you work with statistics there is population and there is some so what is population it is the collection or a set of individuals or objects whose properties are to be analyzed all the data that you have collected is called as the population so out of all the data that you've collected you take out a set of data for your analysis that set of data is called a sample so what is the sample a subset of the population is called as a sample so a well-chosen sample will contain most of the information about the population parameter so how do you choose the sample the sample needs to be chosen such that all the data that the population is trying to convey can be conveyed through the sample itself so that is what is population and sample now that we've understood the basic terminologies let's look at some of the sampling techniques that we have in statistics so sampling can be classified as probabilistic and non probabilistic type of approaches so in probabilistic sampling we have a random sampling systematic sampling and status fide sampling in non probability stick 1 we have snow boil quota judgment convenience and all of that ok so but for machine learning we do not need to take a look at the non probabilistic approach we just need the probabilistic approach so we will be now covering about random sampling systematic sampling and stratified sampling so let's go ahead with that so what is random sampling you take out a sample randomly out of the population that is basically what is a random sampling each member of the population has an equal chance of being selected in the sample so that is what is random sampling so what is a systematic sampling systematic sampling is basically you follow a particular order out of which how the sample has to be taken again as you can see here I have six particular groups now from this six particular groups I am going to take all the even ones right so that'll be two four and six that is basically how the systematic sampling works I follow a system I follow an order of how I am going to take the sample so once we are done with that what is stratified sampling what is a stratum first of all a stratum is basically a subset of the population that shares at least one common characteristic which is the gender which I have taken over here here so as you can see here I have the whole population out of which I have found out one property that is common between all of them that is gender they are either male or female so I am going to break the whole population based on that I am going to break my population into the male subsets and the female subsets so after that I am going to apply the random sampling and I am going to take out all the sample that I am going to be needing okay so once I have now broken down my particular subsets now I'm just going to be using random sampling and take out all the samples that are required by me so this is what is stratified sampling so what are the types of Statistics that we need to infer from from we have descriptive statistics and inferential statistics so let's understand each of these right now okay so what is descriptive statistics descriptive statistics is mainly used to focus upon the main characteristics of the data it provides the graphical summary of the data so as you can see over them descriptive statistics uses the data to provide descriptions of the population either through numbers or calculations or graphs or tables ok so as you can see here I have my short and there can be either three particular sizes there can be the most minimum one the most average one and the most maximum one and all of that is based on the weight of the person suppose for example if a person weighs 60 he is going to get the maximum one if the person weighs 40 he is going to get the average one the person B is 20 he is going to get the minimum one so that is how I am describing my data and I'm going to infer data from it so that is descriptive statistics so what is inferential statistics inferential statistics makes inferences and predictions about a population based on the sample that is being taken from it so what this means is basically I am going to generalize a large data set and apply probability and draw a confusion to it so it allows me to you know infer data parameters based on the statistical models which is using the sample data I have a whole bunch of population out of which they can tell me they have either the large size or the medium size of the small-sized face so according to their sizes I am going to supply them the t-shirts so this is basically what are the two types of statistics fixed there are descriptive statistics and inferential statistics descriptive statistics what does it do it describes the data whereas inferential statistic takes information from the Delta and then analyzes accordingly to however we wanted to so these are the two types of statistics let's go in that with descriptive statistics what is descriptive statistics --ts it is used to describe and understand the features of a specific set of data so what is descriptive statistics it is a method used to describe and understand the features of a specific data set by giving short summaries about the samples and measures of the data and descriptive statistics it is broken down into the measures of center and the measures of variability so what does measures of center have measures of center has the mean median and the mood whereas the measures of center have the range interquartile range variance and standard deviation so let's understand all of this right now so for example here you can see that I have a data set of a car which contains a variables the car the mileage the cylinder type that displacement the horsepower and the real lac salé ratio so I'm going to find out the mean so what is the mean mean is the average of all the samples so for example if I want to find the mean of the hotspot I'm going to just add the values and divide it by the number of values so I get one hundred and three point six two five this is the mean of the hotspot so what is the medium it is the measure of the central value of the sample set that is what is the median mean so let me show you how to find it for the miles per gallon to find the order of the center value you have to basically arrange it in the ascending order okay so you have 21 21 21 point 3 22 point a 23 and accordingly now I have to take the two middle values okay so that will be 20 2.8 plus 23 divided by 2 that becomes 22.9 so that is the median of my sample next let's move over to mode the value which is most recurrent in the sample set is called as the mode so let me find the mode for the cylinder type what is most recurring the six stroke or the four stroke I can see here that there are one two three four five if so there are five times that the cylinders type of six stroke has been repeated and three times type four stroke has been repeated so sex becomes the mode of the data okay so now that we have understood all that is required from measures of center let's move over to the measures of spread pain so a measure of spread is also called as a measure of dispersion is used to describe the variability in the sample or the population okay so the first one is a range range is the given measure of how spread apart the values are in a data set so it is the maximum value minus the minimum value which gives us the range so what is an interquartile range quartiles tell us about the spread of the data by breaking the data into many many quarters just like the median breaks itself so for example if I have eight data is over here I can break it down into three quarters where I have one two then I have three four five six and seven eight which become different different samples accordingly okay how do I find what is the interquartile range so how will I find the interquartile range of this okay so the first quartile q1 comes between the 25th and the 26th so it will be 45 plus 45 divided by 2 that becomes 45 the second one comes under 50 and 50 first and that will be 58 plus 59 by 2 that'll be 50 8.5 and the next one comes at 75 and 76 that is 71 plus 71 divided by 2 which is always a 71 so that is basically what is an interquartile range it is the measure of variability based on the dividing set you to the quartiles so that is basically how you break down the interquartile range and it is the measure of variability based on dividing the set into quartiles so quartiles divide a rank ordered data set you before equal parts which would be q1 q2 and q3 respectively and the interquartile range is measured by q3 minus q1 so if you can see this you have the whole data which is 100-person and you have broken down the data into 25 portion each so the quartiles come in between the first q1 q2 and q3 so those are the interquartile ranges so now that we want to stop that let's move over to variance so variance describes how much a random variable first from its expected value it NT is the computing squares of the deviations so that is how you find the variance which is s square is equal to the summation of 1 to n is equal to 1 X I minus X bar the whole square divided by n so X I is the value and X bar is the mean I hope that simple for you to understand what is the variance mean it means the difference between the data which it should have actually been so what does the variance actually try to tell you okay it basically means that the data which should have been something and what it really is what is the difference between them how much it differs from each other so that is what is the variance so what is a standard deviation deviation is the difference between each element from the mean so I have X I and I have the mean what is the difference between each of them is the deviation so what is the population variance the population variance is given by this particular formula that is Sigma square is equal to 1 by n the whole summation of I is equal to 1 to N is equal to X I minus mu the whole square once we have found out the population variance what is the sample variance so it is s square is equal to 1 divided by n minus 1 the summation of I equal to 1 to N and X I minus X bar the whole square that is the sample variance so what is the standard deviation standard deviation is the measure of the dispersion of a set of data from its mean what the standard deviation is trying to tell you is basically that how much this data is trying to disperse from its actual data so how is that given by it is given by 1 divided by n summation of I equal to 1 to nxi minus mu the whole square which is basically the variances square root so that is standard deviation so how do you find the standard deviation I am just going to give you an example over here so say for example 10 IRS has a 20 Dragons and they have the numbers accordingly so how do you work out the standard deviation so the mean is given by adding all the numbers dividing it by 20 so mu is equal to 7 and then what is the standard deviation so it will be X I minus B the so it'll be nine minus seven the whole square is equal to 2 square is equal to 4 and you have 2 minus 7 you know all accordingly you keep on going and you get the following results which is 425 for 925 0 accordingly once we have done that put that into the formula so you have this formula accordingly so we get the variance square is equal to eight point nine nine and the standard deviation will just be the square root of eight point nine which will be two point nine eighty three what this means is that with a value of two point ninety three the values are differing from the mean so if you are trying to add two point ninety three to all the other values you will be getting something which is much more similar and much more realistic and much more you know closer to the mean of the so that is our standard deviation works now that we have understood all of that we have a very classic example of information gained so what does information gain help us with this is an ideology that comes into picture for decision trees so let's understand the basic concepts of what we require in information again so the first topic that we need to understand is entropy so what is entropy entropy measures the impurity or uncertainty that is present in a data and it is given by the formula H of s is equal to minus summation of minus one to the power of M V of I log base two P of five where s is all the instances in the dataset n is the number of the distinct class values and P of I is the even probability because entropy is the impurity that a data has so what is information gain information gain is how much information of particular feature will give to the final outcome so that is given by gain of a of s is equal to H of S which is the entropy minus summation of j is equal to 1 to b s of j / j into H of s of J is equal to H of s minus H of Abe is H of s is the entropy of the whole data set s of J is the number of instances with J and a value of attribute a s is the total number of data sets B is the distinct value of attribute 8h is the entropy of the data set with value a and entropy of attribute a so these are all the requirements so let me give you an example of information gain you can see the data set over here which is for 14 days and the particular you know attributes that are given to it so the forecast is to find out whether the match will be played or not according to the weather conditions so this is what we are trying to find here we have five nose and nine yeses so we have the outlook and then we have a sunny overcast and rainy so in sunny we have three nose and two yeses an overcast we have a four yeses and zero nose and in rainy we have three yess and two nose so how are we going to find out which particular node is going to you know attribute the most of it so let's find out the entropy first for the whole data which is nine instances CES and Phi instances say no so we have a minus nine by fourteen log base two of nine by fourteen minus five by fourteen log 2 of five by fourteen is equal to zero point nine four zero so what this means is that the data that we have has zero point nine four zero times bad data means it cannot give us much information that is much more impurities in the data so let's move over the first step in information gained is to find the root variable so how are we going to find the root variable we are going to take all of the attributes and find the information gained from them we have our clock we have windy humidity and temperature so let's find out the information again pain so for information gain of windy we have six instances true and eight instances false right so H of s is equal to 0.94 0 minus 8 into 14 and as you can see all the other variables so the gain that we have achieved from Wendie is 0.04 which is really really less so for Outlook the same thing happens over here we have the information gain that is being passed by it is zero point two four seven then we have the information gain from humidity that is 0.15 which is also really good and we have temperature which is 0.029 you can pause and look at the steps if you want to out of all the gains that we have the variable with the highest information gained is used to split the Delta so outlook has zero point two four seven wendy has zero point zero four eight gain has zero point one five one and gain of temperature is equal to zero point zero to nine so what are we going to choose we are going to choose outlook look why is that that it's because it has the highest information gain so that becomes our root node that is a simple example of how statistics is used in the information gained next let's move over to the confusion matrix so what is a confusion matrix confusion matrix is a table that is often used to describe the performance of a classification model so it is a table that is used to describe the performance of a classification model on the set of test data so how are you going to calculate the accuracy of this it will be true positives plus two negatives divided by true positive to negative false positive false negatives so if you are confused with part is to positive true negative false positives false negatives let me explain that to you right now let's say for example there are two possible predicted classes which are yes or no the classifier made a total of 165 predictions that means 165 predictions were total so out of the 165 cases the classifier predicted yes 110 times and no 55 times so in reality a hundred and five patients only had the disease and 60 patients did not so how are we going to put this into the table so as you can see here in the table we have n is equal to hundred and sixty-five then we have predicted no one predicted yes actually no and actually yes so if our classifier predicted it no and it was actually nope that is the correct output but if it predicted yes and it was actually yes that is also the correct output but it predicted it as yes and it was actually no that means it has made a mistake and it predicted no but it was actually yes it has still made a mistake what are the correct values over here which is fifty and hundred so fifteen hundred other correct values and ten and five other wrong values what is a true positive true positive R which it was predicted as yes and they actually have the disease two negatives are predicted no and they did not have the disease false positives are classify predicted as yes but they did not have the disease and false negatives we predicted no but they also had the disease so this is basically what a confusion matrix is and it is helpful in finding out the performance of the classifier that you are working with so that is what is a confusion matrix next comes inferential statistics it is a method where we infer and understand what the data is trying to communicate with us so it is broken down into two points it is a point estimation and interval estimation so let us understand about point estimation point estimation is concerned with the use of a sample data to measure a single value which serves as an approximate value or the best estimate of an unknown population parameter so we have such a huge population of India right out of that you're just going to randomly take out some sample of it you're going to find out the sample mean and this sample should be so good enough that it is going to estimate the whole mean of the population so this is how you use point estimation and try to approximate it for the whole you know population so what are the different methods may have methods of moment methods of likelihood based estimator and best unbiased estimators so estimates are found out by equating the first key sample moments to the corresponding key population moments which is the method of moments and then we have the maximum of likelihood which uses a model to maximize a likelihood function based estimator minimizes the average risk and it is an expectation of all the random variables and then we have the best unbiased estimators which is used to find out the best depending parameter so that is all that we require from point estimation let's move over to interval estimates so an interval or a range of values is used to estimate the population parameters so we have a point estimate and we have an interval which gives us low confidence and the high confidence limit how is this going to happen there are three things that you need to remember there is something called as a confidence interval and it is the measure of your confidence that the interval estimate contains the population mean mu so statisticians use a confidence interval to describe the amount of uncertainty associated with the sample technically a range of values are so constructed that it is specified probability of including the true value of a parameter within it so for example let's take a sample where we have 10,000 so let's say that we have something called as a supermarket so out in that supermarket we have collected a lot of data and in that data I am saying that this particular items okay so item 1 and item 2 - if they both are kept together they are going to be sold together and I say that with 90% confidence so that 90% confidence okay is what is my confidence interval it can be in between this range it can be between 90 to 95% and between 90 to 95% it is highly likely that that particular thing is going to happen so that is my confidence interval so what is the sampling error it is the difference between the point estimate and the actual population parameter which is the sampling error okay so when mu is estimated the sampling error is mu minus x-bar it is mean minus the point what is the margin of error so for a given level of confidence it is the greatest possible distance between the point estimate and the value of the parameter it is estimating what this basically means is that it is the greatest possible distance between what you have predicted and what it actually is and what level of confidence you are giving to it in so that is the margin of error how much error you can allow for the predicted model is what is the margin of error so I hope you have understood that so the level of confidence is the probability that the interval estimate contains the population parameter so as you can see here all the parameters that come under C right are what we are allowing okay that is the margin of error that is but whatever comes in minus ZC and plus ZC we are not going to take that those are the interval estimates so finding these intervals is what is our probability so finding these interval estimates is what is our work over here okay so for example if the level of confidence is 90% this means that you are 90% confident that the interval contains the population mean 0.05 and 0.05 of minus ZC and plus C is equal to plus one point six four five and minus one point six four five those are the z-scores anything minus or plus of one point six or five is not allowed so those are the values that we are concerned with when we are working with interval estimates so now that we have understood everything we needed to from inferential statistics and descriptive statistics let's move over to hypothesis testing so what is a hypothesis first of all it is some event that you have made up which may have the probability of happening or not happening testing that hypothesis is what is hypothesis testing is you know formally checking whether this hypothesis is accepted or rejected so how is the hypothesis testing conducted you first state the hypothesis and then you formulate an analysis plan of how you are going to you know test this particular hypothesis then you get all the output you analyze the data that you have got from all of it so then you understand whether this particular hypothesis has failed or it is of good hypothesis okay suppose I have four boys over here which are Nick John Bob and Harry so they have a called doing mischief in the class and they have to now serve detention for almost two months okay the detention is is basically they have to clean the classrooms so John over here comes up with an idea saying that I'm going to write all the names of Arts in to chits and then put them into a bowl who's ever name comes out has to clean the classroom so what happens here that we are going to assume that the event is free of bias so our hypothesis is what is the probe of John not cheating let's find out what happens actually the probability of John not being picked up for a day is three by four now the probability just keeps increasing on and on if he is not picked up for three days okay so it is 3 by 4 into 3 by 4 into 3 by 4 is equal to 0.4 to approximately and what is the probability of John Nord being picked up for 12 days it is 3 by 4 into 2 L is equal to 0.03 due to which is equal to 3 point 2 percent which is a less than 0.05 which is 5% so what this means is that our hypothesis was that John is not cheating but actually John is cheating over here because it has come down and it has come below the threshold value of fibers impotent which is 3 point 2 percent so now all of us have come to know that John is cheating because he has not even written his name in the chit's so this is how the hypothesis works so what is the null hypothesis the hypothesis that we have created in the beginning okay that John was not cheating that was a null hypothesis it has no result which is different from the assumption and we have an alternate hypothesis which means that our results disapprove the assumption we had a hypothesis that John is not cheating but actually through our results we found out that John is cheating so this is what is the hypothesis testing and we have a threshold value if the probability that we are testing goes below this threshold value it means that the particular hypothesis has failed so this is what is the hypothesis testing so now we need to know something about the P and T values okay what is the P and T values the p value and the T value help us in our hypothesis testing ok we want to find the height of students who are greater than 5 feet and 7 inches okay so we take a sample of hundred students and find that the mean height was 5 feet and 9 inches okay we make a hypothesis that out of all the hundred students there is going to be at least six students who are greater than five feet nine inches so this is our p-value it is the probability value that we are saying that this particular hypothesis is going to be correct say that at least out of hundred times six students will at least have a height which is greater than 5 feet and 9 inches okay so this is the p-value this is the probability value that we have given to a hypothesis the T value is testing this particular hypothesis and finding the difference that we have found out from our assumption and what we have actually calculated from our results so this is how the p-value and the T value help in making the null and alternate hypothesis you know you find the result of the null hypothesis and the alternate hypothesis so this is how the P and the T values help in making and observing the results from the null and alternate hypothesis again let me show you a code where we are going to find the mean median mode accordingly and how that becomes helpful to us okay so let me go back to the presentation moon I have my data over here which is this particular data so I am going to import statistics as s and I am going to just find the mean median mode of it so let me just run this program for now so as you can see here the mean is thirty three point four three and the median is three and more is two two has been repeated the most number of times so that is the reason it is the mode and the variance is a Phi 1 to 4 and the standard deviation is 73 that is how the mean medium mode is calculated let me run the program now so that you can understand what we are trying to do so let me close this out what's happening over here is I have the iris data which is the load iris and then I am creating it into a data frame okay simple enough after that I am saying that those species is the target and the data species is equal to I am going to just apply all the lambda parts of it okay and then I am going to get the description of the data and after that what I'm going to do is I am going to take this data and I am going to pair plotted and then I am going to show the plot to you so let me run the program let me show you the figure first so I had four features over here right so that was the petal length petal width sepal length and sepal width so if you can take a look at this data you can see that there is a normal distribution here there's a normal distribution here also there's a normal distribution over here with some exclusions over here and there is the normal distribution here too but with some exclusions here so looking at this I will be able to understand if I want to you know apply the Gaussian nay Bayes classifier or accordingly to however I wanted to so this is how statistics has helped me to understand what is my data looking like now let me go back over here so as you can see this is the count of a 150 right simple enough then what is the mean what is the standard deviation what is the maximum length and maximum width that you can find over here all of this is done using the describe part of it okay and describe is a function that is already available in pandas so that was a very simple example of how I was able to understand how I have to use the Gaussian naive Bayes classifier for my iris data which I showed in the probability part right now we are done with this let's go back to the presentation so with that program that brings us to end of statistics and also to the end of our tutorial that was a lot to take in but trust me if you are through with this you know the basics of mathematics for machine learning why I say basics is because every problem is different and solving them helps you master it but if you have at least this much you have a start to all those problems so let's summarize what we have learnt till now we understood by mathematics for machine learning is so important and why you should learn it we then also covered linear algebra and how you can use it for various tasks of PCA and more thereafter we started with multivariate calculus its rules and how it is helpful in optimization of the model where we have created and we have statistics and probability which were thought separately but they share a lot in common when it comes to the working in machine learning and what the hypothesis is what it is testing is and much more so with that we have reached the end of the session I hope it was elaborate and precise as this is all the method you would need when it comes to machine learning I enjoyed sharing this information with you guys if you have any doubts drop them in the comment section below and I will get back to you as soon as possible so until next time take care and happy learning I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist and subscribe to any rekha channel to learn more happy learning