Support Vector Machines Lecture Notes

support vector machine is one of the past and nonlinear supervised - machine learning models given a set of labeled training data SVM will help us to find the optimum hyperplane which categorized new examples in 1 dimensional space the hyperplane is a point in two dimensional space the hyperplane is a land and in three-dimensional space the hyperplane is a surface that divides a space into two parts where each class lies on either side in the first section of this video we are going to go through the detailed process of finding the optimum hyperplane let's first start with the two dimensional linear separable case figure Y ensures that the data are separated by a line in this case the data are linearly separable figure two is an example of non linearly separable data which means that we can now define a line to separate the data therefore we say the data are linearly separable if we can find a line to separate the original data let's have a closer look at figure one in general there are lots of possible solutions for drawing a hyperplane to separate the positive samples from the negative samples but which one is the best SVM can help us choose the best hyperplane which maximizes the margin or sometimes we call it a street around the hyperplane to separate as a positive ones from the negative ones that's the reason why sometimes we also call SVM the maximum margin classifier now let's introduce another important concept support vectors support vectors are the data points that lie closest to the hyperplane they are the data points most difficult to classify which also decide which hyperplane we should choose unlike some other Michelini models like linear regression on your network which all data points influence the final optimization for SVM only the difficult points which is the support vectors has an impact on the final decision boundary moving a support vectors move the decision boundary moving the other vectors has no impact just like any other muscle in your model during the inference time as VM will predict the output Y with input X the input is just a set of features X for each sample SVM will help us to find a set of weights W for each feature then through the linear combination of input features and weights we can predict the output Y so far it is just like the neural net however there is a major difference SVM will optimize the weights in such a way that the only support vectors determines weights and the decision boundary now that starts the fun part of this section we're going to go through the underlying mathematical process finds VM to find the hyperplane and maximum the margin this part only requires that you have some basic background of linear algebra and optimization Theory still for this 2d example we have three important lines the middle line of the street h0 which is the hyperplane two gutters of the straight h1 and h2 now let's assume there is a vector W perpendicular to these three lines for all the points on h0 they should have the same projection distance on w direction which is w dotted x over the magnitude of W is equal to say let's multiply the magnitude of W on each set of the equation we can obtain W dot 8x is equal to say times the magnitude of W now let's substitute se times the magnitude of W with negative B we can get the function of the land h0 as W dotted ax plus B is equal to zero which is the equation for H zero here we only have two dimensions so this can be expanded as a following function in fact this expression works for any number of dimensions once we have the expression of hyperplane we can then use it to make predictions for this example if the expression for any unknown sample is equal or greater than zero will predict it as a positive sample otherwise we predict all know as negative similarly we can write the expression for garter h1 and h2 here K is just another constant value and we can rescale W and B to make K is equal to 1 for mathematical convenience as we have introduced before SVM is a maximum margin classifier to find a way to express a margin or street weights here let's introduce two vectors 1 X negative 2 an active sample on the garda h2 another 1 x+ to an appositive sample on the gara each one now let's take the difference of these two vectors and dot product is with the unit vector perpendicular to h0 we can get an expression of the ways of the street from previous draft function for gather h1 and h2 we can get w dotted x+ it's equal to 1 minus b for the positive sample and w dotted x- is equal to negative 1 minus p for negative sample let's substitute them back to the straight weights equation and derive 8 as 2 over the magnitude of W in order to maximize this ways with us need to minimize the magnitude of W for mathematical convenience it is equivalent to minimize 1/2 times the magnitude of W squared which also equivalent to 1/2 times w dot e 2w however we cannot make this straight with arbitrary large we can only expand the street weights under the condition that there are no data points between each one and to mathematically this means for all the positive samples they should obey the constraint the WX plus B is greater or equal to 1 and for all the negative samples the constraint is doublet ax plus B is less or equal than negative 1 now let's introduce another variable Y to transform these two equations to make it more convenient from mathematical perspective here we assume Y to be 1 for all positive samples and negative 1 for all negative samples so the previous two equations can be rewritten as a single equation for all the samples let's revisit the problems we're trying to solve here we want to optimize the street weights 1/2 times W dotted W subjected to the constraint equation for all the samples it is a constrained optimization problems and we have a powerful technique lagrangian multiplier for this if you are going to find an extreme of a function with constraint then we are going to have to use Lagrangian multipliers this will give us some new expressions using which we can find the optimization of 8 without thinking about the constraints anymore for this problem the new expression can be written as one half times the WTTW manners the summation of all the constraints and each of the constraints is going to have a multiplier alpha sub I what we are going to do next is to find an extremum of this new expression well for this we just need to find all the derivatives of this expression and set them to 0 the first equation is a partial derivative of L the Lagrangian with respect to the vector tableau after setting it to 0 we can get to the W equals to the summation of alpha sub I times y sub I times X sub I over I for all the samples this is very interesting it tells us the vector W can be obtained with a linear sum of the sample of actors actually at the end of the optimization a lot of alpha-i is going to be 0 and the non 0 alpha I respond to the samples in the gutter which is the support vectors now we should differentiate the Lagrangian with respect to B as well which is going to be equal to the sum of alpha sub I times y sub I also at the extremum condition it should be equal to 0 now we are going to plug the derived expression W back in the Lagrangian equation and the magic happens here we can merge the first two term and the second term and rewrite the equation as a following one what we've discovered is that the optimization depends only on the dot product of pairs of samples now we can also substitute u2w back to the decision rule which is going to be sum of alpha sub I times y sub I times X sub I taught it to the unknown vector u plus B if that is greater or equal to 0 then the prediction is a positive sample now we discovered that the decision rule also depends only on the dot product of those samples vectors with the arno vector our previous discussion is based on the assumption that the samples are linearly separable and there is no outliers in the data and above SVM formulation is called hot margin SVM what if there are some noise and outlier in the data set then the heart margin SVM simply cannot tolerate them and the field to find the optimization let's revisit our inequality constraint here for the optimization problem to be solved all the constraints have to be satisfied in this part we will talk about how to deal with this limitation using soft margin SVM basically the trick is very simple let's add a slack variable Zeta to the constraint of the optimization problem it now becomes this equation with theta by adding these slack variables we're minimizing the objective function it is not possible to satisfy the constraint even if some outliers do not meet the original constraint the problem is we can always choose a large enough value of data so that all examples will satisfy the constraints one technique to handle this problem is to use regular addition for example we could use l1 regulation to penalize the large value of theta now the regularized optimization problem becomes the following expression notice that the Zeta here is always a positive number and the regularization parameter C determines how important data should be which means how much we want to avoid misclassifying each training example a smaller C emphasized the importance of Zeta and a larger Z diminishes the importance of data if we set say to positive infinite we will get the same optimization result at the heart margin SVM so small values of C will result in a wider margin at the cost of some misclassifications large value of C will result in the smaller margin at a less tolerant to outliers now the soft margin SVM can handle the non linearly separable data with some outliers and noise what if the non linearly separable data is not caused by the noise what if the data are characteristically non linearly separable like the following case in this part we will talk about a technique called kernel trick to deal with this kind of problem for this example it looks like there is no easy way to draw a line to separate the positive samples from the negative samples however if we transform the data from one space to another space we would be able to find a hyperplane to separate the data now let's revisit our Lagrangian we find that the optimization only depends on the dot product of the sample AK sub I and the sub J so we could use any transformation function that can be expressed at the dot product of X sub I and X sub J we call this function kernel function so the kernel trick is to define a kernel function k and use it to replace the X sub I dotted X sub J in the Lagrangian this is a small but powerful trick for these problems we could try to the kernel as X 1 sub I square plus X 2 sub I squared if we plot these samples in the new space a clear separation is visible and the line can be drawn to separate the positive samples from the negative samples here I will also show two popular kernels polynomial kernel and RBF kernel the polynomial kernel is defined as the following expression this kernel contains two parameters a constant C and the degree of freedom D a larger value of T will make the decision boundary more complex and Matt resort in overfitting the RBF kernel is defined as the following expression the RBF kernel is also called Gaussian kernel it will result in a more complex Gaussian boundary it has one parameter gamma here are some examples for SVM with RBF kernel of different gamma values the simulation is now with second learn a small value of gamma will make the model behave like a linear SVM without kernel a large value of gamma will make the model heavily impacted by the support factors and may result in overfitting I hope you like this video don't forget to subscribe to us and the rain the notification bell and stay tuned

Transcript for:Support Vector Machines Lecture Notes

Transcript for:
Support Vector Machines Lecture Notes