Support Vector Machines Lecture Notes

Jul 23, 2024

Support Vector Machines (SVM)

Overview

SVM is a nonlinear supervised machine learning model.
Given labeled training data, it finds the optimum hyperplane that categorizes new examples.

Concepts of Hyperplane

Hyperplane in 1D Space: A point.
Hyperplane in 2D Space: A line.
Hyperplane in 3D Space: A surface dividing space into two parts.

Linear Separability

Linearly Separable Data: Can be separated by a straight line.
- Example shown in Figure Y.
Non-Linearly Separable Data: Cannot be separated by a straight line.
- Example shown in Figure 2.

Choosing the Best Hyperplane

There are multiple possible hyperplanes to separate classes.
SVM selects the best hyperplane by maximizing the margin or "street" around it.
SVM is often called the Maximum Margin Classifier.

Support Vectors

Support vectors are points closest to the hyperplane and hardest to classify.
These points influence the final decision boundary.
- Other points do not affect the boundary.

Prediction in SVM

During inference, SVM predicts output (Y) based on input features (X).
SVM finds a set of weights (W) for each feature and combines them to predict outputs.
Unlike neural networks, SVM optimizes weights based only on support vectors.

Mathematical Foundation

Hyperplane Equation

For 2D example with lines (h0, h1, h2), the relationship is:
- h0 (hyperplane): W · x + B = 0
- Where W is a vector perpendicular to the hyperplane.

Margin and Constraints

The margin is expressed as: 2 / ||W||
Aims to minimize 1/2 * ||W||^2 under constraints:
- For positive samples: W · x + B ≥ 1
- For negative samples: W · x + B ≤ -1

Lagrangian Multipliers

To solve constrained optimization problems, we use Lagrangian multipliers.
This creates a new expression combining constraints and the optimization objective.

Deriving Weights

Using derivatives and setting them to zero gives:
- W = Σ α_i * y_i * x_i (where α_i corresponds to non-zero support vectors).

Soft Margin SVM

Deals with non-separable data and outliers by using slack variables (ζ).
- Reformulates constraints allowing some misclassifications.
- Regularization parameter C balances margin width against misclassification tolerance:
  - Small C: Wider margin, accepts more misclassifications.
  - Large C: Narrower margin, less tolerance to outliers.

Handling Non-Linearly Separable Data

Kernel Trick: Transforms data to a higher-dimensional space for better separation.
- Replace dot product in Lagrangian with kernel function K.
Common Kernels:
- Polynomial Kernel: K(X_i, X_j) = (C + X_i · X_j)^D
- RBF (Gaussian Kernel): K(X_i, X_j) = exp(-γ ||X_i - X_j||^2)

Effects of Gamma in RBF Kernel

Small γ approximates linear SVM, while large γ can lead to overfitting.

Conclusion

Soft margin SVM handles noisy or non-linearly separable data effectively.
Understanding the underlying mathematics and kernel trick is crucial for successful SVM application.

Don't forget to subscribe and stay tuned for more updates!

Full transcript