Support Vector Machines (SVM)
Overview
- SVM is a nonlinear supervised machine learning model.
- Given labeled training data, it finds the optimum hyperplane that categorizes new examples.
Concepts of Hyperplane
- Hyperplane in 1D Space: A point.
- Hyperplane in 2D Space: A line.
- Hyperplane in 3D Space: A surface dividing space into two parts.
Linear Separability
- Linearly Separable Data: Can be separated by a straight line.
- Example shown in Figure Y.
- Non-Linearly Separable Data: Cannot be separated by a straight line.
- Example shown in Figure 2.
Choosing the Best Hyperplane
- There are multiple possible hyperplanes to separate classes.
- SVM selects the best hyperplane by maximizing the margin or "street" around it.
- SVM is often called the Maximum Margin Classifier.
Support Vectors
- Support vectors are points closest to the hyperplane and hardest to classify.
- These points influence the final decision boundary.
- Other points do not affect the boundary.
Prediction in SVM
- During inference, SVM predicts output (Y) based on input features (X).
- SVM finds a set of weights (W) for each feature and combines them to predict outputs.
- Unlike neural networks, SVM optimizes weights based only on support vectors.
Mathematical Foundation
Hyperplane Equation
- For 2D example with lines (h0, h1, h2), the relationship is:
- h0 (hyperplane):
W · x + B = 0
- Where
W
is a vector perpendicular to the hyperplane.
Margin and Constraints
- The margin is expressed as:
2 / ||W||
- Aims to minimize
1/2 * ||W||^2
under constraints:
- For positive samples:
W · x + B ≥ 1
- For negative samples:
W · x + B ≤ -1
Lagrangian Multipliers
- To solve constrained optimization problems, we use Lagrangian multipliers.
- This creates a new expression combining constraints and the optimization objective.
Deriving Weights
- Using derivatives and setting them to zero gives:
W = Σ α_i * y_i * x_i
(where α_i corresponds to non-zero support vectors).
Soft Margin SVM
- Deals with non-separable data and outliers by using slack variables (ζ).
- Reformulates constraints allowing some misclassifications.
- Regularization parameter
C
balances margin width against misclassification tolerance:
- Small C: Wider margin, accepts more misclassifications.
- Large C: Narrower margin, less tolerance to outliers.
Handling Non-Linearly Separable Data
- Kernel Trick: Transforms data to a higher-dimensional space for better separation.
- Replace dot product in Lagrangian with kernel function K.
- Common Kernels:
- Polynomial Kernel:
K(X_i, X_j) = (C + X_i · X_j)^D
- RBF (Gaussian Kernel):
K(X_i, X_j) = exp(-γ ||X_i - X_j||^2)
Effects of Gamma in RBF Kernel
- Small γ approximates linear SVM, while large γ can lead to overfitting.
Conclusion
- Soft margin SVM handles noisy or non-linearly separable data effectively.
- Understanding the underlying mathematics and kernel trick is crucial for successful SVM application.
Don't forget to subscribe and stay tuned for more updates!