Overview
This lecture introduced linear classifiers, the geometric interpretation of separating hyperplanes, and the perceptron learning algorithm, highlighting its properties, limitations, and connections to gradient descent.
Approaches to Linear Classification
- Linear classifiers can be built by comparing discriminant functions or directly modeling the separating hyperplane.
- A separating hyperplane consists of all points x such that ( f(x) = 0 ), where ( f(x) ) defines the hyperplane.
- The sign of ( f(x) ) determines on which side of the hyperplane a point lies.
Properties of Separating Hyperplanes
- The vector ( \beta ) is normal (perpendicular) to the hyperplane.
- The equation ( \beta^T x + \beta_0 = 0 ) defines the hyperplane.
- The signed distance from a point x to the hyperplane is proportional to ( f(x) ).
- Normalizing ( \beta ) yields the actual distance from x to the hyperplane.
Perceptron Learning Algorithm
- The perceptron algorithm adjusts the separating hyperplane to correctly classify training data.
- Misclassified points are identified when the sign of ( f(x) ) disagrees with the true label.
- The perceptron criterion (objective function) focuses on minimizing misclassification.
- The algorithm uses stochastic (random) gradient descent: update ( \beta ) after each misclassified point.
- Step size (( \rho )) is usually set to 1 in the perceptron but is chosen smaller in general stochastic gradient descent.
- If data are linearly separable, perceptron converges to a solution, but the specific hyperplane depends on initialization.
- If data are not linearly separable, the algorithm may not converge and can cycle indefinitely.
Limitations and Extensions
- Multiple hyperplanes may separate linearly separable data; the algorithm converges to one such hyperplane.
- Perceptron fails for non-linearly separable data (e.g., the XOR problem).
- Enlarging input space with feature transformations can make data more separable.
- There is often a trade-off between step size and convergence speed or stability.
Key Terms & Definitions
- Separating Hyperplane — A boundary that divides data into different classes; defined by ( \beta^T x + \beta_0 = 0 ).
- Perceptron Learning Algorithm — An iterative method to find a linear separator by updating parameters using misclassified examples.
- Stochastic Gradient Descent — Optimization method updating parameters based on individual data points rather than the whole dataset.
- Linearly Separable — Data is linearly separable if a hyperplane exists that perfectly separates classes with no misclassification.
Action Items / Next Steps
- Review geometry and properties of hyperplanes.
- Prepare to discuss methods for defining an optimal separating hyperplane in the next class.