Support Vector Machines
1. Optimal Separation
2. Kernels
Any symmetric function that is positive definite (meaning that it enforces positivity on the integral of arbitrary functions) can be used as a kernel. This is a result of Mercer’s theorem, which also says that it is possible to convolve kernels together and the result will be another kernel.
However, there are three different types of basis functions that are commonly used, and they have nice kernels that correspond to them:
Polynomial Kernel
- Expands the feature space to include polynomials up to degree s in the elements of the input vector x.
- Example terms: x_3^3 (cubic term) or x_1 * x_4 (interaction term).
- Kernel formula:
- For s = 1, this reduces to a linear kernel (dot product)
Sigmoid Kernel
- Uses hyperbolic tangent (tanh) functions, which resemble sigmoid activations.
- Parameters: k(scaling) and (shift).
- Kernel formula:
- This kernel can model neural network-like behavior but is not always positive definite.
Radial Basis Function (RBF) Kernel:
- Measures similarity based on the Euclidean distance between vectors, scaled by .
- Kernel formula(Gaussian version):
- Creates nonlinear decision boundaries and is widely used for its flexibility.
How to choose the kernel function:
- Some theory based on something known as the Vapnik-Chernik dimension
- Most people just experiment with different values and find one that works.
3. The Support Vector Machine Algorithm
What is Support Vector
- To train a SVM model is just find the sperating hyperplane with maximum margin.
- The hyperplane can be represented as
- w is the normal vector of the hyperplane,
- b is the bias.
- To determine the variables of w and b, we need to setup the problem.
- Support vector is the datapoint just lies on the margin.
. Find the Support Vectors and Separating Hyperplane
To find the support vector, support x_i is an potential support vector, then its distance to the hyperplane, we define it as Below is why the support vectors have this property:
How to Compute Support Vectors
1. Support vector is exactly lying on the margin
select a point on hyperplane , the distance from to
the distance from x_i to the hyperplane is the distance from x_i to x_0 projection onto the normal vector of the hyperplane which is w:
because x_0 is on the hyperplane, so
so the distance from x_i to the hyperplane is
because w and b is scalable, we just let the distance from the support vector to the hyperplane equals to 1 which is the support vector satisfy
2. The width of margin need to be as wide as possible
- Margin
The distance from support vector to the separating hyperplane is , as we define , so the distance from support vector to separating hyperplane is
Then the total margin( separation between classes) is twice this distance: - Maximizing the Margin
To maximize the margin , we equivalently minimize ( larger ) implies smaller ||w||).
For mathematical convenience, we minimize (the squared norm avoids square roots and preserves convexity) - Constraints for correct classification
All data points must satisfy: where is the class label.
3. we got the optimization problem
From above induction, we got the optimization problem:
4 Solve the optimization problem
Construct the Lagrangian function
Optimality Conditions
Take partial derivatives and set to zero:Dual Problem
Subsitute w back into L, simplify to the dual form:subject to and
Solve for
Use quadratic programming to find . Only support vectors haveRecover w and b
- (from step 2)
- for any support vector (where )
5. Step by Step to solve the dual SVM problem
Dual Problem Recap
Maximize:subject to:
( for linear SVM.)
Solving the Dual
(a) Quadratic Programming (QP)- Use a QP solver to find .
- Most will be zero; non-zero correspond to support vectors.
(b) Recover w and b
Weight vector:
Bias term b:
Pick any support vector (where ):
Classification Rule
For a new point x :