Machine Learning Algorithms - Support Vector Machine (SVM)

2025-04-03

Support Vector Machines

1. Optimal Separation

2. Kernels

Any symmetric function that is positive definite (meaning that it enforces positivity on the integral of arbitrary functions) can be used as a kernel. This is a result of Mercer’s theorem, which also says that it is possible to convolve kernels together and the result will be another kernel.
However, there are three different types of basis functions that are commonly used, and they have nice kernels that correspond to them:

Polynomial Kernel
- Expands the feature space to include polynomials up to degree s in the elements of the input vector x.
- Example terms: x_3^3 (cubic term) or x_1 * x_4 (interaction term).
- Kernel formula:
  $K(x,y) = (1+ x^Ty)^s$
- For s = 1, this reduces to a linear kernel (dot product)
Sigmoid Kernel
- Uses hyperbolic tangent (tanh) functions, which resemble sigmoid activations.
- Parameters: k(scaling) and $\delta$ (shift).
- Kernel formula: $K(x, y) = tanh(kx^Ty-\delta)$
- This kernel can model neural network-like behavior but is not always positive definite.
Radial Basis Function (RBF) Kernel:
- Measures similarity based on the Euclidean distance between vectors, scaled by $\sigma$ .
- Kernel formula(Gaussian version): $K(x, y) = e ^{- \frac{||x-y||^2}{2\sigma^2}}$
- Creates nonlinear decision boundaries and is widely used for its flexibility.

How to choose the kernel function:

Some theory based on something known as the Vapnik-Chernik dimension
Most people just experiment with different values and find one that works.

3. The Support Vector Machine Algorithm

What is Support Vector

To train a SVM model is just find the sperating hyperplane with maximum margin.
The hyperplane can be represented as $wx+b=0$ $w x + b = 0$
- w is the normal vector of the hyperplane,
- b is the bias.
To determine the variables of w and b, we need to setup the problem.
Support vector is the datapoint just lies on the margin.
. Find the Support Vectors and Separating Hyperplane
To find the support vector, support x_i is an potential support vector, then its distance to the hyperplane, we define it as $\frac{|wx_i + b|}{||w||} = 1$ Below is why the support vectors have this property:

How to Compute Support Vectors

1. Support vector $a_i$ is exactly lying on the margin

select a point on hyperplane $x_0$ , the distance from $x_i$ to $x_0$

x_i - x_0

the distance from x_i to the hyperplane is the distance from x_i to x_0 projection onto the normal vector of the hyperplane which is w:

\frac{|w (x_i-x_0)|}{||w||} = \frac{|w x_i- wx_0|}{||w||}

because x_0 is on the hyperplane, so $wx_0+b =0 => wx_0 = -b$

so the distance from x_i to the hyperplane is

\frac{|wx_i + b|}{||w||}

because w and b is scalable, we just let the distance from the support vector to the hyperplane equals to 1 which is the support vector satisfy

|wx_i + b| = 1

2. The width of margin need to be as wide as possible

Margin
The distance from support vector to the separating hyperplane is $\frac{|wx_i + b|}{||w||}$ , as we define $|wx_i+b|=1$ , so the distance from support vector to separating hyperplane is $\frac{1}{||w||}$
Then the total margin( separation between classes) is twice this distance: $\text{Margin} = \frac{2}{||w||}$
Maximizing the Margin
To maximize the margin $\frac{2}{||w||}$ , we equivalently minimize $||w||$ ( larger $\frac{2}{||w||}$ ) implies smaller ||w||).
For mathematical convenience, we minimize $\frac{1}{2} ||w||^2$ (the squared norm avoids square roots and preserves convexity)
Constraints for correct classification
All data points must satisfy: $y_i (wx_i+b) \ge 1 , \forall i$ where $y_i \in \{-1, +1\}$ is the class label.

3. we got the optimization problem

From above induction, we got the optimization problem:

\begin{cases} \min\limits_{w,b} \frac{1}{2}||w||^2 \\ \text{subject to } y_i(wx_i + b) \ge 1, \forall i \end{cases}

4 Solve the optimization problem

Construct the Lagrangian function
$L(w, b, \alpha) = \frac{1}{2}||w||^2 - \sum\limits_{i} \alpha _i [y_i (wx_i + b) - 1 ]$
Optimality Conditions
Take partial derivatives and set to zero:
$\frac{\partial L}{\partial w} = 0 \Rightarrow w = \sum\limits_i \alpha_i y_i x_i,$ $\frac{\partial L}{\partial b} = 0 \Rightarrow \sum\limits_i \alpha_i y_i =0$
Dual Problem
Subsitute w back into L, simplify to the dual form:
$\max\limits_{\alpha} \sum\limits_{i} \alpha_i - \frac{1}{2} \sum\limits_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j$
subject to $\alpha_i \ge 0$ and $\sum_i \alpha_i y_i =0$
Solve for $\alpha$
Use quadratic programming to find $\alpha_i$ . Only support vectors have $\alpha_i > 0$
Recover w and b

$w = \sum_i \alpha_i y_i x_i$ (from step 2)
$b = y_k - w^T x_k$ for any support vector $x_k$ (where $\alpha_k > 0$ )

5. Step by Step to solve the dual SVM problem

Dual Problem Recap
Maximize:
$\sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j),$
subject to:
$\alpha_i \geq 0 \quad \text{and} \quad \sum_i \alpha_i y_i = 0.$
( $K(x_i, x_j) = x_i^T x_j$ for linear SVM.)
Solving the Dual
(a) Quadratic Programming (QP)
- Use a QP solver to find $\alpha$ .
- Most $\alpha_i$ will be zero; non-zero $\alpha_i$ correspond to support vectors.
(b) Recover w and b
- Weight vector:
  $w = \sum_i \alpha_i y_i x_i.$
- Bias term b:
  Pick any support vector $x_k$ (where $\alpha_k > 0$ ):
  $b = y_k - \sum_i \alpha_i y_i K(x_i, x_k).$
Classification Rule
For a new point x :
$f(x) = \text{sign}\left( \sum_i \alpha_i y_i K(x_i, x) + b \right).$

Machine Learning Algorithms - Support Vector Machine (SVM)

Support Vector Machines

1. Optimal Separation

2. Kernels

3. The Support Vector Machine Algorithm

What is Support Vector

How to Compute Support Vectors

1. Support vector $a_i$ is exactly lying on the margin

2. The width of margin need to be as wide as possible

3. we got the optimization problem

4 Solve the optimization problem

5. Step by Step to solve the dual SVM problem

4. Extensions to the SVM

5. Quadratic Programming SOlver

Support Vector Machines

1. Optimal Separation

2. Kernels

3. The Support Vector Machine Algorithm

What is Support Vector

How to Compute Support Vectors

1. Support vector aia_iai​ is exactly lying on the margin

2. The width of margin need to be as wide as possible

3. we got the optimization problem

4 Solve the optimization problem

5. Step by Step to solve the dual SVM problem

4. Extensions to the SVM

5. Quadratic Programming SOlver

1. Support vector $a_i$ is exactly lying on the margin