Machine Learning Algorithms - Support Vector Machine (SVM)

Support Vector Machines

1. Optimal Separation

2. Kernels

Any symmetric function that is positive definite (meaning that it enforces positivity on the integral of arbitrary functions) can be used as a kernel. This is a result of Mercer’s theorem, which also says that it is possible to convolve kernels together and the result will be another kernel.
However, there are three different types of basis functions that are commonly used, and they have nice kernels that correspond to them:

  • Polynomial Kernel

    • Expands the feature space to include polynomials up to degree s in the elements of the input vector x.
    • Example terms: x_3^3 (cubic term) or x_1 * x_4 (interaction term).
    • Kernel formula:
      K(x,y)=(1+xTy)sK(x,y) = (1+ x^Ty)^s
    • For s = 1, this reduces to a linear kernel (dot product)
  • Sigmoid Kernel

    • Uses hyperbolic tangent (tanh) functions, which resemble sigmoid activations.
    • Parameters: k(scaling) and δ\delta (shift).
    • Kernel formula: K(x,y)=tanh(kxTyδ)K(x, y) = tanh(kx^Ty-\delta)
    • This kernel can model neural network-like behavior but is not always positive definite.
  • Radial Basis Function (RBF) Kernel:

    • Measures similarity based on the Euclidean distance between vectors, scaled by σ\sigma.
    • Kernel formula(Gaussian version): K(x,y)=exy22σ2K(x, y) = e ^{- \frac{||x-y||^2}{2\sigma^2}}
    • Creates nonlinear decision boundaries and is widely used for its flexibility.

How to choose the kernel function:

  1. Some theory based on something known as the Vapnik-Chernik dimension
  2. Most people just experiment with different values and find one that works.

3. The Support Vector Machine Algorithm

What is Support Vector

  1. To train a SVM model is just find the sperating hyperplane with maximum margin.
  2. The hyperplane can be represented as wx+b=0wx+b=0
    • w is the normal vector of the hyperplane,
    • b is the bias.
  3. To determine the variables of w and b, we need to setup the problem.
  4. Support vector is the datapoint just lies on the margin.
    . Find the Support Vectors and Separating Hyperplane
    To find the support vector, support x_i is an potential support vector, then its distance to the hyperplane, we define it as wxi+bw=1\frac{|wx_i + b|}{||w||} = 1 Below is why the support vectors have this property:

How to Compute Support Vectors

1. Support vector aia_i is exactly lying on the margin

select a point on hyperplane x0x_0 , the distance from xix_i to x0x_0

xix0x_i - x_0

the distance from x_i to the hyperplane is the distance from x_i to x_0 projection onto the normal vector of the hyperplane which is w:

w(xix0)w=wxiwx0w\frac{|w (x_i-x_0)|}{||w||} = \frac{|w x_i- wx_0|}{||w||}

because x_0 is on the hyperplane, so wx0+b=0=>wx0=bwx_0+b =0 => wx_0 = -b

so the distance from x_i to the hyperplane is

wxi+bw\frac{|wx_i + b|}{||w||}

because w and b is scalable, we just let the distance from the support vector to the hyperplane equals to 1 which is the support vector satisfy

wxi+b=1|wx_i + b| = 1
2. The width of margin need to be as wide as possible
  1. Margin
    The distance from support vector to the separating hyperplane is wxi+bw\frac{|wx_i + b|}{||w||}, as we define wxi+b=1|wx_i+b|=1, so the distance from support vector to separating hyperplane is 1w\frac{1}{||w||}
    Then the total margin( separation between classes) is twice this distance:Margin=2w\text{Margin} = \frac{2}{||w||}
  2. Maximizing the Margin
    To maximize the margin 2w\frac{2}{||w||}, we equivalently minimize w||w|| ( larger 2w\frac{2}{||w||}) implies smaller ||w||).
    For mathematical convenience, we minimize 12w2\frac{1}{2} ||w||^2 (the squared norm avoids square roots and preserves convexity)
  3. Constraints for correct classification
    All data points must satisfy:yi(wxi+b)1,iy_i (wx_i+b) \ge 1 , \forall i where yi{1,+1}y_i \in \{-1, +1\} is the class label.
3. we got the optimization problem

From above induction, we got the optimization problem:

{minw,b12w2subject to yi(wxi+b)1,i\begin{cases} \min\limits_{w,b} \frac{1}{2}||w||^2 \\ \text{subject to } y_i(wx_i + b) \ge 1, \forall i \end{cases}
4 Solve the optimization problem
  1. Construct the Lagrangian function

    L(w,b,α)=12w2iαi[yi(wxi+b)1]L(w, b, \alpha) = \frac{1}{2}||w||^2 - \sum\limits_{i} \alpha _i [y_i (wx_i + b) - 1 ]
  2. Optimality Conditions
    Take partial derivatives and set to zero:

    Lw=0w=iαiyixi,\frac{\partial L}{\partial w} = 0 \Rightarrow w = \sum\limits_i \alpha_i y_i x_i, Lb=0iαiyi=0\frac{\partial L}{\partial b} = 0 \Rightarrow \sum\limits_i \alpha_i y_i =0
  3. Dual Problem
    Subsitute w back into L, simplify to the dual form:

    maxαiαi12i,jαiαjyiyjxiTxj\max\limits_{\alpha} \sum\limits_{i} \alpha_i - \frac{1}{2} \sum\limits_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j

    subject to αi0\alpha_i \ge 0 and iαiyi=0\sum_i \alpha_i y_i =0

  4. Solve for α\alpha
    Use quadratic programming to find αi\alpha_i. Only support vectors have αi>0\alpha_i > 0

  5. Recover w and b

  • w=iαiyixiw = \sum_i \alpha_i y_i x_i (from step 2)
  • b=ykwTxkb = y_k - w^T x_k for any support vector xkx_k (where αk>0\alpha_k > 0)
5. Step by Step to solve the dual SVM problem
  1. Dual Problem Recap
    Maximize:

    iαi12i,jαiαjyiyjK(xi,xj),\sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j),

    subject to:

    αi0andiαiyi=0.\alpha_i \geq 0 \quad \text{and} \quad \sum_i \alpha_i y_i = 0.

    (K(xi,xj)=xiTxjK(x_i, x_j) = x_i^T x_j for linear SVM.)

  2. Solving the Dual
    (a) Quadratic Programming (QP)

    • Use a QP solver to find α\alpha .
    • Most αi\alpha_i will be zero; non-zero αi\alpha_i correspond to support vectors.

    (b) Recover w and b

    • Weight vector:

      w=iαiyixi.w = \sum_i \alpha_i y_i x_i.
    • Bias term b:
      Pick any support vector xkx_k (where αk>0\alpha_k > 0 ):

      b=ykiαiyiK(xi,xk).b = y_k - \sum_i \alpha_i y_i K(x_i, x_k).
  3. Classification Rule
    For a new point x :

    f(x)=sign(iαiyiK(xi,x)+b).f(x) = \text{sign}\left( \sum_i \alpha_i y_i K(x_i, x) + b \right).

4. Extensions to the SVM

5. Quadratic Programming SOlver