CS229 - Machine Learning - Lecture 7 - Support Vector Machines
Functional margin
If we use the classification scheme h(T,x) = sigmoid(T’x) > 0.5, then we are really just checking to see if T’x > 0. The greater T’x is away from 0, the more “confidence” we have in the prediction.
- Functional margin:
- Geometric margin: distance between the separating hyperplane and the nearest point in the dataset.
Notation for Geometric margin:
- class labels y in {-1, 1}
- h(w, b, x) must output {-1 or 1} so choose g(x) = 1{x>0}
- w is in R^n
- x is in R^n
- b is a real number
- h(w,b,x) = g(w’x + b)
- “Functional margin for ith example” gamhat_i = yi (w’xi + b)
This is the same as the distance of the prediction from the separating line - Margin for the training set is the worst case margin of all examples
Note that the ‘w’ term here can scale arbitrarily. So we also need a constraint that |w|=1. Or we could choose the constraint that gamhat=1.
Geometric Margin
Geometric margin gamma_i is just the distance between the points x_i and the decision boundary measured in the space of x.
- gam_i = (w’x+b) / ||w||
This doesnt change as w and b are scaled.
So we can write a linear programming problem to optimize the geometric margin gamma
- if |w|=1 then gamhat = gamma
- by scaling w,b we can set gamhat without changing gamma
So here’s a non convex optimization problem:
- input constants: xi, yi
- variables: w, b
- maximize gam
- constraint all i: yi(w’xi+b) > gam
- constraint: ||w||=1 {non-convex}
Optimal Margin Classifier
But if we want to make this into a convex optimization problem:
- input constants: xi, yi
- variables: w, b
- maximize gamhat/||w|| {same as geometric margin}
- constraint all i: yi(w’xi+b) > gamhat
- set gamhat = 1
Which is the same as minimizing ||w||, which makes this problem quadratic.
Lagrange Multipliers
Here’s a quick recap on lagrange multipliers.
Let’s say we want to minimize f(w) subject to some constraints:
- min f(w)
- st gi(w) <= 0 {i in 1..k} # NEGATIVE!
- st hi(w) == 0 {i in 1..l}
The “generalized Lagrangian” is
- L(w, alp, beta) = f(w) + Sum{i:1..k} alph_i gi(w) + Sum{i:1..l} beta_i hi(w)
where alpha_i and beta_i are the Lagrange multipliers. We seek solutions that meet these constraints:
- dL/dw = 0 # solution is a local max wrt
- dL/dalpha = 0 {true if and only if gi(w) = 0 }
- dL/dbeta = 0 {true if and only if hi(w) = 0 }
Lagrange Duality - Primal optimization problem
Now we pose the primal optimization problem. Find the value of w which minimizes the maximum value of L(w,alpha,beta) when we pick the worst possible value of the lagrange multipliers.
Theta_p* = Min_{w} {{ Max{alpha>=0,beta} L(w,alpha,beta) }} = min_w Theta_p(w,alpha,beta)
Note that this is a solution because if the constraints are violated, we can choose
- If gi(w) > 0 => Theta_p(w) = Infinity
- If hi(w) != 0 => Theta_p(w) = Infinity
so the only way to prevent L(w,alpha,beta) from blowing up is if the constraints are satisfied.
Lagrange Duality - Dual optimization problem
We can also search for the maximal value of the lagrange multipliers over the mins.
Theta_d* = Max_{alpha, beta} {{ Min{w} L(w,alpha,beta) }}
Note that
- Theta_d* <= Theta_p*
The dual problem is actually a little easier to solve.