CS229 - Machine Learning - Lecture 7 - Support Vector Machines

joel | Uncategorized | Tuesday, October 28th, 2008

Functional margin

If we use the classification scheme h(T,x) = sigmoid(T’x) > 0.5, then we are really just checking to see if T’x > 0. The greater T’x is away from 0, the more “confidence” we have in the prediction.

  • Functional margin:
  • Geometric margin: distance between the separating hyperplane and the nearest point in the dataset.

Notation for Geometric margin:

  • class labels y in {-1, 1}
  • h(w, b, x) must output {-1 or 1} so choose g(x) = 1{x>0}
  • w is in R^n
  • x is in R^n
  • b is a real number
  • h(w,b,x) = g(w’x + b)
  • “Functional margin for ith example” gamhat_i = yi (w’xi + b)
    This is the same as the distance of the prediction from the separating line
  • Margin for the training set is the worst case margin of all examples

Note that the ‘w’ term here can scale arbitrarily. So we also need a constraint that |w|=1. Or we could choose the constraint that gamhat=1.

Geometric Margin

Geometric margin gamma_i is just the distance between the points x_i and the decision boundary measured in the space of x.

  • gam_i = (w’x+b) / ||w||

This doesnt change as w and b are scaled.

So we can write a linear programming problem to optimize the geometric margin gamma

  • if |w|=1 then gamhat = gamma
  • by scaling w,b we can set gamhat without changing gamma

So here’s a non convex optimization problem:

  • input constants: xi, yi
  • variables: w, b
  • maximize gam
  • constraint all i: yi(w’xi+b) > gam
  • constraint: ||w||=1 {non-convex}

Optimal Margin Classifier

But if we want to make this into a convex optimization problem:

  • input constants: xi, yi
  • variables: w, b
  • maximize gamhat/||w|| {same as geometric margin}
  • constraint all i: yi(w’xi+b) > gamhat
  • set gamhat = 1

Which is the same as minimizing ||w||, which makes this problem quadratic.

Lagrange Multipliers

Here’s a quick recap on lagrange multipliers.

Let’s say we want to minimize f(w) subject to some constraints:

  • min f(w)
  • st gi(w) <= 0 {i in 1..k} # NEGATIVE!
  • st hi(w) == 0 {i in 1..l}

The “generalized Lagrangian” is

  • L(w, alp, beta) = f(w) + Sum{i:1..k} alph_i gi(w) + Sum{i:1..l} beta_i hi(w)

where alpha_i and beta_i are the Lagrange multipliers. We seek solutions that meet these constraints:

  • dL/dw = 0 # solution is a local max wrt
  • dL/dalpha = 0 {true if and only if gi(w) = 0 }
  • dL/dbeta = 0 {true if and only if hi(w) = 0 }

Lagrange Duality - Primal optimization problem

Now we pose the primal optimization problem. Find the value of w which minimizes the maximum value of L(w,alpha,beta) when we pick the worst possible value of the lagrange multipliers.

Theta_p* = Min_{w} {{ Max{alpha>=0,beta} L(w,alpha,beta) }} = min_w Theta_p(w,alpha,beta)

Note that this is a solution because if the constraints are violated, we can choose

  • If gi(w) > 0 => Theta_p(w) = Infinity
  • If hi(w) != 0 => Theta_p(w) = Infinity

so the only way to prevent L(w,alpha,beta) from blowing up is if the constraints are satisfied.

Lagrange Duality - Dual optimization problem

We can also search for the maximal value of the lagrange multipliers over the mins.

Theta_d* = Max_{alpha, beta} {{ Min{w} L(w,alpha,beta) }}

Note that

  • Theta_d* <= Theta_p*

The dual problem is actually a little easier to solve.

Leave a comment

»

RSS feed for comments on this post. TrackBack URI