Support Vector Machines II

5 minute read

Updated:

14. Support Vector Machines II

Learning outcomes

  • Define soft margin SVMs
  • Project a given dataset to a higher-dimensional space

The SVM Formulation

  • image→ We want the w such that the value of the function is minimum, but subject to the constraints.
  • We derived this expression- it’s a problem formulation not an algorithm.
  • We were able to describe classification problem as a constrained optimization over a convex function
  • Since it is a convex, there are many efficient algorithms and stick it into a solver and it tells us if the solution exists or not (exists – linearly separable)

  • Q: how many constraints are there in SVM?
    • A: As many as the points in the training set. Constraints are for each of the point in the dataset.
    • Meaning of the constraint is, for every single point, it should be in the correct side of the plane and NOT on the boundary.
  • The meaning of the inequality (above eq)– if this is satisfied, the point xy is correctly classified coz it has same sign hence positive, and it’s NOT zero – we gotta use strictly positive number.
  • To minimize, w has to be 0. But we don’t want the vector to vanish we want them to represent sth that separates the dataset – we impose that by saying for every single point, I want the solution only if the points are in the correct side.
    • Hence it could not return anything – it either finds best possible classifier or nothing

Why SUPPORT VECTOR machine?

  • image → we remove the points to image

  • If we look at this dataset, what’s making the difference? All the points or just one? Even if we remove all the points except for the closest one, the boundary is the same.
  • The boundary is determined by the points that are closest to the boundary -all the other points don’t matter
    • It would be great if we knew which points are the closest to the boundary, but we don’t. so we solve for the entire dataset
  • After uve solved the problem, you realize that the points that are closest to the boundary is THE only one that counts – these points are called support vectors
    • These are the vectors that support the decision boundary.
  • For the support vectors, that t(wtx+w) is fulfilled AT equality!! =1.
    • you can tell which one is support vector, by above eq =1, which are the important ones that makes the decision boundary what it is.

Back to linear separability

  • So far we discussed that SVM is all or nothing – either there is optimal or none. But what if my dataset isn’t linearly separable? Do I throw it away?
  • image
  • The idea is that this could be acceptable bc there’s only one point that is on the wrong side. How can I chill the constrain a little bit so that this is acceptable? through soft margin SVM.

Slack variables

  • Right now we are enforcing this inequality. image- has to be on correct side, and bigger than 1.

  • imageimage

  • When the point is correctly classified and is far away from the boundary, t will be >=1. If you move point closer to the boundary, at the margin it is equal to 1. (first dotted line) support vectors
    • Where the inequality is satisfied at equality.
  • If we move the boundary further, the equation would be -1. but we want to allow this! The point can be on the wrong side.
    • What we do is we rewrite image
    • Small positive number ξ, by adding additional constraint: image
    • The constrain now becomes 1- ξ, but ϵ has to be bigger than 0 coz we want them to be positive
  • Magnitude of ξ tells us how much the point is on the wrong side:
    • If we are on the correct side, ξ = 0.
    • If we move beyond margin, ξ grows more and more as we get further on the wrong side

Soft margin

  • We ALSO want to minimize this ξ!! Coz if we didn’t minimize this, we could find a solution for whatever dataset and it won’t classify.
  • We are not only minimizing ‖w‖, which is the length of the margin, but we are also minimizing the sum of ξ. ξ is larger if it’s in a further wrong side.
    • We take sum of all the error (sum of ξ) and also add it to the minimization – slack variables.
  • This is a very common practice where you have multi objective optimization and you are squishing the objectives into a single objective. Just like we added sum of xi to w.
  • We want to maximize the margin AND minimize the slack – not possible! So we need to think about how much we care
    • Usually done by constants
    • At the limit, if C is very large or infinite, we have original SVM as it doesn’t allow for slack.
    • C allows you to tune how much error you want to allow. And can optimize to see which classifier works best against the validation set.
  • SVM we are used to, without slack variable is called Hard margin SVM
  • With slack variable - Soft margin SVM – margin can be crossed but we want the error to be as small as possible
  • Q: what is the role of slack variable?
    • A: it allows some points to be within the margin or on the wrong side of the decision boundary
  • For SVM we already have the best. But slack allows non-linearly separable values to also work

Example: polynomial features

  • We need to use a basis function. We are gonna expand 1D to 3D.
  • In 1D – the points are x= <-1>, <2>.
  • In 3D, we evaluate points on <1,x,x2> , - can be polynomial, gaussian etc
    • Now our points are <1,-1,1> and <1,2,4> for above points x = -1,2.
  • Basis function needs to be independent
  • you can make the dimension as large as you want. Can make it hundred and million dimensions
  • → Idea is to add more dimensions until the dataset is linearly separable, with appropriate basis function. And you see bunch of function to see which gives best validation error

Features

  • We start w a point with its own coord, and we create new features using set of basis function and we can make the dataset as high dimension as you want.
    • Original point: x
    • Basis function: Φi(x)
    • New point – Φ (x) = (…)

Substitution Screenshot 2020-02-25 at 8 08 04 pm

  • 2 constraints, and we made one dimensional constraint to 3-dimensional constraint by using basis function
  • What you do is if you have a new point that you wanna classify, you compute the basis of the point and use the high dimension classifier to classify the point
  • We transformed our problem into a higher dimension one by adding more dimensions.
  • Q: but what’s the catch? now the classifier is infinitely powerful?? there must be a catch.
    • A: for each new dimension, we are adding a new weight. So we are making the optimization problem also high dimensional which means its harder and longer to solve.
    • So we have a limit of how many dimension we can add, so that its not taking forever
  • IS there a better formulation? Yes there is without having to generate high dimensional problem

Leave a comment