Support Vector Machines

7 minute read

Updated:

13. Support Vector Machines

Learning outcomes

  • Derive the formulation of support vector machines a s a constrained optimisation problem

Multiple separation boundaries

  • In perceptron, it was rather limited as we could only identify straight line classifiers. But with XOR we did manage to classify with multiple separation boundaries. The idea is, it is always possible to transform any set of data so that the classes within it can be separated linearly.
  • Screenshot 2020-02-25 at 7 57 19 pm
  • In the image of 3 different separation boundary – we tend to choose that is a line that runs through the middle of the datapoints, that is in equidistant from data of both classes. – but not the easiest to derive.
  • BUT what criteria do we choose a line based on?

The best discrimination boundary

  • image
  • What’s special about this line?
    • A: first important thing is that this line has to go through the classes. it classifies the points correctly AND The line is as far as possible from all the points.
  • So far, we looked at unconstrained optimization. We had an error and we wanted to minimize it, but we had no constraints on the feasible region - so parameters (weights) could take any values.
  • We have constrained optimization- only from feasible region- we can only accept solution that satisfies:
    1. It classifies every point correctly
    2. Optimization problem – and it is a one single hyperplane that is as far as possible

    -> we are gonna encode the constraints

The constraints

  • How do we decide whether or not the classifier is good? We set constraints that say that the classifier should get the answer right.
  • We make the target answers for our classes as -1,1 coz math comes out more easily than 0,1
  • if target multiplied by the output will be positive if two are the same, and negative otherwise.
    • image so outputs 1 if > 0, and -1 if <0
  • image line : x1+x2−4=0
  • Steps
    1. So we sub the points <1,1>, <3,3>, <2,1> into the eq of the line
    2. Evaluate the output value on the y(x) function to see if it’s >0 or <0. Hence output -1 or 1.
    3. Then rescale the weights accordingly to minimize
  • If we evaluate linear function on each point, we get something that’s proportional to the distance.
    • [diagram] Interestingly, Let’s note that when the points are classified correctly, when we evaluate the linear function, we get -2, which means that we are going to output 1, and this is class -1. correctly classified.
    • Misclassified points – we want class 1, but if we evaluate on <2,1> we get -1, negative and it doesn’t match
  • Y(x) can either be 1 or -1. But remember if the point is classified to class 1 correctly, then linear function would be positive, and if it belongs to class -1 and is classified correctly, the function would be negative.
  • ty = 1, coz if classified correctly, it’s gonna be -1*-1 or 1*1. ty will always be positive !!. This gives us condition to check if it’s been classified correctly.
  • What happens if we reformulate above y(x) as
    • this thing also accepts zero classifier is undecided!! it lies in the boundary, but we don’t like this so we are gonna do sth.

Canonical form

  • 0 means that the classifier is undecided, and this should be avoided.
  • We can avoid that by saying we want the image this to be strictly greater than 0. So there exists epsilon that tis bigger than 0.
  • But for epsilon we particularly like 1. So lets just say ϵ = 1. Then we get the form:
  • Canonical form of the constrains: image
    • x – vector of variables – input – is a single point!!
    • w – vector of weight – so for the line eq, we reorder it x1+x2-4=0 , so weight = <-4,1,1>
    • w0 isnt multiplied by any variable
    • t is the class
  • For every point, this has to be true, and every point should be classified correctly AND NO point is on the boundary.

Example of constraint

  • But the ϵ doesn’t matter- no need for it to be 1.
  • image
  • One evaluates to 2 and 3. So for this dataset, every point is >= 2.
  • We can rescale – instead of 2, we can make t (wtx + w0 ) = 1. We can make it one just by dividing by 2. We could have the exact same line, and only thing we can change is the weight vector.
  • If we rescale the vector of weight (from <-4,1,1>) and instead of having <1,1> to <0.5, 0.5>. Then the line hasn’t changed and now we have a 1 for the constraint, which is the canonical form
    • New weight = w = <-2, 0.5, 0.5>
  • This case, we did opposite - we started with the decision boundary and we modified the weight to satisfy the constraint. IRL, we don’t have the line yet. we start with the constraint, and then compute the weight to satisfy the constraint.
  • Imp thing is there’s always gonna be a vector of weight that does satisfy the constraint!!!so the actual number doesn’t really matter and we are gonna see it in that form ONLY.

Constraints

  • How do we know that a point is in the correct side? This is the answer! It’s correct when this satisfies:
  • image
  • We consider vector w as valid, only if this inequality is true for EVERY point in the dataset.
  • There’s gonna be many vector of weights that satisfies this – but we want the best out of all.
    • Which is the one that classifies correctly AND furthest away.
  • For closest point in the boundary, the constrained is satisfied at the equality!! So t (wtx + w0 ) = 1 -V imp

The margin

  • Screenshot 2020-02-25 at 8 00 11 pm
  • Margin: The distance between the closest point to the decision boundary, and the boundary itself
  • We want maximum margin classifier
  • Support vectors: The points in each class that lie closest to the classification line

Recall from the perceptron

  • If I evaluate linear function to any points, I get something that is proportional to the distance and the norm of the vector of weights. wT x+w0 = d‖w‖ aka a(x) = d‖w‖ where ax is the linear function.

The margin

  • Margin is what we intend to maximize.
  • image
  • SO, we can say t*a(x) = d‖w‖, and to isolate the distance we divide by the ‖w‖.
  • d can either be pos or neg depending on which side of the line point lies on – to make sure we just get the magnitude, we multiply by t.
    • d is absolute value bc if the point is on the same side as the grad = pos, and vice versa. Then t is negative, and d is negative.
  • Evaluation of the linear function on any point is proportional to the distance. If we get closer to the line the distance has to decrease. But we constrain the smallest to 1.
  • Q: why do we divide by norm?
    • Coz we wanna get the analytical expression of the distance. Hence, we wanna get rid of w.

Maximum margin

  • [recap of the steps] Take the closest point, and evaluate on the eq of the line. Then evaluate that value on y(x) to see if it is 1 or -1. - For that point that are closest to the margin, the constraints ACTUALLY equals to one. And for the further points, constraint is greater than 1
    • For closest point in the line, image== 1.
  • image
  • [explanation of formula] For t (wtx + w0 ) if we substitute the closest point, we get sth that’s proportional to the distance and if we divide it by the norm of w, then it’s the same as 1/ ‖w‖.
  • So to find the maximum margin, aka the distance to the point, we need to minimize ‖w‖.
    • imaget and d respectively
  • Why is this interesting?
    • Coz we said we wanted a plane that’s as far away as possible, and we need an analytical expression for the distance coz we needa maximize it.
    • We can minimize the norm of w. so we maximize 1/w, which essentially maximizes the distance.

The SVM Formulation

  • It’s an optimization formulation that says minimize the norm of the vector of w
  • Since the norm has a square root – we want to differentiate it which is why we square it. And ½ for easier derivation.
    • image
  • And we prevent this by introducing the constraints. For every single point that this eq is true:
    • Minimize it subject to the constraint: image
  • This is a quadratic function – they have properties that make easy to solve. And there are so many algorithms that does this.
  • Differently from perceptron and MLP, SVM finds ONE optimal solution – no need to deal with local minima It’s all or nothing – if the dataset is linearly separable, we get the optimal solution - we get a unique solution.

Derive the formulation of support vector machines as a constrained optimisation problem

  • Screenshot 2020-02-25 at 8 02 09 pm

Leave a comment