Support Vector Machines IV

10 minute read

Updated:

16. Support Vector Machines IV

Learning outcomes

  • Derive the dual formulation of Support Vector Machine
  • Explain the kernel trick
  • Apply dual SVMs and the kernel trick to datasets

The dual problem

  • image
  • We took a journey btwn dimensions in some space and we projected our self to Lagrangian space
  • We saw how to add variables to an objective function of a constrained optimization problem to obtain Lagrangian. Which is a special function coz the stationary points are solutions for original constraint problem.
  • The Lagrangian can be used to derive dual formulation solution of dual coincides to primal problem

KKT Conditions

  • This is saying that a point to be a min of an unconstrained problem the only condition is grad = 0.
  • For constrained, the constraints must play a role and these below conditions are necessary for the inequality constraints:
    1. Gradient of original objective function and grad of constraint have to be parallel!!- main idea
    2. And image, either the constraint is active and the constrain is satisfied at eqlaitiy OR the Lagrangian multiplier must be zero so that it disappears when its inactive.
    3. The last one, the dual feasibility, says that the gradient not only have to be parallel but it also has to have the same direction. Which is why Lagrangian multiplier has to be positive image

Example

  • image
  • -2 and -3 is our range – so what is the optimal value of x?
  • Optimal is always on the edge of a constraint – so it’s not gonna be in the middle. It is either on -2 or -3.
  • Lets apply KKT!!!

Stationary point for the Lagrangian I

Solving: image

  • image this imposes that the gradient in respect to has to be zero, and λ times the constraints has to be zero for both constraints, and both lambdas have to be positive.
  • We have 2 constraints:
    • X = -3
    • X = -2.

Stationary point for the Lagrangian II– constraint = -3

  • Lets assume first constraint is active, which is x= -3. What does this imply?
    • If I substitute x = -3 to the eq, λ2(-3+2) = 0, hence λ2 = 0.
    • And subbing x and λ2, we get λ1 = -6!!!!!
    • it violates the constraint, hence x= -3 isn’t the optimal solution.
  • Screenshot 2020-02-25 at 8 27 38 pm-constraint being -3.
  • The gradient of the constraint (the gradient is -1 coz derivative of (-x-3) so it points left
  • And the gradient of the objective function x2, is 2x. 2x, 2(-3) = -6. We take anti gradient so is 6.
  • And they are on the opposite direction, but they have to be on the same side to be a solution.

Stationary point for the Lagrangian II – constraint = -2

  • Assume the other constrain is active, x >= -2.
  • After substituting, we get x= -2, λ1 = 0, and λ2 = 4 This IS the valid solution!!
  • This is a stationary point of the Lagrangian and the solution of the original constrained problem
  • image image
  • Bc the constraint x=-2 goes to the left, so they have the same direction. I cannot follow the gradient without breaking the constraint. If keep moving right, I’m going outside of the constraint

Dual Problem

  • We did this tbat to compute the dual problem and we want to get rid of x.
    • [16 pdf 10] To do that we solve the derivative of x =0. And we get expression of x only in terms of Lagrangian multipliers.
    • Then we take this and sub back to original Lagrangian to make x disappear.
    • Reorganize that to end up in a function of only λ !!

Complementary slackness & Dual optimality

  • We get this eq of just λ, imageand we want to maximize this!
  • It is also a constrained problem, but only on λ. We solve this problem, and we only allow lambdas to be positive value.
  • Screenshot 2020-02-25 at 8 29 10 pm
    • The first one, the constraint is active coz we have to force λ > 0!!
      • If I slice this function through lam 1 and lam 2, then λ1, maximum is on -6. But we need to stop on 0 coz constrain hence active
    • But the second one its inactive coz its already positive, in the right side.
  • So the solution of the dual problem is (0,4)
  • These are the same numbers that came out from Lagrangian!! - in terms of all variables – x, λ1, λ2
  • But in dual problem, if we only use 0,4, we get the solution of the primal problem. To solve for x, we do image. So (0-4)/2 = -2.
  • This happens bc of how Lagrangian is built

Duality and SVM

  • Gonna do the same for SVM formulation
  • image
  • The original problem, which we will call the primal problem, was a minimisation problem.

Follow the Duality Recipe – refer to slides!

  • Parabola stuff on the left to be less scary!
  1. First thing was to compile constraints into the Lagrangian and introduce Lagrange multiplier

    1. For every constrain we add 1 (Lagrangian multiplier* the constraint).
    2. Move the constraint on the other side and make it <= 0.
    3. Now we have 1 constraint per point – one lag multiplier per point
  2. Solve for the optimal primal variables. – for w and w0.

    1. SVM – we had only the eq with λ.
    2. We need to take the gradient of the Lagrangian in respect to original variables, aka w and w0, and we need to find an expression only in terms of Lagrangian multipliers.
    3. Derivative of L(w,w0,λ) is whatever that is multiplied by w. so x, t, λ.
    4. And solve for w!! image
    5. image W0 isn’t there so we can’t find exp in terms of only Lagrange – so we keep carry
  3. We substitute for w and w0.

    1. SVM – we substituted x and have exp of only the λ.
    2. This is not in the book ahah – steps of how to go for primal to dual
    3. We take w in terms of lamda and substituted back, so whenever we see w, substitute.
    4. Imp is that this term that comes of the various multiplications image disappears bc of the constraint we have found before. image
    5. Also the fact that the thing in the circle is the same as the square of the norm of the vector Screenshot 2020-02-25 at 8 31 56 pm

Formulations

  • Screenshot 2020-02-25 at 8 32 41 pm
  • We want to maximize this subject to ^
  • Once we’ve done this, we have two options
    • Compute dual formulation and optimize the problem to find λs and use expression of w we fond to retrieve the separating boundary in terms of lag multipliers. So that we can plot the hyper plane in original dimensionality
    • Once you solve the problem in a dual space, and you have the lambdas, you don’t have to go back to original – you can classify using the lambdas directly
  • After optimization is over, we know w, t, and x!!

  • We can get w* in terms of λ form this eq image and in this constraint we know all except for w0.
  • Not knowing w0 isn’t a problem bc we know that every constraint that represents a support vector is verified at the equality. Therefore, we can get w0 from any of the constraints of the support vectors.

  • But in reality, there will be noise, hence we use all the equations of the support vectors to solve for w0 and take the average
    • image
    • Each one will give a variant of w0, and we take the average.
    • V important fact!! Its saying that the hyper plane is entirely defined by support vectors!!
    • The support vectors are the only ones whose the Lagrangian won’t be non-zero, since the constraint is active (active constraints determines the solution)
    • Hyperplane is the linear combination of the support vectors, times the class.
      • Hence we can throw out the rest of the dataset
  • Non-parametric and parametric
    • Parametric – you decide the structure at the beginning and you only learn the values of parameters.
      • e.g. primal formulation - as we have to start with fixed dimension– number of features are predetermined
    • Nonparametric – we use all points in the dataset to classify and each point becomes a parameter and we dunno how many data we have
      • image
      • e.g, For dual formulation, how many parameters does this have? How many non zero lambdas are left? many as the support vectors!!!!!! Coz all the others are 0. Parameters that are non-zero is Lagrangian multipliers
      • All non-parametric methods end up memorizing some points – SVM is exceptional coz it only needs to memorize the support vectors.
  • Optimal coz its max margin classifier – so you can’t do anything better.

Dual problem

  • This is the dual formulation we derived – we classify only using dual instead of going back to original, using image
  • Interesting is the input vectors in the dual formulation only ever appear in the dot product.
  • We did all this to increase the dimensionality – using basis function.

Experience the power of kernels

  • If we substitute x for expanded version that uses the basis. We have a vector of basis function only multiplying
  • We take 2 points image projected to image
  1. Computing the dot product image
    • the dot product is 4.
  2. Evaluate original points on image
    • the soluiton is 4!!!

Example: Polynomial features

  • It is constructed so that the dot product is computed without ever having to work with the expanded space.
  • We can construct features so that the dot product can be computed in terms of the original values instead of expanded values kernel trick!!

Kernels

  • This can be done with different features
  • Polynomial: image By changing the exponent, we are computing the features that corresponds to a polynomial of that order
    • We can compute dot product of very large dimension without expanding it, all you have to do is to increase the exponent. very little time.
  • Gaussian: image
    • image corresponds to dot product of infinite vectors
  • Sigmoid It would expand to enormous if we do x ^2 – we don’t have to do, but increase the

Constructing kernels

  • There are rules so that you can combine kernel functions and create new kernels. This means that there exists some feature vector that corresponds to that, we just need to know what the features are.
  • Important thing is that we can compute the DOT product

Kernels

  • Therefore, we take the dual formulation and substitute the dot product for the kernels, and now we are classifying for however many dimensions WITHOUT going there!!
  • This is the key coz its not a linear classifier anymore. Its linear in the original dimension but we are now classifying in a space that has many more dimensions.
  • It’s a hyper plane there. In this crazy space. But you don’t have to got here. Can compute the dot product in that space through the kernel without expanding
  • Now, what was limited as a linear method in the original space, which means it has to be linearly separable to begin with, but through the kernel we can project into space with infinite dimension which means it will be linearly separable.

Spaces

  • If in low dimension my dataset is not linearly separable, what we are ding is we project it to a high dimensional space where its linearly separable through kernel, and then we are classifying there.

History

  • 1963 - Original SVM was by Vapnik and Chervonenkis
  • 1992 – Guyon’s idea of applying kernel trick to dual formulation changed everything – now we are projecting into infinity and can classify anything
  • 1995 – Cortes’s proposal of soft margin SVM
  • Two things together, so soft margin SVM and dual formulation with kernel trick Most powerful classifier – is max margin and can always classify anything given the kernel

Comparison of different methods

  1. Neural network
    • Can be used for everything – discriminative, generative and is super powerful coz you can add as many layers as you want.
    • But they learn through gradient descent, local minimum, which forces you to do a random restart, and can easily overfit
    • When to use: for big data.
  2. Support Vector Machines
    • Always give you the optimal solution
    • But it’s a difficult problem to solve. As the difficulty increases in respect to number of constraints,
    • Hence can’t use for large dataset
    • When to use: when the data set isn’t too big, it will give you the best as its max margin classifier, and always able to classify anything given kernels and infinitely many dimensions
  3. Trees
    • Easy to understand, and works for non-metric data. (others only work for continuous inputs)
    • But it doesn’t control the number of features and it can become a large tree which is not intelligible.
    • Through boosting, trees are powerful
    • When to use: for non-metric data
  • In the end, we gotta find the ones that work for our situation.
    • Nonmetric – tree
    • Small dataset– SVM
    • Everything else – neural network

Leave a comment