Elements of Local Optimisation

4 minute read

Updated:

5. Elements of Local Optimisation

Learning outcomes

  • We define an error function and we try to minimize them by defining an algorithm that minimizes it
  • Describe the difference btwn zero, first and second order optimisation
  • Apeopley gradient descent to a given object function we gonna use this all the time
  • Choose and appropriate Step size for gradient descent.

Optimisation Goal

  • It is about finding the minimum of the function
  • For instance, we have 2x2 + x has a single minimum what’s special about this minimum is the fact that the derivate of the stationary points are zero. Hence, the tangential slope is a straight line
  • Derivative of a point is slope of the tangential line going through that point
    • Interesting is that it gives you an idea where the function grows
    • When the gradient is 0 – tangential line is straight and the function hasn’t grown
      • That is the point we are looking for !! either a min or max.

Local methods

  • Analogy of ski – we don’t know the function we are minimizing. Only thing we have is the local info aka the view around us.
  • We dunno what the mountain looks like but we can see around us, and knowing we wanna get to bottom, we can look around for local info - It will give you a direction to take to get down.
  • The point is, it’s the steepest that gets you the quickest – ideally is the best
  • By looking at the surrounding area, you can do 3 things
    • Zero order – any direction that’s going down, no matter the steepness
    • First order – requires first derivative of the function– derivatives are about the local behaviour. It tells you what the direction around the point is where the function is growing the fastest – the thing that points in the direction of the fastest ascent is gradient
      • Gradient of the function = vector of partial derivative
      • We want anti gradient, which is gradient-1
  • Second order - it computes the optimal step, but is computationally expensive

<Gradient descent>

0. Zero order

  • We don’t even compute the derivative; we just pick any direction that looks like its gonna improve → inefficient

1. First order – gradient descent – use the most

  • We are given an initial point xt, and we want to reach the better point.
  • In 2d you can go left or right. In more direction, the anti -gradient will be the direction of the steepest
  • image step = learning rate.
  • We start with random xt, and it will give some initial error, and we change parameters to minimize this
    • every time I start randomly, I start at a different initial state.
  • The next point = current point – n (step parameter: step size)* substitute current point in derivative of f(xt)
  • Continue until the gradient is small enough – close to 0 – which is close to min or max
    • Computing gradient is cheap enough to do it

2. Second order method -newtons method

  • Uses the second order expansion of the function, → Taylor’s expansion
  • If we compute the second order derivative, you can actually solve the eq for the optimal step
  • Computing this is computationally expensive → instead of computing the optimal step, its faster to do suboptimal

Preliminaries - Partial derivatives

  • image
  • after doing the partial derivation of each one in respect to x, y,z combine it to a vector this is the gradient

Question Example - The current point is <1,0>, compute the next point following gradient descent on the function

f(x,y) = x3 + 2y2 - y with step size 0.1.

  • image

Learning rate (n) – this decides how fast the network learns.

  • The learning rate controls how much to change the weights by.
  • Small learning rate – weights need to see the inputs more often before they change significantly – takes longer to learn. BUT its more stable and resistant to errors and inaccuracies.

In3D

  • Screenshot 2020-02-17 at 10 43 08 pm step size n of 0.1 and 0.14
  • We get a decrease until we reach 0, although we will never hit 0 since it is a continuous space
    • Need some sort of threshold that makes me happy stop!
  • When you increase the step size - you could just end up to the other side too quickly, so it goes back and forth, and it doesn’t reach the minimum
  • → Step size has to be small enough to not diverge, but big enough to not take too long

Local Minima

  • We don’t know the shape of the whole function – it may have more than one minimum
    • Global min – to find global one over the generic graph
    • Local min – the first stationary point that it stops at
  • !! Initial point is important because for each area you are in, you gonna get a diff min every time I start randomly, we start at a diff initial state. Then do gradient descent, find minimum and evaluate it, and do it again. Then we look at what’s the best. !!!!

Leave a comment