Elements of Local Optimisation

4 minute read

Updated: February 11, 2020

5. Elements of Local Optimisation

Learning outcomes

We define an error function and we try to minimize them by defining an algorithm that minimizes it
Describe the difference btwn zero, first and second order optimisation
Apeopley gradient descent to a given object function we gonna use this all the time
Choose and appropriate Step size for gradient descent.

Optimisation Goal

It is about finding the minimum of the function
For instance, we have 2x² + x has a single minimum what’s special about this minimum is the fact that the derivate of the stationary points are zero. Hence, the tangential slope is a straight line
Derivative of a point is slope of the tangential line going through that point
- Interesting is that it gives you an idea where the function grows
- When the gradient is 0 – tangential line is straight and the function hasn’t grown
  - That is the point we are looking for !! either a min or max.

Local methods

Analogy of ski – we don’t know the function we are minimizing. Only thing we have is the local info aka the view around us.
We dunno what the mountain looks like but we can see around us, and knowing we wanna get to bottom, we can look around for local info - It will give you a direction to take to get down.
The point is, it’s the steepest that gets you the quickest – ideally is the best
By looking at the surrounding area, you can do 3 things
- Zero order – any direction that’s going down, no matter the steepness
- First order – requires first derivative of the function– derivatives are about the local behaviour. It tells you what the direction around the point is where the function is growing the fastest – the thing that points in the direction of the fastest ascent is gradient
  - Gradient of the function = vector of partial derivative
  - We want anti gradient, which is gradient-1
Second order - it computes the optimal step, but is computationally expensive

<Gradient descent>

0. Zero order

We don’t even compute the derivative; we just pick any direction that looks like its gonna improve → inefficient

1. First order – gradient descent – use the most

We are given an initial point xt, and we want to reach the better point.
In 2d you can go left or right. In more direction, the anti -gradient will be the direction of the steepest
step = learning rate.

We start with random xt, and it will give some initial error, and we change parameters to minimize this
- every time I start randomly, I start at a different initial state.

The next point = current point – n (step parameter: step size)* substitute current point in derivative of f(x_t)
Continue until the gradient is small enough – close to 0 – which is close to min or max
- Computing gradient is cheap enough to do it

2. Second order method -newtons method

Uses the second order expansion of the function, → Taylor’s expansion
If we compute the second order derivative, you can actually solve the eq for the optimal step
Computing this is computationally expensive → instead of computing the optimal step, its faster to do suboptimal

Preliminaries - Partial derivatives

after doing the partial derivation of each one in respect to x, y,z combine it to a vector this is the gradient

Question Example - The current point is <1,0>, compute the next point following gradient descent on the function

f(x,y) = x3 + 2y2 - y with step size 0.1.

Learning rate (n) – this decides how fast the network learns.

The learning rate controls how much to change the weights by.
Small learning rate – weights need to see the inputs more often before they change significantly – takes longer to learn. BUT its more stable and resistant to errors and inaccuracies.

In3D

step size n of 0.1 and 0.14
We get a decrease until we reach 0, although we will never hit 0 since it is a continuous space
- Need some sort of threshold that makes me happy stop!
When you increase the step size - you could just end up to the other side too quickly, so it goes back and forth, and it doesn’t reach the minimum
→ Step size has to be small enough to not diverge, but big enough to not take too long

Local Minima

We don’t know the shape of the whole function – it may have more than one minimum
- Global min – to find global one over the generic graph
- Local min – the first stationary point that it stops at
!! Initial point is important because for each area you are in, you gonna get a diff min every time I start randomly, we start at a diff initial state. Then do gradient descent, find minimum and evaluate it, and do it again. Then we look at what’s the best. !!!!