Perceptron III

3 minute read

Updated: February 15, 2020

7. Perceptron III

Learning outcomes

Define an appropriate error function for the perceptron
Derive the corresponding update algorithm
Describe the difference between gradient descent and stochastic gradient descent.

Number of mistakes

Locally constant everywhere – therefore it gives no information as we move. If we move the gradient boundary a little bit, it doesn’t tell us if we are doing good or not.
We need sth that gives better idea of how much we are mistaken by.

Towards a better error function

In the previous lecture, we derived this to this(sum)
We can move from number of mistakes summing the error for each point in the function. That way the total number is the total error.
This is close to what we want, but not exactly. The D is positive or negative depending whether the misclassified point is on the same side as w or on the other side. Problem with this is that two errors can cancel each other out and it’d be 0.
We fix it in an easy way next slide

The perceptron criterion

T = value we want, y = actual output
Whenever w^Tx >0, our neurons gonna output 1 when we actually wanted 0. So y-t is 0.
- w^Tx > 0
- y-t = 1
- w^Tx (y-t) = positive * 1
Say w^Tx < 0, so our neuron outputted a 0 when we actually wanted 1. Y-t is -1.
- w^Tx < 0
- Y-t = -1
- w^Tx (y-t) = negative * -1
If we are not making a mistake, y-t should be 0, and if there is a mistake itd be positive.
- w^Tx (y-t) = positive * 1
- w^Tx (y-t) = negative * -1
- so w^Tx (y-t) will always be positive in case of a mistake !! so cancelling doesn’t happen anymore
new function!
- No mistake = 0, mistake = strictly positive no cancelling error anymore
- Is proportional to the distance of misclassified points form surface
Property that is proportional to the distance so it changes as the decision boundary changes

Question

We take the error and expand w into its components.
- So w0x0 (y-t) + w1x1 (y-t) + …+ wnxn (y-t)
- w0x0y – w0x0t + w1x1y – w1x1t …
We only consider the derivative in respect to w. all the other components in the sum disappear
- It is x0y- x0t, x1y-x1t …
- Hence, x0 (y-t), x1(y-t) …
Therefore, the gradient is the vector of what is multiplied by each element
- this is the perceptron algorithm.

Gradient descent

Stochastic gradient descent

In practice we don’t do the sum coz it takes too long, we take one point at the time.
Minimizing the total error is the same as minimizing the average error – as multiplying the objective function by a constant does not change the minimum of the function.
Average error = EXPECTED error, with each point in the dataset having equal probability of 1/N.
- u take one point off the dataset, and has to be chosen 1/N, and use that point and do a gradient descent and you take other point and do it till you find a minimum. ITS NOISY. For some point, error increases but on average it will decrease.
Stochastic gradient descent – gradient is computed with respect to a single point extracted form a distribution. If all points have the same probability, it converges in expectation, to the same solution as gradient descent on the average error
- Faster! As each update does not require to compute the sum over all misclassified pts.
- But will require more iterations – as its per point.
- It improves the function in expectation, meaning on average.
  - Real gradient descent improves at every single step.
In practice, we do sth in btwn – 1 point – not accurate, whole dataset – too long so we take minibatch!
- Use a certain number of mis-classified pts (the size of minibatch)
- Gives better estimate of the gradient than a single pt, but still faster to compute

Learning graphically

We have to sample uniformly from all data set 1/N
- As we can use the error of some point which could make the error worse on the other pt
IMP independent and identically distributed
- Identically – you don’t change the distribution throughout. you can decide it at the beginning
- Independent – cannot go down in same order – this will not give you a global error. The bottom error will improve when top stays. So instead if you take it from bits from everywhere, its good everywhere