Perceptron III
Updated:
7. Perceptron III
Learning outcomes
- Define an appropriate error function for the perceptron
- Derive the corresponding update algorithm
- Describe the difference between gradient descent and stochastic gradient descent.
Number of mistakes
- Locally constant everywhere – therefore it gives no information as we move. If we move the gradient boundary a little bit, it doesn’t tell us if we are doing good or not.
- We need sth that gives better idea of how much we are mistaken by.
Towards a better error function
- In the previous lecture, we derived
this to this
(sum)
- We can move from number of mistakes summing the error for each point in the function. That way the total number is the total error.
This is close to what we want, but not exactly. The D is positive or negative depending whether the misclassified point is on the same side as w or on the other side. Problem with this is that two errors can cancel each other out and it’d be 0.
- We fix it in an easy way next slide
The perceptron criterion
- T = value we want, y = actual output
- Whenever wTx >0, our neurons gonna output 1 when we
actually wanted 0. So y-t is 0.
- wTx > 0
- y-t = 1
- wTx (y-t) = positive * 1
- Say wTx < 0, so our neuron outputted a 0 when we
actually wanted 1. Y-t is -1.
- wTx < 0
- Y-t = -1
- wTx (y-t) = negative * -1
- If we are not making a mistake, y-t should be 0, and if there is a
mistake itd be positive.
- wTx (y-t) = positive * 1
- wTx (y-t) = negative * -1
- so wTx (y-t) will always be positive in case of a mistake !! so cancelling doesn’t happen anymore
new function!
- No mistake = 0, mistake = strictly positive no cancelling error anymore
- Is proportional to the distance of misclassified points form surface
- Property that is proportional to the distance so it changes as the decision boundary changes
Question
- Given the perceptron error, what is the gradient with respect to w?
- We take the error and expand w into its components.
- So w0x0 (y-t) + w1x1 (y-t) + …+ wnxn (y-t)
- w0x0y – w0x0t + w1x1y – w1x1t …
- We only consider the derivative in respect to w. all the other
components in the sum disappear
- It is x0y- x0t, x1y-x1t …
- Hence, x0 (y-t), x1(y-t) …
- Therefore, the gradient is the vector of what is multiplied by each
element
this is the perceptron algorithm.
Gradient descent
- Combining the new function, we got for perceptron + gradient descent
= sum of the error components due to each misclassified pts
- Gradient will be the sum of all components due to each single point
Stochastic gradient descent
- In practice we don’t do the sum coz it takes too long, we take one point at the time.
- Minimizing the total error is the same as minimizing the average error – as multiplying the objective function by a constant does not change the minimum of the function.
- Average error = EXPECTED error, with each point in the dataset
having equal probability of 1/N.
- u take one point off the dataset, and has to be chosen 1/N, and use that point and do a gradient descent and you take other point and do it till you find a minimum. ITS NOISY. For some point, error increases but on average it will decrease.
- Stochastic gradient descent – gradient is computed with
respect to a single point
extracted form a distribution. If all points have the same
probability, it converges in expectation, to the same solution
as gradient descent on the average error
- Faster! As each update does not require to compute the sum over all misclassified pts.
- But will require more iterations – as its per point.
- It improves the function in expectation, meaning on
average.
- Real gradient descent improves at every single step.
- In practice, we do sth in btwn – 1 point – not accurate, whole
dataset – too long so we take minibatch!
- Use a certain number of mis-classified pts (the size of minibatch)
- Gives better estimate of the gradient than a single pt, but still faster to compute
Learning graphically
- We have to sample uniformly from all data set 1/N
- As we can use the error of some point which could make the error worse on the other pt
- IMP independent and identically distributed
- Identically – you don’t change the distribution throughout. you can decide it at the beginning
- Independent – cannot go down in same order – this will not give you a global error. The bottom error will improve when top stays. So instead if you take it from bits from everywhere, its good everywhere
Leave a comment