Perceptron III
Updated:
7. Perceptron III
Learning outcomes
- Define an appropriate error function for the perceptron
- Derive the corresponding update algorithm
- Describe the difference between gradient descent and stochastic gradient descent.
Number of mistakes
- Locally constant everywhere – therefore it gives no information as we move. If we move the gradient boundary a little bit, it doesn’t tell us if we are doing good or not.
- We need sth that gives better idea of how much we are mistaken by.
Towards a better error function
- In the previous lecture, we derived this to this(sum)
- We can move from number of mistakes summing the error for each point in the function. That way the total number is the total error.
- This is close to what we want, but not exactly. The D is positive or negative depending whether the misclassified point is on the same side as w or on the other side. Problem with this is that two errors can cancel each other out and it’d be 0.
- We fix it in an easy way next slide
The perceptron criterion
- T = value we want, y = actual output
- Whenever wTx >0, our neurons gonna output 1 when we
actually wanted 0. So y-t is 0.
- wTx > 0
- y-t = 1
- wTx (y-t) = positive * 1
- Say wTx < 0, so our neuron outputted a 0 when we
actually wanted 1. Y-t is -1.
- wTx < 0
- Y-t = -1
- wTx (y-t) = negative * -1
- If we are not making a mistake, y-t should be 0, and if there is a
mistake itd be positive.
- wTx (y-t) = positive * 1
- wTx (y-t) = negative * -1
- so wTx (y-t) will always be positive in case of a mistake !! so cancelling doesn’t happen anymore
-
new function!
- No mistake = 0, mistake = strictly positive no cancelling error anymore
- Is proportional to the distance of misclassified points form surface
- Property that is proportional to the distance so it changes as the decision boundary changes
Question
- Given the perceptron error, what is the gradient with respect to w?
- We take the error and expand w into its components.
- So w0x0 (y-t) + w1x1 (y-t) + …+ wnxn (y-t)
- w0x0y – w0x0t + w1x1y – w1x1t …
- We only consider the derivative in respect to w. all the other
components in the sum disappear
- It is x0y- x0t, x1y-x1t …
- Hence, x0 (y-t), x1(y-t) …
- Therefore, the gradient is the vector of what is multiplied by each
element
- this is the perceptron algorithm.
Gradient descent
- Combining the new function, we got for perceptron + gradient descent
- = sum of the error components due to each misclassified pts
- Gradient will be the sum of all components due to each single point
Stochastic gradient descent
- In practice we don’t do the sum coz it takes too long, we take one point at the time.
- Minimizing the total error is the same as minimizing the average error – as multiplying the objective function by a constant does not change the minimum of the function.
- Average error = EXPECTED error, with each point in the dataset
having equal probability of 1/N.
- u take one point off the dataset, and has to be chosen 1/N, and use that point and do a gradient descent and you take other point and do it till you find a minimum. ITS NOISY. For some point, error increases but on average it will decrease.
- Stochastic gradient descent – gradient is computed with
respect to a single point
extracted form a distribution. If all points have the same
probability, it converges in expectation, to the same solution
as gradient descent on the average error
- Faster! As each update does not require to compute the sum over all misclassified pts.
- But will require more iterations – as its per point.
- It improves the function in expectation, meaning on
average.
- Real gradient descent improves at every single step.
- In practice, we do sth in btwn – 1 point – not accurate, whole
dataset – too long so we take minibatch!
- Use a certain number of mis-classified pts (the size of minibatch)
- Gives better estimate of the gradient than a single pt, but still faster to compute
Learning graphically
- We have to sample uniformly from all data set 1/N
- As we can use the error of some point which could make the error worse on the other pt
- IMP independent and identically distributed
- Identically – you don’t change the distribution throughout. you can decide it at the beginning
- Independent – cannot go down in same order – this will not give you a global error. The bottom error will improve when top stays. So instead if you take it from bits from everywhere, its good everywhere
Leave a comment