Perceptron III

3 minute read

Updated:

7. Perceptron III

Learning outcomes

  • Define an appropriate error function for the perceptron
  • Derive the corresponding update algorithm
  • Describe the difference between gradient descent and stochastic gradient descent.

Number of mistakes

  • Locally constant everywhere – therefore it gives no information as we move. If we move the gradient boundary a little bit, it doesn’t tell us if we are doing good or not.
  • We need sth that gives better idea of how much we are mistaken by.

Towards a better error function

  • In the previous lecture, we derived image this to thisimage(sum)
  • We can move from number of mistakes summing the error for each point in the function. That way the total number is the total error.
  • image This is close to what we want, but not exactly. The D is positive or negative depending whether the misclassified point is on the same side as w or on the other side. Problem with this is that two errors can cancel each other out and it’d be 0.
  • We fix it in an easy way next slide

The perceptron criterion

  • T = value we want, y = actual output
  • Whenever wTx >0, our neurons gonna output 1 when we actually wanted 0. So y-t is 0.
    • wTx > 0
    • y-t = 1
    • wTx (y-t) = positive * 1
  • Say wTx < 0, so our neuron outputted a 0 when we actually wanted 1. Y-t is -1.
    • wTx < 0
    • Y-t = -1
    • wTx (y-t) = negative * -1
  • If we are not making a mistake, y-t should be 0, and if there is a mistake itd be positive.
    • wTx (y-t) = positive * 1
    • wTx (y-t) = negative * -1
    • so wTx (y-t) will always be positive in case of a mistake !! so cancelling doesn’t happen anymore
  • image new function!
    • No mistake = 0, mistake = strictly positive no cancelling error anymore
    • Is proportional to the distance of misclassified points form surface
  • Property that is proportional to the distance so it changes as the decision boundary changes

Question

  • image
  • Given the perceptron error, what is the gradient with respect to w?
  1. We take the error and expand w into its components.
    • So w0x0 (y-t) + w1x1 (y-t) + …+ wnxn (y-t)
    • w0x0y – w0x0t + w1x1y – w1x1t …
  2. We only consider the derivative in respect to w. all the other components in the sum disappear
    • It is x0y- x0t, x1y-x1t …
    • Hence, x0 (y-t), x1(y-t) …
  3. Therefore, the gradient is the vector of what is multiplied by each element
    • image this is the perceptron algorithm.

Gradient descent

  • Combining the new function, we got for perceptron + gradient descent
  • image = sum of the error components due to each misclassified pts
  • Gradient will be the sum of all components due to each single point

Stochastic gradient descent

  • In practice we don’t do the sum coz it takes too long, we take one point at the time.
  • Minimizing the total error is the same as minimizing the average error – as multiplying the objective function by a constant does not change the minimum of the function.
  • Average error = EXPECTED error, with each point in the dataset having equal probability of 1/N.
    • u take one point off the dataset, and has to be chosen 1/N, and use that point and do a gradient descent and you take other point and do it till you find a minimum. ITS NOISY. For some point, error increases but on average it will decrease.
  • Stochastic gradient descent – gradient is computed with respect to a single point extracted form a distribution. If all points have the same probability, it converges in expectation, to the same solution as gradient descent on the average error
    • Faster! As each update does not require to compute the sum over all misclassified pts.
    • But will require more iterations – as its per point.
    • It improves the function in expectation, meaning on average.
      • Real gradient descent improves at every single step.
  • In practice, we do sth in btwn – 1 point – not accurate, whole dataset – too long so we take minibatch!
    • Use a certain number of mis-classified pts (the size of minibatch)
    • Gives better estimate of the gradient than a single pt, but still faster to compute

Learning graphically

  • We have to sample uniformly from all data set 1/N
    • As we can use the error of some point which could make the error worse on the other pt
  • IMP independent and identically distributed
    • Identically – you don’t change the distribution throughout. you can decide it at the beginning
    • Independent – cannot go down in same order – this will not give you a global error. The bottom error will improve when top stays. So instead if you take it from bits from everywhere, its good everywhere

Leave a comment