Neural Networks: Perceptron II
6. Neural Networks: Perceptron II
Learning outcomes
- Define linear separability
- Justify whether a given error function is suitable for gradient descent
- To parameters of McColloch and Pitts perceptron model:
- To minimize an error function:
- we want to measure error and find a parameter that gives the lowest error
- for each point in the training set, we have
- y = the output of the classifier (0 or 1)
- t = desired output
- if the output is a desired class, the difference will be 0. If there’s a mistake, it will be 1 or -1. So if we take the absolute value, that’s the exact number of mistakes.
- it seems like a good thing to minimize, but its not.
Bias input
So depending on the threshold θ, it tells neurons to fire or not. But the threshold should be adjustable, so that we can change the value that the neuron fires at which is why we move θ to one side.
- Previously, using McCulloch, (x1w1+
x2w2+ …+ xnwm+ b)
aka w*x + b new vector input
- x= -1 is the new vector of input, and
the new vector of weight is θ,
new threshold = 0
- Wx wx, so why can’t -1*θ be one as well?
- Bias input - any constant, -1 in this case
- Bias weight - a weight that is multiplied by a constant, θ, in this case, but really anything
- For the middle function, we take θ and make it part of h
- θ can change – if we wanted 2, the θ has to be -2 bc -1*-2 = 2 which is the desired value so doesn’t matter what bias input is!
Vectors: recap look at the worked out solution!!
- How to find a Direction vector (parallel to the line)
- We take any two points on that line, and we subtract them, we
get a vector that connects both.
- So red are two vectors, and the blue line is parallel to the line
- Or is <b, -a>
- Unit vector (direction vector)– if we divide a vector by its
norm, length = 1
- This is a vector that has the same direction, but shorter as norm = 1
- U and v are parallel
- How to find a normal vector (orthogonal to the line) very
- Is the same as original vector from a and b!!! so <a, b>
- The dot product btwn two vector is orthogonal iff the dot
product = 0.
- u*v = 0 orthogonal
u’v= 0 because cos 90 = 0
Question: Compute a vector parallel to the straight line, and one orthogonal to the straight line 1/2x-y+1=0.
- A = ½, b = -1, c= 1
- Parallel = <2,2> - <0,1 > = <2, 1> take two points in the line and subtract
- Orthogonal = <1/2, -1> take the vector as it is
The decision boundary of the perceptron is the function:
- f(x) = w*x = 0 When the dot product equals zero, its the decision
- When dot product > 0, it outputs 1, < 0 outputs 0. so when the dot product is exactly zero, that’s the decision boundary
- Because the perceptron is nothing more than a dot product, when = 0, where it makes decision
- Allows you to do simple classification based on linear decision boundary
<==> ax + by + c (fake input)
- If we plot this a straight line
- Orthogonal
- Weights * variable are the components of the vector that is orthogonal to the line
- Is a gradient of the decision boundary
- Perceptron is a linear classifier because the decision boundary is a straight line
- To find optimal parameter for x, y, and the bias to separate the data points better.
Linear separability
- Because the perceptron is a linear classifier, and the decision boundary is a hyperplane, we can only correctly classify problems that allow hyperplane to separate classes.
- Given some data, perceptron tries to find a straight line that
divides the neurons when the straight line exists, its called
linearly separable. this straight line is the decision
- Line in 2D, plane in 3D, and hyperplane in higher dimensions
- XOR – is not linearly separable. It keeps on cycling through
wrong solutions.
- 0,1 as an input and 1 when inputs are different.
- No straight line that has circles on one side and crosses on the other side not linearly separable
Number of mistakes as Error
- Optimization is about minimization. We minimize the error/ loss. We
defined a while ago that there is an intuitive error function.
, which will implement the exact number of mistakes. Intuitively we wanna minimize this. BUTTT, if we make local changes, it may not change the number of mistakes! Even if I move one point to another decision boundary, it is still an error.
- It is a piecewise constant – no gradient, hence doesn’t tell us about how ‘much’
- Better way is to move the decision boundary
- We need a different error function that could give us whether we
are doing better or not, giving info about the local information
- we make small change, and we want to know if this change is minimizing or not.
- want an error function which the error decreases as the current solution gets closer to a misclassified point. Towards a better error function
- We want the information of ‘how much’ we are far off from the wrong side.
If we know the distance of the misclassed point, we could tell whether by moving the decision boundary by a little, if we are decreasing or increasing the distance of the misclassified point from the boundary.
- To represent x as a sum of two vectors could be represented as a= b+c.
- w = gradient of the decision boundary, is orthogonal and points to the direction function grows
- To isolate d, the distance to the hyperplane, we have to make sure that w is the unit vector with length of 1!!!! to ensure unit vector, we divide w by its norm. w/ ||w|| = length of 1 that has same direction as w
- The crossed out - equation of the line evaluated on xp bc it’s on the line, gradient is 0
wTw is a dot product with w with itself =||w||2.
So it clears out to **d w **
**hw(x) = d w ** FINAL EQ of the decision boundary! - If we evaluate any point with the eq h() of the decision boundary, we get a number that’s proportional to the distance of the point to the decision boundary
- Points satisfying the inequality f (x1,
x2,.., xn) ≥ 0* lie on one side of the
surface. Which side?
- Half of this space lie one side of the surface - Half will lie
>0, and the other half, <0.
- There’s a hyperplane that cuts the surface in half.
- Half of this space lie one side of the surface - Half will lie
>0, and the other half, <0.
- But which one of the two half is the part where the function is >0,
and what’s <0?
- Important bc depending on the part, we are gonna classify it as 0 and 1!! we don’t want the classifier to be opposite even if it is as good.
- We do this by evaluating some points.
- split btwn neg and pos
- If we plot this outcome,
- Points on the bottom half – evaluates to positive – bottom is positive.
- Points on the surface = f(0,1) = 0, decision boundary
- Points on the top half – evaluates to minus – top side is negative
- How do we know this without having to plot it every time?
- IMP Positive side is always the side of gradient !! gradient is the direction of the steepest ascent where the function is growing the fastest. So by look at the gradient and see where it points, we can tell which side is pos or neg.
- Just by looking at the y component, you can tell that its pointing down.
- –> !!! Further away we get from the line, larger the output becomes. F(4,0) is further away than f(0,0). Function grows as you move in the direction of the gradient.
Question: Is the inequality 1/2 x−y+z+1 ≥ 0 satisfied by the points above or below the corresponding plane?
- The gradient is <1/2, -1>. We have a x component of 1/2 going to the right, and y component of -1 going negatively, and up along z with +1. it’s above the plane where the function would be positive
- If we want to change the direction of the plane, without changing the gradient we multiply everything (the function) by -1
Designing a perceptron
Leave a comment