Multi-Layer Neural Networks
Updated:
10. Multi-Layer Neural Networks
MLP is a universal function approximator – whatever function that is embedded in the data, MLP can guess it
Learning outcomes
- Define an appropriate error to minimise for Feed-forward neural networks
- Derive the update rule of the weights of the NN, through back propagations
- Apply NNs to real-world data sets
A different activation function
- So after coming up with a rounded activation function, we can now differentiate. And β =1, so we ignore it
- σ (1-σ)
Gradient descent (again)
- now we have a differentiable error function, and we want to compute a gradient! And do gradient descent!
- Derivation of this is called the backpropagation of the error – backbone of deep learning
Example
- We consider network with 2 neurons- only one input + biases
- Input = x1, v1
- Bias = x0, v0
- Now can output value btwn 0 and 1 thanks to sigmoid! If its closer to 1 we classify to 1 and vice versa.
- We need to change the weight to reduce the error.
- Which weight do we change by how much? Gradient is the key! We need to compute the gradient on that point in respect to the weight
Example II
- Output of the whole neural network is the sigmoid of the last neuron applied to the input
- We want the derivative of the error, in respect to its weight. This tells us what the gradient in respect to weight is. We can only do this by applying chain rule.
- We can do this by finding derivative of the Error in respect to a, and derivative of a in respect to w0.
- We decided we like mean squared error, so we take the output of the network – target (y-t) and square it and add ½ so that we cancel out the constant when deriving.
Example III
- Gradient = it is the vector of partial derivatives
- In this case, the gradient is a derivative of Error in respect to w0, w1, v0, v1
- Now we can compute this on a current point to figure out how to change the weight.
- We substitute y =output of the network, z0-bias so 1, z1= output of the hidden neuron, to actually get a numeric value.
- From previous,
- Y = 0.879
- T = 0
- Z0=1
- Z1=0.982
- Substitute to get this gradient:
- The vector suggests that: Outer layer’s weight changes more than the inner hidden layer bc the output of outer has a greater effect than the hidden layer.
- We follow ANTI GRADIENT!! Coz gradient is towards the growth of
function, but we want this to minimize so
- Since we are minimizing, gradient must be multiplied by -1.
- Drawback of backpropagation – vanishing gradient!! – after
certain number of layers, the gradient of the
weights near the inputs is close to
zero – aka barely affect the error.
- it is a problem coz after a certain number of layers, the gradient of the weights near the inputs is close to 0, meaning that we can’t take a step of gradient to update a function.
- Effect of the inner layer gets smaller and smaller -0.002 eventually vanishes!!!
Summary
- We really just have to look at the topology of the network, to understand how each weight would affect the output of the error.
- Chain rule is key to back propagation!
- And the path to a weight determines the chain.
Backpropagation of errors, notation
- We do the same thing but in a more general form.
- We need 2 update rule – one for output, one for hidden.
- All the outputs (input) = z’s, input = a’s
- Input of the layer is output of the previous layer
- MLP with one hidden layer and one output layer. The network may have many hidden layers, but we only look at one coz they are all treated equally & has the same update rules
- Read from right to left.
- Output of rightmost = the activation function applied to its input
- Input of the activation fx = sum of all the stimuli multiplied by their corresponding weight =
- Input of rightmost neuron= output of a hidden neuron. repeats as however many hidden layers there are.
- Input of the leftmost neurons = input of the whole network
Forward pass
- Backpropagation of errors is a 2-pass algorithm
- Forward pass - We compute all the outputs – w0z0
- Then we do backward pass – we start with the derivative of a, then w1,
Backward pass, output neuron
- The error is ‘backpropagated’ layer by layer.
-
- We investigate in the order of: output neuron (ak, wjk) and hidden neuron (aj, wij)
- We first see what happens to the error as the input of the activation function changes.
1) To do that, we compute the derivative of the error, with respect to the input of the sigmoid, that is ak. - Sigmoid is σ(1-σ) hence zk(1-zk) 2) Then we look inside the box, and we derive in respect to actual weight that we want.
- we get the gradient with respect to the weights of the output, and
we get an update rule.
Backward pass, hidden neuron & Computing delta
- Thing we haven’t considered is the fact that hidden layer can be
connected to more than 1 layers.
- So the error will be the sum of all these components. One per neuron that’s connected to neuron that we care about.
- The weights that are in these hidden layers affect the error through all the output neurons they are connected to.
1) We start by finding the error with respect to the input of the
sigmoid, a.
- Now we have multiple neurons hence the sum* sigmoid
2) We go inside the box, and sth diff is the delta. As it takes into
account of all the weights.
- We have completed gradient with respect to all the weights. So we get a new update rule.
Difference btwn Delta functions
- Delta of a hidden neuron takes into all that follows, whereas for
output, nothing follows
- Output neuron: - nothing follows
- Hidden neuron: - sums the dependencies
Gradient descent (again)
- If we look at the update rules of single perceptron, and what we
derived (MLP), they are not that different
- Perceptron: component of the gradient is (output-target)
- MLP: component of the gradient is delta.
- We have two delta – for output and hidden
- Delta of a hidden neuron depends on its connection. – only difference.
Local Minima
- The local optimisation method, meaning it will converge to a local minimum – depending on the initial point.
- Since gradient descent is a local method, to find different solutions we need to start multiple times from different initial weights. we will randomly restart by initializing the weight and backpropagate until it converges somewhere
- Diff initialization gives diff value for error, and you keep the best model out of those
Using MLPs
- Can be used for
1) Regression – to approximate any function, with a single output, where the output neuron has the identity activation function.
- If we do this with MLP, normally the last layer will have a
linear activation function, sum of the weights and the inputs
without a sigmoid. Coz we don’t want the function to be
compressed into 0 or 1.
2) Classification – use of neural network for unsupervised learning, where the MLP has one output neuron per class, and we consider the input to be classified into the class with the highest output
- We have as many output neurons as input neurons – and we pass
the input back to the output for training. We want the network
to regenerate the input as it was.
3) Compression -where the inputs are presented as labels at the output, and the MLP learns to reproduce inputs. This is an unsupervised learning technique (there are no labels) and the hidden layers learn a “compressed” representation of the input.
- Normally then we force the hidden layers to be smaller than the
input. If the network can reconstruct the inputs through these
lower dimensional layers, it becomes a compressed representation
of input. Coz input starts in a certain number of dimensions,
but we can reconstruct to a smaller dimension. for reencoding
the input.
- Unsupervised bc you don’t need a label – ur giving ur inputs
back as a correct output that you want.
Training “recipe”
- If ur not learning the features, then you need to start by choosing the features. - the input variables.
- One thing that’s normally done is, data is
normalised. either to have zero mean,
and unit variance or to be btwn 0 and 1
- That’s bc the algorithm, the neural network, is metric and the actual number matters. If you have one feature that dominates over in terms of size. E.g. classifying based on height and their distance from Leeds. One is km and one is m. So km will dominate, and height won’t matter coz the number is so small. But if you normalise it, then they will have same weight equally!
- Then you create training, validation and test sets
- Need to decide on an architecture – how many layers and how large it will be
- Then you train and test.
Leave a comment