Linear Models for Regression
Updated:
11. Linear Models for Regression
Learning outcomes
- Approximate a function through linear regression with different basis functions
- Compute the pseudo-inverse of a matrix
- Derive the bias-variance decomposition
Back to regression
- What is regression? – problem of approximating a function from a set of sample points. We want to estimate something that is not binary. E.g. the number of mm of rain in past years.
- We have pairs of inputs and outputs – inputs are the features that we gonna measure, and the output is the value of the function that we have assembled
Linear Regression
- There’s an unknown function that give a target value for input
- The model used for the function iz linear in the parameters. But no need for the input to be linear bc non-linear functions can be obtained with the use of basis functions.
- We wanna approximate the function with a model – we do it with a model. We commit to a linear model and learn the parameters
- What if I approximate the points I have with a linear function?
- Doesn’t have to be linear in the input. But has to be linear in the parameters - in the w’s. we can make this function non-linear on its inputs by using basis functions.
Basis functions – example of basis functions
- Polynomial basis – where each input is given to an exponential of some coordinate. It’s not linear anymore its polynomial! But its linear in the parameters
- Gaussian basis – simple but v imp
- We take that input and give it to a gaussian, with a certain mean and variant.
- A bump somewhere in the domain, and we compose these bumps by taking a little bit or a lot, depending on the gradient.
- Spreading gaussian all over the input space which will have high and low weights, and by composing them we get a mountain we get a learned function
- Sigmoid basis – we can do the same for sigmoid - later
System of equations
- We have a set of samples, each input vector may have a number of
features, and we have a single target for each point.
- We create a fake basis function of just 1, Φ= 1, that way linear reg
function will just be a vector of weight * vector of basis
function.
- We can compute the value of each basis functions on each input point
we call that the I,j. value of basis function I, and we can have as
many as basis functions.
- For each point xj ,
- The more parameters you have, the higher presentation of your approximator, which means you need more data otherwise ur gonna overfit a lot. This is why you choose a structure of a function you are learning, which will give corresponding number of parameters.
- Q: What is it we are asking?
- A: we are now representing as vector of weights * vector of basis function. We would like the approximator xij to match the target! wTΦ = t
- The target = approximation happens if we have n points in the dataset, it corresponds to n equations, aka if we have exactly as many parameters as unknowns. but usually not possible.
An example – data & Choose model
- 6 points. And we use sigmoid as a basis function.
- The first step is choosing the basis functions - How many and what
kind of sigmoid? – all up to us!
- In this case, 3 sigmoids +bias = function.
- Q: How well can it match my input dataset? Can I combine 3 sigmoid through the weights to approximate as close as possible?
Equations
- It turns into a number of linear system of equations. For each point
we ask the approximation to be close to the target. correct on that
point. Which is impossible!
- <-2.18, 18.28> - 2.18 is the input, and 18.28 is the value we wanna get from the function.
- For each point we get an eq like this: the model we are tyrna learn.
- We just give input of the dataset, and the output is now a number!
- Now we have linear system of eqs. The basis function just turns into numbers. The unknowns in the system are the w’s -weights.
In matrix form
- If it’s linear systems, we can represent it as vector of unknowns * matrix of coefficients
- If Φ was a square matrix, meaning we’d have exactly the same number of points as weights (unknowns), it will be system with n variables and n eqs can compute exact solution! Meaning itd pass exactly through each point.
- Could be done by inverting the matrix Φ.
Exact solution
- If I choose any 4 points, we can get the regression function to go through them exactly.
- This is the function we got by combining 3 fx – how much of each basis function do we want?)
- We won’t have as many functions as the dataset – its not our goal! It is to generalize well to an unseen dataset
Overdetermined vector
- In general, the vector of weights will be overdetermined - we will have more eqs than unknowns. we can’t satisfy the eq exactly.
- We need to find a way to define an error and find representation that minimizes it.
- - t is the target and we want the computed
output of Φ to be as close to the t
- We said we wanted the approx. function (weight* input) to be equal to the target. We want to get as close as possible. So why don’t we measure the distance btwn the RHS and LHS of the eq. 0 error = equality but isn’t possible.
- So we want to minimize the difference btwn RHS and LHS as much as possible.
- Q: What’s a reasonable error for regression?
- A: Mean squared error!
- The subtraction bit (Φw -t) is what we want to minimize.
- So we minimise the total error: the square of the norm of the difference between the left and right side of the equation for each point, summed over all the points.
- Square it coz positive and neg error cancels out, and we stick ½ bc it will go away with the derivative
Sum-of-squares error
-
Derivative of in terms of w is
- Now we have Φ is a matrix, instead of the sum – the one that we solved using basis func
-
We want to minimize the difference where t and w is a vector =equivalent to minimizing the square of the norm of the vector of coefficients of our system * weights – (Φw -t)2
Least squares solution
- Now what do we do with this? We don’t do gradient descent. But we do vector! We don’t have to find a local minimum method, instead we can find min in one step
- We know that necessary condition for point to be a min = gradient of the point is zero.
- Our error function is quadratic in the input. It’s a convex. And optimising convex is easier bc there’s a single minimum.
- If we solve Φ(Φw-t)=0 that’s the only minimum!! Where the gradient is zero.
- transpose by multiplied by itself = square matrix. And if a square matrix is invertible, then it has a solution!!!!
- We take the square matrix and invert it, and
multiply it by Φt. - We end up with a massive matrix - pseudoinverse of Φ – it works like
an inverse of a normal systems of eq
- If we have number of unknowns = eq, we can solve the system exactly by inverting the matrix
- But if we have overdetermined – we have to use another
matrix; the form is the same tho.
- pseudoinverse
- Hence a pseudoinverse gives us a solution of mean squared error for linear regression for a given dataset.
Least squares solution II
- We compute our function with coeff, and Φ is nothing else but a value of each basis function on each input point. Matrix of numbers. We do matrix mult and solve, and we get a certain value for the vector of weights.
- And that function is the one that minimizes the error over the whole dataset. But this function doesn’t go through any of the point – its closeeee to every point.
- The distance from function we learn and the point is now a minimum!! minimizes total/ average error
Comparison
-
original and least square functions
- Original function was not a function of sigmoid – so we can’t match exactly
- But interesting is that by using sigmoid we could almost exactly reproduce the original fx to data points
Exercise
- {(0,0), (0,1), (1,0)} – find the least square solution for the regression of the function y = w1+w2*x2
- Remember we want overdetermined system, we can find the EXACT minimum!! So we want to have less parameters than the eqs
- Steps
- Evaluate the bases on the points: Φ = ?
- Φ is a value of output on EACH point!!! But not in the matrix of features
- Evaluate the bases on the points: Φ = ?
-
Compute ΦT Φ = ?
-
Invert (ΦT Φ)-1 = ?
- Compute the pseudo inverse Φp = (ΦT Φ)-1 ΦT = ?
- Compute w! w = Φpt = ? optimal
solution!
- but the problem is the pseudo inverse can be incredibly large
Result
- -0.5x2 + 0.5 we got the values from the previous steps!! [0.5, -0.5]
- The function is a parabola -
- The function minimizes the average error bc - It goes through exactly one point and it has exactly the same distance for 2 points.
Sequential learning
- Least-squares problem – one with the pseudoinverse, so only one
minimum or no minimum
- Pseudoinverse is expensive because the matrix could be enormous, and we do the whole data set at once– what is possible to do is to use stochastic gradient descent.
- Sequential learning - The errors are computed, and the weights are
updated after each input.
- Not the most efficient, but is simple to implement
- Instead of using all the points at once, we can take point one by one and do one step in the direction of gradient for each point same as stochastic gradient descent.
- Advantage of pseudo-inverse is that we don’t have to worry about the global minimum – with neural network, since it is a non-linear, we have to start on random initial points, but with this since we have a single min, we don’t have to restart
Bias and variance
- Important concept in ML independent from linear function
approximator – bias and variance
- This applies to classifiers, regressions, etc
- It’s all about finding the balance without overfitting.
- Bias of an approximator – how far away from the value ur
approximating the function is. Average error across all training set
- Unbiased – given enough training time, it will converge exactly to what you want
- But its almost impossible, it will converge with error
- Bias means we won’t hit the function we want exactly, but it will be a little bit away from what we want
- Variance – how much my classifiers changes when I change the
dataset in the input. Average distance between the errors across all
datasets,
- If I take the 10000 of the data, and train the classifier, and select different 10000 and train that, how different are these classifiers?
- Out of different subset of the data.
- Low bias and low variance (both 0) – very little error, and doesn’t depend too much on our input points.
- Worst is high variance and bias. – it’s far away from the target and it changes every time we change the dataset. It is very subject to the particular points we are using for training
Bias- variance decomposition
- The intermediate steps are less important – but important is what
does this mean?
- Won’t ask the proof in the exam but shouldn’t throw it away
- We start with the output of the classifier minus the function we are actually going to approximate.
- We want to compute the Average error over all possible
dataset– expected value over all possible dataset. –
theoretical concept
- If I change my dataset in all possible ways what would my average error be?
- It starts by subtracting and adding the expected value over all possible dataset of our approximator, and then
Bias-variance decomposition
- We get to the fact that average error over
all possible dataset decomposes into two terms:
- Variance - the average distance between the output of the
function and the mean of the function
- The mean of all the points will be somewhere in the middle – so what is the average distance between each point and this mean
- Large – large variance – large cloud of points- it measures how large this cloud is.
- Bias – average distance of the mean from the true center
- Average distance between the center of the cloud and the target
- Variance - the average distance between the output of the
function and the mean of the function
- Error we get is a combo of this two- variance and bias
- Q: why is this imp?
- We can’t minimize both at the same time. For a given dataset, what we do is we keep training, until the error on the validation set increases.
- In terms of bias and variance decomposition, as we keep training to reduce the error, it means that we are reducing the variance, reducing the global error, but at some point, we do that in the expense of the variance. So if I change my dataset, the classifier we get changes. And in fact, if we change the dataset we get a diff result.
- We need to find the optimal!!
Bias – Variance
- Sampled same original function with some noise,
- If we do the same linear regression for diff subset of points, we get these diff lines. If I change the dataset, we get a diff approximation.
- And the difference between these lines are the variance. Bc I trained diff subset and we get diff function.
- The diff of mean of all these lines and the original is the bias. – this graph is almost unbiased.
- Is mostly unbiased, so the difference is mostly in variance – now we found the min of the dataset.
On training and validation error
- Keep in mind the idea of the target!! - If we change the dataset, we get difference bias and variance.
- Imagine each point here is a classifier trained on a diff dataset. And we have training set and a validation set. 2 points – if we keep training, the 2 points get closer to the target. At some point, they get further apart tho – that means variance is starting to increase.
- Error is about balancing the right amount of bias that doesn’t get the variance increasing, and the variance that doesn’t give you a very large error (big bias)
- Overfitting increases the variance – bc it reduces the error as much as possible on the training set which means the bias is reduced but it happens with the expense of the variance.
- So when we use a different set, such as the validation set- we are likely to see a large variation in the error.
Leave a comment