Linear regression like a rock star

21 September 2019

If you studied economics at any point or have used Excel you probably know what linear regression is. It is extremely useful. In order to understand it better, we are going to go on a (non-rigorous) journey through finding the two coefficients for linear regression on one variable with a toy dataset.

We are going to calculate the coefficients for linear regression with one variable by:

  1. Trying various coefficients and seeing which is optimal
  2. Finding an exact solution
  1. Finding a precise-enough solution numerically with gradient descent

I like to think that in a parallel universe, this is how Keith Richards is explaining linear regression.

1. Trying various coefficients and seeing which are best

Let’s plot some random numbers for xx, and use those same numbers with a bit of jitter for yy, so we have an approximately linear relationship.

x, y

[[7.  7.4]
 [4.  4.1]
 [8.  8.5]
 [5.  5.3]
 [7.  7.1]]


In high school we were told that the equation of a straight line was y=mx+cy=mx+c, and that linear regression is the line in this form which ‘best fits’ the data we have for x,yx,y.

Now we’re in machine learning land let’s use y^=θ0+θ1x\hat y = \theta_0 + \theta_1 x, where θ0,θ1\theta_0, \theta_1 are the intercept and the gradient respectively. The hat in y^\hat y denotes that it is an estimate, and the points along the regression line will be x,y^x, \hat y.

We need some way to come up with y^\hat y; let’s call this h(x)h(x).

And so y^=h(x)=θ0+θ1x\hat y = h(x) = \theta_0 + \theta_1 x.

Note that rather than regarding y^\hat y as our prediction for yy, we can regard it as the yy that goes with the obligatory xx on the linear regression line.

So far so good, but we have no idea how h(x)h(x) will work to supply us with our y^\hat y values which are sensible. One way to do this is to define how we will know when h(x)h(x) has done the optimal job for us. We know that this will be when the line best fits the data x,yx,y and we know that xx will be an input into our h(x)h(x) function that supplies us with y^\hat y. We also know that h(x)h(x) is parameterised by θ0\theta_0 and θ1\theta_1, so really all we have to do is pick the best values for each of those.

Luckily for us, we have this function on hand. We will calculate the mean squared error (MSE) and we will denote this as the function JJ. J(θ0,θ1)J(\theta_0, \theta_1) will be evaluated, and the lower the output (the MSE), the better. We call J(θ0,θ1)J(\theta_0, \theta_1) our ‘loss’ function, because we want to find the values for θ0,θ1\theta_0, \theta_1 that minimise it.

The MSE is simply the mean of the square of all the errors, where the error is defined as y^iyi=hθ(xi)yi\hat{y}_{i}- y_{i} = h_\theta (x_{i}) - y_{i} for the pair of (x,y)(x,y) values at index ii i.e. for any given pair in our dataset. This is then also halved by convention, which has no effect on where the minimum or maximum are.

J(θ0,θ1)=12mi=1m(y^iyi)2=12mi=1m(hθ(xi)yi)2J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2

All we have to do now is minimise J(θ0,θ1)J(\theta_0, \theta_1). To do this, let’s first take a brute force approach, plugging in values and seeing what that looks like. This approach of minimising JJ is called ordinary least squares (OLS).

OLS has a couple of nice properties: a perfect fit will give us J(θ0,θ1)=0J(\theta_0, \theta_1)=0, and there will be one minimum, so we will be able to land on one set of parameters.


Such a chart is not that easy to read the values from, but that’s ok because we can ask the computer what the best values for θ0,θ1\theta_0, \theta_1 are:

(0.020408163265305923, 1.0408163265306123)

Let’s plot y=θ0+θ1xy = \theta_0 + \theta_1 x as our trend line given those values of θ0,θ1\theta_0, \theta_1 and see how it looks.


Not bad! But it doesn’t look exactly right. Maybe there is a better way to find these parameters for our trend line.

2. A. Finding an exact solution

We can find an exact solution using the magic of linear algebra and particularly matrix inversion.

θ=(XX)1Xy\mathbf \theta = (\mathbf X^\top \cdot \mathbf X)^{-1} \cdot \mathbf X^\top \cdot \mathbf y

By evaluating the right hand side of the above, we are left with a vector θ\mathbf \theta which contains our coefficients.

This is nice because it is quick for the computer to calculate, although it won’t always be possible to use this method because not all matrices can be inverted.

To do this, we first need to prepend a column vector of 11s to our column vector x\mathbf x to create a matrix X\mathbf X. This means that when we do our calculations, we will end up with a value for θ0\theta_0 in addition to θ1\theta_1.

We will be working with:

y =

X =
[[1 7]
 [1 4]
 [1 8]
 [1 5]
 [1 7]]

Which gives us for θ\mathbf \theta

 [ 1.067]]


When we plot the regression line with these θ\theta it looks pretty good.

2. B. Finding an exact solution (again)

We know that we want to find the minimum of our loss function with respect to parameters … which sounds a lot like a calculus problem. We will not always be able to invert the matrix per the previous, and this will give us an exact solution.

We can take the partial derivative of JJ with respect to each of θ0,θ1\theta_0, \theta_1 and set each of these to 0.

J(θ0,θ1)=12mi=1i=m(hθ(x(i))y(i))2=12mi=1i=m(θ0+θ1x(i)y(i))2J(\theta_0, \theta_1) = \frac {1}{2m} \sum_{i=1}^{i=m} (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac {1}{2m} \sum_{i=1}^{i=m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})^2

J(θ0,θ1)θ0=i=1i=m(θ0+θ1x(i)y(i))=0\dfrac{\partial \, J(\theta_0, \theta_1)}{\partial \, \theta_0} = \sum_{i=1}^{i=m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)}) = 0

J(θ0,θ1)θ1=x(i)i=1i=m(θ0+θ1x(i)y(i))=0\dfrac{\partial \, J(\theta_0, \theta_1)}{\partial \, \theta_1} = x^{(i)} \sum_{i=1}^{i=m} (\theta_0 + \theta_1 x^{(i)} - y^{(i)}) = 0

[ … ]

θ0=(y)(x2)(x)(xy)mx2(x)2\theta_0 = \frac{(\sum y) (\sum x^2) - (\sum x) (\sum xy)}{m \sum x^2 - (\sum x)^2}

θ1=m(xy)(x)(y)mx2(x)2\theta_1 = \frac{m(\sum xy) - (\sum x)(\sum y)}{m \sum x^2 - (\sum x)^2}

And this gives us the same answer …

[-0.133  1.067]


These θ\theta provides us with a chart which looks reassuringly similar.

3. Gradient descent

We can also find the optimal values for θ0,θ1\theta_0, \theta_1 using gradient descent.

To do this, we first set them to some arbitrary values, and then alter them based on the effect on JJ.

More precisely, we pick a value for α\alpha, the learning rate, which is kept constant and used to control the magnitude of the update. We subtract αθjJ(θ0,θ1)\alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) from our given parameter θj\theta_j. We do this for each θj\theta_j. We then repeat this exercise, and as we get closer to the optimal values, the subtracted value reduces in magnitude, until such time as any θj\theta_j barely changes at all.

θ0:=θ0α1mi=1m(hθ(xi)yi)\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x_{i}) - y_{i})

θ1:=θ1α1mi=1m(hθ(xi)yi)xi\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x_{i}) - y_{i}) x_{i}

Note that in the above :=:= means ‘is assigned the value of’. Further, both of θ0,θ1\theta_0,\theta_1 need to be updated before the process is repeated.

If this sounds a bit similar to Newton-Raphson from school days, it is.

If we initialise both to one, the below chart shows their convergence towards the optimal value.



Looks reassuringly similar, again.



  • x(i)x^{(i)} is an ‘input’ variable a.k.a. feature

  • y(i)y^{(i)} is an ‘output’ variable a.k.a. target

  • Pair (x(i),y(i))(x^{(i)}, y^{(i)}) is a training example

  • List of mm training examples (x(i),y(i))(x^{(i)}, y^{(i)}) for i=1,,mi=1,\ldots,m is a training set

  • Note: (i)^{(i)} is simply an index and not to do with exponentiation


“I’ll just run the numbers once I finish my Jack Daniels”