If you studied economics at any point or have used Excel you probably know what linear regression is. It is extremely useful. In order to understand it better, we are going to go on a (nonrigorous) journey through finding the two coefficients for linear regression on one variable with a toy dataset.
We are going to calculate the coefficients for linear regression with one variable by:
 Trying various coefficients and seeing which is optimal
 Finding an exact solution
 Finding a preciseenough solution numerically with gradient descent
I like to think that in a parallel universe, this is how Keith Richards is explaining linear regression.
1. Trying various coefficients and seeing which are best
Let’s plot some random numbers for $x$, and use those same numbers with a bit of jitter for $y$, so we have an approximately linear relationship.
x, y

[[7. 7.4]
[4. 4.1]
[8. 8.5]
[5. 5.3]
[7. 7.1]]
In high school we were told that the equation of a straight line was $y=mx+c$, and that linear regression is the line in this form which ‘best fits’ the data we have for $x,y$.
Now we’re in machine learning land let’s use $\hat y = \theta_0 + \theta_1 x$, where $\theta_0, \theta_1$ are the intercept and the gradient respectively. The hat in $\hat y$ denotes that it is an estimate, and the points along the regression line will be $x, \hat y$.
We need some way to come up with $\hat y$; let’s call this $h(x)$.
And so $\hat y = h(x) = \theta_0 + \theta_1 x$.
Note that rather than regarding $\hat y$ as our prediction for $y$, we can regard it as the $y$ that goes with the obligatory $x$ on the linear regression line.
So far so good, but we have no idea how $h(x)$ will work to supply us with our $\hat y$ values which are sensible. One way to do this is to define how we will know when $h(x)$ has done the optimal job for us. We know that this will be when the line best fits the data $x,y$ and we know that $x$ will be an input into our $h(x)$ function that supplies us with $\hat y$. We also know that $h(x)$ is parameterised by $\theta_0$ and $\theta_1$, so really all we have to do is pick the best values for each of those.
Luckily for us, we have this function on hand. We will calculate the mean squared error (MSE) and we will denote this as the function $J$. $J(\theta_0, \theta_1)$ will be evaluated, and the lower the output (the MSE), the better. We call $J(\theta_0, \theta_1)$ our ‘loss’ function, because we want to find the values for $\theta_0, \theta_1$ that minimise it.
The MSE is simply the mean of the square of all the errors, where the error is defined as $\hat{y}_{i} y_{i} = h_\theta (x_{i})  y_{i}$ for the pair of $(x,y)$ values at index $i$ i.e. for any given pair in our dataset. This is then also halved by convention, which has no effect on where the minimum or maximum are.
$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i} y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i})  y_{i} \right)^2$
All we have to do now is minimise $J(\theta_0, \theta_1)$. To do this, let’s first take a brute force approach, plugging in values and seeing what that looks like. This approach of minimising $J$ is called ordinary least squares (OLS).
OLS has a couple of nice properties: a perfect fit will give us $J(\theta_0, \theta_1)=0$, and there will be one minimum, so we will be able to land on one set of parameters.
Such a chart is not that easy to read the values from, but that’s ok because we can ask the computer what the best values for $\theta_0, \theta_1$ are:
(0.020408163265305923, 1.0408163265306123)
Let’s plot $y = \theta_0 + \theta_1 x$ as our trend line given those values of $\theta_0, \theta_1$ and see how it looks.
Not bad! But it doesn’t look exactly right. Maybe there is a better way to find these parameters for our trend line.
2. A. Finding an exact solution
We can find an exact solution using the magic of linear algebra and particularly matrix inversion.
$\mathbf \theta = (\mathbf X^\top \cdot \mathbf X)^{1} \cdot \mathbf X^\top \cdot \mathbf y$
By evaluating the right hand side of the above, we are left with a vector $\mathbf \theta$ which contains our coefficients.
This is nice because it is quick for the computer to calculate, although it won’t always be possible to use this method because not all matrices can be inverted.
To do this, we first need to prepend a column vector of $1$s to our column vector $\mathbf x$ to create a matrix $\mathbf X$. This means that when we do our calculations, we will end up with a value for $\theta_0$ in addition to $\theta_1$.
We will be working with:
y =
[[7.4]
[4.1]
[8.5]
[5.3]
[7.1]]
X =
[[1 7]
[1 4]
[1 8]
[1 5]
[1 7]]
Which gives us for $\mathbf \theta$
[[0.133]
[ 1.067]]
When we plot the regression line with these $\theta$ it looks pretty good.
2. B. Finding an exact solution (again)
We know that we want to find the minimum of our loss function with respect to parameters … which sounds a lot like a calculus problem. We will not always be able to invert the matrix per the previous, and this will give us an exact solution.
We can take the partial derivative of $J$ with respect to each of $\theta_0, \theta_1$ and set each of these to 0.
$J(\theta_0, \theta_1) = \frac {1}{2m} \sum_{i=1}^{i=m} (h_{\theta}(x^{(i)})  y^{(i)})^2 = \frac {1}{2m} \sum_{i=1}^{i=m} (\theta_0 + \theta_1 x^{(i)}  y^{(i)})^2$
$\dfrac{\partial \, J(\theta_0, \theta_1)}{\partial \, \theta_0} = \sum_{i=1}^{i=m} (\theta_0 + \theta_1 x^{(i)}  y^{(i)}) = 0$
$\dfrac{\partial \, J(\theta_0, \theta_1)}{\partial \, \theta_1} = x^{(i)} \sum_{i=1}^{i=m} (\theta_0 + \theta_1 x^{(i)}  y^{(i)}) = 0$
[ … ]
$\theta_0 = \frac{(\sum y) (\sum x^2)  (\sum x) (\sum xy)}{m \sum x^2  (\sum x)^2}$
$\theta_1 = \frac{m(\sum xy)  (\sum x)(\sum y)}{m \sum x^2  (\sum x)^2}$
And this gives us the same answer …
[0.133 1.067]
These $\theta$ provides us with a chart which looks reassuringly similar.
3. Gradient descent
We can also find the optimal values for $\theta_0, \theta_1$ using gradient descent.
To do this, we first set them to some arbitrary values, and then alter them based on the effect on $J$.
More precisely, we pick a value for $\alpha$, the learning rate, which is kept constant and used to control the magnitude of the update. We subtract $\alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ from our given parameter $\theta_j$. We do this for each $\theta_j$. We then repeat this exercise, and as we get closer to the optimal values, the subtracted value reduces in magnitude, until such time as any $\theta_j$ barely changes at all.
$\theta_0 := \theta_0  \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x_{i})  y_{i})$
$\theta_1 := \theta_1  \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x_{i})  y_{i}) x_{i}$
Note that in the above $:=$ means ‘is assigned the value of’. Further, both of $\theta_0,\theta_1$ need to be updated before the process is repeated.
If this sounds a bit similar to NewtonRaphson from school days, it is.
If we initialise both to one, the below chart shows their convergence towards the optimal value.
Looks reassuringly similar, again.
Notes
Notation

$x^{(i)}$ is an ‘input’ variable a.k.a. feature

$y^{(i)}$ is an ‘output’ variable a.k.a. target

Pair $(x^{(i)}, y^{(i)})$ is a training example

List of $m$ training examples $(x^{(i)}, y^{(i)})$ for $i=1,\ldots,m$ is a training set

Note: $^{(i)}$ is simply an index and not to do with exponentiation
“I’ll just run the numbers once I finish my Jack Daniels”