Training a Simple Linear Regression Model From Scratch

Hey everyone, welcome to my first blog post! This is going to be a walkthrough on training a simple linear regression model in Python. I’ll show you how to do it from scratch, without using any machine learning tools or libraries. We’ll only use NumPy and Matplotlib for matrix operations and data visualization.

Problem & Dataset

We’ll look at a regression problem from a very popular machine learning course taught by Andrew Ng. Our objective in this problem will be to train a model that accurately predicts the profits of a food truck.

The first column in our dataset file contains city populations and the second column contains food truck profits in each city, both in $10,000$s. Here are the first few training examples:

food_truck_data.txt
1
2
3
4
5
6
6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
...

We’re going to use this dataset as a training sample to build our model. Let’s begin by loading it:

1
2
3
4
5
6
import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('food_truck_data.txt', delimiter=",")
x = data[:, 0] # city populations
y = data[:, 1] # food truck profits

Both $x$ and $y$ are one dimensional arrays, because we have one feature (population) and one target variable (profit) in this problem. Therefore we can conveniently visualize our dataset using a scatter plot:

1
2
3
4
5
6
7
fig, ax = plt.subplots()
ax.scatter(x, y, marker="x", c="red")
plt.title("Food Truck Dataset", fontsize=16)
plt.xlabel("City Population in 10,000s", fontsize=14)
plt.ylabel("Food Truck Profit in 10,000s", fontsize=14)
plt.axis([4, 25, -5, 25])
plt.show()

Hypothesis Function

Now we need to come up with a straight line which accurately represents the relationship between population and profit. This is called the hypothesis function and it’s formulated as:

$h_\theta(x) = \theta^Tx = \theta_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n$

where $x$ corresponds to the feature matrix and $\theta$ corresponds to the vector of model parameters.

Since we have a single feature $x_1,$ we’ll only have two model parameters $\theta_0$ and $\theta_1$ in our hypothesis function:

$h_\theta(x) = \theta_0 + \theta_1x_1$

As you may have noticed, the number of model parameters is equal to the number of features plus $1$. That’s because each feature is weighted by a parameter to control its impact on the hypothesis $h_\theta(x)$. There is also an independent parameter $\theta_0$ called the intercept term, which defines the point where the hypothesis function intercepts the $y$-axis as demonstrated below:



The predictions of a hypothesis function can easily be evaluated in Python by computing the cross product of $x$ and $\theta^T.$ At the moment we have our $x$ and $y$ vectors but we don’t have our model parameters yet. So let’s create those as well and initialize them with zeros:

1
theta = np.zeros(2)

Also, we have to make sure that the matrix dimensions of $x$ and $\theta^T$ are compatible with each other for cross product. Currently $x$ has $1$ column but $\theta^T$ has $2$ rows. The dimensions don’t match because of the additional intercept term $\theta_0.$

We can solve this issue by prepending a column to $x$ and set it to all ones. This is essentially equivalent to creating a new feature $x_0 = 1.$ This extra column won’t affect the hypothesis whatsoever because $\theta_0$ is going to be multiplied by $1$ in the cross product.

Let’s create a new variable $X$ to store the extended $x$ matrix:

1
2
X = np.ones(shape=(len(x), 2))
X[:, 1] = x

Finally, we can compute the predictions of our hypothesis as follows:

1
predictions = X @ theta

Of course, the predictions are currently all zeros because we haven’t trained our model yet.

Cost Function

The objective in training a linear regression model is to minimize a cost function, which measures the difference between actual $y$ values in the training sample and predictions made by the hypothesis function $h_\theta(x)$.

Such a cost function can be formulated as;

$J(\theta) = \dfrac{1}{2m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$

where $m$ is the number of training examples.

Here’s its Python version:

1
2
3
4
def cost(theta, X, y):
predictions = X @ theta
squared_errors = np.square(predictions - y)
return np.sum(squared_errors) / (2 * len(y))

Now let’s take a look at the cost of our initial untrained model:

1
print('The initial cost is:', cost(theta, X, y))
The initial cost is: 32.0727338775

Gradient Descent Algorithm

Since our hypothesis is based on the model parameters $\theta$, we must somehow adjust them to minimize our cost function $J(\theta)$. This is where the gradient descent algorithm comes into play. It’s an optimization algorithm which can be used in minimizing differentiable functions. Luckily our cost function $J(\theta)$ happens to be a differentiable one.

So here’s how the gradient descent algorithm works in a nutshell:

In each iteration, it takes a small step in the opposite gradient direction of $J(\theta)$. This makes the model parameters $\theta$ gradually come closer to the optimal values. This process is repeated until eventually the minimum cost is achieved.

More formally, gradient descent performs the following update in each iteration:

$\theta_j := \theta_j - \alpha\frac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j$

The $\alpha$ term here is called the learning rate. It allows us to control the step size to update $\theta$ in each iteration. Choosing a too large learning rate may prevent us from converging to a minimum cost, whereas choosing a too small learning rate may significantly slow down the algorithm.

Here’s a generic implementation of the gradient descent algorithm:

1
2
3
4
5
6
7
8
9
def gradient_descent(X, y, alpha, num_iters):
num_features = X.shape[1]
theta = np.zeros(num_features) # initialize model parameters
for n in range(num_iters):
predictions = X @ theta # compute predictions based on the current hypothesis
errors = predictions - y
gradient = X.transpose() @ errors
theta -= alpha * gradient / len(y) # update model parameters
return theta # return optimized parameters

Now let’s use this function to train our model and plot the hypothesis function:

1
2
3
4
theta = gradient_descent(X, y, 0.02, 600)   # run GD for 600 iterations with learning rate = 0.02
predictions = X @ theta # predictions made by the optimized model
ax.plot(X[:, 1], predictions, linewidth=2) # plot the hypothesis on top of the training data
fig

Debugging

Our linear fit looks pretty good, right? The algorithm must have successfully optimized our model.

Well, to be honest, it’s been fairly easy to visualize the hypothesis because there’s only one feature in this problem.

But what if we had multiple features? Then it wouldn’t be possible to simply plot the hypothesis to see whether the algorithm has worked as intended or not.

Fortunately, there’s a simple way to debug the gradient descent algorithm irrespective of the number of features:

  1. Modify the gradient descent function to make it record the cost at the end of each iteration.
  2. Plot the cost history after the gradient descent has finished.
  3. Pat yourself on the back if you see that the cost has monotonically decreased over time.

Here’s the modified version of our gradient descent function:

1
2
3
4
5
6
7
8
9
10
11
def gradient_descent(X, y, alpha, num_iters):
cost_history = np.zeros(num_iters) # create a vector to store the cost history
num_features = X.shape[1]
theta = np.zeros(num_features)
for n in range(num_iters):
predictions = X @ theta
errors = predictions - y
gradient = X.transpose() @ errors
theta -= alpha * gradient / len(y)
cost_history[n] = cost(theta, X, y) # compute and record the cost
return theta, cost_history # return optimized parameters and cost history

Now let’s try learning rates $0.01$, $0.015$, $0.02$ and plot the cost history for each one:

1
2
3
4
5
6
7
8
9
10
11
12
13
plt.figure()
num_iters = 1200
learning_rates = [0.01, 0.015, 0.02]
for lr in learning_rates:
_, cost_history = gradient_descent(X, y, lr, num_iters)
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with different learning rates", fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.legend(list(map(str, learning_rates)))
plt.axis([0, num_iters, 4, 6])
plt.grid()
plt.show()

It appears that the gradient descent algorithm worked correctly for these particular learning rates. Notice that it takes more iterations to minimize the cost as the learning rate decreases.

Now let’s try a larger learning rate and see what happens:

1
2
3
4
5
6
7
8
9
10
learning_rate = 0.025
num_iters = 50
_, cost_history = gradient_descent(X, y, learning_rate, num_iters)
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with learning rate = " + str(learning_rate), fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.axis([0, num_iters, 0, 6000])
plt.grid()
plt.show()



Doesn’t look good… That’s what happens when the learning rate is too large. Even though the gradient descent algorithm takes steps in the correct direction, these steps are so huge that it’s going to overshoot the target and the cost diverges from the minimum value instead of converging to it.

Right now we can safely set the learning rate to $0.02$, because it allows us to minimize the cost and it requires relatively fewer iterations to converge.

Prediction

Now that we’ve learned how to train our model, we can finally predict the food truck profit for a particular city:

1
2
3
4
theta, _ = gradient_descent(X, y, 0.02, 600)    # train the model
test_example = np.array([1, 7]) # pick a city with 70,000 population as a test example
prediction = test_example @ theta # use the trained model to make a prediction
print('For population = 70,000, we predict a profit of $', prediction * 10000);
For population = 70,000, we predict a profit of $ 45905.6621788

If you’re still here, you should subscribe to get updates on my future articles. Also feel free to leave a comment below and let me know what you think!