Learning Curves in Linear & Polynomial Regression

Learning curves are very useful for analyzing the bias-variance characteristics of a machine learning model. In this post, I’m going to talk about how to make use of them in a case study of a regression problem. We’re going to start with a simple linear regression model and improve it as much as we can by taking advantage of learning curves.

Introduction to Learning Curves

In a nutshell, learning curves show how the training and validation errors change with respect to the number of training examples used while training a machine learning model.

If a model is balanced, both errors converge to small values as the training sample size increases.
If a model has high bias, it ends up underfitting the data. As a result, both errors fail to decrease no matter how many examples there are in the training set.
If a model has high variance, it ends up overfitting the training data. In that case, increasing the training sample size decreases the training error but it fails to decrease the validation error.

The figure below demonstrates each of those cases:

Problem Definition and Dataset

After this incredibly brief introduction, let me introduce you to today’s problem where we’ll get to see learning curves in action. It’s another problem from Andrew Ng’s machine learning course, in which the objective is to predict the amount of water flowing out of a dam, given the change of water level in a reservoir.

The dataset file we’re about to read contains historical records on the change in water level and the amount of water flowing out of the dam. The reason that it’s a .mat file is because this problem is originally a MATLAB assignment. Fortunately, it’s pretty easy to load .mat files in Python using the loadmat function from SciPy. We’ll also need NumPy and Matplotlib for matrix operations and data visualization:

import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt # we'll need this later
import scipy.io as sio

dataset = sio.loadmat("water.mat")
x_train = dataset["X"]
x_val = dataset["Xval"]
x_test = dataset["Xtest"]

# squeeze the target variables into one-dimensional arrays
y_train = dataset["y"].squeeze()
y_val = dataset["yval"].squeeze()
y_test = dataset["ytest"].squeeze()

The dataset is divided into three samples:

The training sample consists of x_train and y_train.
The validation sample consists of x_val and y_val.
The test sample consists of x_test and y_test.

Notice that we have to explicitly convert the target variables (y_train, y_val and y_test) to one dimensional vectors, because they are stored as matrices inside the .mat file.

Let’s plot the training sample to see what it looks like:

fig, ax = plt.subplots()
ax.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Training sample", fontsize=16)
plt.show()

The Game Plan

Alright, it’s time to come up with a strategy. First of all, it’s clear that there’s a nonlinear relationship between $x$ and $y$. Normally we would rule out any linear model because of that. However, we are going to begin by training a linear regression model so that we can see how the learning curves of a model with high bias look like.

Then we’ll train a polynomial regression model which is going to be much more flexible than linear regression. This will let us see the learning curves of a model with high variance.

Finally, we’ll add regularization to the existing polynomial regression model and see how a balanced model’s learning curves look like.

Linear Regression

I’ve already shown you in the previous post how to train a linear regression model using gradient descent. Before proceeding any further, I strongly encourage you to take a look at it if you don’t have at least a basic understanding of linear regression.

Here I’ll show you an easier way to train a linear regression model using an optimization function called fmin_cg from scipy.optimize. You can check out the detailed documentation here. The cool thing about this function is that it’s faster than gradient descent and also you don’t have to select a learning rate by trial and error.

fmin_cg needs a function that returns the cost and another one that returns the gradient of the cost for a given hypothesis. We have to pass those to fmin_cg as function arguments. Fortunately, we can reuse some code from the previous post:

We can completely reuse the cost function because it’s independent of the optimization method that we use.
From the gradient_descent function, we can borrow the part where the gradient of the cost function is evaluated.

So here’s (almost) all we need in order to train a linear regression model:

def cost(theta, X, y):
    predictions = X @ theta
    return np.sum(np.square(predictions - y)) / (2 * len(y))

def cost_gradient(theta, X, y):
    predictions = X @ theta
    return X.transpose() @ (predictions - y) / len(y)

def train_linear_regression(X, y):
    theta = np.zeros(X.shape[1]) # initialize model parameters with zeros
    return opt.fmin_cg(cost, theta, cost_gradient, (X, y), disp=False)

If you look at our cost function, there we evaluate the cross product of the feature matrix $X$ and the vector of model parameters $\theta$. Remember, this is only possible if the matrix dimensions match. Therefore we also need a tiny utility function to insert an additional first column of all ones to a raw feature matrix such as x_train.

def insert_ones(x):
    X = np.ones(shape=(x.shape[0], x.shape[1] + 1))
    X[:, 1:] = x
    return X

Now let’s train a linear regression model and plot the linear fit on top of the training sample:

X_train = insert_ones(x_train)
theta = train_linear_regression(X_train, y_train)
hypothesis = X_train @ theta
ax.plot(X_train[:, 1], hypothesis, linewidth=2)
fig

Learning Curves for Linear Regression

The above plot clearly shows that linear regression is not suitable for this task. Let’s also look at its learning curves and see if we can draw the same conclusion.

While plotting learning curves, we’re going to start with $2$ training examples and increase them one by one. In each iteration, we’ll train a model and evaluate the training error on the existing training sample, and the validation error on the whole validation sample:

def learning_curves(X_train, y_train, X_val, y_val):
    train_err = np.zeros(len(y_train))
    val_err = np.zeros(len(y_train))
    for i in range(1, len(y_train)):
        theta = train_linear_regression(X_train[0:i + 1, :], y_train[0:i + 1])
        train_err[i] = cost(theta, X_train[0:i + 1, :], y_train[0:i + 1])
        val_err[i] = cost(theta, X_val, y_val)
    plt.plot(range(2, len(y_train) + 1), train_err[1:], c="r", linewidth=2)
    plt.plot(range(2, len(y_train) + 1), val_err[1:], c="b", linewidth=2)
    plt.xlabel("number of training examples", fontsize=14)
    plt.ylabel("error", fontsize=14)
    plt.legend(["training", "validation"], loc="best")
    plt.axis([2, len(y_train), 0, 100])
    plt.grid()

In order to use this function, we have to resize x_val just like we did x_train:

1
2
3

X_val = insert_ones(x_val)
plt.title("Learning Curves for Linear Regression", fontsize=16)
learning_curves(X_train, y_train, X_val, y_val)

As expected, we were unable to sufficiently decrease either the training or the validation error.

Polynomial Regression

Now it’s time to introduce some nonlinearity with polynomial regression.

Feature Mapping

In order to train a polynomial regression model, the existing feature(s) have to be mapped to artificially generated polynomial features. Then the rest is pretty much the same drill.

In our case we only have a single feature $x_1$, the change in water level. Therefore we can simply compute the first several powers of $x_1$ to artificially obtain new polynomial features. Let’s create a simple function for this:

def poly_features(x, degree):
    X_poly = np.zeros(shape=(len(x), degree))
    for i in range(0, degree):
        X_poly[:, i] = x.squeeze() ** (i + 1);
    return X_poly

Now let’s generate new feature matrices for training, validation and test samples with 8 polynomial features in each:

1
2
3

x_train_poly = poly_features(x_train, 8)
x_val_poly = poly_features(x_val, 8)
x_test_poly = poly_features(x_test, 8)

Feature Normalization

Ok, we have our polynomial features but we also have a tiny little problem. If you take a closer look at one of the new matrices, you’ll see that the polynomial features are very imbalanced at the moment. For instance, let’s look at the first few rows of the x_train_poly matrix:

1	print(x_train_poly[:4, :])

[[ -1.59367581e+01   2.53980260e+02  -4.04762197e+03   6.45059724e+04
   -1.02801608e+06   1.63832436e+07  -2.61095791e+08   4.16102047e+09]
 [ -2.91529792e+01   8.49896197e+02  -2.47770062e+04   7.22323546e+05
   -2.10578833e+07   6.13900035e+08  -1.78970150e+10   5.21751305e+11]
 [  3.61895486e+01   1.30968343e+03   4.73968522e+04   1.71527069e+06
    6.20748719e+07   2.24646160e+09   8.12984311e+10   2.94215353e+12]
 [  3.74921873e+01   1.40566411e+03   5.27014222e+04   1.97589159e+06
    7.40804977e+07   2.77743990e+09   1.04132297e+11   3.90414759e+12]]

As the polynomial degree increases, the values in the corresponding columns exponentially grow to the point where they differ by orders of magnitude.

The thing is, the cost function will generally converge much more slowly when the features are imbalanced like this. So we need to make sure that our features are on a similar scale before we begin to train our model. We’re going to do this in two steps:

Subtract the mean value of each column from itself and make the new mean $0$.
Divide the values in each column by their standard deviation and make the new standard deviation $1$.

It’s important that we use the mean and standard deviation values from the training sample while normalizing the validation and test samples.

train_means = x_train_poly.mean(axis=0)
train_stdevs = np.std(x_train_poly, axis=0, ddof=1)

x_train_poly = (x_train_poly - train_means) / train_stdevs
x_val_poly = (x_val_poly - train_means) / train_stdevs
x_test_poly = (x_test_poly - train_means) / train_stdevs

X_train_poly = insert_ones(x_train_poly)
X_val_poly = insert_ones(x_val_poly)
X_test_poly = insert_ones(x_test_poly)

Finally we can train our polynomial regression model by using our train_linear_regression function and plot the polynomial fit. Note that when the polynomial features are simply treated as independent features, training a polynomial regression model is no different than training a multivariate linear regression model:

def plot_fit(min_x, max_x, means, stdevs, theta, degree):
    x = np.linspace(min_x - 5, max_x + 5, 1000)
    x_poly = poly_features(x, degree)
    x_poly = (x_poly - means) / stdevs
    x_poly = insert_ones(x_poly)
    plt.plot(x, x_poly @ theta, linewidth=2)
    plt.show()

theta = train_linear_regression(X_train_poly, y_train)
plt.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Polynomial Fit", fontsize=16)
plot_fit(min(x_train), max(x_train), train_means, train_stdevs, theta, 8)

What do you think, seems pretty accurate right? Let’s take a look at the learning curves.

1 2	plt.title("Learning Curves for Polynomial Regression", fontsize=16) learning_curves(X_train_poly, y_train, X_val_poly, y_val)

Now that’s overfitting written all over it. Even though the training error is very low, the validation error miserably fails to converge.

It appears that we need something in between in terms of flexibility. Although we can’t make linear regression more flexible, we can decrease the flexibility of polynomial regression using regularization.

Regularized Polynomial Regression

Regularization lets us come up with simpler hypothesis functions that are less prone to overfitting. This is achieved by penalizing large $\theta$ values during the training stage.

Here’s the regularized cost function:

$J(\theta) = \dfrac{1}{2m}\Big(\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2\Big) +
\dfrac{\lambda}{2m}\Big(\sum\limits_{j=1}^{n}\theta_j^2\Big)$

And its gradient becomes:

$\dfrac{\partial J(\theta)}{\partial \theta_0} =
\dfrac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}
\quad \qquad \qquad \qquad for ,, j = 0$

$\dfrac{\partial J(\theta)}{\partial \theta_j} =
\Big(\dfrac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\Big) + \dfrac{\lambda}{m}\theta_j
,,\quad \qquad for ,, j \geq 0$

Notice that we are not penalizing the intercept term $\theta_0.$ That’s because it doesn’t have anything to do with the model’s flexibility.

Of course we’ll need to reflect these changes to the corresponding Python implementations by introducing a regularization parameter lamb:

def cost(theta, X, y, lamb=0):
    predictions = X @ theta
    squared_errors = np.sum(np.square(predictions - y))
    regularization = np.sum(lamb * np.square(theta[1:]))
    return (squared_errors + regularization) / (2 * len(y))

def cost_gradient(theta, X, y, lamb=0):
    predictions = X @ theta
    gradient = X.transpose() @ (predictions - y)
    regularization = lamb * theta
    regularization[0] = 0 # don't penalize the intercept term
    return (gradient + regularization) / len(y)

We also have to slightly modify train_linear_regression and learning_curves:

def train_linear_regression(X, y, lamb=0):
    theta = np.zeros(X.shape[1])
    return opt.fmin_cg(cost, theta, cost_gradient, (X, y, lamb), disp=False)

def learning_curves(X_train, y_train, X_val, y_val, lamb=0):
    train_err = np.zeros(len(y_train))
    val_err = np.zeros(len(y_train))
    for i in range(1, len(y_train)):
        theta = train_linear_regression(X_train[0:i + 1, :], y_train[0:i + 1], lamb)
        train_err[i] = cost(theta, X_train[0:i + 1, :], y_train[0:i + 1])
        val_err[i] = cost(theta, X_val, y_val)
    plt.plot(range(2, len(y_train) + 1), train_err[1:], c="r", linewidth=2)
    plt.plot(range(2, len(y_train) + 1), val_err[1:], c="b", linewidth=2)
    plt.xlabel("number of training examples", fontsize=14)
    plt.ylabel("error", fontsize=14)
    plt.legend(["Training", "Validation"], loc="best")
    plt.axis([2, len(y_train), 0, 100])
    plt.grid()

Alright we’re now ready to train a regularized polynomial regression model. Let’s set $\lambda = 1$ and plot our polynomial hypothesis on top of the training sample:

theta = train_linear_regression(X_train_poly, y_train, 1)
plt.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Regularized Polynomial Fit", fontsize=16)
plot_fit(min(x_train), max(x_train), train_means, train_stdevs, theta, 8)

It is clear that this hypothesis is much less flexible than the unregularized one. Let’s plot the learning curves and observe its bias-variance tradeoff:

1 2	plt.title("Learning Curves for Regularized Polynomial Regression", fontsize=16) learning_curves(X_train_poly, y_train, X_val_poly, y_val, 1)

This is apparently the best model we’ve come up so far.

Choosing the Optimal Regularization Parameter

Although setting $\lambda = 1$ has significantly improved the unregularized model, we can do even better by optimizing $\lambda$ as well. Here’s how we’re going to do it:

Select a set of $\lambda$ values to try out.
Train a model for each $\lambda$ in the set.
Find the $\lambda$ value that yields the minimum validation error.

lambda_values = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10];
val_err = []
for lamb in lambda_values:
    theta = train_linear_regression(X_train_poly, y_train, lamb)
    val_err.append(cost(theta, X_val_poly, y_val))
plt.plot(lambda_values, val_err, c="b", linewidth=2)
plt.axis([0, len(lambda_values), 0, val_err[-1] + 1])
plt.grid()
plt.xlabel("lambda", fontsize=14)
plt.ylabel("error", fontsize=14)
plt.title("Validation Curve", fontsize=16)
plt.show()

Looks like we’ve achieved the lowest validation error where $\lambda = 3$.

Evaluating Test Errors

It’s good practice to evaluate an optimized model’s accuracy on a separate test sample other than the training and validation samples. So let’s train our models once again and compare test errors:

X_test = insert_ones(x_test)
theta = train_linear_regression(X_train, y_train)
test_error = cost(theta, X_test, y_test)
print("Test Error =", test_error, "| Linear Regression")

theta = train_linear_regression(X_train_poly, y_train)
test_error = cost(theta, X_test_poly, y_test)
print("Test Error =", test_error, "| Polynomial Regression")

theta = train_linear_regression(X_train_poly, y_train, 3)
test_error = cost(theta, X_test_poly, y_test)
print("Test Error =", test_error, "| Regularized Polynomial Regression (at lambda = 3)")

Test Error = 32.5057492449 | Linear Regression
Test Error = 17.2624144407 | Polynomial Regression
Test Error = 3.85988782246 | Regularized Polynomial Regression (at lambda = 3)

If you’re still here, you should subscribe to get updates on my future articles.