One-vs-All Classification Using Logistic Regression

Previously, we talked about how to build a binary classifier by implementing our own logistic regression model in Python. In this post, we're going to build upon that existing model and turn it into a multi-class classifier using an approach called one-vs-all classification.

One-vs-All Classification

First of all, let me briefly explain the idea behind one-vs-all classification. Say we have a classification problem and there are $N$ distinct classes. In this case, we’ll have to train a multi-class classifier instead of a binary one.

One-vs-all classification is a method which involves training $N$ distinct binary classifiers, each designed for recognizing a particular class. Then those $N$ classifiers are collectively used for multi-class classification as demonstrated below:

We already know from the previous post how to train a binary classifier using logistic regression. So the only thing we have to do now really is to train $N$ binary classifiers instead of just one. And that’s pretty much it.

Problem & Dataset

We’re going to use this one-vs-all approach to solve a multi-class classification problem from the machine learning course thought by Andrew Ng. The goal in this problem is to identify digits from 0 to 9 by looking at 20x20 pixel drawings.

Here the number of classes $N$ is equal to 10, which is the number of different digits. We’re going to treat each pixel as an individual feature, which adds up to 400 features per image. Here are some examples from our training sample of 5000 images:

The training data is stored in a file called digits.mat. The reason that it’s a .mat file is because this problem is originally a Matlab assignment. No big deal, since it’s pretty easy to import a .mat file in Python using the loadmat function from the scipy.io module. Here’s how to do it:

import numpy as np
import scipy.io as sio
import scipy.optimize as opt

data = sio.loadmat("digits.mat")
x = data['X'] # the feature matrix is labeled with 'X' inside the file
y = np.squeeze(data['y']) # the target variable vector is labeled with 'y' inside the file
np.place(y, y == 10, 0) # replace the label 10 with 0
numExamples = x.shape[0] # 5000 examples
numFeatures = x.shape[1] # 400 features
numLabels = 10 # digits from 0 to 9

Let me point out two things here:

We’re using the squeeze function on the y array in order to explicitly make it one dimensional. We’re doing this because y is stored as a 2D matrix in the .mat file although it’s actually a 1D array.
We’re replacing the label 10 with 0. This label actually stands for the digit 0 but it was converted to 10 because of array indexing issues in Matlab.

Logistic Regression Recap

Remember the sigmoid, cost and cost_gradient functions that we’ve come up with while training a logistic regression model in the previous post? Here we can reuse these functions exactly as they are, because we’re going to train nothing but logistic regression models also in this problem.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def cost(theta, X, y):
    predictions = sigmoid(X @ theta)
    predictions[predictions == 1] = 0.999 # log(1)=0 causes error in division
    error = -y * np.log(predictions) - (1 - y) * np.log(1 - predictions);
    return sum(error) / len(y);

def cost_gradient(theta, X, y):
    predictions = sigmoid(X @ theta);
    return X.transpose() @ (predictions - y) / len(y)

Training Stage

The final thing we have to do before starting to train our multi-class classifier is to add an initial column of ones to our feature matrix to take into account the intercept term:

1 2	X = np.ones(shape=(x.shape[0], x.shape[1] + 1)) X[:, 1:] = x

Now we’re ready to train our classifiers. Let’s create an array to store the model parameters $\theta$ for each classifier. Note that we need 10 sets of model parameters, each consisting of 401 parameters including the intercept term:

1	classifiers = np.zeros(shape=(numLabels, numFeatures + 1))

Then we’re going to train 10 binary classifiers targeted for each digit inside a for loop:

for c in range(0, numLabels):
    label = (y == c).astype(int)
    initial_theta = np.zeros(X.shape[1])
    classifiers[c, :] = opt.fmin_cg(cost, initial_theta, cost_gradient, (X, label), disp=0)

Here we create a label vector in each iteration. We set its values to 1 where the corresponding values in y are equal to the current digit, and we set the rest of its values to 0. Hence the label vector acts as the target variable vector y of the binary classifier that we train for the current digit.

Predictions

We can evaluate the probability estimations of our optimized model for each class as follows:

1	classProbabilities = sigmoid(X @ classifiers.transpose())

This will give us a matrix of 5000 rows and 10 columns, where the columns correspond to the estimated class (digit) probabilities for all 5000 images.

However, we may need the final predictions of the optimized classifier instead of numerical probability estimations. We can find out our model’s predictions by simply selecting the label with the highest probability in each row :

1	predictions = classProbabilities.argmax(axis=1)

Now we have our model’s predictions as a vector with 5000 elements labeled from 0 to 9.

Accuracy

Finally, we can compute our model’s training accuracy by computing the percentage of successful predictions:

1	print("Training accuracy:", str(100 * np.mean(predictions == y)) + "%")

Training accuracy: 94.54%

An accuracy of 94.5% isn’t bad at all considering we have 10 classes and a very large number of features. Still, we could do even better if we decided to use a nonlinear model such as a neural network.

If you’re still here, you should subscribe to get updates on my future articles.