Calculate Top N Accuracy Score of a Multi-Class Classifier using Scikit-Learn

6 min readMay 9, 2019

Hello world!

So I had a client recently that had a need for a multi-class recommender system that would display the top 5 recommendations as opposed to just the top most preferred recommendation. This particular use case was a natural language processing problem that categorized job descriptions into one of over 1,000 categories. I decided to use the trusty Scikit-Learn library for this task as it is well maintained and runs very efficiently.

Everything seemed to be going well until it came time to compare, validate and QA my models using accuracy rates. Scikit-Learn’s accuracy_score calculator appeared to only calculate the accuracy score based on the top result rather than the top N result, so I had to jimmy rig an alternative solution using the predict_proba function. Of course, I used other measures to validate my model, but this one was particularly important to our client.

The solution I came up with seemed to run fine and provided reasonable results, but then I noticed that the top 5 accuracy rates were actually LESS than the top 1 accuracy rates! For example, the top 1 accuracy rate might be 70% and then the top 5 accuracy rate (which inherently includes the top result as well obviously) would 52%. “How could that be?” I wondered. So I did some digging, posting, crowd sourcing, and then… several agonizing days later… eureka! I finally discovered that the index of the predict results dataset was automatically being set back to zero when I ran it through the predictor function. So when I tried to merge the results from that output, back to the original dataset to calculate the accuracy, of course the indices were now different and thus the wrong results were being merged up with the input labels! Of course that meant that the accuracy score was not able to tally correctly, which is why my results were so off.

So, in hopes of saving other users the same heartache that I went through, I decided to post my solution. If I can help at least one other person, I will consider writing this post time well spent.

C’est la vie!

Here we go!

The remainder of this post will walk you through the set up and implementation of the use case from start to finish.

Let’s begin by importing the necessary dependencies:

# For basic data wrangling 
import pandas as pd
import numpy as np# For modeling
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

Import your data:

df = pd.read_csv('data.csv',low_memory=False,thousands=’,’, encoding=’latin-1')#select only the columns you need
df = df[['CODE','JOB_DUTIES']]
df = df.rename(columns={"CODE": "label", "JOB_DUTIES": "text"})

Review the data for quality assurance purposes:

print("Quick preview of the data:")
print(df.head())
print(" ")
print("Here is what one full job description looks like (sample):")
print("")
print(df.iloc[733,1])

Clean the text:

# Remove white space
df['text']=df['text'].str.replace('\n', ' ',regex=True, case = False).str.strip()
df['text']=df['text'].str.replace('\s+', ' ',regex=True, case = False)
df['text']=df['text'].str.replace('/', ' ',regex=True, case = False).str.strip()# lower case everything
df['text']=df['text'].str.lower()
#Remove stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))#eliminate all other characters/nmbers except alphas and spaces
list1 = []
whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
for index, row in df.iterrows():
    list1.append(''.join(filter(whitelist.__contains__, row['text'])).strip())
    
newdf = pd.DataFrame({'text clean':list1})
df = df.reset_index()
df = pd.merge(df, newdf, left_index=True, right_index=True)#rename the soc des str var
df = df[['label','text clean']]
df['text'] = df['text clean']
df = df[['label','text']]

Take a look at the cleaned data:

# Validate
print("Total row/col count:", df.shape)
print(" ")
print("Quick preview of the cleaned data:")
print(df.head())
print(" ")
print("Now the same full clean job description (sample):")
print("")
print(df.iloc[733,1])

Next let’s split our data into training and validation sets:

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df[‘text’], df[‘label’])#Reset the indexes on the validation sets — this is important later on
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)# We will also copy the validation datasets to a dataframe to be able to merge later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)

Validate above cell:

print(“train_x row/col count:”, train_x.shape)
print(“train_x head:”)
print(train_x.head())print(“ “)
print(“valid_x row/col count:”, valid_x.shape)
print(“valid_x head:”)
print(valid_x.head())print(“ “)
print(“train_y row count:”, train_y.shape)
print(“train_y first 10 values:”)
print(train_y[0:10])print(“ “)
print(“valid_y row count:”, valid_y.shape)
print(“valid_y first 10 values:”)
print(valid_y[0:10])

Feature Engineering

The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will use count vectors for the purpose of this post to obtain relevant features from our dataset.

What are count vectors?

Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

Normally, you will want to test several different types of features in various different models, but for this post, that is not the focus, so we will just keep things simple.

# create a count vectorizer objectcount_vect = CountVectorizer(analyzer=’word’, token_pattern=r’\w{1,}’)
count_vect.fit(train_x)# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)

Create Function to Train and Validate our Model

The function below will allow us to simply pass in our classifier, feature vectors that we created above and our original dataset and BOOM! We will have our results for both a top 1 accuracy rate and the top 5 accuracy rate.

def train_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x):
    
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the top n labels on validation dataset
    n = 5
    classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)
    
    #Identify the indexes of the top predictions
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]
    
    #then find the associated SOC code for each prediction
    top_class = classifier.classes_[top_n_predictions]
    
    #cast to a new dataframe
    top_class_df = pd.DataFrame(data=top_class)
    
    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_class_df, left_index=True, right_index=True)
     
    # Top 5 results condiions and choices
    top5_conditions = [
        (results.iloc[:,0] == results[0]),
        (results.iloc[:,0] == results[1]),
        (results.iloc[:,0] == results[2]),
        (results.iloc[:,0] == results[3]),
        (results.iloc[:,0] == results[4])]
    top5_choices = [1, 1, 1, 1, 1]
    
    # Fetch Top 1 Result
    top1_conditions = [(results.iloc[:,0] == results[4])]
    top1_choices = [1]
    
    # Create the success columns
    results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
    results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)
    
    print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
    return (round(sum(results['Top 5 Successes'])/results.shape[0],3))*100# Print the results
    print("Results: ")
    print("Top 5 Accuracy Rate = ", round(sum(results['Top 5 Successes'])/results.shape[0],3)*100)
    print("Top 1 Accuracy Rate = ", round(sum(results['Top 1 Successes'])/results.shape[0],3)*100)

Model Building and Validation Time!

The final step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will use a Naive Bayes classifier for the sake of demonstration here.

What is Naive Bayes?

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature here: https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

Run it!

Now you can simply run the above function by calling:

train_model(naive_bayes.MultinomialNB(),xtrain_count,train_y, xvalid_count,valid_y, valid_x)

And your results would look something like the following:

Top 5 Accuracy Rate =  98.1
Top 1 Accuracy Rate =  76.4

And that’s all she wrote! I hope you enjoyed this post. Please write any questions or comments below and I’ll do my best to respond.