Ultimate Guide: F1 Score In Machine Learning

While you may be more familiar with choosing Precision and Recall for your machine learning algorithms, there is a statistic that takes advantage of both. 

The F1 Score is a statistic also used to measure the performance of a machine-learning algorithm. Understanding when and how to use this scoring metric is tricky, but our guide will help.

After reading this guide, you’ll have learned the following:

  • Why we use the F1 Score in Data Science
  • When You Should Use It
  • Python Code To Calculate it
  • Opinions on What A Good Score Is
  • Exploring How it Helps Class Imbalance
  • Discussion on Types Of Problems You Can Use It For

Never too late to add another tool to your data science toolkit!


Why Do We Use The F1 Score in data science?

There are a lot of different ways to evaluate the performance of a machine learning model.

One standard evaluation metric utilizes the F1 Score, which balances Precision and Recall.

Recall and Precision are a little tricky to understand, but basically:

Precision is the ability of our algorithm to not label a true negative as positive.

Recall is the ability of our algorithm to find all positive labels.

See, that wasn’t so bad!

Celebration

While you may have to re-read that a couple of times to understand, these two work together very well to give us a complete picture of how our classification algorithm is doing – irrespective of the number of samples.

This is why we use the F1 Score; combining Precision and recall into one metric is an excellent way to get a general idea of how well a model performs, irrespective of sample counts.

While other algorithms can be used, the F1 Score is a great place to start and will give you a great idea of how your classifier is doing.


When do you use the F1 Score over other evaluation metrics?

Generally speaking, the F1 Score is best used when you want to strike a balance between Precision and Recall.

balance on scale

If you care about minimizing false positives and negatives, then the F1 Score may be a good choice.

Think about scenarios where a false negative is just as bad as a false positive; these would be great scenarios for utilizing the F1 Score.


When should you not use the F1 Score as an evaluation metric?

Sometimes it is not always the best metric to use. In certain scenarios, Precision or Recall may be more important than the other, in which case you would want to use a different evaluation metric.

For example, think about hiring at a big technology company.

Since these companies take hiring seriously concerning their culture and test candidates heavily, they want to avoid hiring the wrong person.

In this scenario, they’re okay with declining some “good” candidates (false negatives) to make sure that they don’t hire any candidates that could be detrimental to their work culture (false positives).

In the above situation, you’d want to maximize Precision since a Type 1 error is much more costly (in the business’s eyes) than a Type 2 error.

No job sign


How do I calculate the F1 Score in Python?

Remember, our formula for the F1 Score is:

f1 formula

Here’s a quick example of utilizing F1 to score the training error in my RandomForestClassifier.

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

# read in our data
df = pd.read_csv('cars.csv')

# grab some data to do a quick classification
df = df[['transmission','feature_1','feature_2',
        'feature_3','feature_4','feature_5',
        'feature_6','feature_7','feature_8',
        'feature_9']]

# original df (once filtered)
df

dataset

# define our encoder
label_encoder = preprocessing.LabelEncoder()

# apply our label encoder to each column
df_labels = df.apply(label_encoder.fit_transform)

# we can see the labels
df_labels

label encoded data set

# seperate x and y
x = df_labels.iloc[:,1:].values

y = df_labels.iloc[:,0].values

# buils classifier
clf = RandomForestClassifier(max_depth=5, random_state=32).fit(x,y)

# get some predictions
predictions = clf.predict(x)

# our training score
score = f1_score(y, predictions)

print(f'\nOur Training F1 Score For Our Classifier {round(score,4)}')

f1 score from model


What is a good F1 score?

As your F1 Score gets closer to 1, it’s seen as good.

I like to see values above 0.5 before feeling confident in my model.

A 0 F1 score is terrible because either your Precision or Recall (or both) is 0.

yikes

Since your F1 Score is the harmonic mean of Precision and Recall, the closer you get to 1, the closer your classifier is to becoming “perfect.”

If your F1 Score isn’t where you’d like to see it, you’ll need to dive deep into your Precision and Recall to figure out what went wrong.

I usually set up a confusion matrix to see which type of error (type 1 or type 2) is most prominent.

Once identified, I’ll go back to my data/model and make modifications to improve that specific type of error.


Is The F1 Score The Same As Accuracy?

The F1 Score is different from accuracy, which is great for our classification algorithms.

This is because True Negatives are not monitored in the F1 Score.

If we remember from above, the F1 Score is comprised entirely of Precision and Recall.

Below we have the formula for Precision and Recall:

precision formula


Recall Formula


We quickly notice that True Negatives (TN) is nowhere to be found.

This means that our F1 Score does not care how accurately we predict True Negatives, and the number of True Negatives in our data will not influence our Score.

This is not true for accuracy; since accuracy is just the amount of correctly classified points over the length of the test set, the number of True Negatives will influence our accuracy by improving the Score.


Does the F1 score help with class imbalance in machine learning?

As a machine learning engineer, it’s important to be careful when addressing “class imbalance.”

Class imbalance can mean many things, and it’s important to be clear about your definition.

Class imbalance could refer to imbalances of classes in multiclassification problems, in which case F1 wouldn’t be helpful.

This type of class imbalance would need to be addressed through feature engineering.

Class imbalance could also refer to the overall imbalance of True Negatives vs. True Positives, in which case F1 would be helpful by still giving you an accurate scoring metric.

For example, in fraud detection datasets, you’ll usually have a low amount of “fraud” per data point of “non-fraud.”

fraud detection

If you used an accuracy metric, your model would always look great since it could call everything “non-fraud” and still score very high.

The F1 Score eliminates this problem by removing True Negatives in the scoring metric.

When thinking about class imbalance, make sure you’re clear about which definition you’re using. Otherwise, you could end up getting the wrong results.


Is F1 score a good metric to use for classification?

The F1 Score is an excellent metric to use for classification because it considers both the Precision and Recall of your classifier.

In other words, it balances the two types of errors that can be made (Type 1 and Type 2).

This is crucial in situations where the cost of each type of error is the same. 

For example, if you were trying to classify whether or not it was going to rain on an upcoming day, and somebody had to switch their plans if you were wrong, this would be an example where being wrong in either direction is equally detrimental.

In this case, it’s essential to have both high Precision and Recall, so the F1 Score is an excellent metric to use.

For another example, if you were classifying some disease, missing a scenario where someone has the disease is much worse than accidentally diagnosing someone who doesn’t have it.

In this case, Recall would be a better scoring metric than the F1 Score because you’d want to maximize, as you want to ensure that you’re correctly classifying as many positive cases as possible.

Other Quick Data Science Tutorials

At EML, we have a ton of cool data science tutorials that break things down so anyone can understand them.

Below we’ve listed a few that are similar to this guide:

Frequently Asked Questions:

Below, we hope to answer some of the “quick” questions that come up with the F1 Score.

Send us an email if you have any more we should add.


Can you use the F1 Score in regression problems?

While many make the mistake of assuming the F1 Score can be used in regression problems, it’s a classification scoring metric.

This confusion usually comes from using the F1 Score with logistic regression, a classification algorithm.


Is the F1 Score better than using Precision and Recall?

The F1 Score can be better than using Precision and Recall in scenarios where these two need to be balanced against each other. The business problem you are solving will dictate which metric is better.


Can the F1 Score be used in multi-class problems?

The F1 Score can be applied in multi-class problems, but it will have to be used class by class. You will have to look at the F1 Score for each class to see how well your classifier does for each.

Stewart Kaplan