Machine Learning: High Training Accuracy And Low Test Accuracy

Have you ever trained a machine learning model and been really excited because it had a high accuracy score on your training data.. but disappointed when it didn’t perform as well on your test data? (We’ve all been there)

This is a common problem that ALL data scientists face.

But don’t worry; we know just the fix!

In this post, we’ll talk about what it means to have high training accuracy and low test accuracy and how you can fix it.

However, we want to emphasize that this is probably the wrong approach towards your modeling methods, and another technique could give you a much better insight into your modeling experience.

So, stay tuned and get ready to become an expert in machine learning!

Table of Contents show

Why Do We Need To Score Machine Learning Models?

Like in sports, where you keep score to track how you’re doing, in machine learning, we also need to score our models to see how well they perform.

This is important because you need to track your model’s performance to know if it’s making any decent predictions.

And to Score our Models, we use a thing called metrics.

Metrics are tools that help machine learning engineers and data scientists measure the performance of our models.

There are TONS of different metrics, so it’s essential to understand which metrics are best for your problem.

Hint, accuracy is not always the best fit!

For example, if you’re building a model to predict whether a patient has a particular disease, you might use metrics like accuracy, precision, and recall to measure its performance.

On the other hand, if you’re building a model to predict the price of a house, you might use metrics like mean absolute error or root mean squared error.

What Does High Training Accuracy and Low Test Accuracy Mean?

When you train a machine learning model, you split your data into training and test sets.

The model uses the training set to learn and make predictions, and then you use the test set to see how well the model is actually performing on new data.

If you find that your model has high accuracy on the training set but low accuracy on the test set, this means that you have overfit your model.

Overfitting occurs when a model too closely fits the training data and cannot generalize to new data.

In other words, your model has memorized the training data but fails to predict on data accurately it has yet to see.

This can have a few different causes.

First, It could simply mean that accuracy isn’t the right metric for your problem.

For example, suppose you’re building a model to predict whether a patient has a certain disease. In that case, accuracy might not be the best metric to use because you want to be sure that you catch all instances of the disease, even if that means having some false positive results. In scenarios like this, accuracy can be biased due to your dataset’s low amounts of actual true positives.

Another cause of high training and low test accuracy is simply needing a better model. This could be because your model is too complex or because it’s not capturing the underlying patterns in the data.

In this case, you should try a different model or change the model parameters you’re using.

Should Training Accuracy Be Higher Than Testing Accuracy?

In machine learning, it’s typical for the training accuracy to be a bit higher than the testing accuracy. This is because the model uses the training data to make predictions, so it’s expected to perform slightly better on the training data.

However, if the difference between the training and testing accuracy is too significant, this could indicate a problem.

You generally want the difference between the training and testing accuracy to be as small as possible. If the difference is too significant, it could mean your model is not performing well on new data and needs improvement.

It’s important to remember that slight overfitting is impossible to avoid entirely. However, if you see a large difference between the training and testing accuracy, it’s a sign that you may need to make changes to your model or the data you’re using to train it.

However, in the next section, I argue that you should completely change how you do this WHOLE process.

Should I Even Be Testing My Model This Way?

When building a machine learning model, you’ve probably been told a thousand times that it’s essential to split your data into a training set and a test set to see how well your model is performing. (This is called a train test split)

However, a train test split only uses a single random subset of your data as the test set…

This means that you’re only getting a single score for your model, which might not represent how your model would perform over all of the data.

Think about it this way, what if you tested a different “test” set from your model and got a completely different score, which is the one you’d report to your manager?

Cross Validation is Superior To Train Test Split

Cross-validation is a method that solves this problem by giving all of your data a chance to be both the training set and the test set.

In cross-validation, you split your data into multiple subsets and then use each subset as the test set while using the remaining data as the training set. This means you’re getting a score for your model on all the data, not just one random subset.

The score from cross-validation is a much better representation of your model’s performance than a single-train test split score.

This is because the cross-validation score is the average test score from each subset of your entire dataset, not just one random part.

This gives you a more accurate picture of how well your model is actually performing and helps you make better decisions about your model.

Can you always use Cross Validation?

Cross Validation can only be used in independent data. This means things like time-series data or other non-independent data are off-limits for cross-validation. While you can write a book on this topic (and we won’t cover it here), we wanted to emphasize this before Cross Validation becomes your only go-to modeling method.