Stop Overfitting, Add Bias: Generalization In Machine Learning

It’s a common misconception during model building that your goal is about getting the perfect, most accurate model on your training data.

In reality, this is often not the case.

Creating a model that knows the training data too well can lead to decreased performance when it is time to put it into production.

In this 3-minute read, we’ll discuss generalization in machine learning and how you can avoid overfitting your models to improve their accuracy.

Short and Sweet!

Tootsie roll


What is Generalization in Machine Learning?

When humans learn something, we don’t just try to memorize every single example we see.

We simply can’t, as we do not have the brain power (or at least I don’t) to remember a lot of information without becoming overwhelmed.

Instead, we try to find patterns and generalize from them.

For instance, if we see a bunch of different animals with two eyes, we might conclude that two eyes are a general rule for animals.

horse

We’ve generalized from the examples we’ve seen.

Machine learning works in a similar way.

Instead of trying to remember everything, which is error-prone, doesn’t work well for complex problems, and doesn’t generalize to new unseen data, machine learning algorithms find patterns within our training data and applies those patterns to new unseen data.

If our algorithm just memorized our training data, it would be great at creating predictions for our training data – but it would be awful at making predictions for anything it hadn’t seen before.

Creating the best machine learning model that is prepared to handle new and unseen data accurately is called generalization. 

Generalization is an essential concept in machine learning because it allows us to take what the algorithm has learned and apply it to new situations.

 


Bias Vs. Variance Tradeoff (Underfitting Vs. Overfitting)

When building machine learning models (for production!!), our goal is to find the right balance between (generalizability) bias and (fitting to the current training set) variance to create a model that both understands our current training data – and can handle new unseen data.

An easier way to understand this is to think of variance as the accuracy of your current training data and bias as your ability to predict all things in the universe.

As you increase the accuracy on your current training data, you lose the ability to predict all things in the universe.

universe

And vice versa, as you increase your model’s ability to predict all things in the universe, you’ll lose current accuracy on your specific training set.

While this looks pretty theoretical, there are formulaic ways to ensure we’ve reached a point where our model is accurate and generalizable.

mean squared error formula

In the picture above, if we want to minimize our mean square error (our best accuracy), we’d have to find the minimums of our error due to our variance and bias (squared).


Why Do We Want To Generalize Our Models?

It’s a common misconception that all models need to be generalizable. 

In reality, only production models or models used in diverse situations must generalize well.

Models built for one-off analysis or reports can sometimes benefit from high variance.

Think about it this way, if your boss wants to know how well the marketing department is doing and wants you to build a model to forecast how they’ll do next year – it would be silly to build a model that works for the marketing department and the engineering department!

You’d just build a model for the marketing department!

While this model may have high variance, it answers the question you initially had to solve.

Now on the inverse, if your boss wanted you to build a model that could predict how any business unit was doing at any time – it would make no sense to only use the marketing department, as your model wouldn’t be able to give good prediction to new unseen departments.

young business man at work


How To Know If Our Machine Learning Model Is Good For Production?

Knowing if a model is good enough for production is (literally) the million-dollar question. Generally, you’ll want to create a model that can handle diverse testing before moving it into production.

Once in production, you’ll want to slowly increase user adaption and focus on model drift and prediction accuracy.

The only way to know how your model will do in production is by putting it into production.


How to Generalize (Add Bias) To Your Machine Learning Models

There are a couple of ways to add bias to a model.

One of my favorite ways is utilizing bootstrap sampling; this will drive your model more toward the underlying population parameters and away from only focusing on your training data.

Things like feature selection also lower variance (and increase bias) in your models since you’re lowering the amount of emphasis on the training data.

Using imputation for nulls instead of adding column means also has been shown to increase variance, as you won’t be as highly dependent on the data distributions of the sample for unknown values.

bias


Do supervised models generalize better than deep learning models?

Since the true population parameters of your sample don’t care about the algorithm you use, the type of modeling you do usually has minimal effect on the generalizability of your models.

While something like a linear regression will have more bias than a CNN or any other deep learning framework, using one or the other won’t be a one-stop fix for poor production performance.


Other Quick Machine Learning Tutorials

At EML, we have a ton of cool data science tutorials that break things down so anyone can understand them.

Below we’ve listed a few that are similar to this guide:

Dylan Kaplan