Something that isn’t talked about enough but silently haunts all machine learning practitioners.

Data leakage.

While data leakage is rarely discussed, I’ve seen it destroy production models.

This blog post will discuss detecting and preventing data leakage in machine learning models.

We will also explore some real-world examples of data leakage in machine learning.

**Table of Contents**show

What is Data Leakage in Machine Learning

Data leakage can be defined simply in one sentence: leaking training information into your testing set.

Remember, when we’re building models, our test set must be independent, and to our model, it should not exist until it sees it during testing.

If we somehow leak information from our training set into it, we’ve broken independence and allowed our model to cheat its way to a higher metric.

The Five Type Of Data Leakage

While there are probably more than these four, these are the four that I commonly run into in my day-to-day work as a machine learning engineer.

1.) Splitting Non-Independent Data

Every machine learning engineer knows that you need to scale your data.

But what only some machines learning engineers know is that when you do your scaling is nearly as crucial as when how you do it.

**Scaling data must be scaled after you’ve split up your training and testing data. **

Since most scaling functions use global arithmetic (mean, min-max scaling, etc.), you’ll accidentally leak that information into your training set.

This will bias your algorithm toward the mean of the entire dataset and not just your training data, giving it a heads-up on the testing data it’s about to see.

Remember, our job is to ensure that our test set is unseen to our algorithm, and scaling before splitting is one way to fail.

**Note: Do not create another scaler for your test set; use the one created for your training set also for your test set. This will give you the best predictor of production accuracy.**

2.) Splitting Non-Independent Data

Since you’ve started machine learning, you’ve been taught to split your data into training and testing portions, train your model on your training set, and then evaluate it on the testing set.

This is correct…if you do it right.

When testing with data that isn’t independent (like time series or other correlated data), **you can not randomly shuffle your data when you split it.**

While I could write a book for the reasons why it’s easy to see why in this picture below.

Let’s say you split the data above, and T4 (the orange one) was placed into the testing set, and the others in the training set.

Without doing any learning, you can make a pretty good guess at what that value is.

Since you can see both T3 and T5, you know that value has to fall somewhere between those.

Now, if you were to re-do the same drill as above, only on this image below.

Your guess would have been much different than the one you initially had.

This is how you must test non-independent data, train up until point T, and then test with all data afterward.

This problem arises in both shuffling train and test splits and cross-validation.

3.) Imputing Missing Values Before Splitting

By the same logic above, if we impute missing values based on global statistics from the whole dataset, we arbitrarily bias our training set towards our test set.

For all the same reasons as above, you must split your test set out and THEN build imputing models (even if you use mean or median) to calculate the missing values.

4.) Not Removing Duplicates

Imagine yourself as a machine learning algorithm; your goal was to determine how long it would take to walk across a bridge.

Now, imagine that you had this exact answer in your training set that you were given to learn all about this bridge.

You’d know the exact answer when presented with this same bridge in the test set without guessing!

You pull that information from your training set and push out the exact number!

Duplicate data gives us an unrealistic impression of how our machine learning algorithm is actually doing, as these duplicates will end up in both the training and test sets – giving our algorithm an edge in predictions.

5.) Logic Leak

While I’m not sure what to call this, a logic leak is a good name.

Imagine you were building a classifier that was trying to guess if someone was going to move into an apartment.

Your algorithm was given data from many banks all over the world.

One bank, in particular (we will call it BANK5), sends over a dataset that only had participants that ended up moving into the apartment.

Once you split up your dataset, your algorithm will learn that anything with the BANK5 variable automatically means “move-in” and inflate your accuracy on your test set.

Not only that, but since the bank didn’t send over any applicants that didn’t move into the apartment, your algorithm will not have the chance to balance out the positive values from the biased training set.

Finding data leakage like this is really hard, and many times is missed.

One strategy you can use if you run into this situation is to throw out the BANK indicator.

Allow your algorithm to learn the other parameters and treat all the separate bank data as if it came from one bank.

How To Detect Data Leakage During Data Science Pipelines

Hunting down data leakage can be tricky, but not if you’re equipped with the proper process.

Ask yourself these 5 questions.

- Did I Scale After I split
- Did I Impute After I split
- Did I delete duplicates?
- Is the data correlated in any way?
- Is this data time series?

If you answered yes to the first three and no to the bottom two, you’re on the right track.

If the bottom two are giving you trouble, you can use a holdout testing process with your data.

This involves splitting your dataset into training and testing sets without shuffling.

For example:

Train your model on 50% of the data, and test it on the 51% percentile.

Train your model on 85% of the data, and test it on the 86-90% chunks.

Use these scores for your metrics and decide whether your model is good.

Sadly, there is one final way of identifying data leakage – when you witness a reduction in production accuracy compared to the actual results seen during testing.

I’ve personally done this one a couple of times.

If you put a model in production and it’s not scoring near the metrics you had during training and testing, you’ve accidentally leaked data.

Head back to the drawing board and work out what went wrong; use the checklist above to find it.

Techniques To Minimize Data Leakage When Building Models

Minimizing data leakage when building models is simple; use the checklist below.

- Scale After Split.
- Impute After Split.
- Delete duplicates.
- Check For Correlated/Time-series Data.
- Check if your parameters are inherently biased one way.

If you’ve handled the items above, you should be clear of data leakage when building your models.

How Much Does Data Leakage Effect Machine Learning Models (With Example Code)

Below, I build two models, one shuffling time-series data and one not shuffling time-series data.

As we know, shuffling time-series data will inflate our metrics since we accidentally leak data, showing us a value much better than what we will get in production.

Let’s test it out.

We use this dataset below for both tests.

Note: Random Forests uses bootstrap sampling, if you don’t know what that is, check out that linked article.

```
import pandas as pd
import numpy as np
# read in our data
df = pd.read_csv('stock-data.csv')
# not going to deal with this atm
df = df.dropna()
df.head()
```

We will try to predict tomorrow’s close based on the day’s open, high, low, close, and adj close.

```
# build a pipeline with random splititng
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X = df.iloc[:,2:].values
y = df.iloc[:,-1].values
# note that shuffle is set to true
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=32, shuffle=True)
# same random state for both
rf = RandomForestRegressor(max_depth=5, random_state=32)
# fit the model
fitted_rf = rf.fit(X, y)
# get our predictions
y_pred = fitted_rf.predict(X_test)
#print the mean_squared_error
mean_squared_error(y_test, y_pred, squared=False)
```

Below is our mean squared error calculation for our test set.

From what we know from earlier, shuffling our time-series data should inflate our metrics.

In this scenario, our mean squared error **should be lower** than if we do not shuffle it.

Let’s test that theory by turning off shuffle within our splitting function.

**Note**: everything is the same in both pipelines besides the shuffle=false param.

```
# build a pipeline with WITHOUT random splititng
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X = df.iloc[:,2:].values
y = df.iloc[:,-1].values
# note that shuffle is set to False
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=32, shuffle=False)
# same random state for both
rf = RandomForestRegressor(max_depth=5, random_state=32)
# fit the model
fitted_rf = rf.fit(X, y)
# get our predictions
y_pred = fitted_rf.predict(X_test)
#print the mean_squared_error
mean_squared_error(y_test, y_pred, squared=False)
```

Below is our mean squared error calculation for our test set.

Just as we thought!

We lost nearly 20% of our evaluation metric just by correcting the shuffling method!

Remember this when you’re building your models and reporting your metrics.

Data leakage is a pain but must be taken seriously if you want your models to show stable and high performance in production.