Sampling Methods: Bootstrapping in Machine Learning

Bootstrapping is a resampling method that is used in machine learning.

It is a widespread technique due to its flexibility since it does not require anything other than your training dataset (that I presume you already have).

This blog will dive deeply into Bootstrapping, its use, its history, how to code it, and some advantages over another famous resampling method: cross-validation.

A lot of value for only 8 minutes of your time!

This one is a tad dense, but we’ll get through it!

team


What is The Bootstrap?

Bootstrap Sampling comes from the ideas around just the Boostrap. The Bootstrap is a flexible and powerful statistical tool that brings us closer to our sample’s true population parameters.

For example, if you were interested in a confidence interval of your population mean, the Bootstrap would give you a great estimation.

While that seems a bit wordy, it’s easiest to see with an example.

Let’s say we have this population with 70 values in it, and we want to estimate the mean.

Hint: The mean of the population is 7.264

Now imagine that we only had a sample of 10 random digits.

Those digits are below:sample for bootstrap

If we take the mean of just this set, we get 9.1

We can quickly see that this isn’t a great representation of our population mean.

That is where the idea around the Bootstrap comes in!

Now, let’s take 5 Random Samples of our current set with a length of N (where N is the original length of our data) with replacement (the same value can be seen multiple times)

Those numbers can be seen here, with their respective means underneath.

all samples with means calculated

Now, if we take the mean of our 5 bootstrap samples, we have 8.1.

With only 5 bootstrapped samples, we’ve gotten much closer to correctly estimating our population mean.


What is Bootstrap Sampling?

Bootstrap sampling is a type of resampling where we create N datasets from our population (your dataset) with replacement.

Each bootstrap data set is the same size as our original dataset. As a result, some observations may appear more than once in a given bootstrap data set – and some not at all.

Here’s an image that illustrates this idea, with n=3 observations

 

example of bootstrap


What is The Point of Bootstrap Sampling?

Since we know that the general idea of the Bootstrap is to derive a population parameter, what if we could use this idea to derive the actual population standard error for our data?

This (in theory) would lead us to the Bayes Error rate, the lowest possible prediction error rate that could be achieved (irreducible error).

best value highway sign

In simpler terms, we can use bootstrap sampling to derive the true prediction error of our machine learning algorithm, which gives us an insight into how our algorithm would perform outside of just our original dataset.


Where does the name “Bootstrapping” come from in machine learning?

The term “bootstrap” is derived from the expression “to pull oneself up by one’s bootstraps.” This idea is based on eighteen century classic “The surprising Adventures of Baron Munchausen” By Rudolph Erich Raspe.”

Many attribute this naming to this crucial part of the book.

“The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.”

Note this terminology is different than the term “bootstrap” used in computer science. That one is referencing “booting” a computer – the derivation is similar.

References:

https://www.gutenberg.org/files/3154/3154-h/3154-h.htm 

https://hastie.su.domains/MOOC-Slides/cv_boot.pdf


Implementing Bootstrap Sampling in Python Pandas

import pandas as pd

# import our dataset, first 100 rows
df = pd.read_csv('cars.csv')[0:100]

# create an array to store all of our dataframes
dataframes = []

# we will run this loop the amount of times that we have
# lines of data, in our case, 100 times
for i in range(len(df)):
    
    # append each of these data frames to our array
    # we set the length to be the original length of 
    # our dataframe, and turn on replacement
    dataframes.append(df.sample(n=len(df), replace=True))
    
    
print([df.head() for df in dataframes])

python coding bootstrap data in array


Advantages of Bootstrapping In Machine Learning

Bootstrapping has tons of advantages in Machine Learning. Here are a few


Improve Model Real-World Accuracy

Since we will create a lot more data, bootstrapping will allow our model to generalize to the underlying population.

We now know this happens by resampling your data with replacement, which means some data points will be repeated in the new dataset – moving us closer and closer to the true underlying distribution of our population.

This reduces the variance of your estimates, leading to more accurate predictions and models with higher production success.

Winning in business. a young businessman cheering while working on a laptop in an office


Increase Training Data Size

Let’s be honest; sometimes, we’re given datasets with insufficient data.

This is one of the main advantages of bootstrap sampling.

We can take a dataset with extremely high variance due to our low number of data points and perform our sampling.

This will increase our dataset’s bias and training size, leading to a model that can converge and provide accurate insights.


Disadvantages of Bootstrapping in Machine Learning

There are some disadvantages to bootstrapping should be considered before using this method. 


Independent Population

When sampling with replacement, there is an underlying assumption that your data points are independent.

Some data, like time series data, violates this, and the traditional bootstrap sampling method would not work.

There are ways around this, like the Block Bootstrap.

https://www.youtube.com/watch?v=-M1UtvoajUY


Computational Limits

Since bootstrap sampling will create N new datasets, it is sometimes impossible to fit them into RAM.

If our original dataset were 5,000 rows of data, our bootstrap sample would create 5,000 new datasets.

While this parameter can be lowered, there is a computational and memory cap on bootstrap sampling that many run into.

game over on computer screen

Despite these drawbacks, bootstrapping can help create powerful models that genuinely represent real-world population parameters.


Should you only use bootstrapping methods in data science when your dataset is small?

Bootstrapping is great when you have a small dataset, but it can also be used for bigger datasets.

The only downside to using this technique with larger datasets is RAM size, where we will have a larger dataset than what is available in our RAM.


What is the difference between Bootstrapping and Cross-Validation?

Bootstrapping and Cross-Validation are both sampling methods of statistical inference.

In general, statistical inference uses data from a sample to make estimates or predictions about a population.

Both bootstrapping and cross-validation are used to estimate the performance of our population’s “standard error” or, more simply, how our machine-learning algorithm will do in a production system on unseen data.

The main difference between the two methods is that bootstrapping is a resampling technique, while cross-validation is a partitioning technique.

opposites

Bootstrapping involves random sampling with replacement from the training data set to create multiple new training sets.

This means that bootstrapping will lower the variance for our machine-learning model. This is great in situations where we are overfitting or want to increase our bias – but detrimental in situations where our variance is already low.

Cross-validation involves partitioning the training data set into multiple subsets and training the algorithm on each subset.

The performance of the algorithm is then evaluated on a held-out test set.

This means that cross-validation will take full advantage of our dataset, but without the resampling techniques from bootstrapping, it is confined to only training on the samples we possess.


Compared to cross-validation, is Bootsrapping better for estimating the test error, standard deviation, or bias for our parameter estimates?

Bootstrapping is better for estimating the population parameters than cross-validation because the new datasets built from your sample with replacement can approach the true underlying population parameters distribution.

This will give you an edge in modeling, as the data you’re training on closely resembles real-life production data.

While this is great, it is not a one-size fit all method.

Fit athlete tying his shoe laces. Closeup on the hands of an athlete getting ready to workout. Closeup on the sport shoe of a bodybuilder. Athlete tying shoe laces cropped.

With the bias-variance trade-off, bootstrap sampling with replacement lowers variance.

This is great in situations where your model is overfitting but terrible when your model is underfitting.

This is because when you lower variance, bias increases.

If your machine learning algorithm needs help understanding your training set, cross-validation may be a better fit.

Are there any advantages of bootstrapping over cross-validation?

The main advantage of bootstrapping over cross-validation is that bootstrapping will lower the variance in our machine-learning algorithm.

Increasing the bias in our algorithm will generally lead to more generalizability in production systems.

This is great in overfitting models and something I do a ton during modeling.

However, we only sometimes want to lower our variance; increasing our bias will sometimes lead to underfitting models.


Other Quick Machine Learning Tutorials

At EML, we have a ton of cool data science tutorials that break things down so anyone can understand them.

Below we’ve listed a few that are similar to this guide:

Dylan Kaplan