get_dummies in machine learning

A Simple Explanation of using get_dummies in Machine Learning

If you’re new to machine learning or want a quick refresher on what get_dummies does, you’ve found the right place.

This blog post will quickly give an overview of what get_dummies is and how it can be used correctly in machine learning.

We’ll also provide some Python coding examples to see how it works in practice.

This is one to take advantage of!

How to use get_dummies in machine learning

import pandas as pd
import numpy as np

df = pd.read_csv('cars.csv')[['color','transmission','model_name']]

df

dataframe df of cars

# seperate your x and y variables
x = df[['color','transmission']]

y = df[['model_name']]

# use pd.get_dummies to convert your predictor variables
x = pd.get_dummies(x, drop_first=True)

x

Notice the use of drop_first and how we only have one transmission column (even though there were two separate entries in the original)

dummy variables using the column names, after transformation

What’s the point of get_dummies in machine learning?

In machine learning, we often have to deal with data that is all over the place.

This means we’ll have a mixture of variables, some numerical and some categorical features, within our dataset.

However, most Machine Learning algorithms will need help to handle the categorical variables since it’s only possible to use distance metrics (like euclidean space) with numbers.

In a dataset about groceries, we might have some variables containing information about different fruit types.

In this case, the data for this variable would be categorical because each row would represent a different type of fruit.

Grocery Store Image

However, MOST (like 99%) Machine Learning algorithms can only work with numerical data. Some unique algorithms, like KModes, can handle categorical variables in their original state.

Therefore, we need to use a method from Pandas called get_dummies to convert our categorical data into a numeric form.

Get_dummies works by creating new columns for each category (using them as the column names) and then assigning a value of 1 to the rows that belong to that category.

This process of converting categorical variables into numeric form is essential for building machine learning models.

Why do we need to convert Categorical Variables For Modeling?

Categorical variables have a fixed number of categories or distinct groups.

For example, Fruit (Banana, Apple, Etc.), Marital Status (married, single, divorced), and House Ownership (own, rent, mortgage) are all examples of categorical variables.

apple and cake

Categorical variables are often used in statistical models and data science as predictor variables.

In these models, the categorical variable is dummy coded so that each category is represented by a separate binary variable (also called an indicator variable).

This breaks the categorical variable into a series of ‘dummy’ variables, which can then be used in the model, as there is now no distance representation of the categories.

It is also worth noting that some machine learning algorithms, such as decision trees and random forests, can handle categorical variables without needing to be converted into dummy variables first.

However, this is different for many other machine learning algorithms, and categorical variables must be transformed into dummy/indicator variables before modeling occurs.

Are there any downsides to using Pandas get_dummies?

Dummy variables are usually instantly used in all models to represent categorical variables.

Regardless, using dummy variables in models that make column-based decisions can create problems because it reduces the amount of variance the original column has, potentially weakening a splitting criterion.

word weaken highlighted

You’ll run into this with tree-based methods, like decision trees and random forest models.

Additionally, when performing get_dummies, you may run into multicollinearity if you do not end up with n-1 dummy columns, where n is the number of categories in your column.

This is because two of your columns will be inverses of each other.

Many of the models you use have an assumption of independence, which is violated by multicollinearity.

This can quickly cause problems because it can reduce the model’s accuracy.

Is Sklearn’s One-Hot Encoder better than Pandas get_dummies?

I think Sklearn’s one-hot encoder is much better than the pandas’ get_dummies() method.

While get_dummies is a little easier to use, you cannot save your “dummy columns.”

While this doesn’t seem like a huge deal, in production systems, it can create massive problems.

Suppose you created a model that was in production, and now this dataset has a new category variable. In that case, the size of your vector (that your ML model will make predictions from) is now different since it’ll make columns for all variables.

This will make it impossible for your machine-learning model to make a prediction and send an error.

This is the last thing we want in a production system and something that pops up ALL THE TIME when using get_dummies.

 

Other Quick Data Science Tutorials

At EML, we have a ton of fun data science tutorials that break things down so anyone can understand them.

Below we’ve listed a few that are similar to this guide:

Stewart Kaplan