One common task you will encounter while working in machine learning is clunky datasets that need cleaning.
This involves dealing with inconsistencies and errors in the data and data that wouldn’t “make sense” for our models in their current form.
This blog post will discuss how to deal with zip codes so you can get them in a form your machine-learning models will love.
We will provide 3 Python code options to help you get started and explain some logic for each option below.
1.) Take The First 2-3 Digits Of The Zip code
In many data sets, zip codes are included to add geographic information to specific regions.
Because there are so many different zip codes, this variable (if expanded) can sometimes create more columns and categories than desired.
A way to get around this is by creating your own encoding. Take the first few digits of each zip code and use that to create categories.
For example, many Florida zip codes start with “33”; we could keep the first two digits of each zip code, giving us a column representing some regions in Florida.
You could even go further and keep the first three digits, which usually will (loosely) break down each state into cities and towns.
This method is excellent because it would provide some generalizability – if we were to see a new zip code that started with “33”, our model could easily classify it as the central part of Florida. We do not have to worry about it being the first time we’ve seen that zip code.
import pandas as pd
# dataset we used
# https://www.kaggle.com/datasets/danofer/zipcodes-county-fips-crosswalk
# read in our csv
zips = pd.read_csv('zipcodes.csv')
# a simple lambda funciton that will transform it
zips['ZIP2'] = zips['ZIP'].apply(lambda x: str(x)[0:2])
zips
print(f'Original Uniques {zips.ZIP.nunique()} vs New Uniques {zips.ZIP2.nunique()}')
Now that we’ve cut down our categories from ~39,000 to 90, we can utilize a HashEncoder, TargetEncoder, or create Dummy Variables to feed our model values/variables that it likes.
Reference: https://www.unitedstateszipcodes.org/
2.) Latitude and Longitude Centering For Easy Model Prediction
If you’re looking for another way to deal with your zip code data, you should map latitude and longitude into your data frame.
This can be especially useful if you’re working with tree-based models, which can understand numerical data in this form.
Tree models (in this scenario) under-the-hood work by creating a grid of latitude and longitude coordinates, allowing your model to converge faster and not needing as much data as other processes.
Now, if your model receives new latitude and longitude coordinates, it’ll easily handle them since there is a numerical (distance) factor at play compared to categorical.
While you probably think you’ll need an API, tons of pre-built “maps” are available online.
In the code below, I use a simple text file from GitHub to map longitude and latitude to my data frame:
import pandas as pd
# here's a list of all zip codes
# with latitude and longitutude settings
# https://gist.github.com/erichurst/7882666
# load in our data
zips = pd.read_csv('zipcodes.csv')
# load in our map from above
zipCodeMap = pd.read_csv('zip_codes.txt')
# merge our dataframes together with a join
mappedZipCodes = zips.merge(zipCodeMap, on='ZIP', how='left')
# here is our final result
mappedZipCodes
You can quickly proceed with modeling, dropping the original “ZIP” category.
Reference: https://gist.github.com/erichurst/7882666
3.) Using Dummy Variables To Transform Categorical Data
In some data sets, it might make more sense to treat zip codes as categorical variables instead of numerical ones.
This would allow you to use dummy variables.
If you had data on home prices in different zip codes, you could create a dummy variable for each zip code.
This is slightly different from the other option we discussed before, which cut down on the potential feature space (by chopping down the values) a ton.
Creating new variables for each zip code can quickly cause dimensionality problems from the expansion.
This is where your subspace becomes so large and sparse that models can’t find any meaning in the subspace.
There is another problem with handling zip codes this way since you’ll only have columns for the zip codes that you’ve “seen” in your training set.
This creates a model that can’t handle unseen zip codes.
While models like this aren’t necessarily “bad,” they should only be used in situations where you know there will not be any new zip codes coming down the pipeline.
While there are a lot of negatives, if you’re performing EDA and see you only have 1 or 2 unique zip codes and can guarantee there won’t be any new ones down the road, this route quickly becomes very practical.
import pandas as pd
# load in our data (we assume theres only 8 lines)
zips = pd.read_csv('zipcodes.csv').sample(n=8)
# create dummy variables only for zip
dummy_zips = pd.get_dummies(zips['ZIP'])
# bring those rows back in on the index
combined = pd.concat([dummy_zips, zips], axis=1)
combined
Handling Zip Codes The Right Way For Model Improvement
We hope you found this post helpful – zip codes in machine learning can be a little tricky, but with the right tools in the toolbox, you should know how to fix your problem and continue modeling quickly.
Let us know in the comments below if you have any other tips or tricks on working with zip codes.
We love hearing from our readers and sharing new ideas with everyone.
Other Quick Machine Learning Tutorials
At EML, we have a ton of data science tutorials that break things down so anyone can understand them.
Below we’ve listed a few that are similar to this guide:
- Instance-Based Learning in Machine Learning
- Bootstrapping In Machine Learning
- Generalization In Machine Learning
- Epoch In Machine Learning
- Understanding The Hypothesis In Machine Learning
- get_dummies() in Machine Learning
- Verbose in Machine Learning
- X and Y in Machine Learning
- Types of Data For Machine Learning
- F1 Score in Machine Learning
- How to Make Nintendo Switch Software Download Faster [Boost your Download Speeds] - September 6, 2024
- Step-by-Step Guide to Implement Histogram Equalization in MATLAB Without histeq [Boost Your Image Quality] - September 6, 2024
- Do Software Engineers Make $50 an Hour? [Discover the Truth Here] - September 6, 2024