Unsupervised Learning – EML

Pytorch Lightning vs TensorFlow Lite [Know This Difference]

Stewart Kaplan — Mon, 30 Jun 2025 01:58:48 +0000

In this blog post, we’ll dive deep into the fascinating world of machine learning frameworks – We’ll explore two famous and influential players in this arena: TensorFlow Lite and PyTorch Lightning. While they may seem like similar tools at first glance, they cater to different use cases and offer unique benefits.

Pytorch Lightning is a high-performance wrapper for Pytorch, providing a convenient way to train models on multiple GPUs. Tensorflow lite is designed to put pre-trained Tensorflow models onto mobile phones, reducing server and API calls since the model runs on the mobile device.

While this is just the general difference between the two, this comprehensive guide will highlight a few more critical differences between TensorFlow Lite and PyTorch Lightning to really drive home when and where you should be using each one.

We’ll also clarify whether PyTorch Lightning is the same as PyTorch and if it’s slower than its parent framework.

So, buckle up and get ready for a thrilling adventure into machine learning – and stay tuned till the end for an electrifying revelation that could change how you approach your next AI project!

Understanding The Difference Between PyTorch Lightning and TensorFlow Lite

Before we delve into the specifics of each framework, it’s crucial to understand the fundamental differences between PyTorch Lightning and TensorFlow Lite.

While both tools are designed to streamline and optimize machine learning tasks, they serve distinct purposes and cater to different platforms.

PyTorch Lightning: High-performance Wrapper for PyTorch

PyTorch Lightning is best described as a high-performance wrapper for the popular PyTorch framework.

It provides an organized, flexible, and efficient way to develop and scale deep learning models.

With Lightning, developers can leverage multiple GPUs and distributed training with minimal code changes, allowing faster model training and improved resource utilization.

This powerful tool simplifies the training process by automating repetitive tasks and eliminating boilerplate code, enabling you to focus on the core research and model development.

Moreover, PyTorch Lightning maintains compatibility with the PyTorch ecosystem, ensuring you can seamlessly integrate it into your existing projects.

TensorFlow Lite: ML on Mobile and Embedded Devices

On the other hand, TensorFlow Lite is a lightweight, performance-optimized framework designed specifically for deploying machine learning models on mobile and embedded devices.

It enables developers to bring the power of AI to low-power, resource-constrained platforms with limited internet connectivity.

TensorFlow Lite relies on high-performance C++ code to ensure efficient execution on various hardware, including CPUs, GPUs, and specialized accelerators like Google’s Edge TPU.

It’s important to note that TensorFlow Lite is not meant for training models but rather for running pre-trained models on mobile and embedded devices.

What Do You Need To Use TensorFlow Lite

To harness the power of TensorFlow Lite for deploying machine learning models on mobile and embedded devices, there are a few essential components you’ll need to prepare.

Let’s discuss these prerequisites in detail:

A Trained Model

First and foremost, you’ll need a trained machine-learning model.

This model is usually developed and trained on a high-powered machine or cluster using TensorFlow or another popular framework like PyTorch or Keras.

The model’s architecture and hyperparameters are fine-tuned to achieve optimal performance on a specific task, such as image classification, natural language processing, or object detection.

Model Conversion

Once you have a trained model, you must convert it into a format compatible with TensorFlow Lite.

The conversion process typically involves quantization and optimization techniques to reduce the model size and improve its performance on resource-constrained devices.

TensorFlow Lite provides a converter tool to transform models from various formats, such as TensorFlow SavedModel, Keras HDF5, or even ONNX, into the TensorFlow Lite FlatBuffer format.

More information on it can be found here.

Checkpoints

During the training process, it’s common practice to save intermediate states of the model, known as checkpoints.

Checkpoints allow you to resume training from a specific point if interrupted, fine-tune the model further, or evaluate the model on different datasets.

When using TensorFlow Lite, you can choose the best checkpoint to convert into a TensorFlow Lite model, ensuring you deploy your most accurate and efficient version.

When would you use Pytorch Lightning Over Regular Pytorch?

While PyTorch is a compelling and flexible deep learning framework, there are specific scenarios where using PyTorch Lightning can provide significant benefits.

Here are a few key reasons to consider PyTorch Lightning over regular PyTorch:

Minimize Boilerplate Code

Developing deep learning models often involves writing repetitive and boilerplate code for tasks such as setting up training and validation loops, managing checkpoints, and handling data loading.

PyTorch Lightning abstracts away these routine tasks, allowing you to focus on your model’s core logic and structure.

This streamlined approach leads to cleaner, more organized code that is easier to understand and maintain throughout a team of machine learning engineers.

Cater to Advanced PyTorch Developers

While PyTorch Lightning is built on top of PyTorch, it offers additional features and best practices that can benefit advanced developers.

With built-in support for sophisticated techniques such as mixed-precision training, gradient accumulation, and learning rate schedulers, PyTorch Lightning can further enhance the development experience and improve model performance.

Enable Multi-GPU Training

Scaling deep learning models across multiple GPUs or even multiple nodes can be a complex task with regular PyTorch.

PyTorch Lightning simplifies this process by providing built-in support for distributed training with minimal code changes.

This allows you to leverage the power of multiple GPUs or even a cluster of machines to speed up model training and reduce overall training time.

Reduce Error Chances in Your Code

By adopting PyTorch Lightning, you can minimize the risk of errors in your code due to its structured approach and automated processes.

Since the framework handles many underlying tasks, you’ll be less likely to introduce bugs related to training, validation, or checkpoint management – Think about it, with Pytorch Lightning, you’ll actually be writing less code – and when you’re writing less code – you’ll naturally make fewer errors.

Additionally, the standardized design of PyTorch Lightning promotes code reusability and modularity, making it easier to share, collaborate, and troubleshoot your models.

Is SVG a Machine Learning Algorithm Or Not? [Lets Put This To Rest]

Stewart Kaplan — Fri, 20 Jun 2025 01:38:34 +0000

This post will help break the myths surrounding a unique but common machine-learning algorithm called SVG. One of the most debated (silly) topics is whether SVG is a machine-learning algorithm or not.

Believe it or not, SVG is a machine-learning algorithm, and we’re here to both prove it and clarify the confusion surrounding this notion.

Some might wonder how SVG, a widely known design-based algorithm, could be related to machine learning.

Well, hold on to your hats because we’re about to dive deep into the fascinating world of SVG, fonts, design, and machine learning.

In this post, we’ll explore the connections between these two seemingly unrelated fields, and we promise that by the end, you’ll have a whole new appreciation for SVG and its unique role in machine learning.

Stay tuned for an exciting journey that will challenge your preconceptions and shed light on the hidden depths of SVG!

What Is SVG, and where did it come from?

The origins of Scalable Vector Graphics (SVG) can be traced back to a groundbreaking research paper that aimed to model fonts’ drawing process using sequential generative vector graphics models.

This ambitious project sought to revolutionize our understanding of vision and imagery by focusing on identifying higher-level attributes that best summarized various aspects of an object rather than exhaustively modeling every detail.

In plain English, SVG works as a machine learning algorithm using mathematical equations to create vector-based images.

Unlike raster graphics that rely on a grid of pixels to represent images, vector graphics are formed using paths defined by points, lines, and curves.

These paths can be scaled, rotated, or transformed without any loss of quality, making them highly versatile and ideal for graphic design applications.

SVG’s machine learning aspect comes into play through its ability to learn a dataset’s statistical dependencies and richness, such as an extensive collection of fonts.

By analyzing these patterns, the SVG algorithm can create new font designs or manipulate existing ones to achieve desired styles or effects.

This is made possible by exploiting the latent representation of the vector graphics, which allows for systematic manipulation and style propagation.

It also brilliantly plays off of traditional epoch training, where each new “design” can be an entire training session of the data. While formal machine learning has low expectations for some of the first outputs of a trained model, these seemingly un-trained representations can have unique designs.

SVG is a powerful tool for creating and manipulating vector graphics and a sophisticated machine-learning algorithm.

Its applications in the design world are vast.

It continues to revolutionize the way we approach graphic design by enabling designers to create, modify, and experiment with fonts and other visual elements more efficiently and effectively than ever before.

Why The Internet Is Wrong, and SVG is a machine learning algorithm.

Despite the clear evidence provided by the research paper authored by Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens, a quick Google search may lead you to believe that SVG is not a machine-learning algorithm.

However, this widely circulated misconception couldn’t be further from the truth.

As stated in the paper, SVG employs a class-conditioned, convolutional variational autoencoder, which is undeniably a machine learning algorithm. Variational autoencoders (VAEs) are a type of generative model that learn to encode data into a lower-dimensional latent space and then decode it back to its original form.

In the case of SVG, this algorithm captures the essence of fonts and other vector graphics, enabling the creation and manipulation of these designs more efficiently.

The SVG algorithm is not just any ordinary machine learning algorithm; it can be considered state-of-the-art.

By harnessing the power of convolutional neural networks (CNNs) and VAEs, SVG has demonstrated remarkable capabilities in capturing intricate patterns and dependencies within large datasets of fonts and other graphics.

This makes it an invaluable tool for graphic designers and researchers, as it facilitates generating new designs and exploring creative possibilities.

So, the next time you come across information suggesting that SVG is not a machine learning algorithm, remember the groundbreaking research by Lopes, Ha, Eck, and Shlens that proves otherwise.

In fact, SVG is not only a machine learning algorithm but a state-of-the-art one with the potential to revolutionize how we approach graphic design and push the boundaries of our creative capabilities.

Link To The Paper:

https://arxiv.org/abs/1904.02632

Why You Should Be Careful Trusting Anything You See

The misconception surrounding SVG being unrelated to machine learning is a prime example of why it’s essential to approach information on the internet with a critical eye.

While the internet is an invaluable resource for knowledge and learning, it’s also rife with misinformation and half-truths.

Before accepting anything you read or see online as fact, make sure to verify its accuracy by cross-referencing multiple sources or consulting reputable research papers and experts in the field.

Being vigilant in your quest for accurate information will help you avoid falling prey to misconceptions, form well-informed opinions, and make better decisions in other aspects of life.

How To Choose The Right Algorithm For Machine Learning [Expert Guide]

Stewart Kaplan — Tue, 17 Jun 2025 16:03:08 +0000

I’ll be honest; choosing the right algorithm for machine learning can be one of the most challenging parts of our jobs.

Don’t worry; we’re here to help.

In this article, we’ll be breaking down the process of selecting the perfect algorithm for your project in a simple but effective easy-to-understand way.

We’ll start by taking a high-level look at the world of machine learning algorithms and what to consider before you even touch that keyboard.

Then, we’ll review critical considerations and KPIs to help you know you’ve made the right choice.

By the end of this article, you’ll have a solid understanding of what to look for when choosing a machine learning algorithm and feel confident in your ability to make the best choice for your project.

If you want a future in this field, this is a MUST-READ.

The Two Main Pillars of Machine Learning

Regarding machine learning, there are two main pillars:

Unsupervised learning and Supervised learning. Understanding these two distinct pillars is critical in choosing the right algorithm for your project.

Unsupervised learning is a type of machine learning where the algorithm is trained on a dataset without any specific target variable.

The algorithm must then find patterns and relationships within the data on its own.

This approach is used when you don’t have a target variable or are interested in clusters and groups within your data that aren’t extremely obvious.

For example, an unsupervised approach is excellent when looking for marketing groups and segments within a customer base to increase sales.

Conversely, supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset with a particular target variable.

This means the algorithm knows what it’s trying to both predict and improve on, allowing our algorithm a path to convergence.

Supervised learning is often preferred over unsupervised learning simply due to the information gain.

Let’s run through an example.

Say you have four columns of data and a “target variable.” Since our unsupervised algorithm does not use this target variable, it will take advantage of the four columns.

On the inverse, our supervised algorithm will have four columns of data plus the target variable.

This means our supervised algorithm will have nearly 25% more data to work with!

It’s important to note that your dataset and problem usually dictate which machine learning pillar you should use.

Remember, it’s best to utilize supervised algorithms whenever possible, as they provide more information and can help you achieve better results.

In summary, the two main pillars of machine learning are unsupervised and supervised learning.

While unsupervised learning helps uncover hidden patterns in data, supervised learning is preferred because it can converge on a target variable and provide the underlying algorithms with more information.

One Pillar Has Two Categories; The Other Has None

Under the umbrella of supervised learning, there are two main categories: regression and classification.

Regression is a type of supervised learning where the target variable is continuous, meaning it can take on any value within a range (Note, that range can be 0 to infinity)

The algorithm is trained to predict the target variable’s value based on the input variables’ values.

For example, using historical data on housing prices and their respective features, a regression algorithm can predict the price of a future house based on its features.

On the other hand, classification is a type of supervised learning where the target variable is categorical, meaning it can only take on a limited number of values or categories.

The algorithm is trained to predict the target variable’s category based on the input variables’ values.

For example, one of the most classical machine learning problems is when using data on flower species and their respective features; a classification algorithm can predict the species of a flower based on its features.

It’s worth noting that these two categories only exist in supervised learning, as we have a target variable to learn from and optimize for.

This allows us to predict future values or groups based on the information we’ve learned from the target variable.

In unsupervised learning, we don’t have a target variable to tell us if we’re doing a good job with our predictions.

Our algorithms have nothing to optimize for; they only find patterns and relationships within the data.

This means unsupervised learning differs from supervised learning, requiring an almost different philosophical approach to choosing an algorithm.

What To Do Before You Start Coding Your Algorithm

Before you start coding your machine learning algorithm, sit down and ensure you understand your business problem and are being realistic with your data.

This will help you choose the correct algorithm for your project and ensure you get the best possible results.

When it comes to understanding your business problem, it’s essential to determine whether you’re trying to optimize toward a target (supervised learning) or looking for a new way to look at your data (unsupervised learning).

For example, if you’re trying to predict future sales or which group a new member would belong to, you’ll need a target variable, and supervised learning would be the best approach.

On the other hand, unsupervised learning would be the better option if you’re looking to build up groups and clusters without guiding the algorithm.

Be realistic with your data.

Supervised algorithms are immediately not an option if you don’t have a target variable.

In this case, unsupervised learning is the only option available.

In summary, before you start coding your machine learning algorithm, understand your business problem and be realistic with your data.

Use your data as a guiding light, and make sure you choose the right approach based on your specific needs and the information available.

Quick Guide To Choosing The Right Machine Learning Algorithm

Here’s a quick mental map that I use to choose the right algorithm.

Understand your business problem: What are you trying to solve?

Understanding your business problem is the first step in choosing the right algorithm.

Before exploring different algorithms, you need to understand what you’re trying to achieve.

Explore your data: What columns and data do you have that’s usable?

You need to have a good understanding of the data you have available to you.

This will help you choose an algorithm that is well-suited to your specific needs and can take advantage of the data you have.

Determine if it’s a supervised or unsupervised problem: Once you have explored your data, you need to figure out if you’re dealing with a supervised or unsupervised problem.

This will help you narrow your options and choose the right approach for your problem.

Determine if it’s regression or classification: If it’s a supervised problem, you need to figure out if it’s regression or classification.

Are you predicting a continuous value or putting things into predetermined categories?

Find a group of algorithms to test: Use what you now know about your problem to find a group of algorithms within your group (such as supervised regression or unsupervised NLP problems).

This will help you narrow your options and find the right algorithm for your needs.

Note: As you’ve noticed, we say to find the group independently, as we have yet to recommend any specific data science algorithms.

Finding the right machine-learning model is an iterative process.

Anyone suggesting “regression trees are best when doing X” does not understand machine learning and how algorithms work.

Assess each algorithm in the group: Test each algorithm in the group and assess its performance.

This will help you determine which algorithm performed the best and is the best choice for your specific problem.

Select the machine learning algorithm: Based on your results, select the machine learning algorithm that best suits your business problem.

This will be the algorithm you use to solve your problem and achieve your goals.

What To Watch Out For When Choosing Your Algorithm

When choosing a machine learning algorithm, there are several things to remember when picking out that perfect algorithm.

First, don’t fall in love with an approach before it’s tested.

Even if a particular algorithm looks good on paper or has worked well for others, it may not work the same for you.

It’s important to test multiple algorithms and compare their results to find the best one for your business needs.

Second, remember that your data and problem choose the algorithm, not you.

You may have a favorite algorithm you’re excited to use, but it’s not the right choice if it doesn’t fit your data and problem well.

Make sure to choose an algorithm that is well-suited to accomplish your goals!

Third, be aware that all algorithms seem good before they’re tested.

Only after testing will you know how well an algorithm will perform on your problem.

Don’t be swayed by an algorithm’s hype or popularity- test it and compare its results to other algorithms.

Fourth, don’t assume that a higher accuracy means a better algorithm.

While accuracy is important, it’s not the only factor to consider.

Other factors such as speed, interpretability, and scalability also play a role in determining the best algorithm for your needs.

Fifth, ensure your data source is “tapped,” meaning you can’t get any more data.

If you can obtain additional data, you can improve the performance of your algorithm or choose an altogether different algorithm that could perform much better (remember our unsupervised vs. supervised talk above).

Finally, remember that sometimes the best answer is the most straightforward answer.

Don’t get caught up in using complex algorithms just to use a complex algorithm.

The simplest solution is often the best, especially if it provides the desired results with a lower risk of overfitting or over-complication.

How To Know You’ve Picked you’ve chosen the right learning model for your problem.

Ultimately, the best way to know if you’ve picked the right machine learning algorithm for your problem is if you’ve successfully solved the problem you initially set out to solve.

If your algorithm provides the desired results and you can achieve your goals, you’ve likely made the right choice.

On the other hand, if your algorithm is not providing the results you need, it’s time to go back and reassess.

It’s important to remember that machine learning algorithms are not one-size-fits-all solutions.

What works well for one problem may not work well for another.

This is why it’s important to test multiple algorithms and choose the best fit for your needs.

Machine Learning Algorithm In Sorting?? [With Code!!]

Stewart Kaplan — Tue, 10 Jun 2025 15:39:14 +0000

Every computer science student had to deal with sorting algorithms while learning how to code.

While traditional sorting algorithms have been in use for decades, the rise of machine learning has given birth to a new type of sorting algorithm that has brought just as much commercial value as the originals.

In fact, machine learning algorithms have found applications in a wide range of sorting problems, from sorting images and videos to sorting fruit on a conveyor belt.

In this blog post, we’ll explore some of the machine learning algorithms that are being used behind the scenes for sorting, including their strengths and weaknesses.

Whether you’re a seasoned machine learning expert or just getting started with this exciting field, this post will give you a better understanding of how machine learning is transforming the world of sorting.

So, let’s dive in and discover the algorithms that are powering the future of sorting!

Understanding The Machine Learning Problem

When it comes to using machine learning for sorting, it’s essential to consider the approach you want to take.

One option is to use computer-generated rules to sort your items, which involves training a machine-learning model to recognize patterns in your data and make decisions based on those patterns (unsupervised learning).

This approach is often used when you don’t have pre-existing rules or knowledge about the data you’re sorting and want the model to create them without any form of bias.

On the other hand, you can also use human-generated rules to sort your items (supervised learning).

This approach involves defining specific criteria for sorting items based on prior knowledge or expertise in the field.

For example, you might sort medical records based on the patient’s age, symptoms, or images based on their color, brightness, or other visual features.

This can be useful when you know the data and want to ensure that the sorting process aligns with your personalized end goal.

In either case, the key is to choose the approach that best fits your needs and business goals.

Whether you choose a machine learning-based or human-made-rule-based approach, both have their strengths and weaknesses, and it’s important to evaluate each approach in the context of your specific use case.

By taking the time to carefully consider your options and choose the best approach for your needs, you can ensure that your sorting process is effective and efficient and ultimately achieves the results you’re looking for.

Supervised Learning: How Support Vector Machines Sort

Support Vector Machines (SVMs) are a type of supervised machine learning algorithm used for classification and regression analysis.

In SVMs, a model is trained on a labeled dataset to find the optimal boundary that separates different data classes.

This boundary is based on human-made rules, as the algorithm relies on pre-existing knowledge of the data to classify new instances.

Once this boundary is created, more items can then be sorted into these categories, allowing businesses to quickly and efficiently sort new-found information.

SVMs are often used in image recognition, natural language processing, and other applications requiring classification.

They’re even being used to sort fruit!

Relevant Viewing:

SVM algorithm in Python

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Create an SVM classifier
clf = SVC(kernel='linear', C=1, gamma='auto')

# Train the classifier using the training data
clf.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))

Unsupervised Learning: How Clustering Algorithms Sort

Clustering algorithms are a type of unsupervised machine learning algorithm used to sort data into clusters based on some machine-discovered similarity.

Unlike supervised algorithms, clustering does not rely on pre-existing categories or labels to sort the data.

Instead, the algorithm automatically groups the data based on shared characteristics without humans’ prior knowledge or input.

This makes clustering helpful in discovering patterns and relationships in data that may not be immediately apparent to humans.

It has applications in fields such as marketing, customer segmentation, and anomaly detection.

Some common clustering algorithms include

K-Means (Most Famous)
Hierarchical Clustering
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

K-Means algorithm in Python

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate a random dataset with 100 samples and 2 features
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=0)

# Plot the original dataset
plt.scatter(X[:, 0], X[:, 1])
plt.show()
# Create a K-means model with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=0)

# Train the model on the dataset
kmeans.fit(X)

# Make predictions on new, unseen data
new_data = [[-3, 0], [3, 0]]
predicted_labels = kmeans.predict(new_data)
print("Predicted labels for new data:", predicted_labels)

# Visualize the clustering results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, marker='*', c='red')
plt.show()

How Algorithms Are Helping Commercial Businesses Sort Items

Algorithms have become an essential tool for commercial businesses to sort and classify items, enabling them to streamline their operations and improve efficiency.

By analyzing large amounts of data and identifying patterns, algorithms can help businesses quickly and accurately categorize items, reducing errors and staffing counts, all while saving time.

One example of how algorithms are helping businesses sort items is in e-commerce, where machine learning algorithms are used to sort and recommend products to customers based on their preferences and behavior.

This helps businesses increase sales and improve customer satisfaction by providing personalized recommendations and a better shopping experience.

Algorithms are also used in supply chain management, where they can help businesses manage inventory, track shipments, and optimize logistics.

By analyzing data on product demand, shipping times, and supplier reliability, algorithms can help businesses make better decisions about when and where to source products, reducing costs and minimizing delays.

Overall, the use of algorithms in sorting items is just one example of how technology transforms businesses, enabling them to work more efficiently and effectively in a rapidly changing marketplace and world.

ML101: Noise In Machine Learning [Full Code]

Stewart Kaplan — Tue, 27 May 2025 00:47:37 +0000

We’ve all been there, cleaned up our dataset, and realized it’s incredibly noisy.

Should you get rid of the noise? Why is it even there? And what even is noise?

In this blog post, we’ll look at what noise is, why it matters in machine learning, and whether or not we want it in our systems.

I’ll even throw in some code to get you on your way!

What is Noise in Machine Learning

Noise in Machine Learning is like the static you hear on an old-fashioned TV set: unwanted data mixed in with the clean signals, making it hard to interpret and process “the good stuff.”

Noise can also adversely affect a Machine Learning model’s accuracy, hindering the algorithms from learning the authentic patterns and insights in the data, as the noise masks these.

While many focus on the more common types of noise in Machine Learning, like outliers, corrupted data points, and missing values, this noise is easy to detect and handle.

The best way to manage this type of noise is by understanding the problem context and implementing necessary preprocessing techniques like outlier/anomaly detection and other standard procedures.

The actual problem with noise arises from the randomness of the world, which is much harder to detect.

While many will tell you this noise will ruin your models, I’d argue it can sometimes enhance them.

If you handle it correctly.

Does every real-world dataset have noise in it?

All real-world datasets have noise, even if the dataset seems perfect.

While many machine learning practitioners will argue that this noise needs to be removed, I’d argue it needs to be understood first.

Do we want to remove all noise from data in Machine Learning?

You should be very careful of removing noise in machine learning models, as it’s tough to distinguish between noise in your data and nuances in a system.

For example, let’s say you’re building a system to predict if someone will signup for your SAAS product.

You have data on a bunch of unfinished and finished signup forms, seeing if people convert.

Now, you see your dataset, and the unfinished signup forms are half finished, generally not completely filled in, and sometimes need more accurate information.

The first thing many machine learning practitioners will do is throw out this incomplete data.

That makes sense, right?

It’s “bad” data and doesn’t seem to provide anything (it’s really noisy data).

Now, what if I told you that after someone creates a half-filled forum, they come back within a week and signup for the product at a 60% rate?

Well, now wait, those half-finished signup forms are no longer bad and noisy data; they’re highly predictive and will allow us much higher accuracy for a forecasting model.

This is the problem with removing noise; understanding a problem deep enough to remove all noise takes a ton of business context that I think only some of us have.

That’s okay – there are other things we can do.

How to remove all types of noise for our learning models in python

Instead of feeding your algorithm noisy data, you can use a lowess curve to create smooth points to feed.

Here is an example in python, using statsmodels.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
plt.style.use('seaborn-whitegrid')

# example of a noisy parameter
x = [val for val in range(0,100)]
y = [np.sin(val) + (np.random.normal(val) + random.randint(0,25)) for val in x]

plt.scatter(x, y, 1)

We can see how noisy our data points are

import statsmodels.api as sm

z = sm.nonparametric.lowess(x, y, frac=1/3, it=3)

plt.scatter(x, y, 1)
plt.plot(x, z[:,0], 4)

plt.show()

Here are our new “Z” points, which our lowess curve has smoothed!

Is it possible to label noise before modeling?

Noise can be labeled before modeling using the lowess technique shown above.

Though, this is wasted time.

You’d have much better model improvements focusing on getting a cleaner dataset or understanding the problem more deeply.

Are there ever scenarios where you want to add noise in machine learning?

Noise is constantly added to datasets.

In image detection, it’s very standard to rotate and flip images to try and trick our algorithm.

This will give us more images to model and a more general algorithm that will be much more robust in production.

Adding noise is a way of generalizing models.

What is the difference between Noise and Error In Data Science?

Noise is something that datasets have; error is something that models deal with.

Noise is before we begin modeling, and error is after.

Are Noise And Outliers The Same In Data Science?

There is an argument there, but traditionally, noise is the randomness of the world being injected into your dataset.

While outliers could be seen as noise, they’re usually signs of something else, like process failure and data collection issues.

One Step At A Time: Epoch In Machine Learning

Stewart Kaplan — Mon, 19 May 2025 00:25:21 +0000

Epochs in machine learning can be confusing for newcomers.

This guide will break down epochs and explain what they are, how they work, and why they’re important.

We’ll also dive deep into Epochs’ relationship with Batch size and iterations.

Deep learning is a complex subject, but by understanding epochs, you’re on your way to mastering it!

What is an Epoch in Machine Learning?

When we train a neural network on our training dataset, we perform forward and back propagation using gradient descent to update the weights.

To do this, forward and back propagation requires inputting the data into the neural network.

There are three options to do this:

1.) One by One

2.) Mini Batches

3.) Entire Batch

Feeding your neural network data one by one will update the weights each time using gradient descent.

If you feed your neural network data with mini-batches, after every mini-batch, it’ll update the weights using gradient descent.

Batch Size is equal to N/2 (Mini Batch)

And finally, if you feed your neural network your entire dataset, it’ll update the weights each time using gradient descent.

However, no matter how you feed the data to your neural network, once it’s seen the entire dataset, this is one Epoch.

HWY 1 on Key West, Florida.

Using one by one will take N updates (where N is the # of rows in your dataset) for one Epoch.

In contrast, using the entire dataset will only take one update for one Epoch.

Why do we use more than one Epoch?

Since one Epoch is when our machine learning algorithm has seen our entire dataset one time, more data is needed for our algorithm to learn the hidden trends within our dataset.

This is why we use more than one Epoch to provide enough data to train our algorithm.

How to Choose The Right Number of Epochs

There’s no magic number when choosing the correct number of epochs for training your machine-learning algorithm.

You’ll generally set a number high enough that your algorithm can learn your dataset but not too high where you’re wasting resources and overfitting.

The best way to find the perfect balance is through trial and error.

Start with a relatively high number of epochs and gradually decrease until you find the sweet spot.

It might take a little time, but getting the best results is worth it.

What Is the Difference Between Epoch and Batch In Machine Learning?

An epoch is running through the entire dataset once, and batch size is just how many “chunks” we do it in.

If we have a dataset of 1000 points and a batch size of 10, we’re going to train our model on 10 points at a time, update our weights 10 points at a time, and do that 100 times.

That’s one Epoch.

If we want to run more epochs, we keep going until we hit our stopping criterion.

Pretty simple, right?

We do this because we don’t try to process too much data at once and overload our RAM. If you’re only processing 10 points at a time, you’ll be safe from memory overload.

There are other benefits, too – like stopping training if the validation loss isn’t improving after a certain number of epochs or if the training loss starts increasing (which would mean you’re overfitting).

Do Different Frameworks Have Different Meaning For Epoch?

Whether you’re using TensorFlow, PyTorch, or whatever new deep learning framework comes out in the future – an epoch is one run through the entire dataset.

The meaning of Epoch has eclipsed module usage and is taught as a fundamental part of deep learning.

Whenever someone references an epoch, they’re talking about your dataset, seeing the entire training set.

Does an Epoch Exist Outside of Machine Learning?

While the idea behind Epoch does exist outside of machine learning, you won’t hear it called “epoch.”

For example, if you’re working as a data scientist and are building a visualization for a chart – you’d use the entire dataset.

This would be an “epoch,” but your boss would never reference it that way, as it’s not standard jargon outside machine learning.

So an epoch does exist outside of machine learning, but the terminology would never be used outside of machine learning (and mostly deep learning contexts).

What is an iteration in machine learning?

An iteration is how many updates it takes to complete one Epoch.

In other words, it’s the number of times the model weights are updated during training.

The term comes from the Latin word iter, meaning “to go through or do again.”

Higher iterations, in my experience, improve accuracy but take longer to train because the model has to update the weights much more often.

Trade-offs between the interaction of iterations and batch size exist.

As you increase batch size, your iterations are lowered.

For example, you might want to use fewer iterations if you train a model on a small dataset, as you’ll be able to fit most of the dataset into memory.

But if you’re training a model on a large dataset, you might want to increase iterations (which would lower batch size) because it would take too long to train the model otherwise – and probably wouldn’t fit into ram.

Experimenting with different values and seeing what works best for your situation is essential.

Do Many Epochs Help Our Gradient Descent Optimization Algorithm Converge?

Increasing the number of Epochs your algorithm sees will help until a certain point. Once that point is reached, you’ll start overfitting your dataset.

As a machine learning engineer, your job is to find that sweet spot where your algorithm is seeing enough data but not so much that it’s started focusing on the noise of the training data.

A validation dataset will help track the loss and create programmatic stops if the loss starts to increase.

Why It Matters: Types of Data For Machine Learning

Stewart Kaplan — Tue, 13 May 2025 02:01:27 +0000

Data is the heart and soul of machine learning.

Without data, data scientists cannot create impactful machine-learning algorithms (duh).

While this seems pretty standard, the type of data makes a huge difference when trying to perform data analysis.

And sometimes, when we receive data that we’re not used to or haven’t dealt with before – it can cause problems.

In this guide, we will look at the different types of data for machine learning models.

After this reading this short article, you’ll understand the following:

ALL of the Different Types Of Data in Machine Learning
Discrete Vs. Continuous
Ordinal Vs. Numerical Vs. Nominal
Timeseries Vs. Cross-Sectional
Big Data Vs. Standard
Streamed Vs. Batch Dataset
Answers To Some Common Data Type Questions At The End

Let’s get to it!

Different Types Of Data In Machine Learning

When doing machine learning, you’ll encounter various data types.

Discrete data is a countable data type, like the number of children in a family (whole number).

Continuous data is a data type that can be measured, like height or weight.

Ordinal data can be ranked, like 3rd place or runner-up.

Numerical data is data that can be quantified, like money or age.

Nominal data is data that can be categorized, like colors or countries.

Time series data is collected over time, like monthly sales figures.

Cross-sectional data is data collected at one point in time across various individuals, like census data.

Streamed data is collected in real-time, like social media posts.

Batch data is collected in chunks, like customer purchase records.

Big data is large sets of structured and unstructured data, like weather patterns or satellite imagery (usually streamed).

Standard data is small sets of well-defined structured data, like death certificates or tax returns.

As you can see, there’s a wide variety of different data types that you’ll encounter when doing machine learning.

Whether you’re trying to classify fraud, predict salary or build an awesome visualization, understanding their differences is essential for successfully building a model that works best for you and your customers.

Discrete Vs. Continuous

Discrete data is often considered data that can be counted, like the number of students in a class or crayons on the floor (finite).

However, discrete data can also take on a non-numeric form, like the color of someone’s eyes.

Continuous data, on the other hand, is always numeric and can represent any value within a specific range, like height or weight.

For example, you could weigh 125.3 or 125.325 pounds, etc. – you’ll never get the “exact” weight, as you’ll always be sacrificing some form of precision.

The two types of data in machine learning are often used interchangeably (which would upset your old statistics teacher!!).

For the scope of machine learning models, treating continuous variables as discrete variables is usually your only option and makes modeling much more straightforward.

We can see below that even though we have a mix of continuous (weight and height) and discrete (lap on track), treating them as discrete will allow us to create models.

import pandas as pd

# create our data
data ={'weight':[125,135,160], 'height':[62,50,49], 'laps on track':[6,2,6]}

# make it a data frame
data = pd.DataFrame(data)

# show
data

Ordinal Vs. Numerical Vs. Nominal

Ordinal data is a type of data where the values have a natural order.

For example, if you were to ask people to rate their satisfaction with a product on a scale of 1 to 5, the resulting data would be ordinal.

Numerical data is data that can be measured and quantified. This data type has no order, and values are usually derived from counting or estimating things.

For example, if you were studying the effects of a new medication, you would likely use numerical data to track changes in a patient’s blood pressure or heart rate.

For example, if you were tracking the cost of a stock at closing each day of the week, you’d use numerical data to list that number in your dataset.

Nominal data is categorical data that does not have a natural order.

For example, if we had a variable in our dataset that was the colors in a crayon box, the resulting data would be nominal – since it has no order.

Nominal data is seen throughout machine learning and is called “categorical data.”

In our dataset below, popcorn price would be numerical data, favorite movie genre would be nominal data, and movie rating would be ordinal.

import pandas as pd

# create our data
data ={'popcorn_price':[8.99,9.50,9.25], 'favorite_movie_genre':['horror','scifi','action'], 'movie_rating':[6,8,3]}

# make it a data frame
data = pd.DataFrame(data)

# show
data

Timeseries Vs. Cross-Sectional

Timeseries data tracks the same entity (or entities) over time, while cross-sectional data sets track different entities at the same point in time.

Timeseries data sets are ideal for tracking trends over time. Since they follow the same entity, they can provide a clear picture of how that entity is changing over time.

On the other hand, cross-sectional data sets are better suited for answering questions about causation at that point in time. By tracking different entities simultaneously, cross-sectional data sets can help us identify relationships between variables.

Both of these “types” have their ups and downs.

For example, time series data sets can bring some challenges. In machine learning, assumptions of “independence” are violated simply by the data being time series.

While cross-sectional datasets can fall into the trap of “data fatigue.”

Since we’re only supplied data for a specific time, changes happening before or after are not considered during modeling. This can sometimes lead to short-sided models or models that fatigue as time goes on.

Below, we have an example of a time series dataset.

import pandas as pd

# create our data
data ={'day':[1,1,2,2], 'id':['1','2','1','2'], 'price':[5,6,6,7]}

# make it a data frame
data = pd.DataFrame(data)

# show
data

Streamed Vs. Batch Dataset

Streamed datasets are continuous, meaning new data is constantly being added in real time.

You’ll usually see this type of data from systems built at scale that are always running, like social media companies.

On the other hand, Batch datasets are finite; they only contain a set amount of data typically collected at specific intervals.

Often, these types of datasets will be “blended” together.

Most data scientists will use reservoir sampling if your system is pushing out streamed data.

This will create a batched dataset with the same distributions as your streamed data.

This will allow you to create models for continuous real-time systems (streamed data) utilizing your self-created batched dataset.

Big Data Vs. Standard

Big data and standard data are terms often used interchangeably, but there are some slight differences between them.

Standard data, such as data in a database, is typically collected in a structured format.

This type of data is easy to analyze and can be used to answer specific questions.

As a data scientist, don’t be shocked if 90% of your work is with standard data.

On the other hand, big data is often unstructured and can come from various sources like text files stuck inside amazon web services S3 service.

This makes it more difficult to analyze but allows for incredible models (deep learning) since the data quantity is so high.

Big data is also growing faster than standard data, making it much more costly than storing standard data in SQL databases.

Frequently Asked Questions

What type of data does machine learning need?

Machine learning algorithms need data in a format they can understand. Most of the time, you’ll want to feed your algorithms discrete numerical variables that allow our algorithms to converge. Some algorithms can handle categorical data (like K Modes), but most need help.

Is machine learning required for data analytics or data science?

Machine learning is not required for data analytics or data science and only makes up a small portion of those roles’ job flow. Most time is spent in these roles cleaning and presenting insights into business problems, only utilizing machine learning if the situation warrants it.

Why is having the right dataset important for machine learning algorithms?

Like an engine to a car, data makes or breaks machine learning algorithms. Think about it this way; if someone handed you a list of numbers to memorize and then asked you what those numbers were, you’d have a good chance to answer it correctly. If someone handed you a blurry broken list, you’d have no chance of answering the question.

Do data types in machine learning datasets matter?

Data types do not matter in machine learning as long as they are handled correctly. If categorical data is treated as ordinal, you’ll have problems. If feature engineering and initial exploration are conducted correctly, the data type will not matter in a machine-learning problem.

A Beginner’s Guide to X and Y in Machine Learning

Stewart Kaplan — Mon, 12 May 2025 12:55:42 +0000

Machine learning is a vast and complex field that covers many different concepts.

While there is some jargon to get up and running in machine learning, some of the ideas behind this jargon are very simple.

This guide will focus on some well-known data science jargon, focusing on X and Y in machine learning.

By the end of this, you’ll know what they are, what they do, what they mean, and how you can use them to improve your conversations with machine learning professionals.

Let’s get started!

What Are X And Y In Machine Learning

X and Y are jargon terms in Machine Learning.

“X” are the variables we will use to predict/classify our “Y” variable.

There can be many variables in our “X” set, but there will only be one variable in our “Y” set.

Machine Learning Example of X and Y

Below, we have a dataset, and our goal is to build a machine-learning model that can classify if a car is an automatic transmission or a manual one.

Here is that dataset:

We will need to split these two up, and we will use the code below to do that.

# x is everything but the first column
x = df.iloc[:,1:].values
# our target is our first column
y = df.iloc[:,0].values

Now that we’ve run that code on our data, we have the following.

X = Feature_1, Feature_2, Feature_3, Feature_4, Feature_5, Feature_6, Feature_7, Feature_8, Feature_9

Y = Transmission

We will now use this data to build our models!

Why are variables called x and y in machine learning?

When people think of data science and machine learning, their minds usually drift off, thinking of complicated machine learning algorithms and computer code.

Many fail to realize that these fields are deeply rooted in statistics and mathematics.

The letters “X” and “Y” commonly represent variables in equations in these disciplines.

I’m sure you can remember when you first learned mathematics and explored the equation of a line.

Y = mX + B

Where X is the input and Y is the output.

It’s no different today in Machine Learning Algorithms; Our “X” is typically used to represent the independent variables, while our “Y” represents the dependent variable.

Understanding how these variables interact is the heartbeat of machine learning.

Once an understanding is established, data scientists and machine learning engineers can design prediction models and systems to replicate this relationship.

While in recent years, the computing power boom has allowed for much more complex models to be developed, at their core, these fields are still based on the same fundamental principles of statistics and mathematics.

Are X and Y Both In The Training And Test Sets in Data Science?

When working in data science or machine learning, we must have a way to test our algorithms with data outside of the data we used to train them on see how they’re truly performing.

Otherwise, we risk overfitting our data, which means that our algorithm will do well on the data it’s seen before but won’t be able to generalize to new unseen data.

This is a considerable risk because our model would never perform well in production.

One way to do this is to split our data into two parts: a training split and a testing split.

The training split is the data we’ll use to train our machine-learning algorithm.

The testing split is the data we’ll use to test our algorithm.

To do this, we’ll need to split both X and Y.

Usually, we’ll use 80% of both X and Y to train the model and hold out 20% to test the model.

The image below explains how the splitting is usually done.

Do Unsupervised Machine Learning Algorithms Have X and Y?

Remember, in supervised learning, we have a clear target that we are trying to achieve.

Like in the example above, we were trying to predict the type of transmission of the car. In unsupervised learning, there is no such target.

Instead, the goal is to gain insights into the dataset. More specifically, we’re exploring the “X” without having a “Y.

There are a million different ways we could do this, and we might want to cluster our data points into groups or perform PCA to lower the dimensions of our dataset.

Many different algorithms can be utilized in unsupervised learning, and the choice of algorithm will depend on the nature of the data and the insights we hope to gain.

As stated above, our dataset will not have a traditional “Y” variable.

Without a target to predict, there is no way to measure the performance of our algorithm.

It is difficult to know when our model is “done” or “good.”

This makes unsupervised learning much more about exploration and insights than optimizing KPIs and offline metrics.

Do Supervised Machine Learning Algorithms Have X and Y?

All supervised machine learning algorithms have an X and a Y.

Our “X” set will compromise our independent variables, and a data frame or a matrix will usually represent this.

Our “Y” set will have our dependent variable, again as either a data frame or a matrix.

Supervised learning means we’re training algorithms using labeled data. Labeled data means data that has a target “Y.”

This is why all supervised algorithms have both X and Y… because it’s literally named after it.

Final Thoughts, X and Y in Machine Learning

So, X and Y variables.

You’ve probably heard of them before, even if you didn’t know what they are.

But in today’s digital marketing age, they can be beneficial for building machine learning models (supervised or unsupervised) to help with predicting the data of tomorrow.

Hopefully, this article helped clear up any questions you may have had about X and Y variables- let us know in the comments below if it was your first time using them!

And as always, happy coding!

Best Guesses: Understanding The Hypothesis in Machine Learning

Stewart Kaplan — Mon, 12 May 2025 01:56:22 +0000

Machine learning is a vast and complex field that has inherited many terms from other places all over the mathematical domain.

It can sometimes be challenging to get your head around all the different terminologies, never mind trying to understand how everything comes together.

In this blog post, we will focus on one particular concept: the hypothesis.

While you may think this is simple, there is a little caveat regarding machine learning.

The statistics side and the learning side.

Don’t worry; we’ll do a full breakdown below.

You’ll learn the following:

What Is a Hypothesis in Machine Learning?
Is This any different than the hypothesis in statistics?
What is the difference between the alternative hypothesis and the null?
Why do we restrict hypothesis space in artificial intelligence?
Example code performing hypothesis testing in machine learning

What Is a Hypothesis in Machine Learning?

In machine learning, the term ‘hypothesis’ can refer to two things.

First, it can refer to the hypothesis space, the set of all possible training examples that could be used to predict or answer a new instance.

Second, it can refer to the traditional null and alternative hypotheses from statistics.

Since machine learning works so closely with statistics, 90% of the time, when someone is referencing the hypothesis, they’re referencing hypothesis tests from statistics.

Is This Any Different Than The Hypothesis In Statistics?

In statistics, the hypothesis is an assumption made about a population parameter.

The statistician’s goal is to prove it true or disprove it.

This will take the form of two different hypotheses, one called the null, and one called the alternative.

Usually, you’ll establish your null hypothesis as an assumption that it equals some value.

For example, in Welch’s T-Test Of Unequal Variance, our null hypothesis is that the two means we are testing (population parameter) are equal.

This means our null hypothesis is that the two population means are the same.

We run our statistical tests, and if our p-value is significant (very low), we reject the null hypothesis.

This would mean that their population means are unequal for the two samples you are testing.

Usually, statisticians will use the significance level of .05 (a 5% risk of being wrong) when deciding what to use as the p-value cut-off.

What Is The Difference Between The Alternative Hypothesis And The Null?

The null hypothesis is our default assumption, which we are trying to prove correct.

The alternate hypothesis is usually the opposite of our null and is much broader in scope.

For most statistical tests, the null and alternative hypotheses are already defined.

You are then just trying to find “significant” evidence we can use to reject our null hypothesis.

These two hypotheses are easy to spot by their specific notation. The null hypothesis is usually denoted by H₀, while H₁ denotes the alternative hypothesis.

Example Code Performing Hypothesis Testing In Machine Learning

Since there are many different hypothesis tests in machine learning and data science, we will focus on one of my favorites.

This test is Welch’s T-Test Of Unequal Variance, where we are trying to determine if the population means of these two samples are different.

There are a couple of assumptions for this test, but we will ignore those for now and show the code.

You can read more about this here in our other post, Welch’s T-Test of Unequal Variance.

def welchsttest(M1, M2):
    
    # remember, this is welchs, so we do not assume equal variance
    T, p_value = stats.ttest_ind(M1, M2, equal_var = False)
    
    print(f'T value {T},\n\np-value {round(p_value,5)}\n')
    
    if p_value < .05:
        print('Reject Null Hypothesis')
    else:
        print('Fail To Reject Null')
    
    
welchsttest(df['price'],df['sqft'])

We see that our p-value is very low, and we reject the null hypothesis.

What Is The Difference Between The Biased And Unbiased Hypothesis Spaces?

The difference between the Biased and Unbiased hypothesis space is the number of possible training examples your algorithm has to predict.

The unbiased space has all of them, and the biased space only has the training examples you’ve supplied.

Since neither of these is optimal (one is too small, one is much too big), your algorithm creates generalized rules (inductive learning) to be able to handle examples it hasn’t seen before.

Here’s an example of each:

Example of The Biased Hypothesis Space In Machine Learning

The Biased Hypothesis space in machine learning is a biased subspace where your algorithm does not consider all training examples to make predictions.

This is easiest to see with an example.

Let’s say you have the following data:

Happy and Sunny and Stomach Full = True

Whenever your algorithm sees those three together in the biased hypothesis space, it’ll automatically default to true.

This means when your algorithm sees:

Sad and Sunny And Stomach Full = False

It’ll automatically default to False since it didn’t appear in our subspace.

This is a greedy approach, but it has some practical applications.

Example of the Unbiased Hypothesis Space In Machine Learning

The unbiased hypothesis space is a space where all combinations are stored.

We can use re-use our example above:

Happy and Sunny and Stomach Full = True

This would start to breakdown as

Happy = True

Happy and Sunny = True

Happy and Stomach Full = True

… etc

Let’s say you have four options for each of the three choices.

3 x 4 = 12

This would mean our subspace would need 2^12 instances (4096) just for our little three-word problem.

This is practically impossible; the space would become huge.

So while it would be highly accurate, this has no scalability.

More reading on this idea can be found in our post, Inductive Bias In Machine Learning.

Why Do We Restrict Hypothesis Space In Artificial Intelligence?

We have to restrict the hypothesis space in machine learning. Without any restrictions, our domain becomes much too large, and we lose any form of scalability.

This is why our algorithm creates rules to handle examples that are seen in production.

This gives our algorithms a generalized approach that will be able to handle all new examples that are in the same format.

How To Deal With Zip Codes In Machine Learning [Python Code]

Stewart Kaplan — Thu, 08 May 2025 13:31:18 +0000

One common task you will encounter while working in machine learning is clunky datasets that need cleaning.

This involves dealing with inconsistencies and errors in the data and data that wouldn’t “make sense” for our models in their current form.

This blog post will discuss how to deal with zip codes so you can get them in a form your machine-learning models will love.

We will provide 3 Python code options to help you get started and explain some logic for each option below.

1.) Take The First 2-3 Digits Of The Zip code

In many data sets, zip codes are included to add geographic information to specific regions.

Because there are so many different zip codes, this variable (if expanded) can sometimes create more columns and categories than desired.

A way to get around this is by creating your own encoding. Take the first few digits of each zip code and use that to create categories.

For example, many Florida zip codes start with “33”; we could keep the first two digits of each zip code, giving us a column representing some regions in Florida.

You could even go further and keep the first three digits, which usually will (loosely) break down each state into cities and towns.

This method is excellent because it would provide some generalizability – if we were to see a new zip code that started with “33”, our model could easily classify it as the central part of Florida. We do not have to worry about it being the first time we’ve seen that zip code.

import pandas as pd

# dataset we used
# https://www.kaggle.com/datasets/danofer/zipcodes-county-fips-crosswalk

# read in our csv
zips = pd.read_csv('zipcodes.csv')

# a simple lambda funciton that will transform it
zips['ZIP2'] = zips['ZIP'].apply(lambda x: str(x)[0:2])

zips

print(f'Original Uniques {zips.ZIP.nunique()} vs New Uniques {zips.ZIP2.nunique()}')

Now that we’ve cut down our categories from ~39,000 to 90, we can utilize a HashEncoder, TargetEncoder, or create Dummy Variables to feed our model values/variables that it likes.

Reference: https://www.unitedstateszipcodes.org/

2.) Latitude and Longitude Centering For Easy Model Prediction

If you’re looking for another way to deal with your zip code data, you should map latitude and longitude into your data frame.

This can be especially useful if you’re working with tree-based models, which can understand numerical data in this form.

Tree models (in this scenario) under-the-hood work by creating a grid of latitude and longitude coordinates, allowing your model to converge faster and not needing as much data as other processes.

Now, if your model receives new latitude and longitude coordinates, it’ll easily handle them since there is a numerical (distance) factor at play compared to categorical.

While you probably think you’ll need an API, tons of pre-built “maps” are available online.

In the code below, I use a simple text file from GitHub to map longitude and latitude to my data frame:

import pandas as pd

# here's a list of all zip codes
# with latitude and longitutude settings
# https://gist.github.com/erichurst/7882666

# load in our data
zips = pd.read_csv('zipcodes.csv')

# load in our map from above
zipCodeMap = pd.read_csv('zip_codes.txt')

# merge our dataframes together with a join
mappedZipCodes = zips.merge(zipCodeMap, on='ZIP', how='left')

# here is our final result
mappedZipCodes

You can quickly proceed with modeling, dropping the original “ZIP” category.

Reference: https://gist.github.com/erichurst/7882666

3.) Using Dummy Variables To Transform Categorical Data

In some data sets, it might make more sense to treat zip codes as categorical variables instead of numerical ones.

This would allow you to use dummy variables.

If you had data on home prices in different zip codes, you could create a dummy variable for each zip code.

This is slightly different from the other option we discussed before, which cut down on the potential feature space (by chopping down the values) a ton.

Creating new variables for each zip code can quickly cause dimensionality problems from the expansion.

This is where your subspace becomes so large and sparse that models can’t find any meaning in the subspace.

There is another problem with handling zip codes this way since you’ll only have columns for the zip codes that you’ve “seen” in your training set.

This creates a model that can’t handle unseen zip codes.

While models like this aren’t necessarily “bad,” they should only be used in situations where you know there will not be any new zip codes coming down the pipeline.

While there are a lot of negatives, if you’re performing EDA and see you only have 1 or 2 unique zip codes and can guarantee there won’t be any new ones down the road, this route quickly becomes very practical.

import pandas as pd

# load in our data (we assume theres only 8 lines)
zips = pd.read_csv('zipcodes.csv').sample(n=8)

# create dummy variables only for zip
dummy_zips = pd.get_dummies(zips['ZIP'])

# bring those rows back in on the index
combined = pd.concat([dummy_zips, zips], axis=1)

combined

Handling Zip Codes The Right Way For Model Improvement

We hope you found this post helpful – zip codes in machine learning can be a little tricky, but with the right tools in the toolbox, you should know how to fix your problem and continue modeling quickly.

Let us know in the comments below if you have any other tips or tricks on working with zip codes.

We love hearing from our readers and sharing new ideas with everyone.

Unsupervised Learning – EML

Pytorch Lightning vs TensorFlow Lite [Know This Difference]

Understanding The Difference Between PyTorch Lightning and TensorFlow Lite

PyTorch Lightning: High-performance Wrapper for PyTorch

TensorFlow Lite: ML on Mobile and Embedded Devices

What Do You Need To Use TensorFlow Lite

A Trained Model

Model Conversion

Checkpoints

When would you use Pytorch Lightning Over Regular Pytorch?

Minimize Boilerplate Code

Cater to Advanced PyTorch Developers

Enable Multi-GPU Training

Reduce Error Chances in Your Code

Is SVG a Machine Learning Algorithm Or Not? [Lets Put This To Rest]

What Is SVG, and where did it come from?

Why The Internet Is Wrong, and SVG is a machine learning algorithm.

Why You Should Be Careful Trusting Anything You See

How To Choose The Right Algorithm For Machine Learning [Expert Guide]

The Two Main Pillars of Machine Learning

One Pillar Has Two Categories; The Other Has None

What To Do Before You Start Coding Your Algorithm

Quick Guide To Choosing The Right Machine Learning Algorithm

What To Watch Out For When Choosing Your Algorithm

How To Know You’ve Picked you’ve chosen the right learning model for your problem.

Machine Learning Algorithm In Sorting?? [With Code!!]

Understanding The Machine Learning Problem

Supervised Learning: How Support Vector Machines Sort

SVM algorithm in Python

Unsupervised Learning: How Clustering Algorithms Sort

K-Means algorithm in Python

How Algorithms Are Helping Commercial Businesses Sort Items

ML101: Noise In Machine Learning [Full Code]

What is Noise in Machine Learning

Does every real-world dataset have noise in it?

Do we want to remove all noise from data in Machine Learning?

How to remove all types of noise for our learning models in python

Is it possible to label noise before modeling?

Are there ever scenarios where you want to add noise in machine learning?

What is the difference between Noise and Error In Data Science?

Are Noise And Outliers The Same In Data Science?

Other Quick Machine Learning Tutorials

One Step At A Time: Epoch In Machine Learning

What is an Epoch in Machine Learning?

Why do we use more than one Epoch?

How to Choose The Right Number of Epochs

What Is the Difference Between Epoch and Batch In Machine Learning?

Do Different Frameworks Have Different Meaning For Epoch?

Does an Epoch Exist Outside of Machine Learning?

What is an iteration in machine learning?

Do Many Epochs Help Our Gradient Descent Optimization Algorithm Converge?

Other Quick Machine Learning Tutorials

Why It Matters: Types of Data For Machine Learning

Different Types Of Data In Machine Learning

Discrete Vs. Continuous

Ordinal Vs. Numerical Vs. Nominal

Timeseries Vs. Cross-Sectional

Streamed Vs. Batch Dataset

Big Data Vs. Standard

Other Quick Machine Learning Tutorials

Frequently Asked Questions

What type of data does machine learning need?

Is machine learning required for data analytics or data science?

Why is having the right dataset important for machine learning algorithms?

Do data types in machine learning datasets matter?

A Beginner’s Guide to X and Y in Machine Learning

What Are X And Y In Machine Learning

Machine Learning Example of X and Y

Why are variables called x and y in machine learning?

Are X and Y Both In The Training And Test Sets in Data Science?

Do Unsupervised Machine Learning Algorithms Have X and Y?

Do Supervised Machine Learning Algorithms Have X and Y?

Final Thoughts, X and Y in Machine Learning

Other Quick Data Science Tutorials

Best Guesses: Understanding The Hypothesis in Machine Learning

What Is a Hypothesis in Machine Learning?

Is This Any Different Than The Hypothesis In Statistics?

What Is The Difference Between The Alternative Hypothesis And The Null?

Example Code Performing Hypothesis Testing In Machine Learning

What Is The Difference Between The Biased And Unbiased Hypothesis Spaces?