Data is the heart and soul of machine learning.
Without data, data scientists cannot create impactful machine-learning algorithms (duh).
While this seems pretty standard, the type of data makes a huge difference when trying to perform data analysis.
And sometimes, when we receive data that we’re not used to or haven’t dealt with before – it can cause problems.
In this guide, we will look at the different types of data for machine learning models.
After this reading this short article, you’ll understand the following:
- ALL of the Different Types Of Data in Machine Learning
- Discrete Vs. Continuous
- Ordinal Vs. Numerical Vs. Nominal
- Timeseries Vs. Cross-Sectional
- Big Data Vs. Standard
- Streamed Vs. Batch Dataset
- Answers To Some Common Data Type Questions At The End
Let’s get to it!
Different Types Of Data In Machine Learning
When doing machine learning, you’ll encounter various data types.
Discrete data is a countable data type, like the number of children in a family (whole number).
Continuous data is a data type that can be measured, like height or weight.
Ordinal data can be ranked, like 3rd place or runner-up.
Numerical data is data that can be quantified, like money or age.
Nominal data is data that can be categorized, like colors or countries.
Time series data is collected over time, like monthly sales figures.
Cross-sectional data is data collected at one point in time across various individuals, like census data.
Streamed data is collected in real-time, like social media posts.
Batch data is collected in chunks, like customer purchase records.
Big data is large sets of structured and unstructured data, like weather patterns or satellite imagery (usually streamed).
Standard data is small sets of well-defined structured data, like death certificates or tax returns.
As you can see, there’s a wide variety of different data types that you’ll encounter when doing machine learning.
Whether you’re trying to classify fraud, predict salary or build an awesome visualization, understanding their differences is essential for successfully building a model that works best for you and your customers.
Discrete Vs. Continuous
Discrete data is often considered data that can be counted, like the number of students in a class or crayons on the floor (finite).
However, discrete data can also take on a non-numeric form, like the color of someone’s eyes.
Continuous data, on the other hand, is always numeric and can represent any value within a specific range, like height or weight.
For example, you could weigh 125.3 or 125.325 pounds, etc. – you’ll never get the “exact” weight, as you’ll always be sacrificing some form of precision.
The two types of data in machine learning are often used interchangeably (which would upset your old statistics teacher!!).
For the scope of machine learning models, treating continuous variables as discrete variables is usually your only option and makes modeling much more straightforward.
We can see below that even though we have a mix of continuous (weight and height) and discrete (lap on track), treating them as discrete will allow us to create models.
import pandas as pd
# create our data
data ={'weight':[125,135,160], 'height':[62,50,49], 'laps on track':[6,2,6]}
# make it a data frame
data = pd.DataFrame(data)
# show
data
Ordinal Vs. Numerical Vs. Nominal
Ordinal data is a type of data where the values have a natural order.
For example, if you were to ask people to rate their satisfaction with a product on a scale of 1 to 5, the resulting data would be ordinal.
Numerical data is data that can be measured and quantified. This data type has no order, and values are usually derived from counting or estimating things.
For example, if you were studying the effects of a new medication, you would likely use numerical data to track changes in a patient’s blood pressure or heart rate.
For example, if you were tracking the cost of a stock at closing each day of the week, you’d use numerical data to list that number in your dataset.
Nominal data is categorical data that does not have a natural order.
For example, if we had a variable in our dataset that was the colors in a crayon box, the resulting data would be nominal – since it has no order.
Nominal data is seen throughout machine learning and is called “categorical data.”
In our dataset below, popcorn price would be numerical data, favorite movie genre would be nominal data, and movie rating would be ordinal.
import pandas as pd
# create our data
data ={'popcorn_price':[8.99,9.50,9.25], 'favorite_movie_genre':['horror','scifi','action'], 'movie_rating':[6,8,3]}
# make it a data frame
data = pd.DataFrame(data)
# show
data
Timeseries Vs. Cross-Sectional
Timeseries data tracks the same entity (or entities) over time, while cross-sectional data sets track different entities at the same point in time.
Timeseries data sets are ideal for tracking trends over time. Since they follow the same entity, they can provide a clear picture of how that entity is changing over time.
On the other hand, cross-sectional data sets are better suited for answering questions about causation at that point in time. By tracking different entities simultaneously, cross-sectional data sets can help us identify relationships between variables.
Both of these “types” have their ups and downs.
For example, time series data sets can bring some challenges. In machine learning, assumptions of “independence” are violated simply by the data being time series.
While cross-sectional datasets can fall into the trap of “data fatigue.”
Since we’re only supplied data for a specific time, changes happening before or after are not considered during modeling. This can sometimes lead to short-sided models or models that fatigue as time goes on.
Below, we have an example of a time series dataset.
import pandas as pd
# create our data
data ={'day':[1,1,2,2], 'id':['1','2','1','2'], 'price':[5,6,6,7]}
# make it a data frame
data = pd.DataFrame(data)
# show
data
Streamed Vs. Batch Dataset
Streamed datasets are continuous, meaning new data is constantly being added in real time.
You’ll usually see this type of data from systems built at scale that are always running, like social media companies.
On the other hand, Batch datasets are finite; they only contain a set amount of data typically collected at specific intervals.
Often, these types of datasets will be “blended” together.
Most data scientists will use reservoir sampling if your system is pushing out streamed data.
This will create a batched dataset with the same distributions as your streamed data.
This will allow you to create models for continuous real-time systems (streamed data) utilizing your self-created batched dataset.
More Reading:
https://en.wikipedia.org/wiki/Reservoir_sampling
Big Data Vs. Standard
Big data and standard data are terms often used interchangeably, but there are some slight differences between them.
Standard data, such as data in a database, is typically collected in a structured format.
This type of data is easy to analyze and can be used to answer specific questions.
As a data scientist, don’t be shocked if 90% of your work is with standard data.
On the other hand, big data is often unstructured and can come from various sources like text files stuck inside amazon web services S3 service.
This makes it more difficult to analyze but allows for incredible models (deep learning) since the data quantity is so high.
Big data is also growing faster than standard data, making it much more costly than storing standard data in SQL databases.
Other Quick Machine Learning Tutorials
At EML, we have a ton of cool data science tutorials that break things down so anyone can understand them.
Below we’ve listed a few that are similar to this guide:
- Instance-Based Learning in Machine Learning
- Verbose in Machine Learning
- Noise In Machine Learning
- Hypothesis in Machine Learning
- Bootstrapping In Machine Learning
- Inductive Bias in Machine Learning
- Epoch In Machine Learning
- Understanding The Hypothesis In Machine Learning
- Zip Codes In Machine Learning
- get_dummies() in Machine Learning
- X and Y in Machine Learning
- F1 Score in Machine Learning
- Generalization In Machine Learning
Frequently Asked Questions
What type of data does machine learning need?
Machine learning algorithms need data in a format they can understand. Most of the time, you’ll want to feed your algorithms discrete numerical variables that allow our algorithms to converge. Some algorithms can handle categorical data (like K Modes), but most need help.
Is machine learning required for data analytics or data science?
Machine learning is not required for data analytics or data science and only makes up a small portion of those roles’ job flow. Most time is spent in these roles cleaning and presenting insights into business problems, only utilizing machine learning if the situation warrants it.
Why is having the right dataset important for machine learning algorithms?
Like an engine to a car, data makes or breaks machine learning algorithms. Think about it this way; if someone handed you a list of numbers to memorize and then asked you what those numbers were, you’d have a good chance to answer it correctly. If someone handed you a blurry broken list, you’d have no chance of answering the question.
Do data types in machine learning datasets matter?
Data types do not matter in machine learning as long as they are handled correctly. If categorical data is treated as ordinal, you’ll have problems. If feature engineering and initial exploration are conducted correctly, the data type will not matter in a machine-learning problem.
- What Does a Software Testing Intern Do? [Discover the Insider Secrets] - October 12, 2024
- How Long is a Masters in Software Development? [Plan Your Timeline] - October 12, 2024
- Is Software Testing Still in Demand? Future Outlook Revealed [Must-Read] - October 12, 2024