A Beginner’s Guide to X and Y in Machine Learning

Machine learning is a vast and complex field that covers many different concepts.

While there is some jargon to get up and running in machine learning, some of the ideas behind this jargon are very simple.

This guide will focus on some well-known data science jargon, focusing on X and Y in machine learning.

By the end of this, you’ll know what they are, what they do, what they mean, and how you can use them to improve your conversations with machine learning professionals.

Let’s get started!

What Are X And Y In Machine Learning

X and Y are jargon terms in Machine Learning.

“X” are the variables we will use to predict/classify our “Y” variable.

There can be many variables in our “X” set, but there will only be one variable in our “Y” set.

Machine Learning Example of X and Y

Below, we have a dataset, and our goal is to build a machine-learning model that can classify if a car is an automatic transmission or a manual one.

Here is that dataset:

We will need to split these two up, and we will use the code below to do that.

``````# x is everything but the first column
x = df.iloc[:,1:].values
# our target is our first column
y = df.iloc[:,0].values``````

Now that we’ve run that code on our data, we have the following.

X = Feature_1, Feature_2, Feature_3, Feature_4, Feature_5, Feature_6, Feature_7, Feature_8, Feature_9

Y = Transmission

We will now use this data to build our models!

Why are variables called x and y in machine learning?

When people think of data science and machine learning, their minds usually drift off, thinking of complicated machine learning algorithms and computer code.

Many fail to realize that these fields are deeply rooted in statistics and mathematics.

The letters “X” and “Y” commonly represent variables in equations in these disciplines.

I’m sure you can remember when you first learned mathematics and explored the equation of a line.

Y = mX + B

Where X is the input and Y is the output.

It’s no different today in Machine Learning Algorithms; Our “X” is typically used to represent the independent variables, while our “Y” represents the dependent variable.

Understanding how these variables interact is the heartbeat of machine learning.

Once an understanding is established, data scientists and machine learning engineers can design prediction models and systems to replicate this relationship.

While in recent years, the computing power boom has allowed for much more complex models to be developed, at their core, these fields are still based on the same fundamental principles of statistics and mathematics.

Are X and Y Both In The Training And Test Sets in Data Science?

When working in data science or machine learning, we must have a way to test our algorithms with data outside of the data we used to train them on see how they’re truly performing.

Otherwise, we risk overfitting our data, which means that our algorithm will do well on the data it’s seen before but won’t be able to generalize to new unseen data.

This is a considerable risk because our model would never perform well in production.

One way to do this is to split our data into two parts: a training split and a testing split.

The training split is the data we’ll use to train our machine-learning algorithm.

The testing split is the data we’ll use to test our algorithm.

To do this, we’ll need to split both X and Y.

Usually, we’ll use 80% of both X and Y to train the model and hold out 20% to test the model.

The image below explains how the splitting is usually done.

Do Unsupervised Machine Learning Algorithms Have X and Y?

Remember, in supervised learning, we have a clear target that we are trying to achieve.

Like in the example above, we were trying to predict the type of transmission of the car. In unsupervised learning, there is no such target.

Instead, the goal is to gain insights into the dataset. More specifically, we’re exploring the “X” without having a “Y.

There are a million different ways we could do this, and we might want to cluster our data points into groups or perform PCA to lower the dimensions of our dataset.

Many different algorithms can be utilized in unsupervised learning, and the choice of algorithm will depend on the nature of the data and the insights we hope to gain.

As stated above, our dataset will not have a traditional “Y” variable.

Without a target to predict, there is no way to measure the performance of our algorithm.

It is difficult to know when our model is “done” or “good.”

This makes unsupervised learning much more about exploration and insights than optimizing KPIs and offline metrics.

Do Supervised Machine Learning Algorithms Have X and Y?

All supervised machine learning algorithms have an X and a Y.

Our “X” set will compromise our independent variables, and a data frame or a matrix will usually represent this.

Our “Y” set will have our dependent variable, again as either a data frame or a matrix.

Supervised learning means we’re training algorithms using labeled data. Labeled data means data that has a target “Y.”

This is why all supervised algorithms have both X and Y… because it’s literally named after it.

Final Thoughts, X and Y in Machine Learning

So, X and Y variables.

You’ve probably heard of them before, even if you didn’t know what they are.

But in today’s digital marketing age, they can be beneficial for building machine learning models (supervised or unsupervised) to help with predicting the data of tomorrow.