We’ve all been there; you’ve worked night and day to finally get an accurate model for your dataset. You’ve finally got an output from your model – but it’s scaled. What do you do? How do you reverse standardization?
In this 2-minute guide, we’ll go over how you can find out the real target value for your model prediction and how you can reverse it.
If you’re here for a quick code chunk for a model prediction, here is a python function to get you on your way.
Reverse Standardization In Python For Model Prediction
import pandas as pd
import numpy as np
# example data
df = pd.read_csv('ds_salaries.csv')
# lets say your model gave you an output
# for a salary
# here's how you can reverse engineer it from
# the target column
def reverse_standardization_pred(col, prediction):
# calculate the mean
mean = sum(col) / len(col)
# calculate the variance
var = sum((val - mean)**2 for val in col) / len(col)
# calculate standard deviation
std = var ** 0.5
# apply it to the prediction
real_val = prediction * std + mean
return real_val
# your model predicted a salary, how to reverse it
# where the .25 is your **models** prediction
real_salary = reverse_standardization(df['salary_in_usd'], .25)
print(f'Unstandardized salary data was: ${round(real_salary,2)}')
Why You Should Standardize Variables
Standardizing variables is important in order to ensure that the results of statistical tests are accurate and meaningful.
When variables are not standardized, it can be difficult to determine whether the results of a test are statistically significant.
Some machine learning models, like lasso and ridge regression, depend on scaled data.
Models that utilize gradient descent have shown improvements when data is standardized.
Why we can’t rebuild a dataset from standardized data (Without the Old Data)
We can’t rebuild a dataset from standardized data without the old data because of how standardization is done in the first place.
Let’s take a look.
The formula for standardization is the following (for each data point):
Once we’ve applied this transformation to our data, we now have a standardized column with a mean at zero and a standard deviation of one.
If we wanted to reverse engineer this column (without the old data), the formula would be the following.
Using this formula, every point maps to itself since we multiply it by 1 and add 0.
This is why (without the old standard deviation and mean) we cannot reverse standardize the data.
Other Articles in our Machine Learning 101 Series
We have many quick guides that go over some of the fundamental parts of machine learning. Some of those guides include:
- Heatmaps In Python: Visualizing data is key in data science; this post will teach eight different libraries to plot heatmaps.
- Welch’s T-Test: Do you know the difference between the student’s t-test and welch’s t-test? Don’t worry, we explain it in-depth here.
- Parameter Versus Variable: Commonly misunderstood – these two aren’t the same thing. This article will break down the difference.
- Criterion Vs. Predictor Variables: Now that you can derive your output, make sure you can understand the business context and create accurate models from your dataset.
- Normal Distribution vs. Uniform Distribution: Now that you can do a full ML pipeline, you should explore variable distributions to improve your models.
- Gini Index vs. Entropy: Learn how decision trees make splitting decisions. These two are the workhouse of top-performing tree-based methods.
- CountVectorizer vs. TFIDFVectorizer: Two fundamental NLP Algorithms. I’d take a look at these once you’re ready to start working with language models.
- Feature Selection With SelectKBest Using Scikit-Learn: Feature selection is tough; we make it easy for both regression and classification in this guide.
- Exploring the Shortage of Software Engineers: Strategies for Success [Must-Read Tips] - November 21, 2024
- How Female Software Engineers Dress [Master the Style Secrets] - November 21, 2024
- Demystifying ‘Is Software Testing Hard Reddit?’ [Unlock the Truth] - November 20, 2024