Dataset2API Project 1: Data Science Salary API, Feature Engineering [2/4]

This blog post will discuss Feature Engineering for our first project in our Dataset2API series.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part two of Project 1; you can view the other posts in this project here:

Table of Contents show

Feature Engineering, Creating Data Ready For Modeling

After performing EDA on our data, we found some things we need to address before we can begin modeling.

Here are the notes I took during EDA

Notes That I took during this EDA Process:

Of our 12 columns, we will eliminate two as our salary_in_usd has made them redundant.
employee_residence and company_location seem highly correlated, and it is awkward to provide both to an API – decide which of the two I want to keep
employee_residence/company_location, and job_title have way too many unique values for OHE. I will write functions to slim these columns down to a lower cardinality.
Salary_in_usd isn’t normal and seems to have a few outliers. We will need to deal with that 600k salary point as it may skew predictions high.
Some of the data is from 2021 and 2020; since I’m building this API in 2022, I want to move these numbers to 2022 salary numbers (by shifting the mean of those distributions).

Making a Decision Between Two Columns With Pandas Correlation

Since employee_residence and company_location are highly correlated, it seems a bit awkward to provide both to our API.

Think about the end user, who will probably be an individual who is interested in finding out how much they should be making.

Since we’re focused on ease of use and speed, applying both feels a little bloated, and we will only be keeping one of the two.

We use correlation with our target variable (salary_in_usd) to determine which of the two categorical columns to keep.

We’ll have to convert these categorical columns with a label encoder before we can compare, as the .corr() method in Pandas cannot handle non-numeric values.

# import our labelEncoder
from sklearn.preprocessing import LabelEncoder

# select from our dataframe which ones we want to keep
residence = df[['company_location','employee_residence','salary_in_usd']]

# transform all categorical columns with a fresh labelencoder for each column
residence = residence.apply(LabelEncoder().fit_transform).corr()

residence

We see that employee_residence is a little more correlated with our salary column than company_location

From this, we decide to move forward with employee_residence and will drop company_location

Handling Categorical Columns With Too many Features In Pandas

We need to handle our employee_residence column, which has 57 unique values!

To do this, we will create a simple function that is comprised of the top 5-10 values, which for our dataset, hold about 80% of the values.

If a feature does not fall within those 5-10 values, we’ll give it the blanket term “OTHER.”

While this will cost us some accuracy, it will make our system much more dynamic and be able to handle data from countries that we haven’t seen before.

Employee Residence Column

def convertResidence(text):
    
    approved = ['US','GB','IN','CA','DE','FR','ES','GR','JP']
    
    if text.upper() in approved:
        return text
    
    return 'OTHR_RES'

residence_converted = df[['employee_residence']]

residence_converted['conv'] = df.employee_residence.apply(convertResidence)

residence_converted

Most columns seem to stay the same, but a few did convert to our “OTHER” tag.

Job Title Column

Now, we need to do this same process for our job title column, which has tons of unique titles for job titles.

We’ll use a text search to find bits and pieces of the job roles to give us six different tags.

These tags will easily be One Hot Encoded (OHE) for our modeling portion.

def ConvertRole(text):
    
    if 'lead' in text.lower() or 'manager' in text.lower() or 'director' in text.lower() or 'head' in text.lower():
        return 'LDR'
    
    elif 'machine' in text.lower() or 'ai ' in text.lower() or 'vision' in text.lower():
        return 'ML'
    
    if 'scientist' in text.lower() or 'analytics' in text.lower() or 'science' in text.lower():
        return 'DS'
    
    if 'analyst' in text.lower():
        return 'AL'
    
    if 'engineer' in text.lower():
        return 'DE'

    
    return 'OTHR_ROLE'

roles_converted = df[['job_title']]

roles_converted['conv'] = df.job_title.apply(ConvertRole)

roles_converted

We can see on the right how our conversion went.

Remote Perecentage Column

Finally, we need to do this same process for our remote% column.

This column is currently an integer, and I want to transform it into cleaner categorical categories for easier understanding.

We’ll use a text search to find bits and pieces of the job roles to give us six different tags.

These tags will easily be One Hot Encoded (OHE) for our modeling portion.

def FixRemoteRatio(percentage):
    
    if percentage > 50:
        return 'Remote'
    
    if percentage < 50:
        return 'Office'
    
    return 'Hybrid'

df['remote_converted'] = df.remote_ratio.apply(FixRemoteRatio)


df[['remote_ratio','remote_converted']]

Now we can see our remote ratio converted over.

Handling Outliers In Salary Data With IQR

We know from our EDA process that we had found some outliers in our salary_in_usd column, and that our data wasn’t normal like we’d like.

Let’s see if we can fix that.

salary_converted = df[['salary_in_usd']]

# find our quartiles
q1, q3 = np.percentile(salary_converted['salary_in_usd'],[25,75])

# find our range
IQR = q3 - q1

# lower barrier
Q1_Barrier = q1 - 1.5 * IQR

# upper barrier
Q3_Barrier = q3 + 1.5 * IQR

# mark outliers                  # if less than q1b   and more than q3b
salary_converted['salary_in_usd'+'_Outlier'] = np.where((salary_converted['salary_in_usd']<Q1_Barrier) | \ (salary_converted['salary_in_usd']>Q3_Barrier), \
                                          1, 0)

salary_converted

From this random sample, we can see that making $423,000 a year is an outlier for our data.

We will be removing any values that we detect as an outlier.

Transforming Target Salary Data Into A Normal Distribution

I always like to have the data fitted to a normal distribution for a numerical target variable.

This is simply because models perform better when data follows a normal distribution.

Normalization for column variables gives equal weights/importance to each variable so that no single variable leads model performance in one direction just because the numbers aren’t equally scaled.

Normalization for target variables confines the spread of values, which has been proven to increase accuracy.

from sklearn.preprocessing import StandardScaler

# remove outliers
normalize_salary = salary_converted[salary_converted['salary_in_usd_Outlier'] < 1]

# create scalar
scaler = StandardScaler().fit(normalize_salary[['salary_in_usd']])

# apply scalar
normalize_salary['Normalized_Salary'] = scaler.transform(normalize_salary[['salary_in_usd']])

# plot
normalize_salary.Normalized_Salary.hist()

While this is not perfectly normal, I think this will do the job.

Handling Old Data And Moving it to the Current Year

Our data is almost ready! We’ve cleaned up most of the columns and have a normal target variable.

There is one thing that is bothering me: some of our data is from 2020 and 2021 (I’m building this in 2022).

When querying an API, you don’t care what the salary was in 2021; you want to know what the salary is for right now (2022).

I’m going to check and see if the average salary has been increasing (which is my guess) YoY (Year over Year).

# im interested in if salaries are increasing YoY
from matplotlib import pyplot as plt


fig = plt.figure()
ax = plt.axes()

_2020 = df.query('work_year == 2020')['normalized_salary']
_2021 = df.query('work_year == 2021')['normalized_salary']
_2022 = df.query('work_year == 2022')['normalized_salary']



plt.scatter([val for val in range(len(_2020.values))], _2020.values, color='y')
plt.scatter([val for val in range(len(_2021.values))], _2021.values, color='g')
plt.scatter([val for val in range(len(_2022.values))], _2022.values, color='r')


print(np.mean(_2020.values))
print(np.mean(_2021.values))
print(np.mean(_2022.values))

Right away, we see from the top three numbers that salary is lowest in 2020 and highest in 2022.

I want to “move” the 2020 and 2021 salary data into 2022.

Changing The Salary Means For 2020 and 2021 to be equal to 2022

Let’s change the 2020 and 2021 salaries to be “equal” to the salaries for 2022.

# i want my estimator to work for the current year, so lets move
# these values all to 2022
current_year = np.mean(_2022.values)

change_2020 = np.mean(_2020.values)

change_2021 = np.mean(_2021.values)

# percent increase formula
increase_2020 = current_year - change_2020
increase_2021 = current_year - change_2021


def increaseSalaries(year, value):
    
    if year == 2020:
        return value + increase_2020
    
    elif year == 2021:
        return value + increase_2021
    
    return value
    
    
df['final_salary'] = df.apply(lambda x: increaseSalaries(x.work_year, x.normalized_salary), axis=1)

On the left, we have the beginning salary, and on the right, we have the final salary that we will be taking moving forward.

Lets double check that everything is how we want it

fig = plt.figure()
ax = plt.axes()

_2020 = df.query('work_year == 2020')['final_salary']
_2021 = df.query('work_year == 2021')['final_salary']
_2022 = df.query('work_year == 2022')['final_salary']



plt.scatter([val for val in range(len(_2020.values))], _2020.values, color='y')
plt.scatter([val for val in range(len(_2021.values))], _2021.values, color='g')
plt.scatter([val for val in range(len(_2022.values))], _2022.values, color='r')


print(np.mean(_2020.values))
print(np.mean(_2021.values))
print(np.mean(_2022.values))

Now that the means for each year are all equivalent, all data has been moved to the current year.

Drop Unused Columns and See Featured Engineer Dataset

Now that we’ve converted everything over, we can drop the columns we aren’t using and visualize our final dataset.

This will be the dataset that we will work with for modeling.

df.drop(['normalized_salary', 'work_year', 'remote_ratio'], axis=1, inplace=True)

df

In our next post, we’ll start building our model. This data needs to be encoded before we can model it.

We explore multiple different models, and build out some parameter searching to get the best model possible

Check it out here

Part 3: Modeling

Or, if you are interested in a different part of this process, here are the links to the other posts.

Author
Recent Posts

Stewart Kaplan

Stewart Kaplan has years of experience as a Senior Data Scientist. He enjoys coding and teaching and has created this website to make Machine Learning accessible to everyone.

Latest posts by Stewart Kaplan (see all)

Are Degrees Necessary for Google Software Engineers? [Discover the Truth] - July 26, 2024
Can You Get into Software Development with No Experience? [Must-Read Tips] - July 26, 2024
Navigating Generative vs Discriminative Models in Data Science [Make the Right Choice Now!] - July 26, 2024