project part 1

Dataset2API Project 1: Data Science Salary API, EDA [1/4]

This blog post will discuss EDA (exploratory data analysis) for our first project in our Dataset2API series.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part one of Project 1; you can view the other posts in this project here:


Our Goal and Dataset For Data Science Salary API

We will be building an API that covers the general roles under the umbrella of data science.

We needed a dataset that covered many different roles, was up to date, and had enough data to build this API.

Fortunately, we were able to find just such a dataset on Kaggle.

The dataset consists of about ~600 different data science role salary combinations, which we hope will be enough to cover the needs of our API.

This was a huge relief for us, as finding a suitable dataset was one of our biggest challenges in building this API.

We do have some concerns about the small amount of data, but we have some techniques we can deploy that we hope to give us an accurate API.

We hope you follow along with these posts to build your first API, or if you’re already a pro, use some of the tips and tricks discussed here to upgrade your current APIs!

Dataset Link

https://www.kaggle.com/datasets/whenamancodes/data-science-fields-salary-categorization

Exploratory Data Analysis (EDA) For Data Science Dataset

When starting EDA, one of the first things I like to do is go column by column and note the amount of work needed to take my dataset to the level I need to be API level.

Here is a quick picture of the data

original dataset

Right Away, I notice a salary_in_usd and a regular “salary” column.

Since I want to compare apples to apples, in the future, I’ll be deleting the salary and salary_currency columns and only focusing on the salary_in_usd column.

Also, I notice an employee_residence and company_location column; just with the sheer size of unique values, these columns will have, I’m betting I’ll only use one of the two for modeling.

Besides the above, nothing else above seems to jump off the page; let’s see how much data we have.

print(df.shape)

our original dataset shape

While we seem to have a good chunk of columns (12), building an API with only 600 rows of data will be tough.

The first thought I have when I see this is, “We shouldn’t probably delete any lines for any circumstances.”

Seeing this low of a number will change how I approach deleting NAs and getting rid of other data points.


Checking Unique Value Counts And Distributions For All Columns With Pandas

Seeing lots of categorical data makes me believe we’ll need to OHE (One Hot Encode) our data down the line. When we have too many categorical variables within a column, we run into problems with OHE.

Below, I check each categorical column and see how many unique values exist for each of those columns.

Note we’re not focused on the number of values in each of the values, just how many unique values there are.

df.work_year.value_counts()

unique values for column year

Only 3 Unique Categories for our year column; we can easily handle this later.

df.experience_level.value_counts()

unique values for experience level

Only 4 Unique Categories for our experience column; we can easily handle this later.

df.employment_type.value_counts()

Employment type unique values

Only 4 Unique Categories for our employment type column; we can easily handle this later.

df.job_title.value_counts()

data title unique values

We have 50 unique values, and if we were to OHE this column, we would increase our subspace by 50 columns!

This would be too much (especially for 600 rows of data). We’ll take note of this and will deal with it later.

df.salary_in_usd.hist()

Salary data distribution

Our salary_in_usd column does not have a normal distribution (which we’d like to see) and seems to have some outliers around the 600k salary point.

We take note of this and continue with our EDA.

df.employee_residence.value_counts()

countries

We have 57 unique values, and if we were to OHE this column, we would increase our subspace by 57 columns!

We’ll take note of this and will deal with it later. 

Note: Both employee_residence and company_location were similar in the number of unique values they had.

df.company_size.value_counts()

company size unique values

Our column company size looks great, with only three unique values – OHE can easily handle this later.


What We Learned From EDA And The Next Steps

Now that we’ve performed EDA and have a general understanding of our dataset, we need to move into feature engineering.


Notes That I took during this EDA Process:

  1. Of our 12 columns, we will eliminate two as our salary_in_usd has made them redundant.
  2. employee_residence and company_location seem highly correlated, and it is awkward to provide both to an API – decide which of the two I want to keep
  3. employee_residence/company_location, and job_title have way too many unique values for OHE. I will write functions to slim these columns down to a lower cardinality.
  4. Salary_in_usd isn’t normal and seems to have a few outliers. We will need to deal with that 600k salary point as it may skew predictions high.
  5. Some of the data is from 2021 and 2020; since I’m building this API in 2022, I want to move these numbers to 2022 salary numbers (by shifting the mean of those distributions).

 

In our next post, I’ll handle the problems above and prepare our data for modeling.

Check it out here

Or, if you are interested in a different part of this process, here are the links to the other posts.

Stewart Kaplan