Projects – EML https://enjoymachinelearning.com All Machines Learn Thu, 22 Feb 2024 17:10:37 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.5 https://enjoymachinelearning.com/wp-content/uploads/2024/03/cropped-eml-header-e1709858269741-32x32.png Projects – EML https://enjoymachinelearning.com 32 32 Dataset2API Project 1: Data Science Salary API, Building API [4/4] https://enjoymachinelearning.com/blog/dataset2api_project1_part4_api/ Thu, 22 Feb 2024 17:10:37 +0000 https://enjoymachinelearning.com/?p=1753 Read more

]]>
This blog post will discuss the final portion of our first project in our Dataset2API series, where we will build a working API.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part four of Project 1; you can view the other posts in this project here:


Building Our Data Science Salary API

Now that everything is in order, we can finally start building our API.

Before we start, here is the complete code in case you’re interested in copying a piece of the code.

We will break down each of the sections below.


from flask import Flask, jsonify, request,  make_response
from flask_restful import Resource, Api
import joblib
import pandas as pd

# load in our model
loaded_model = joblib.load('finalized_model.sav')

# load in our encoder
encoder = joblib.load('OHEencoder.sav')

# creating the flask app
app = Flask(__name__)

# creating an API object
api = Api(app)


class Prediction(Resource):

    def get(self):
        
        '''
        returns the structure of the request needed for our
        post request
        '''

        # tell how to make requests to our API (during posts)
        return make_response(jsonify({
                                      'experience_level': ['SE','MI','EN','EX'],
                                      'employment_type' : ['FT','PT','CT','FL'],
                                      'company_size': ['S', 'M', 'L'],
                                      'role' : ['**Job Title**'],
                                      'residence' : ['2 Syl Country Code (US, GB) etc'],
                                      'remote%' : ['0','50','100']
                                      }), 201)
  
    def post(self):

        '''
        retrieves payload
        sends back model prediction
        '''
        
        # grab the payload data sent
        data = request.get_json()

        # make sure we have all of our columns
        if 'experience_level' not in data \
            or 'employment_type' not in data \
            or 'company_size' not in data \
            or 'role' not in data \
            or 'residence' not in data \
            or 'remote%' not in data:

            return make_response(jsonify({'message' : 'Missing A Category'}), 400)

        # convert the roles
        # the exact same way we did
        # in trainning
        def convertJob(text):
            
            '''
            converts job titles to form model can understand
            '''
    
            if 'lead' in text.lower() or 'manager' in text.lower() or 'director' in text.lower() or 'head' in text.lower():
                return 'LDR'
            
            elif 'machine' in text.lower() or 'ai ' in text.lower() or 'vision' in text.lower():
                return 'ML'
            
            if 'scientist' in text.lower() or 'analytics' in text.lower() or 'science' in text.lower():
                return 'DS'
            
            if 'analyst' in text.lower():
                return 'AL'
            
            if 'engineer' in text.lower():
                return 'DE'

            
            return 'OTHR_ROLE'

        # convert residence
        # the exact same way we did
        # in trainning
        def convertResidence(text):

            '''
            converts user input of residence so model can understand
            
            '''

            if len(text) != 2:
                return 'OTHER_RES'
    
            approved = ['US','GB','IN','CA','DE','FR','ES','GR','JP']
            
            if text.upper() in approved:
                return text
            
            return 'OTHR_RES'

        # convert remote work
        # the exact same way we did
        # in trainning
        def ConvertRemote(percentage):
            
            if int(percentage) > 50:
                return 'Remote'
            
            if int(percentage) < 50:
                return 'Office'
            
            return 'Hybrid'

        # build out a prediction dictionary, using our functions 
        # that we used during trainning
        user_dict = {
            'experience_level': data['experience_level'],
            'employment_type' : data['employment_type'],
            'company_size' : data['company_size'],
            'roles_converted' : convertJob(data['role']),
            'residence_converted' : convertResidence(data['residence']),
            'remote_converted' : ConvertRemote(data['remote%'])
        }

        # convert our dictoinary to a dataframe
        df = pd.DataFrame([user_dict])


        # use our encoder from trainning
        encoded_df = pd.DataFrame(encoder.transform(df).toarray())

        # now use our model from trainning for a prediction
        pred = loaded_model.predict(encoded_df)


        # return our prediction in a JSON
        return make_response(jsonify({'prediction' : str(pred[0])}), 201)


    


  
api.add_resource(Prediction, '/pred')
  
  
# driver function
if __name__ == '__main__':

    ## load in model on start
    app.run(debug = True)

 

Correct Python Packages

You’ll want to ensure you install a virtual environment in the project directory for the packages.

Whatever packages you use for training, use the identical versions on your APIs.

This will help avoid any weird interactions.

The first part of our API, our packages, are here:

from flask import Flask, jsonify, request,  make_response
from flask_restful import Resource, Api
import joblib
import pandas as pd


We use a very lean setup for this simple API


Loading in our Model and Encoder, And Starting The API

You’ll have to find a way to add your encoder and model (from training) into the same folder in which you’re running your API.

Usually, the easiest way to do this is through a GitHub repo.

Also, flask APIs are always started with the two commands listed.

# load in our model
loaded_model = joblib.load('finalized_model.sav')

# load in our encoder
encoder = joblib.load('OHEencoder.sav')

# creating the flask app
app = Flask(__name__)

# creating an API object
api = Api(app)

 

Rest API Get Request

One thing I like to do for my models is creating a get request that holds the “structure” for our post request.

This simple function lets users quickly understand what data they need to provide to our post request.

Here is the example script.

    def get(self):
        
        '''
        returns the structure of the request needed for our
        post request
        '''

        # tell how to make requests to our API (during posts)
        return make_response(jsonify({
                                      'experience_level': ['SE','MI','EN','EX'],
                                      'employment_type' : ['FT','PT','CT','FL'],
                                      'company_size': ['S', 'M', 'L'],
                                      'role' : ['**Job Title**'],
                                      'residence' : ['2 Syl Country Code (US, GB) etc'],
                                      'remote%' : ['0','50','100']
                                      }), 201)

If a user pings our get request, they’ll immediately know how to talk with our API

Restful API Post Request

Next, we have our post request.

This will be the main piece of our project.

When the user sends the correct data in the payload, our model will return a prediction.

The API starts with some payload checking to ensure all the correct data is passed.

 

def post(self):

        '''
        retrieves payload
        sends back model prediction
        '''
        
        # grab the payload data sent
        data = request.get_json()

        # make sure we have all of our columns
        if 'experience_level' not in data \
            or 'employment_type' not in data \
            or 'company_size' not in data \
            or 'role' not in data \
            or 'residence' not in data \
            or 'remote%' not in data:

            return make_response(jsonify({'message' : 'Missing A Category'}), 400)


Once this if statement is passed, we continue on in our API.

We use the same functions as before to convert our data to a form our model can read.

If you’ve been following along, you’ll recognize the functions below.

# convert the roles
        # the exact same way we did
        # in trainning
        def convertJob(text):
            
            '''
            converts job titles to form model can understand
            '''
    
            if 'lead' in text.lower() or 'manager' in text.lower() or 'director' in text.lower() or 'head' in text.lower():
                return 'LDR'
            
            elif 'machine' in text.lower() or 'ai ' in text.lower() or 'vision' in text.lower():
                return 'ML'
            
            if 'scientist' in text.lower() or 'analytics' in text.lower() or 'science' in text.lower():
                return 'DS'
            
            if 'analyst' in text.lower():
                return 'AL'
            
            if 'engineer' in text.lower():
                return 'DE'

            
            return 'OTHR_ROLE'

        # convert residence
        # the exact same way we did
        # in trainning
        def convertResidence(text):

            '''
            converts user input of residence so model can understand
            
            '''

            if len(text) != 2:
                return 'OTHER_RES'
    
            approved = ['US','GB','IN','CA','DE','FR','ES','GR','JP']
            
            if text.upper() in approved:
                return text
            
            return 'OTHR_RES'

        # convert remote work
        # the exact same way we did
        # in training
        def ConvertRemote(percentage):
            
            if int(percentage) > 50:
                return 'Remote'
            
            if int(percentage) < 50:
                return 'Office'
            
            return 'Hybrid'


Now that we have these functions set up and ready to be used, we call them from our dictionary object to build out a prediction array.

        # build out a prediction dictionary, using our functions 
        # that we used during trainning
        user_dict = {
            'experience_level': data['experience_level'],
            'employment_type' : data['employment_type'],
            'company_size' : data['company_size'],
            'roles_converted' : convertJob(data['role']),
            'residence_converted' : convertResidence(data['residence']),
            'remote_converted' : ConvertRemote(data['remote%'])
        }

We use the payload data (stored in a dictionary called data from above) to either use it directly or send it to a function for cleaning.

Finally, we build out a data frame, use our encoder from training to build our array with the same columns as during training, get a prediction, and return it to the user.

        # convert our dictoinary to a dataframe
        df = pd.DataFrame([user_dict])


        # use our encoder from trainning
        encoded_df = pd.DataFrame(encoder.transform(df).toarray())

        # now use our model from trainning for a prediction
        pred = loaded_model.predict(encoded_df)


        # return our prediction in a JSON
        return make_response(jsonify({'prediction' : str(pred[0])}), 201)


While this is the “meat” of the project, the rest involves defining the API and creating a deployment run.

api.add_resource(Prediction, '/pred')
  
  
# driver function
if __name__ == '__main__':

    ## load in model on start
    app.run(debug = True)

 

Using Python Requests Module To Test Our API

You can’t call an API finished without testing!

Here is our test script:

import requests
import json

def run_test():
    
    # test get
    _test_get = requests.get('http://127.0.0.1:5000/pred')
    
    # print our json
    print(_test_get.json())


    print('\n\n\n\n')



    # test post
    payload = {
                'experience_level': 'EX',
                'employment_type' : 'PT',
                'company_size': 'S',
                'role' : 'Financial Analyst',
                'residence' : 'US',
                'remote%' : '0'
    }

    r = requests.post('http://127.0.0.1:5000/pred', json=payload)

    print(r.json(), '\n\n\n')


if __name__ == '__main__':

    # example tests
    run_test()

For our first test, let’s see how much a financial analyst gets paid that is entirely remote.

We send this data within our payload:

    payload = {
                'experience_level': 'EX',
                'employment_type' : 'PT',
                'company_size': 'S',
                'role' : 'Financial Analyst',
                'residence' : 'US',
                'remote%' : '100'
    }

example for our project

Quite low.

Let’s see how a full-time remote machine-learning engineer does

    payload = {
                'experience_level': 'EX',
                'employment_type' : 'FT',
                'company_size': 'L',
                'role' : 'Machine Learning Engineer',
                'residence' : 'US',
                'remote%' : '100'
    }

prediction data for machine learning engineer

Machine learning engineers are paid much closer to the top of the distribution!


Now, wouldn’t it be nice to know what numbers those normalized values correlate to?

It’s pretty straightforward; head over to our guide reverse standardization, and we’ll show you how to change those numbers back into salary data!


Next Steps

This is, sadly, the last post in this project.

If you were interested in a previous section, you could check those out here:

If you’re interested in doing more for this API, try and add security (authorization and authentication) and deploy it!

]]>
Dataset2API Project 1: Data Science Salary API, Modeling [3/4] https://enjoymachinelearning.com/blog/dataset2api_project1_part3_modeling/ Thu, 22 Feb 2024 16:38:00 +0000 https://enjoymachinelearning.com/?p=1743 Read more

]]>
This blog post will discuss the modeling portion of our first project in our Dataset2API series.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part three of Project 1; you can view the other posts in this project here:


Modeling, Building The Best Possible Model For Our API

After performing feature engineering on our dataset, we’re ready to move on to modeling.

Here is a picture of the dataset from when we finished feature engineering.

finished dataset for modeling

All values are converted into categorical columns, and our target variable “final_salary” is normalized.


Performing One Hot Encoding For Our Model Data

Sadly, with the form that our data is in, our machine learning algorithms can’t understand it.

We’ll need to convert these using One Hot Encoding.

In Machine Learning, there are two ways to perform One Hot Encoding: Sklearn or Pandas.


Why we use Sklearns One Hot Encoding Over Pandas Get Dummies

While the pandas get_dummies() method is cleaner and easier to perform, it is not great for models designed for production or use in APIs.

This is simply because we can’t save the fitted encoder.

When we feed our API data on the backend, we want the input array to be precisely the same as our training array.

Since the get_dummies() method will not save the fitted encoder, it becomes useless to us in a production environment, as it can’t re-design the same data layout.

This subtle difference is why we use Sklearn’s OHE over Pandas get_dummies method.

# need to get our data model ready, since we
# won't see any new
from sklearn.preprocessing import OneHotEncoder

#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

#ill need this later
save_encoder = encoder.fit(df[[val for val in df.columns if val != 'final_salary']])

# actually perform it
encoded_df = pd.DataFrame(save_encoder.transform(df[[val for val in df.columns if val != 'final_salary']]).toarray())

#view final df
encoded_df['final_salary'] = [val for val in df.final_salary]

# set up our data for modeling
model_data = encoded_df.copy()

model_data

modeling data after one hot encoding

Now that our categorical columns have all been converted, and our target column is normalized, we can start modeling!


Modeling Approach

Since we know this is a regression problem, we will use some famous regression models to see which can give us the best accuracy.

For each model below, we will use a cross-validated mean square metric from our test set as our scoring metric.

We will perform 250 iterations of RandomSearchCV and select the model that scores the lowest test error.

Note: we use a scorer called “neg_mean_squared_error” since Sklearn does not have the standard mean_squared_error. 

The only difference about this scorer is that there is a negative sign, and we still want the closest value to zero.


Performing Hyperparameter Search With RandomizedSearchCV and XGBoost For Regression Problem

import xgboost as xgb
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# we will be using a gradient boosted tree with cross validation
xgb_model = xgb.XGBRegressor()

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    "colsample_bytree": uniform(0.4, .03),
    "gamma": uniform(0, 1),
    "learning_rate": uniform(0.1, .05), 
    "max_depth": randint(2, 8),
    #"min_samples_split": randint(2,10),
    "n_estimators": randint(100, 200),
    "subsample": uniform(.2, .05)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(xgb_model, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=250,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We see how well the best model did below.

xgboost hyperparameter score


Performing Hyperparameter Search With RandomizedSearchCV and RandomForest For Regression Problem

from sklearn.ensemble import RandomForestRegressor
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# we will be using a gradient boosted tree with cross validation
rf = RandomForestRegressor(bootstrap='True', warm_start='False')

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
     'max_depth': randint(1,15),
     'max_leaf_nodes': randint(2,10),
     'min_impurity_decrease': uniform(0,1),
     'min_samples_leaf': randint(1,10),
     'min_samples_split': randint(2,10),
     'min_weight_fraction_leaf': uniform(0,.3),
     'n_estimators': randint(50,200),
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(rf, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We see how well the best model did below.

randomforest score after hyperparam search

With the low amounts of data, I thought the underlying bootstrap method would prove strong here, but it still couldn’t beat XGBoost.


Performing Hyperparameter Search With RandomizedSearchCV and Support Vector Regression (SVR) For Regression Problem

from sklearn.svm import SVR
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# Choose regression method and set hyperparameter
svr_rbf=SVR(kernel='rbf')


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    'C': uniform(0,1),
    'epsilon': uniform(0,1)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(svr_rbf, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We can see how well this model did below.

Support Vector Regression Model

Really surprised by how well this model performed.

The accuracy was almost equivalent to our XGBoost model, and I would actually prefer this model in some situations.


Performing Hyperparameter Search With RandomizedSearchCV and KNearestNeighbors Regression (KNN) For Regression Problem

from sklearn.neighbors import KNeighborsRegressor
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# Choose regression method and set hyperparameter
knn = KNeighborsRegressor()


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    'n_neighbors': randint(1,25),
    'leaf_size':randint(10,125)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(knn, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We can see how well this model did below:

KNN regression model

Our KNN did decently, on par with our random forest model.


Model Choosing Criteria

Since our XGBoost and SVR models were similar in test accuracy, there wasn’t a clear-cut choice on which model to choose.

If I were concerned with model size, I would have chosen our SVR model, as XGBoost models are usually much bigger.

Since I do not have those concerns, I chose our XGBoost model, which had a slight edge in test accuracy.


Saving our Encoder and Model

We need to save our encoder (from one hot encoding) and our model.

import joblib

# save our model to move to production
filename = r'[input your destination here]\p1\finalized_model.sav'
joblib.dump(hyperParamModel, filename)

# save our OHE fit to move to production
filename = r'[input your destination here]\p1\OHEencoder.sav'
joblib.dump(save_encoder, filename)

We’re now ready to build our API.


Next Steps

In our next post, we’ll build our API with our model.

This will be a great introduction to flask_restful and an overall review of how to create a great API.

Check it out here

Or, if you are interested in a different part of this process, here are the links to the other posts.

]]>
Dataset2API Project 1: Data Science Salary API, Feature Engineering [2/4] https://enjoymachinelearning.com/blog/dataset2api_project1_part2_feature_engineering/ https://enjoymachinelearning.com/blog/dataset2api_project1_part2_feature_engineering/#respond Thu, 22 Feb 2024 16:09:47 +0000 https://enjoymachinelearning.com/?p=1723 Read more

]]>
This blog post will discuss Feature Engineering for our first project in our Dataset2API series.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part two of Project 1; you can view the other posts in this project here:

 

Feature Engineering, Creating Data Ready For Modeling

After performing EDA on our data, we found some things we need to address before we can begin modeling.

Here are the notes I took during EDA


Notes That I took during this EDA Process:

  1. Of our 12 columns, we will eliminate two as our salary_in_usd has made them redundant.
  2. employee_residence and company_location seem highly correlated, and it is awkward to provide both to an API – decide which of the two I want to keep
  3. employee_residence/company_location, and job_title have way too many unique values for OHE. I will write functions to slim these columns down to a lower cardinality.
  4. Salary_in_usd isn’t normal and seems to have a few outliers. We will need to deal with that 600k salary point as it may skew predictions high.
  5. Some of the data is from 2021 and 2020; since I’m building this API in 2022, I want to move these numbers to 2022 salary numbers (by shifting the mean of those distributions).

 

Making a Decision Between Two Columns With Pandas Correlation

Since employee_residence and company_location are highly correlated, it seems a bit awkward to provide both to our API.

Think about the end user, who will probably be an individual who is interested in finding out how much they should be making.

Since we’re focused on ease of use and speed, applying both feels a little bloated, and we will only be keeping one of the two.

We use correlation with our target variable (salary_in_usd) to determine which of the two categorical columns to keep.

We’ll have to convert these categorical columns with a label encoder before we can compare, as the .corr() method in Pandas cannot handle non-numeric values.

# import our labelEncoder
from sklearn.preprocessing import LabelEncoder

# select from our dataframe which ones we want to keep
residence = df[['company_location','employee_residence','salary_in_usd']]

# transform all categorical columns with a fresh labelencoder for each column
residence = residence.apply(LabelEncoder().fit_transform).corr()

residence

correlation of two columns

We see that employee_residence is a little more correlated with our salary column than company_location

From this, we decide to move forward with employee_residence and will drop company_location


Handling Categorical Columns With Too many Features In Pandas

We need to handle our employee_residence column, which has 57 unique values!

To do this, we will create a simple function that is comprised of the top 5-10 values, which for our dataset, hold about 80% of the values.

If a feature does not fall within those 5-10 values, we’ll give it the blanket term “OTHER.” 

While this will cost us some accuracy, it will make our system much more dynamic and be able to handle data from countries that we haven’t seen before.

Employee Residence Column

def convertResidence(text):
    
    approved = ['US','GB','IN','CA','DE','FR','ES','GR','JP']
    
    if text.upper() in approved:
        return text
    
    return 'OTHR_RES'

residence_converted = df[['employee_residence']]

residence_converted['conv'] = df.employee_residence.apply(convertResidence)

residence_converted

converted countries using pandas function

Most columns seem to stay the same, but a few did convert to our “OTHER” tag.


Job Title Column

Now, we need to do this same process for our job title column, which has tons of unique titles for job titles.

We’ll use a text search to find bits and pieces of the job roles to give us six different tags.

These tags will easily be One Hot Encoded (OHE) for our modeling portion.

def ConvertRole(text):
    
    if 'lead' in text.lower() or 'manager' in text.lower() or 'director' in text.lower() or 'head' in text.lower():
        return 'LDR'
    
    elif 'machine' in text.lower() or 'ai ' in text.lower() or 'vision' in text.lower():
        return 'ML'
    
    if 'scientist' in text.lower() or 'analytics' in text.lower() or 'science' in text.lower():
        return 'DS'
    
    if 'analyst' in text.lower():
        return 'AL'
    
    if 'engineer' in text.lower():
        return 'DE'

    
    return 'OTHR_ROLE'

roles_converted = df[['job_title']]

roles_converted['conv'] = df.job_title.apply(ConvertRole)

roles_converted

roles after converted

We can see on the right how our conversion went.

Remote Perecentage Column

Finally, we need to do this same process for our remote% column.

This column is currently an integer, and I want to transform it into cleaner categorical categories for easier understanding.

We’ll use a text search to find bits and pieces of the job roles to give us six different tags.

These tags will easily be One Hot Encoded (OHE) for our modeling portion.

def FixRemoteRatio(percentage):
    
    if percentage > 50:
        return 'Remote'
    
    if percentage < 50:
        return 'Office'
    
    return 'Hybrid'

df['remote_converted'] = df.remote_ratio.apply(FixRemoteRatio)


df[['remote_ratio','remote_converted']]

remote ratio converted

Now we can see our remote ratio converted over.

 


Handling Outliers In Salary Data With IQR

We know from our EDA process that we had found some outliers in our salary_in_usd column, and that our data wasn’t normal like we’d like.

Let’s see if we can fix that.

salary_converted = df[['salary_in_usd']]

# find our quartiles
q1, q3 = np.percentile(salary_converted['salary_in_usd'],[25,75])

# find our range
IQR = q3 - q1

# lower barrier
Q1_Barrier = q1 - 1.5 * IQR

# upper barrier
Q3_Barrier = q3 + 1.5 * IQR

# mark outliers                  # if less than q1b   and more than q3b
salary_converted['salary_in_usd'+'_Outlier'] = np.where((salary_converted['salary_in_usd']<Q1_Barrier) | \ (salary_converted['salary_in_usd']>Q3_Barrier), \
                                          1, 0)

salary_converted

converted salary data

From this random sample, we can see that making $423,000 a year is an outlier for our data.

We will be removing any values that we detect as an outlier.


Transforming Target Salary Data Into A Normal Distribution

I always like to have the data fitted to a normal distribution for a numerical target variable.

This is simply because models perform better when data follows a normal distribution.

Normalization for column variables gives equal weights/importance to each variable so that no single variable leads model performance in one direction just because the numbers aren’t equally scaled.

Normalization for target variables confines the spread of values, which has been proven to increase accuracy.

from sklearn.preprocessing import StandardScaler

# remove outliers
normalize_salary = salary_converted[salary_converted['salary_in_usd_Outlier'] < 1]

# create scalar
scaler = StandardScaler().fit(normalize_salary[['salary_in_usd']])

# apply scalar
normalize_salary['Normalized_Salary'] = scaler.transform(normalize_salary[['salary_in_usd']])

# plot
normalize_salary.Normalized_Salary.hist()

normalized data for salary

While this is not perfectly normal, I think this will do the job.


Handling Old Data And Moving it to the Current Year

Our data is almost ready! We’ve cleaned up most of the columns and have a normal target variable.

There is one thing that is bothering me: some of our data is from 2020 and 2021 (I’m building this in 2022).

When querying an API, you don’t care what the salary was in 2021; you want to know what the salary is for right now (2022).

I’m going to check and see if the average salary has been increasing (which is my guess) YoY (Year over Year).

# im interested in if salaries are increasing YoY
from matplotlib import pyplot as plt


fig = plt.figure()
ax = plt.axes()

_2020 = df.query('work_year == 2020')['normalized_salary']
_2021 = df.query('work_year == 2021')['normalized_salary']
_2022 = df.query('work_year == 2022')['normalized_salary']



plt.scatter([val for val in range(len(_2020.values))], _2020.values, color='y')
plt.scatter([val for val in range(len(_2021.values))], _2021.values, color='g')
plt.scatter([val for val in range(len(_2022.values))], _2022.values, color='r')


print(np.mean(_2020.values))
print(np.mean(_2021.values))
print(np.mean(_2022.values))

salary data means and plot

Right away, we see from the top three numbers that salary is lowest in 2020 and highest in 2022.

I want to “move” the 2020 and 2021 salary data into 2022.


Changing The Salary Means For 2020 and 2021 to be equal to 2022

Let’s change the 2020 and 2021 salaries to be “equal” to the salaries for 2022.

# i want my estimator to work for the current year, so lets move
# these values all to 2022
current_year = np.mean(_2022.values)

change_2020 = np.mean(_2020.values)

change_2021 = np.mean(_2021.values)

# percent increase formula
increase_2020 = current_year - change_2020
increase_2021 = current_year - change_2021


def increaseSalaries(year, value):
    
    if year == 2020:
        return value + increase_2020
    
    elif year == 2021:
        return value + increase_2021
    
    return value
    
    
df['final_salary'] = df.apply(lambda x: increaseSalaries(x.work_year, x.normalized_salary), axis=1)

salary data before and after

On the left, we have the beginning salary, and on the right, we have the final salary that we will be taking moving forward.

Lets double check that everything is how we want it

fig = plt.figure()
ax = plt.axes()

_2020 = df.query('work_year == 2020')['final_salary']
_2021 = df.query('work_year == 2021')['final_salary']
_2022 = df.query('work_year == 2022')['final_salary']



plt.scatter([val for val in range(len(_2020.values))], _2020.values, color='y')
plt.scatter([val for val in range(len(_2021.values))], _2021.values, color='g')
plt.scatter([val for val in range(len(_2022.values))], _2022.values, color='r')


print(np.mean(_2020.values))
print(np.mean(_2021.values))
print(np.mean(_2022.values))

final data from salary

Now that the means for each year are all equivalent, all data has been moved to the current year.

 

Drop Unused Columns and See Featured Engineer Dataset

Now that we’ve converted everything over, we can drop the columns we aren’t using and visualize our final dataset.

This will be the dataset that we will work with for modeling.

df.drop(['normalized_salary', 'work_year', 'remote_ratio'], axis=1, inplace=True)

df

final dataset before modeling

In our next post, we’ll start building our model. This data needs to be encoded before we can model it.

We explore multiple different models, and build out some parameter searching to get the best model possible

Check it out here

Or, if you are interested in a different part of this process, here are the links to the other posts.

]]>
https://enjoymachinelearning.com/blog/dataset2api_project1_part2_feature_engineering/feed/ 0
Dataset2API Project 1: Data Science Salary API, EDA [1/4] https://enjoymachinelearning.com/blog/dataset2api_project1_part1_exploratory-data-analysis/ Thu, 22 Feb 2024 15:53:00 +0000 https://enjoymachinelearning.com/?p=1709 Read more

]]>
This blog post will discuss EDA (exploratory data analysis) for our first project in our Dataset2API series.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part one of Project 1; you can view the other posts in this project here:


Our Goal and Dataset For Data Science Salary API

We will be building an API that covers the general roles under the umbrella of data science.

We needed a dataset that covered many different roles, was up to date, and had enough data to build this API.

Fortunately, we were able to find just such a dataset on Kaggle.

The dataset consists of about ~600 different data science role salary combinations, which we hope will be enough to cover the needs of our API.

This was a huge relief for us, as finding a suitable dataset was one of our biggest challenges in building this API.

We do have some concerns about the small amount of data, but we have some techniques we can deploy that we hope to give us an accurate API.

We hope you follow along with these posts to build your first API, or if you’re already a pro, use some of the tips and tricks discussed here to upgrade your current APIs!

Dataset Link

https://www.kaggle.com/datasets/whenamancodes/data-science-fields-salary-categorization

Exploratory Data Analysis (EDA) For Data Science Dataset

When starting EDA, one of the first things I like to do is go column by column and note the amount of work needed to take my dataset to the level I need to be API level.

Here is a quick picture of the data

original dataset

Right Away, I notice a salary_in_usd and a regular “salary” column.

Since I want to compare apples to apples, in the future, I’ll be deleting the salary and salary_currency columns and only focusing on the salary_in_usd column.

Also, I notice an employee_residence and company_location column; just with the sheer size of unique values, these columns will have, I’m betting I’ll only use one of the two for modeling.

Besides the above, nothing else above seems to jump off the page; let’s see how much data we have.

print(df.shape)

our original dataset shape

While we seem to have a good chunk of columns (12), building an API with only 600 rows of data will be tough.

The first thought I have when I see this is, “We shouldn’t probably delete any lines for any circumstances.”

Seeing this low of a number will change how I approach deleting NAs and getting rid of other data points.


Checking Unique Value Counts And Distributions For All Columns With Pandas

Seeing lots of categorical data makes me believe we’ll need to OHE (One Hot Encode) our data down the line. When we have too many categorical variables within a column, we run into problems with OHE.

Below, I check each categorical column and see how many unique values exist for each of those columns.

Note we’re not focused on the number of values in each of the values, just how many unique values there are.

df.work_year.value_counts()

unique values for column year

Only 3 Unique Categories for our year column; we can easily handle this later.

df.experience_level.value_counts()

unique values for experience level

Only 4 Unique Categories for our experience column; we can easily handle this later.

df.employment_type.value_counts()

Employment type unique values

Only 4 Unique Categories for our employment type column; we can easily handle this later.

df.job_title.value_counts()

data title unique values

We have 50 unique values, and if we were to OHE this column, we would increase our subspace by 50 columns!

This would be too much (especially for 600 rows of data). We’ll take note of this and will deal with it later.

df.salary_in_usd.hist()

Salary data distribution

Our salary_in_usd column does not have a normal distribution (which we’d like to see) and seems to have some outliers around the 600k salary point.

We take note of this and continue with our EDA.

df.employee_residence.value_counts()

countries

We have 57 unique values, and if we were to OHE this column, we would increase our subspace by 57 columns!

We’ll take note of this and will deal with it later. 

Note: Both employee_residence and company_location were similar in the number of unique values they had.

df.company_size.value_counts()

company size unique values

Our column company size looks great, with only three unique values – OHE can easily handle this later.


What We Learned From EDA And The Next Steps

Now that we’ve performed EDA and have a general understanding of our dataset, we need to move into feature engineering.


Notes That I took during this EDA Process:

  1. Of our 12 columns, we will eliminate two as our salary_in_usd has made them redundant.
  2. employee_residence and company_location seem highly correlated, and it is awkward to provide both to an API – decide which of the two I want to keep
  3. employee_residence/company_location, and job_title have way too many unique values for OHE. I will write functions to slim these columns down to a lower cardinality.
  4. Salary_in_usd isn’t normal and seems to have a few outliers. We will need to deal with that 600k salary point as it may skew predictions high.
  5. Some of the data is from 2021 and 2020; since I’m building this API in 2022, I want to move these numbers to 2022 salary numbers (by shifting the mean of those distributions).

 

In our next post, I’ll handle the problems above and prepare our data for modeling.

Check it out here

Or, if you are interested in a different part of this process, here are the links to the other posts.

]]>