Part 3

Dataset2API Project 1: Data Science Salary API, Modeling [3/4]

This blog post will discuss the modeling portion of our first project in our Dataset2API series.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part three of Project 1; you can view the other posts in this project here:


Modeling, Building The Best Possible Model For Our API

After performing feature engineering on our dataset, we’re ready to move on to modeling.

Here is a picture of the dataset from when we finished feature engineering.

finished dataset for modeling

All values are converted into categorical columns, and our target variable “final_salary” is normalized.


Performing One Hot Encoding For Our Model Data

Sadly, with the form that our data is in, our machine learning algorithms can’t understand it.

We’ll need to convert these using One Hot Encoding.

In Machine Learning, there are two ways to perform One Hot Encoding: Sklearn or Pandas.


Why we use Sklearns One Hot Encoding Over Pandas Get Dummies

While the pandas get_dummies() method is cleaner and easier to perform, it is not great for models designed for production or use in APIs.

This is simply because we can’t save the fitted encoder.

When we feed our API data on the backend, we want the input array to be precisely the same as our training array.

Since the get_dummies() method will not save the fitted encoder, it becomes useless to us in a production environment, as it can’t re-design the same data layout.

This subtle difference is why we use Sklearn’s OHE over Pandas get_dummies method.

# need to get our data model ready, since we
# won't see any new
from sklearn.preprocessing import OneHotEncoder

#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

#ill need this later
save_encoder = encoder.fit(df[[val for val in df.columns if val != 'final_salary']])

# actually perform it
encoded_df = pd.DataFrame(save_encoder.transform(df[[val for val in df.columns if val != 'final_salary']]).toarray())

#view final df
encoded_df['final_salary'] = [val for val in df.final_salary]

# set up our data for modeling
model_data = encoded_df.copy()

model_data

modeling data after one hot encoding

Now that our categorical columns have all been converted, and our target column is normalized, we can start modeling!


Modeling Approach

Since we know this is a regression problem, we will use some famous regression models to see which can give us the best accuracy.

For each model below, we will use a cross-validated mean square metric from our test set as our scoring metric.

We will perform 250 iterations of RandomSearchCV and select the model that scores the lowest test error.

Note: we use a scorer called “neg_mean_squared_error” since Sklearn does not have the standard mean_squared_error. 

The only difference about this scorer is that there is a negative sign, and we still want the closest value to zero.


Performing Hyperparameter Search With RandomizedSearchCV and XGBoost For Regression Problem

import xgboost as xgb
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# we will be using a gradient boosted tree with cross validation
xgb_model = xgb.XGBRegressor()

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    "colsample_bytree": uniform(0.4, .03),
    "gamma": uniform(0, 1),
    "learning_rate": uniform(0.1, .05), 
    "max_depth": randint(2, 8),
    #"min_samples_split": randint(2,10),
    "n_estimators": randint(100, 200),
    "subsample": uniform(.2, .05)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(xgb_model, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=250,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We see how well the best model did below.

xgboost hyperparameter score


Performing Hyperparameter Search With RandomizedSearchCV and RandomForest For Regression Problem

from sklearn.ensemble import RandomForestRegressor
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# we will be using a gradient boosted tree with cross validation
rf = RandomForestRegressor(bootstrap='True', warm_start='False')

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
     'max_depth': randint(1,15),
     'max_leaf_nodes': randint(2,10),
     'min_impurity_decrease': uniform(0,1),
     'min_samples_leaf': randint(1,10),
     'min_samples_split': randint(2,10),
     'min_weight_fraction_leaf': uniform(0,.3),
     'n_estimators': randint(50,200),
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(rf, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We see how well the best model did below.

randomforest score after hyperparam search

With the low amounts of data, I thought the underlying bootstrap method would prove strong here, but it still couldn’t beat XGBoost.


Performing Hyperparameter Search With RandomizedSearchCV and Support Vector Regression (SVR) For Regression Problem

from sklearn.svm import SVR
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# Choose regression method and set hyperparameter
svr_rbf=SVR(kernel='rbf')


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    'C': uniform(0,1),
    'epsilon': uniform(0,1)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(svr_rbf, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We can see how well this model did below.

Support Vector Regression Model

Really surprised by how well this model performed.

The accuracy was almost equivalent to our XGBoost model, and I would actually prefer this model in some situations.


Performing Hyperparameter Search With RandomizedSearchCV and KNearestNeighbors Regression (KNN) For Regression Problem

from sklearn.neighbors import KNeighborsRegressor
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# Choose regression method and set hyperparameter
knn = KNeighborsRegressor()


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    'n_neighbors': randint(1,25),
    'leaf_size':randint(10,125)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(knn, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We can see how well this model did below:

KNN regression model

Our KNN did decently, on par with our random forest model.


Model Choosing Criteria

Since our XGBoost and SVR models were similar in test accuracy, there wasn’t a clear-cut choice on which model to choose.

If I were concerned with model size, I would have chosen our SVR model, as XGBoost models are usually much bigger.

Since I do not have those concerns, I chose our XGBoost model, which had a slight edge in test accuracy.


Saving our Encoder and Model

We need to save our encoder (from one hot encoding) and our model.

import joblib

# save our model to move to production
filename = r'[input your destination here]\p1\finalized_model.sav'
joblib.dump(hyperParamModel, filename)

# save our OHE fit to move to production
filename = r'[input your destination here]\p1\OHEencoder.sav'
joblib.dump(save_encoder, filename)

We’re now ready to build our API.


Next Steps

In our next post, we’ll build our API with our model.

This will be a great introduction to flask_restful and an overall review of how to create a great API.

Check it out here

Or, if you are interested in a different part of this process, here are the links to the other posts.

Stewart Kaplan