Dataset2API Project 1: Data Science Salary API, Modeling [3/4]

This blog post will discuss the modeling portion of our first project in our Dataset2API series.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part three of Project 1; you can view the other posts in this project here:

Table of Contents show

Modeling, Building The Best Possible Model For Our API

After performing feature engineering on our dataset, we’re ready to move on to modeling.

Here is a picture of the dataset from when we finished feature engineering.

All values are converted into categorical columns, and our target variable “final_salary” is normalized.

Performing One Hot Encoding For Our Model Data

Sadly, with the form that our data is in, our machine learning algorithms can’t understand it.

We’ll need to convert these using One Hot Encoding.

In Machine Learning, there are two ways to perform One Hot Encoding: Sklearn or Pandas.

Why we use Sklearns One Hot Encoding Over Pandas Get Dummies

While the pandas get_dummies() method is cleaner and easier to perform, it is not great for models designed for production or use in APIs.

This is simply because we can’t save the fitted encoder.

When we feed our API data on the backend, we want the input array to be precisely the same as our training array.

Since the get_dummies() method will not save the fitted encoder, it becomes useless to us in a production environment, as it can’t re-design the same data layout.

This subtle difference is why we use Sklearn’s OHE over Pandas get_dummies method.

# need to get our data model ready, since we
# won't see any new
from sklearn.preprocessing import OneHotEncoder

#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

#ill need this later
save_encoder = encoder.fit(df[[val for val in df.columns if val != 'final_salary']])

# actually perform it
encoded_df = pd.DataFrame(save_encoder.transform(df[[val for val in df.columns if val != 'final_salary']]).toarray())

#view final df
encoded_df['final_salary'] = [val for val in df.final_salary]

# set up our data for modeling
model_data = encoded_df.copy()

model_data

Now that our categorical columns have all been converted, and our target column is normalized, we can start modeling!

Modeling Approach

Since we know this is a regression problem, we will use some famous regression models to see which can give us the best accuracy.

For each model below, we will use a cross-validated mean square metric from our test set as our scoring metric.

We will perform 250 iterations of RandomSearchCV and select the model that scores the lowest test error.

Note: we use a scorer called “neg_mean_squared_error” since Sklearn does not have the standard mean_squared_error.

The only difference about this scorer is that there is a negative sign, and we still want the closest value to zero.

Performing Hyperparameter Search With RandomizedSearchCV and XGBoost For Regression Problem

import xgboost as xgb
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# we will be using a gradient boosted tree with cross validation
xgb_model = xgb.XGBRegressor()

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    "colsample_bytree": uniform(0.4, .03),
    "gamma": uniform(0, 1),
    "learning_rate": uniform(0.1, .05), 
    "max_depth": randint(2, 8),
    #"min_samples_split": randint(2,10),
    "n_estimators": randint(100, 200),
    "subsample": uniform(.2, .05)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(xgb_model, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=250,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We see how well the best model did below.

Performing Hyperparameter Search With RandomizedSearchCV and RandomForest For Regression Problem

from sklearn.ensemble import RandomForestRegressor
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# we will be using a gradient boosted tree with cross validation
rf = RandomForestRegressor(bootstrap='True', warm_start='False')

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
     'max_depth': randint(1,15),
     'max_leaf_nodes': randint(2,10),
     'min_impurity_decrease': uniform(0,1),
     'min_samples_leaf': randint(1,10),
     'min_samples_split': randint(2,10),
     'min_weight_fraction_leaf': uniform(0,.3),
     'n_estimators': randint(50,200),
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(rf, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We see how well the best model did below.

With the low amounts of data, I thought the underlying bootstrap method would prove strong here, but it still couldn’t beat XGBoost.

Performing Hyperparameter Search With RandomizedSearchCV and Support Vector Regression (SVR) For Regression Problem

from sklearn.svm import SVR
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# Choose regression method and set hyperparameter
svr_rbf=SVR(kernel='rbf')


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    'C': uniform(0,1),
    'epsilon': uniform(0,1)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(svr_rbf, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We can see how well this model did below.

Really surprised by how well this model performed.

The accuracy was almost equivalent to our XGBoost model, and I would actually prefer this model in some situations.

Performing Hyperparameter Search With RandomizedSearchCV and KNearestNeighbors Regression (KNN) For Regression Problem

from sklearn.neighbors import KNeighborsRegressor
from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV

# Choose regression method and set hyperparameter
knn = KNeighborsRegressor()


# split in X and Y
X = model_data.iloc[:, 0 : -1]
y = model_data.iloc[:, -1]

# set some params, we use distributions and allow our 
# hyperparam search to find us the best parameters
params = {
    'n_neighbors': randint(1,25),
    'leaf_size':randint(10,125)
}

# we use our params
# with a kfold 5 cross validation
# to find the best model for our predictions
# we will use mean_squared_error, a good general
# scoring metric for cross validation models

hyperParamModel = RandomizedSearchCV(knn, 
                            param_distributions=params, 
                            random_state=32,
                            n_iter=100,
                            scoring='neg_mean_squared_error',
                            cv=5, 
                            verbose=3, 
                            n_jobs=-1,
                            error_score='raise',
                            return_train_score=True)

hyperParamModel.fit(X, y)

We can see how well this model did below:

Our KNN did decently, on par with our random forest model.

Model Choosing Criteria

Since our XGBoost and SVR models were similar in test accuracy, there wasn’t a clear-cut choice on which model to choose.

If I were concerned with model size, I would have chosen our SVR model, as XGBoost models are usually much bigger.

Since I do not have those concerns, I chose our XGBoost model, which had a slight edge in test accuracy.

Saving our Encoder and Model

We need to save our encoder (from one hot encoding) and our model.

import joblib

# save our model to move to production
filename = r'[input your destination here]\p1\finalized_model.sav'
joblib.dump(hyperParamModel, filename)

# save our OHE fit to move to production
filename = r'[input your destination here]\p1\OHEencoder.sav'
joblib.dump(save_encoder, filename)

We’re now ready to build our API.

Next Steps

In our next post, we’ll build our API with our model.

This will be a great introduction to flask_restful and an overall review of how to create a great API.

Check it out here

Part 4: API Building

Or, if you are interested in a different part of this process, here are the links to the other posts.

Author
Recent Posts

Stewart Kaplan

Stewart Kaplan has years of experience as a Senior Data Scientist. He enjoys coding and teaching and has created this website to make Machine Learning accessible to everyone.

Latest posts by Stewart Kaplan (see all)

Are Degrees Necessary for Google Software Engineers? [Discover the Truth] - July 26, 2024
Can You Get into Software Development with No Experience? [Must-Read Tips] - July 26, 2024
Navigating Generative vs Discriminative Models in Data Science [Make the Right Choice Now!] - July 26, 2024