Part 4

Dataset2API Project 1: Data Science Salary API, Building API [4/4]

This blog post will discuss the final portion of our first project in our Dataset2API series, where we will build a working API.

This project focuses on creating a salary API that will allow users to query for average salaries based on job titles, locations, years of experience, and other attributes.

We will use Python and Pandas for data analysis and manipulation, Sklearn and XGBoost for modeling, and Flask (Restful) for web development.

Complete code is available here in this GitHub repo, but each post will walk through the process I personally go through to take a dataset and a problem to an API.

This is part four of Project 1; you can view the other posts in this project here:


Building Our Data Science Salary API

Now that everything is in order, we can finally start building our API.

Before we start, here is the complete code in case you’re interested in copying a piece of the code.

We will break down each of the sections below.


from flask import Flask, jsonify, request,  make_response
from flask_restful import Resource, Api
import joblib
import pandas as pd

# load in our model
loaded_model = joblib.load('finalized_model.sav')

# load in our encoder
encoder = joblib.load('OHEencoder.sav')

# creating the flask app
app = Flask(__name__)

# creating an API object
api = Api(app)


class Prediction(Resource):

    def get(self):
        
        '''
        returns the structure of the request needed for our
        post request
        '''

        # tell how to make requests to our API (during posts)
        return make_response(jsonify({
                                      'experience_level': ['SE','MI','EN','EX'],
                                      'employment_type' : ['FT','PT','CT','FL'],
                                      'company_size': ['S', 'M', 'L'],
                                      'role' : ['**Job Title**'],
                                      'residence' : ['2 Syl Country Code (US, GB) etc'],
                                      'remote%' : ['0','50','100']
                                      }), 201)
  
    def post(self):

        '''
        retrieves payload
        sends back model prediction
        '''
        
        # grab the payload data sent
        data = request.get_json()

        # make sure we have all of our columns
        if 'experience_level' not in data \
            or 'employment_type' not in data \
            or 'company_size' not in data \
            or 'role' not in data \
            or 'residence' not in data \
            or 'remote%' not in data:

            return make_response(jsonify({'message' : 'Missing A Category'}), 400)

        # convert the roles
        # the exact same way we did
        # in trainning
        def convertJob(text):
            
            '''
            converts job titles to form model can understand
            '''
    
            if 'lead' in text.lower() or 'manager' in text.lower() or 'director' in text.lower() or 'head' in text.lower():
                return 'LDR'
            
            elif 'machine' in text.lower() or 'ai ' in text.lower() or 'vision' in text.lower():
                return 'ML'
            
            if 'scientist' in text.lower() or 'analytics' in text.lower() or 'science' in text.lower():
                return 'DS'
            
            if 'analyst' in text.lower():
                return 'AL'
            
            if 'engineer' in text.lower():
                return 'DE'

            
            return 'OTHR_ROLE'

        # convert residence
        # the exact same way we did
        # in trainning
        def convertResidence(text):

            '''
            converts user input of residence so model can understand
            
            '''

            if len(text) != 2:
                return 'OTHER_RES'
    
            approved = ['US','GB','IN','CA','DE','FR','ES','GR','JP']
            
            if text.upper() in approved:
                return text
            
            return 'OTHR_RES'

        # convert remote work
        # the exact same way we did
        # in trainning
        def ConvertRemote(percentage):
            
            if int(percentage) > 50:
                return 'Remote'
            
            if int(percentage) < 50:
                return 'Office'
            
            return 'Hybrid'

        # build out a prediction dictionary, using our functions 
        # that we used during trainning
        user_dict = {
            'experience_level': data['experience_level'],
            'employment_type' : data['employment_type'],
            'company_size' : data['company_size'],
            'roles_converted' : convertJob(data['role']),
            'residence_converted' : convertResidence(data['residence']),
            'remote_converted' : ConvertRemote(data['remote%'])
        }

        # convert our dictoinary to a dataframe
        df = pd.DataFrame([user_dict])


        # use our encoder from trainning
        encoded_df = pd.DataFrame(encoder.transform(df).toarray())

        # now use our model from trainning for a prediction
        pred = loaded_model.predict(encoded_df)


        # return our prediction in a JSON
        return make_response(jsonify({'prediction' : str(pred[0])}), 201)


    


  
api.add_resource(Prediction, '/pred')
  
  
# driver function
if __name__ == '__main__':

    ## load in model on start
    app.run(debug = True)

 

Correct Python Packages

You’ll want to ensure you install a virtual environment in the project directory for the packages.

Whatever packages you use for training, use the identical versions on your APIs.

This will help avoid any weird interactions.

The first part of our API, our packages, are here:

from flask import Flask, jsonify, request,  make_response
from flask_restful import Resource, Api
import joblib
import pandas as pd


We use a very lean setup for this simple API


Loading in our Model and Encoder, And Starting The API

You’ll have to find a way to add your encoder and model (from training) into the same folder in which you’re running your API.

Usually, the easiest way to do this is through a GitHub repo.

Also, flask APIs are always started with the two commands listed.

# load in our model
loaded_model = joblib.load('finalized_model.sav')

# load in our encoder
encoder = joblib.load('OHEencoder.sav')

# creating the flask app
app = Flask(__name__)

# creating an API object
api = Api(app)

 

Rest API Get Request

One thing I like to do for my models is creating a get request that holds the “structure” for our post request.

This simple function lets users quickly understand what data they need to provide to our post request.

Here is the example script.

    def get(self):
        
        '''
        returns the structure of the request needed for our
        post request
        '''

        # tell how to make requests to our API (during posts)
        return make_response(jsonify({
                                      'experience_level': ['SE','MI','EN','EX'],
                                      'employment_type' : ['FT','PT','CT','FL'],
                                      'company_size': ['S', 'M', 'L'],
                                      'role' : ['**Job Title**'],
                                      'residence' : ['2 Syl Country Code (US, GB) etc'],
                                      'remote%' : ['0','50','100']
                                      }), 201)

If a user pings our get request, they’ll immediately know how to talk with our API

Restful API Post Request

Next, we have our post request.

This will be the main piece of our project.

When the user sends the correct data in the payload, our model will return a prediction.

The API starts with some payload checking to ensure all the correct data is passed.

 

def post(self):

        '''
        retrieves payload
        sends back model prediction
        '''
        
        # grab the payload data sent
        data = request.get_json()

        # make sure we have all of our columns
        if 'experience_level' not in data \
            or 'employment_type' not in data \
            or 'company_size' not in data \
            or 'role' not in data \
            or 'residence' not in data \
            or 'remote%' not in data:

            return make_response(jsonify({'message' : 'Missing A Category'}), 400)


Once this if statement is passed, we continue on in our API.

We use the same functions as before to convert our data to a form our model can read.

If you’ve been following along, you’ll recognize the functions below.

# convert the roles
        # the exact same way we did
        # in trainning
        def convertJob(text):
            
            '''
            converts job titles to form model can understand
            '''
    
            if 'lead' in text.lower() or 'manager' in text.lower() or 'director' in text.lower() or 'head' in text.lower():
                return 'LDR'
            
            elif 'machine' in text.lower() or 'ai ' in text.lower() or 'vision' in text.lower():
                return 'ML'
            
            if 'scientist' in text.lower() or 'analytics' in text.lower() or 'science' in text.lower():
                return 'DS'
            
            if 'analyst' in text.lower():
                return 'AL'
            
            if 'engineer' in text.lower():
                return 'DE'

            
            return 'OTHR_ROLE'

        # convert residence
        # the exact same way we did
        # in trainning
        def convertResidence(text):

            '''
            converts user input of residence so model can understand
            
            '''

            if len(text) != 2:
                return 'OTHER_RES'
    
            approved = ['US','GB','IN','CA','DE','FR','ES','GR','JP']
            
            if text.upper() in approved:
                return text
            
            return 'OTHR_RES'

        # convert remote work
        # the exact same way we did
        # in training
        def ConvertRemote(percentage):
            
            if int(percentage) > 50:
                return 'Remote'
            
            if int(percentage) < 50:
                return 'Office'
            
            return 'Hybrid'


Now that we have these functions set up and ready to be used, we call them from our dictionary object to build out a prediction array.

        # build out a prediction dictionary, using our functions 
        # that we used during trainning
        user_dict = {
            'experience_level': data['experience_level'],
            'employment_type' : data['employment_type'],
            'company_size' : data['company_size'],
            'roles_converted' : convertJob(data['role']),
            'residence_converted' : convertResidence(data['residence']),
            'remote_converted' : ConvertRemote(data['remote%'])
        }

We use the payload data (stored in a dictionary called data from above) to either use it directly or send it to a function for cleaning.

Finally, we build out a data frame, use our encoder from training to build our array with the same columns as during training, get a prediction, and return it to the user.

        # convert our dictoinary to a dataframe
        df = pd.DataFrame([user_dict])


        # use our encoder from trainning
        encoded_df = pd.DataFrame(encoder.transform(df).toarray())

        # now use our model from trainning for a prediction
        pred = loaded_model.predict(encoded_df)


        # return our prediction in a JSON
        return make_response(jsonify({'prediction' : str(pred[0])}), 201)


While this is the “meat” of the project, the rest involves defining the API and creating a deployment run.

api.add_resource(Prediction, '/pred')
  
  
# driver function
if __name__ == '__main__':

    ## load in model on start
    app.run(debug = True)

 

Using Python Requests Module To Test Our API

You can’t call an API finished without testing!

Here is our test script:

import requests
import json

def run_test():
    
    # test get
    _test_get = requests.get('http://127.0.0.1:5000/pred')
    
    # print our json
    print(_test_get.json())


    print('\n\n\n\n')



    # test post
    payload = {
                'experience_level': 'EX',
                'employment_type' : 'PT',
                'company_size': 'S',
                'role' : 'Financial Analyst',
                'residence' : 'US',
                'remote%' : '0'
    }

    r = requests.post('http://127.0.0.1:5000/pred', json=payload)

    print(r.json(), '\n\n\n')


if __name__ == '__main__':

    # example tests
    run_test()

For our first test, let’s see how much a financial analyst gets paid that is entirely remote.

We send this data within our payload:

    payload = {
                'experience_level': 'EX',
                'employment_type' : 'PT',
                'company_size': 'S',
                'role' : 'Financial Analyst',
                'residence' : 'US',
                'remote%' : '100'
    }

example for our project

Quite low.

Let’s see how a full-time remote machine-learning engineer does

    payload = {
                'experience_level': 'EX',
                'employment_type' : 'FT',
                'company_size': 'L',
                'role' : 'Machine Learning Engineer',
                'residence' : 'US',
                'remote%' : '100'
    }

prediction data for machine learning engineer

Machine learning engineers are paid much closer to the top of the distribution!


Now, wouldn’t it be nice to know what numbers those normalized values correlate to?

It’s pretty straightforward; head over to our guide reverse standardization, and we’ll show you how to change those numbers back into salary data!


Next Steps

This is, sadly, the last post in this project.

If you were interested in a previous section, you could check those out here:

If you’re interested in doing more for this API, try and add security (authorization and authentication) and deploy it!

Stewart Kaplan