The BLS API: Synchronous and Asynchronous Test Results

I am a big fan of using APIs to pull data.  I am the creator of the R package blsAPI found on CRAN and GitHub.  As I mentioned in my previous post, I have started playing around with Python’s asyncio library.  I decided to try it out on the BLS’s API.

I set up a little test where I would request unemployment data (i.e. the number unemployed, the unemployment rate and the number in the labor force) for all 50 states and D.C. both synchronously and asynchronously.  The test was a bit of a hello world as it really didn’t do much more than request the data.  I just wanted to get a sense of the productivity gains.  My first run through all 153 requests took a little more than 2 seconds asynchronously and roughly 20 seconds synchronously.  That is a 10 fold difference!  That’s huge!

But before getting too excited I decided to run a few more tests to see if these results are typical.  Not wanting to adversely affect the BLS’s servers, I limited the number of runs to 100 each.  The test code is found at the end of this post for those interested in trying it out for themselves.

Here’s a screenshot of the test’s output to the console:

As you can see the results are pretty close to what I initially observed. The synchronous code took on average roughly 21.8 seconds to complete.  While the asynchronous took on average about 2.5 seconds.  Here’s a visualization of the full data set:

import requests
import json
import time
import asyncio
import pandas as pd

test_runs = 100

# BLS API Parameters
BLS_API_key = 'TYPE YOUR OWN KEY HERE'
headers = {'Content-type': 'application/json'}

BLS_LAUS_state_area_codes = ['ST0100000000000', 'ST0200000000000', 'ST0400000000000', 'ST0500000000000', 'ST0600000000000', 'ST0800000000000', 'ST0900000000000', 'ST1000000000000', 'ST1100000000000', 'ST1200000000000', 'ST1300000000000', 'ST1500000000000', 'ST1600000000000', 'ST1700000000000', 'ST1800000000000', 'ST1900000000000', 'ST2000000000000', 'ST2100000000000', 'ST2200000000000', 'ST2300000000000', 'ST2400000000000', 'ST2500000000000', 'ST2600000000000', 'ST2700000000000', 'ST2800000000000', 'ST2900000000000', 'ST3000000000000', 'ST3100000000000', 'ST3200000000000', 'ST3300000000000', 'ST3400000000000', 'ST3500000000000', 'ST3600000000000', 'ST3700000000000', 'ST3800000000000', 'ST3900000000000', 'ST4000000000000', 'ST4100000000000', 'ST4200000000000', 'ST4400000000000', 'ST4500000000000', 'ST4600000000000', 'ST4700000000000', 'ST4800000000000', 'ST4900000000000', 'ST5000000000000', 'ST5100000000000', 'ST5300000000000', 'ST5400000000000', 'ST5500000000000', 'ST5600000000000']
measures = ['03', '04', '05']

# Translate the area codes to series ids
seriesids = list()
for BLS_LAUS_state_area_code in BLS_LAUS_state_area_codes:
    for measure in measures:
        seriesids.append('LAS' + BLS_LAUS_state_area_code + measure)

number_of_requests = len(seriesids)


def fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    response = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    return response.json()

def async_fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    headers = {'Content-type': 'application/json'}
    response = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    return (seriesid, response.json())

async def fetch_all(seriesids):
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            async_fetch, 
            seriesid
        )
        for seriesid in seriesids
    ]
    for d in await asyncio.gather(*futures):
        temp[d[0]] = d[1]
        
temp = dict()
results = dict()

for i in range(test_runs):
    test_run = i + 1

    # Test A - Synchronous Requests
    start_time = time.time()
    
    for seriesid in seriesids:
        temp[seriesid] = fetch(seriesid)
    
    synchronous_time = time.time() - start_time
    
    print("Test " + str(test_run) + "-A " + str(number_of_requests) + " Sychronous Requests: %s seconds" % (sychronous_time))
    
    # Test B - Asynchronous Requests
    start_time = time.time()
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(fetch_all(seriesids))
    
    asynchronous_time = time.time() - start_time
    
    print("Test " + str(test_run) + "-B " + str(number_of_requests) + " Asychronous Requests: %s seconds" % (asychronous_time))
    results[test_run] = {'Synchronous': synchronous_time, 'Asynchronous': aynchronous_time}

loop.close()
df = pd.DataFrame.from_dict(results, orient='index')
writer = pd.ExcelWriter('BLS Test.xlsx')
df.to_excel(writer,'Results')
writer.save()
Advertisements

R blsAPI Package Updated

You may not know it but I maintain a small R package that allows users to pull data from the Bureau of Labor Statistics (BLS) API.  James Morris recently pointed out a bug with my package.  I have resolved the issue and the latest version has been submitted to CRAN and is available on GitHub.

Status

BLS Featuring My R API Wrapper

I was in the process of cleaning up my package for submission to CRAN when I learned that the BLS has released v2 of their API service.  This version requires a key but allows for more requests plus annual average calculations which is cool.

I was shocked and gratified to see that under the Sample Code: R page they were featuring my work with this acknowledgement:

bls_api

My submission to CRAN has not accepted yet, but I’m still working on it.  In the mean time it is available through GitHub.

Status

BLS API Wrapper for R

I have created my first R package! It is called blsAPI and is available through GitHub.  It allows people to request series from the BLS’s API.

To use the function you need to specify the series id(s) and optionally the start and end years.  The following are some example of how you could use this package (these examples are taken from http://www.bls.gov/developers/api_signature.htm):

Single Series

response <- blsAPI('LAUCN040010000000005')
json <- fromJSON(response)

Multiple Series

payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'))
response <- blsAPI(payload)
json <- fromJSON(response)

One or More Series, Specifying Years

payload <- list('seriesid'=c('LAUCN040010000000005','LAUCN040010000000006'), 'startyear'='2010', 'endyear'='2012')
response <- blsAPI(payload)
json <- fromJSON(response)
Status

Getting and Cleaning Data Course Complete

After a short delay in grading I received notification that I have completed the Getting and Cleaning Data Course.

Getting and Cleaning Data Course Record

This class explained how to pull in data from a variety of formats (i.e. excel, XML, JSON, MySQL, etc.).  It introduced me to the data.table package in R which is awesome.  It was not always taught in a very clear way.

I was taking Core Concepts in Data Analysis at the same time which was very challenging.  That is why my blog has fell silent for the last little bit (because I was too busy to write).  I am waiting to see how that class turns out.

I have been using the skills gained in this class to write scripts to pull data from the BEA’s API using R.  Hopefully I will have more to share on this in a later post.

Status

Quick Update

My blog has been quiet lately because I have been really busy.  I bit off a little more than I could chew in the Coursera courses.  As they are coming to a close I feel a lot less pressure (it’s hard working full-time with a family and carry a course load).

I have, however, learned a lot that I am putting into practice on the job.  I have constructed cost projection models using R which was a lot easier than because what I used to do by eye I was able to program a function to do.  I have written and used R scripts that use Google’s geocoding API.

I am confident that I am moving on from R novice however I am not a master.  At least not yet.