Using Census Geocoder with Python

I often geocode data at work. The U.S. Census Bureau has a geocoder that is open to the public. The following is a simple script to geocode a single address using python

import requests
import json

def get_url(address):
    # Convert spaces to plus signs
    address = address.replace(' ', '+')
    # Convert comma to %2C
    address = address.replace(',', '%2C')
    url = ''+address+'&benchmark=9&format=json'
    return url

address = '1600 Pennsylvania Avenue NW, Washington, DC 20500'

response = requests.get(get_url(address))

data = requests.get(get_url(address)).text

data = json.loads(data)

coordinates = data['result']['addressMatches'][0]['coordinates']

lat = coordinates['y']
lng = coordinates['x']

Fall 2018 Retrospective

As the semester has come to a close I find myself reflecting on what I did and learned.  I took two courses in Fall 2018.  One that focused on statistics and probability.  The other focused on data acquisition and management. 

Both courses allowed me to deepened my understanding of these data science fundamentals.  They also offered the chance to extended my knowledge.  I learned from both courses and am proud of my work (we’ll see if my professors share my assessment shortly).

In data acquisition and management, we completed a series of projects. The goal of the projects was to cement the concepts covered in the text books. I enjoyed the practical nature of the course. As part of the course work I learned how to work with documents. I used this new found knowledge to develop a recommender system for work.

I have also have access to Data Camp. I have been working through the lessons as time permits. At the time of this post, I stand at the top of the all-time leaderboard (behind the professor). Now that finals are over I will spend some more time with this platform.

I developed some soft skills through these courses. My presentation and writing skills have improved. I have room to grow in both of these areas. And my work is available to all through my GitHub repository. All in all, I am happy with the way this semester turned out.

Yes, I Support Vector Machines

Recently, I developed a model that classifies an email as spam or not-spam (a.k.a. ham).  It uses the Apache Spam Assassin public corpus as the data set.  I processed about 2,800 email messages and created a matrix with 71,500 terms.  75% of the email messages fed into a support vector machine.  The remaining 25% (697 messages) evaluated the model’s accuracy.  The support vector machine model was right all but 22 times. The model has a 97% accuracy rate.

For those interested in seeing how, please visit this write-up.

Asyncio and IPython Woes

When I write something in Python I use something like Spyder or Jupyter Labs.  I recently have been trying to learn asyncio which has been a struggle.  Everytime I ran my script I got the dreaded RuntimeError: Event loop is running.  I figured out today that the issue lies with IPython.  According to  IPython starts up an event loop which makes it impossible to run the simplest of scripts with asyncio functionality.  This impacts Spyder and Jupyter Labs.  It is nice to know it wasn’t my programming that was throwing the errors. It was the environment.

The BLS API: Synchronous and Asynchronous Test Results

I am a big fan of using APIs to pull data.  I am the creator of the R package blsAPI found on CRAN and GitHub.  As I mentioned in my previous post, I have started playing around with Python’s asyncio library.  I decided to try it out on the BLS’s API.

I set up a little test where I would request unemployment data (i.e. the number unemployed, the unemployment rate and the number in the labor force) for all 50 states and D.C. both synchronously and asynchronously.  The test was a bit of a hello world as it really didn’t do much more than request the data.  I just wanted to get a sense of the productivity gains.  My first run through all 153 requests took a little more than 2 seconds asynchronously and roughly 20 seconds synchronously.  That is a 10 fold difference!  That’s huge!

But before getting too excited I decided to run a few more tests to see if these results are typical.  Not wanting to adversely affect the BLS’s servers, I limited the number of runs to 100 each.  The test code is found at the end of this post for those interested in trying it out for themselves.

Here’s a screenshot of the test’s output to the console:

As you can see the results are pretty close to what I initially observed. The synchronous code took on average roughly 21.8 seconds to complete.  While the asynchronous took on average about 2.5 seconds.  Here’s a visualization of the full data set:

import requests
import json
import time
import asyncio
import pandas as pd

test_runs = 100

# BLS API Parameters
headers = {'Content-type': 'application/json'}

BLS_LAUS_state_area_codes = ['ST0100000000000', 'ST0200000000000', 'ST0400000000000', 'ST0500000000000', 'ST0600000000000', 'ST0800000000000', 'ST0900000000000', 'ST1000000000000', 'ST1100000000000', 'ST1200000000000', 'ST1300000000000', 'ST1500000000000', 'ST1600000000000', 'ST1700000000000', 'ST1800000000000', 'ST1900000000000', 'ST2000000000000', 'ST2100000000000', 'ST2200000000000', 'ST2300000000000', 'ST2400000000000', 'ST2500000000000', 'ST2600000000000', 'ST2700000000000', 'ST2800000000000', 'ST2900000000000', 'ST3000000000000', 'ST3100000000000', 'ST3200000000000', 'ST3300000000000', 'ST3400000000000', 'ST3500000000000', 'ST3600000000000', 'ST3700000000000', 'ST3800000000000', 'ST3900000000000', 'ST4000000000000', 'ST4100000000000', 'ST4200000000000', 'ST4400000000000', 'ST4500000000000', 'ST4600000000000', 'ST4700000000000', 'ST4800000000000', 'ST4900000000000', 'ST5000000000000', 'ST5100000000000', 'ST5300000000000', 'ST5400000000000', 'ST5500000000000', 'ST5600000000000']
measures = ['03', '04', '05']

# Translate the area codes to series ids
seriesids = list()
for BLS_LAUS_state_area_code in BLS_LAUS_state_area_codes:
    for measure in measures:
        seriesids.append('LAS' + BLS_LAUS_state_area_code + measure)

number_of_requests = len(seriesids)

def fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    response ='', data=data, headers=headers)
    return response.json()

def async_fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    headers = {'Content-type': 'application/json'}
    response ='', data=data, headers=headers)
    return (seriesid, response.json())

async def fetch_all(seriesids):
    loop = asyncio.get_event_loop()
    futures = [
        for seriesid in seriesids
    for d in await asyncio.gather(*futures):
        temp[d[0]] = d[1]
temp = dict()
results = dict()

for i in range(test_runs):
    test_run = i + 1

    # Test A - Synchronous Requests
    start_time = time.time()
    for seriesid in seriesids:
        temp[seriesid] = fetch(seriesid)
    synchronous_time = time.time() - start_time
    print("Test " + str(test_run) + "-A " + str(number_of_requests) + " Sychronous Requests: %s seconds" % (sychronous_time))
    # Test B - Asynchronous Requests
    start_time = time.time()
    loop = asyncio.get_event_loop()
    asynchronous_time = time.time() - start_time
    print("Test " + str(test_run) + "-B " + str(number_of_requests) + " Asychronous Requests: %s seconds" % (asychronous_time))
    results[test_run] = {'Synchronous': synchronous_time, 'Asynchronous': aynchronous_time}

df = pd.DataFrame.from_dict(results, orient='index')
writer = pd.ExcelWriter('BLS Test.xlsx')