Starting Graduate School

I am excited to be beginning a Masters in Data Science program through the City University of New York School of Professional Studies.  I hope to post some of my course findings here.

Advertisements

Asyncio and IPython Woes

When I write something in Python I use something like Spyder or Jupyter Labs.  I recently have been trying to learn asyncio which has been a struggle.  Everytime I ran my script I got the dreaded RuntimeError: Event loop is running.  I figured out today that the issue lies with IPython.  According to https://github.com/ipython/ipython/issues/11030  IPython starts up an event loop which makes it impossible to run the simplest of scripts with asyncio functionality.  This impacts Spyder and Jupyter Labs.  It is nice to know it wasn’t my programming that was throwing the errors. It was the environment.

The BLS API: Synchronous and Asynchronous Test Results

I am a big fan of using APIs to pull data.  I am the creator of the R package blsAPI found on CRAN and GitHub.  As I mentioned in my previous post, I have started playing around with Python’s asyncio library.  I decided to try it out on the BLS’s API.

I set up a little test where I would request unemployment data (i.e. the number unemployed, the unemployment rate and the number in the labor force) for all 50 states and D.C. both synchronously and asynchronously.  The test was a bit of a hello world as it really didn’t do much more than request the data.  I just wanted to get a sense of the productivity gains.  My first run through all 153 requests took a little more than 2 seconds asynchronously and roughly 20 seconds synchronously.  That is a 10 fold difference!  That’s huge!

But before getting too excited I decided to run a few more tests to see if these results are typical.  Not wanting to adversely affect the BLS’s servers, I limited the number of runs to 100 each.  The test code is found at the end of this post for those interested in trying it out for themselves.

Here’s a screenshot of the test’s output to the console:

As you can see the results are pretty close to what I initially observed. The synchronous code took on average roughly 21.8 seconds to complete.  While the asynchronous took on average about 2.5 seconds.  Here’s a visualization of the full data set:

import requests
import json
import time
import asyncio
import pandas as pd

test_runs = 100

# BLS API Parameters
BLS_API_key = 'TYPE YOUR OWN KEY HERE'
headers = {'Content-type': 'application/json'}

BLS_LAUS_state_area_codes = ['ST0100000000000', 'ST0200000000000', 'ST0400000000000', 'ST0500000000000', 'ST0600000000000', 'ST0800000000000', 'ST0900000000000', 'ST1000000000000', 'ST1100000000000', 'ST1200000000000', 'ST1300000000000', 'ST1500000000000', 'ST1600000000000', 'ST1700000000000', 'ST1800000000000', 'ST1900000000000', 'ST2000000000000', 'ST2100000000000', 'ST2200000000000', 'ST2300000000000', 'ST2400000000000', 'ST2500000000000', 'ST2600000000000', 'ST2700000000000', 'ST2800000000000', 'ST2900000000000', 'ST3000000000000', 'ST3100000000000', 'ST3200000000000', 'ST3300000000000', 'ST3400000000000', 'ST3500000000000', 'ST3600000000000', 'ST3700000000000', 'ST3800000000000', 'ST3900000000000', 'ST4000000000000', 'ST4100000000000', 'ST4200000000000', 'ST4400000000000', 'ST4500000000000', 'ST4600000000000', 'ST4700000000000', 'ST4800000000000', 'ST4900000000000', 'ST5000000000000', 'ST5100000000000', 'ST5300000000000', 'ST5400000000000', 'ST5500000000000', 'ST5600000000000']
measures = ['03', '04', '05']

# Translate the area codes to series ids
seriesids = list()
for BLS_LAUS_state_area_code in BLS_LAUS_state_area_codes:
    for measure in measures:
        seriesids.append('LAS' + BLS_LAUS_state_area_code + measure)

number_of_requests = len(seriesids)


def fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    response = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    return response.json()

def async_fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    headers = {'Content-type': 'application/json'}
    response = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    return (seriesid, response.json())

async def fetch_all(seriesids):
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            async_fetch, 
            seriesid
        )
        for seriesid in seriesids
    ]
    for d in await asyncio.gather(*futures):
        temp[d[0]] = d[1]
        
temp = dict()
results = dict()

for i in range(test_runs):
    test_run = i + 1

    # Test A - Synchronous Requests
    start_time = time.time()
    
    for seriesid in seriesids:
        temp[seriesid] = fetch(seriesid)
    
    synchronous_time = time.time() - start_time
    
    print("Test " + str(test_run) + "-A " + str(number_of_requests) + " Sychronous Requests: %s seconds" % (sychronous_time))
    
    # Test B - Asynchronous Requests
    start_time = time.time()
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(fetch_all(seriesids))
    
    asynchronous_time = time.time() - start_time
    
    print("Test " + str(test_run) + "-B " + str(number_of_requests) + " Asychronous Requests: %s seconds" % (asychronous_time))
    results[test_run] = {'Synchronous': synchronous_time, 'Asynchronous': aynchronous_time}

loop.close()
df = pd.DataFrame.from_dict(results, orient='index')
writer = pd.ExcelWriter('BLS Test.xlsx')
df.to_excel(writer,'Results')
writer.save()
Aside

How I Scrapped a Quarter of a Million URLs in Three Quarters of an Hour (Roughly)

I do a lot of web scrapping at my job. I recently heard about Python’s asyncio (yeah I know it’s been out for a while) and thought it could really make a difference. I dove into the documentation and the examples published by others. Most of the examples were not very helpful. I mean really what’s the point of asynchronously printing? After much trial and error I finally was able to code up an asynchronous web scrapper.

Our county has a property portal where you can look up the property records for any property in the county. I knew that you could search by address or by property tax map identifiers. NYS GIS maintains a set of tax parcel centroids which has these identifiers.  I used them as my universe of locations (although they are current as of 2016 and the property portal is current as of a couple of months ago). 

I ran my script overnight and when I came in I had a database of 227,062 webpages that it scrapped in 49 about  minutes.  I want to publish my code so that other trying to build a web scrapper that uses asyncio can have a working model. It downloads the tax parcel centroids and extracts them, builds a list of urls to scrape, then scrapes them and saves the HTML to a sqlite database. I will write another script to comb through these pages and pull out the data I’m interested in.

# -*- coding: utf-8 -*-
"""
Created on Mon May 14 08:32:20 2018

@author: Michael Silva
"""
import urllib
import zipfile
import shutil
from dbfread import DBF
import sqlite3 as db
import requests 
import asyncio
import aioodbc
import time

start_time = time.time()
    
print('Setting up db')
db_name = "Monroe Real Property Data.db"
con = db.connect(db_name)
con.row_factory = lambda cursor, row: row[0]
c = con.cursor()
c.execute('DROP TABLE IF EXISTS `scrapped_data`')
c.execute('CREATE TABLE `scrapped_data` (`url`,`html`)')
con.commit()

print("--- %s seconds ---" % (time.time() - start_time))

    
print('Getting the Monroe County Tax Parcels Centroids')
url = 'http://gis.ny.gov/gisdata/fileserver/?DSID=1300&file=Monroe-Tax-Parcels-Centroid-Points-SHP.zip'
zip_file_name = url.split('file=')[1]
dbf_file_name = 'Monroe_2016_Tax_Parcel_Centroid_Points_SHP.dbf'

with urllib.request.urlopen(url) as response, open(zip_file_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(zip_file_name) as zf:
    zf.extractall()
print("--- %s seconds ---" % (time.time() - start_time))

print('Building list of urls to scrape')
urls_to_scrape = list()

for record in DBF(dbf_file_name):
    try: 
        if record['PROP_CLASS'][0] == '2':
            scrape_me = 'https://www.monroecounty.gov/etc/rp/report.php?a='+record['SWIS']+'-'+record['SBL']
            urls_to_scrape.append(scrape_me)
    except IndexError:
        continue
print("--- %s seconds ---" % (time.time() - start_time))

def scrape(url):
	print('Scrapping ' + url)
	return requests.get(url)
                
async def scrape_all(urls_to_scrape):
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            scrape, 
            url
        )
        for url in urls_to_scrape
    ]
    for response in await asyncio.gather(*futures):
        print('Saving ' + response.url)
        c.execute('INSERT INTO `scrapped_data` VALUES (?, ?)', (response.url, response.content))


loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_all(urls_to_scrape))
loop.close()

print("--- %s seconds ---" % (time.time() - start_time))

print('Finalizing database')
con.commit()    
con.close()
print("--- %s seconds ---" % (time.time() - start_time))

Fixing FIPS Codes Mangled by Pandas

I have been working with data produced by the U.S. Census Bureau.  They use FIPS codes to identify geographies.  When I read in the data using Pandas, the FIPS codes get converted into numbers.  After trying to force the type to string (which doesn’t work currently), I decided to create a work around.  Here it is:

def fix_fips(fips, total_length):
    """Takes a broken FIPS and repairs it"""
    fips = str(fips)
    current_length = len(fips)
    if current_length < total_length:
        number_of_leading_zeros = total_length - current_length
        leading_zeros = ''.join('0' * number_of_leading_zeros)
        fips = leading_zeros + fips
    return fips

So say I have some state level data which has a two character FIPS code read into a pandas dataframe. I would correct the mangled data by:

df['State FIPS code'] = df['State FIPS code'].apply(fix_fips, args=(2,))

Hope this helps other data ninjas out there!

Seeing the Tree from the Random Forest

Introduction

It is often said that ensemble method are excellent at prediction but impossible to interpret.  I was thinking about this issue and wondered couldn’t there be a way to create something that is very interpretable (say a decision tree) that represents the random forest.  It wouldn’t be a perfect representation but it would offer some insight into what may be going on under the hood.  Could this be a possible way to reverse engineer a machine learning model?

Test Case – Wine Classification

I decided that I would use the University of California Irvine’s databases on wine qualities as a data set.  The classification problem is to determine whether or not a wine is a white wine.

# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
df = pd.read_csv(url, sep=';')
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
temp = pd.read_csv(url, sep=';')
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)

I would break the data into training and test sets, construct a simple classification model, then run the training back through the model to get the random forest’s predictions.

# Make this reproducible
np.random.seed(123)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

training_predictions = rf.predict(training_data)

I would then use the training data with the predictions to construct a decision tree.

# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)

I would evaluate the effectiveness of the model by comparing the predictions of the test data made by the random forest against those of the decision tree representing the random forest.

print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))

Findings

The random forest model produced by the above code had a 99% accuracy rate in classifying a wine as white.

The decision tree built off of the random forest model agreed with the random forest model 98% of the time. The final decision tree is presented below:

rf

Discussion

This was a first attempt at seeing if it were possible to make a model of a model in order to provide some insight into what is the process.  This examination shows promise but I would offer the following warnings:

Simplicity of Task

This was intentionally a very simple classification task. One reason why the results are so positive might be due to the simplicity of the task.

Value of Interpretability

The actual decision tree produced by the process would need to be taken with a grain of salt. As decision trees tend to overfit I would worry about the practical aspect of what people would do with the decision tree. The first cut point in the decision tree is the total sulfur dioxide ≤ 67.5. How much faith should be put in that exact number is questionable. I personally would treat the output as a rule-of-thumb but not get too hung up on the exact value.

Complete Source Code

The following is the source code used to produce this analysis.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

# Make this reproducible
np.random.seed(123)

# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
df = pd.read_csv(url, sep=';')
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
temp = pd.read_csv(url, sep=';')
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

# Use it on the test data
rf_predictions = rf.predict(test_data)

# Assess the accuracy
print('Testing Random Forest')
print('Accuracy Score: ' + str(accuracy_score(test_target, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(test_target, rf_predictions)) 

# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
training_predictions = rf.predict(training_data)
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)

print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))

Random Forests Are Cool

The Problem

I recently was trying to merge in some data work a co-worker of mine did with my data but I quickly had a problem.  The work didn’t share any common key to allow for joining.  Instead I joined on addresses which was not perfect but got a lot done.  I had roughly 4,000 records and about 1,200 of them were missing a vital variable.  To illustrate the problem direct your attention to the figures below:

rf1

The dark blue points are missing data.  It is clear to me that all the dark blue points surrounded by yellow for example should also be yellow.  How can I correct this?  Especially since I have very little time to do it?

The Solution: Random Forest

Realizing that this is a classification problem I decided to try out a random forest.  I quickly wrangled the data into training and test sets and using the caret package in R produced a cross-validated random forest.  The model had a 99.5% accuracy on the test data (3 misclassified out of 661).

I used the model to predict the values for those records missing the feature.  The results were excellent as shown below:

rf3

All Models Are Wrong, But Some Are Useful

Although this random forest model did a great job I took a closer look in the area where the orange and pinkish dots come together.  This animated image shows the dark blue dots that needed a prediction followed by what the model generated.  I circled one dot in red that is an error.  It should be orange.  But given that the data was going to be used at such a high level this error is allowable (along with the handful of others that are probably out there).  The random forest was a great tool to use to get the job done.

rf