Yes, I Support Vector Machines

Recently, I developed a model that classifies an email as spam or not-spam (a.k.a. ham).  It uses the Apache Spam Assassin public corpus as the data set.  I processed about 2,800 email messages and created a matrix with 71,500 terms.  75% of the email messages fed into a support vector machine.  The remaining 25% (697 messages) evaluated the model’s accuracy.  The support vector machine model was right all but 22 times. The model has a 97% accuracy rate.

For those interested in seeing how, please visit this write-up.

Advertisements

Top Skills of a Data Scientist

I recently completed a project in which our team scraped job postings to determine the top skills of a data scientist.  The following word cloud above summarizes our findings.  You can learn more about this project by reading our write-up or visiting our GitHub repository.

Asyncio and IPython Woes

When I write something in Python I use something like Spyder or Jupyter Labs.  I recently have been trying to learn asyncio which has been a struggle.  Everytime I ran my script I got the dreaded RuntimeError: Event loop is running.  I figured out today that the issue lies with IPython.  According to https://github.com/ipython/ipython/issues/11030  IPython starts up an event loop which makes it impossible to run the simplest of scripts with asyncio functionality.  This impacts Spyder and Jupyter Labs.  It is nice to know it wasn’t my programming that was throwing the errors. It was the environment.

The BLS API: Synchronous and Asynchronous Test Results

I am a big fan of using APIs to pull data.  I am the creator of the R package blsAPI found on CRAN and GitHub.  As I mentioned in my previous post, I have started playing around with Python’s asyncio library.  I decided to try it out on the BLS’s API.

I set up a little test where I would request unemployment data (i.e. the number unemployed, the unemployment rate and the number in the labor force) for all 50 states and D.C. both synchronously and asynchronously.  The test was a bit of a hello world as it really didn’t do much more than request the data.  I just wanted to get a sense of the productivity gains.  My first run through all 153 requests took a little more than 2 seconds asynchronously and roughly 20 seconds synchronously.  That is a 10 fold difference!  That’s huge!

But before getting too excited I decided to run a few more tests to see if these results are typical.  Not wanting to adversely affect the BLS’s servers, I limited the number of runs to 100 each.  The test code is found at the end of this post for those interested in trying it out for themselves.

Here’s a screenshot of the test’s output to the console:

As you can see the results are pretty close to what I initially observed. The synchronous code took on average roughly 21.8 seconds to complete.  While the asynchronous took on average about 2.5 seconds.  Here’s a visualization of the full data set:

import requests
import json
import time
import asyncio
import pandas as pd

test_runs = 100

# BLS API Parameters
BLS_API_key = 'TYPE YOUR OWN KEY HERE'
headers = {'Content-type': 'application/json'}

BLS_LAUS_state_area_codes = ['ST0100000000000', 'ST0200000000000', 'ST0400000000000', 'ST0500000000000', 'ST0600000000000', 'ST0800000000000', 'ST0900000000000', 'ST1000000000000', 'ST1100000000000', 'ST1200000000000', 'ST1300000000000', 'ST1500000000000', 'ST1600000000000', 'ST1700000000000', 'ST1800000000000', 'ST1900000000000', 'ST2000000000000', 'ST2100000000000', 'ST2200000000000', 'ST2300000000000', 'ST2400000000000', 'ST2500000000000', 'ST2600000000000', 'ST2700000000000', 'ST2800000000000', 'ST2900000000000', 'ST3000000000000', 'ST3100000000000', 'ST3200000000000', 'ST3300000000000', 'ST3400000000000', 'ST3500000000000', 'ST3600000000000', 'ST3700000000000', 'ST3800000000000', 'ST3900000000000', 'ST4000000000000', 'ST4100000000000', 'ST4200000000000', 'ST4400000000000', 'ST4500000000000', 'ST4600000000000', 'ST4700000000000', 'ST4800000000000', 'ST4900000000000', 'ST5000000000000', 'ST5100000000000', 'ST5300000000000', 'ST5400000000000', 'ST5500000000000', 'ST5600000000000']
measures = ['03', '04', '05']

# Translate the area codes to series ids
seriesids = list()
for BLS_LAUS_state_area_code in BLS_LAUS_state_area_codes:
    for measure in measures:
        seriesids.append('LAS' + BLS_LAUS_state_area_code + measure)

number_of_requests = len(seriesids)


def fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    response = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    return response.json()

def async_fetch(seriesid):
    data = json.dumps({'seriesid': [seriesid], 'registrationkey': BLS_API_key})
    headers = {'Content-type': 'application/json'}
    response = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
    return (seriesid, response.json())

async def fetch_all(seriesids):
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            async_fetch, 
            seriesid
        )
        for seriesid in seriesids
    ]
    for d in await asyncio.gather(*futures):
        temp[d[0]] = d[1]
        
temp = dict()
results = dict()

for i in range(test_runs):
    test_run = i + 1

    # Test A - Synchronous Requests
    start_time = time.time()
    
    for seriesid in seriesids:
        temp[seriesid] = fetch(seriesid)
    
    synchronous_time = time.time() - start_time
    
    print("Test " + str(test_run) + "-A " + str(number_of_requests) + " Sychronous Requests: %s seconds" % (sychronous_time))
    
    # Test B - Asynchronous Requests
    start_time = time.time()
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(fetch_all(seriesids))
    
    asynchronous_time = time.time() - start_time
    
    print("Test " + str(test_run) + "-B " + str(number_of_requests) + " Asychronous Requests: %s seconds" % (asychronous_time))
    results[test_run] = {'Synchronous': synchronous_time, 'Asynchronous': aynchronous_time}

loop.close()
df = pd.DataFrame.from_dict(results, orient='index')
writer = pd.ExcelWriter('BLS Test.xlsx')
df.to_excel(writer,'Results')
writer.save()
Aside

How I Scrapped a Quarter of a Million URLs in Three Quarters of an Hour (Roughly)

I do a lot of web scrapping at my job. I recently heard about Python’s asyncio (yeah I know it’s been out for a while) and thought it could really make a difference. I dove into the documentation and the examples published by others. Most of the examples were not very helpful. I mean really what’s the point of asynchronously printing? After much trial and error I finally was able to code up an asynchronous web scrapper.

Our county has a property portal where you can look up the property records for any property in the county. I knew that you could search by address or by property tax map identifiers. NYS GIS maintains a set of tax parcel centroids which has these identifiers.  I used them as my universe of locations (although they are current as of 2016 and the property portal is current as of a couple of months ago). 

I ran my script overnight and when I came in I had a database of 227,062 webpages that it scrapped in 49 about  minutes.  I want to publish my code so that other trying to build a web scrapper that uses asyncio can have a working model. It downloads the tax parcel centroids and extracts them, builds a list of urls to scrape, then scrapes them and saves the HTML to a sqlite database. I will write another script to comb through these pages and pull out the data I’m interested in.

# -*- coding: utf-8 -*-
"""
Created on Mon May 14 08:32:20 2018

@author: Michael Silva
"""
import urllib
import zipfile
import shutil
from dbfread import DBF
import sqlite3 as db
import requests 
import asyncio
import aioodbc
import time

start_time = time.time()
    
print('Setting up db')
db_name = "Monroe Real Property Data.db"
con = db.connect(db_name)
con.row_factory = lambda cursor, row: row[0]
c = con.cursor()
c.execute('DROP TABLE IF EXISTS `scrapped_data`')
c.execute('CREATE TABLE `scrapped_data` (`url`,`html`)')
con.commit()

print("--- %s seconds ---" % (time.time() - start_time))

    
print('Getting the Monroe County Tax Parcels Centroids')
url = 'http://gis.ny.gov/gisdata/fileserver/?DSID=1300&file=Monroe-Tax-Parcels-Centroid-Points-SHP.zip'
zip_file_name = url.split('file=')[1]
dbf_file_name = 'Monroe_2016_Tax_Parcel_Centroid_Points_SHP.dbf'

with urllib.request.urlopen(url) as response, open(zip_file_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(zip_file_name) as zf:
    zf.extractall()
print("--- %s seconds ---" % (time.time() - start_time))

print('Building list of urls to scrape')
urls_to_scrape = list()

for record in DBF(dbf_file_name):
    try: 
        if record['PROP_CLASS'][0] == '2':
            scrape_me = 'https://www.monroecounty.gov/etc/rp/report.php?a='+record['SWIS']+'-'+record['SBL']
            urls_to_scrape.append(scrape_me)
    except IndexError:
        continue
print("--- %s seconds ---" % (time.time() - start_time))

def scrape(url):
	print('Scrapping ' + url)
	return requests.get(url)
                
async def scrape_all(urls_to_scrape):
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            scrape, 
            url
        )
        for url in urls_to_scrape
    ]
    for response in await asyncio.gather(*futures):
        print('Saving ' + response.url)
        c.execute('INSERT INTO `scrapped_data` VALUES (?, ?)', (response.url, response.content))


loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_all(urls_to_scrape))
loop.close()

print("--- %s seconds ---" % (time.time() - start_time))

print('Finalizing database')
con.commit()    
con.close()
print("--- %s seconds ---" % (time.time() - start_time))

Fixing FIPS Codes Mangled by Pandas

I have been working with data produced by the U.S. Census Bureau.  They use FIPS codes to identify geographies.  When I read in the data using Pandas, the FIPS codes get converted into numbers.  After trying to force the type to string (which doesn’t work currently), I decided to create a work around.  Here it is:

def fix_fips(fips, total_length):
    """Takes a broken FIPS and repairs it"""
    fips = str(fips)
    current_length = len(fips)
    if current_length < total_length:
        number_of_leading_zeros = total_length - current_length
        leading_zeros = ''.join('0' * number_of_leading_zeros)
        fips = leading_zeros + fips
    return fips

So say I have some state level data which has a two character FIPS code read into a pandas dataframe. I would correct the mangled data by:

df['State FIPS code'] = df['State FIPS code'].apply(fix_fips, args=(2,))

Hope this helps other data ninjas out there!