Random Forests Are Cool

The Problem

I recently was trying to merge in some data work a co-worker of mine did with my data but I quickly had a problem.  The work didn’t share any common key to allow for joining.  Instead I joined on addresses which was not perfect but got a lot done.  I had roughly 4,000 records and about 1,200 of them were missing a vital variable.  To illustrate the problem direct your attention to the figures below:


The dark blue points are missing data.  It is clear to me that all the dark blue points surrounded by yellow for example should also be yellow.  How can I correct this?  Especially since I have very little time to do it?

The Solution: Random Forest

Realizing that this is a classification problem I decided to try out a random forest.  I quickly wrangled the data into training and test sets and using the caret package in R produced a cross-validated random forest.  The model had a 99.5% accuracy on the test data (3 misclassified out of 661).

I used the model to predict the values for those records missing the feature.  The results were excellent as shown below:


All Models Are Wrong, But Some Are Useful

Although this random forest model did a great job I took a closer look in the area where the orange and pinkish dots come together.  This animated image shows the dark blue dots that needed a prediction followed by what the model generated.  I circled one dot in red that is an error.  It should be orange.  But given that the data was going to be used at such a high level this error is allowable (along with the handful of others that are probably out there).  The random forest was a great tool to use to get the job done.



Visualizing EMS Service Delivery Options

I put together a neat little visualization that allows the user to do a back-of-the-envelope calculation of what an EMS service delivery option.  You can try it out at https://msilva-cgr.shinyapps.io/essex-county-ems-options/.  It is a shiny app.  Because of the time crunch it is slow because it is doing a lot of calculations on the fly.  If I were to do it again I would have preprocessed the data and saved the results.  It would cut out the costly computations.  But in any event it is a really neat tool.


Does Investment in Public Libraries Increase Usage?

I recently mentioned that I had been exploring some data on public libraries.  Here’s the reason why.  A recent local new paper article chronicled the role libraries are playing today. They highlight the fact that some local libraries that have undergone major renovations recently. In the article they claim:

The surge in popularity mirrors what other communities have seen. When they invest in libraries, the number of people using them goes up.

The claim seemed to rely on anecdotal evidence, so I determined to examine this using data.


I want preface this by admitting that I am a big fan of libraries.  I have fond memories of summer reading programs in my childhood.  My very first exposure to the Internet happened in a public library.  I used to roller blade to the local public library as a teenager to do my homework (even though I had my own desk at home). When my parents moved and I visited them, one of the local attractions I wanted to see was their public libraries.  I love them.  However I love claims being backed with data more than anecdote, especially when it touches something close to me.


I used data from the annual report for public and association libraries to evaluate the claims.  I looked at the data from 1991-2014.  As always, for those who care to replicate my analysis, you can check out the GitHub repository.

I examined the change in library “usage” in terms of circulation and visits.  I wanted to see if the investment in libraries spurred on increase usage that died out over time so I looked at the difference from a one year before and after investment window up to ten years.

There are just under 500 libraries that had a renovation over the time period.  There were also about 200 libraries in New York State didn’t have major renovations.  I was able to use these libraries as a control group.  If there was a statistically significant difference between these two groups there would be data to back up the news paper article’s claim.


After looking at circulation and visitation over the various time frames there was no difference between the libraries that were renovated and those that were not.  Not over the short term, or long term.  So the bottom-line is that the claim that investment increases library usage is not supported by the data.


New York State Public Libraries Circulation Visualization

I have recently been exploring data on the public libraries of New York State for a side project (more on that in a latter post hopefully).  I have also stated a Data Visualization course on Coursera and have decided to feature some visualization of this data set.

About the Data

The data used in this analysis comes from the Annual Report for Public and Association Libraries produced for New York State Education Department (NYSED). You can access the data at http://collectconnect.baker-taylor.com/ using “new york” as the username and “pals” as the password.  Load the saved list named “All Libraries as of 15 March 2016” and select the “Total Circulation” data element.

Visualization Decisions

For this visualization I decided to use all data from 2000 to 2014 (latest data available).  I aggregated the library level circulation data to generate the aggregate circulation for New York State Public Libraries.  I used colorblind safe colors from the Color Brewer palette.  I adjusted the scale on the Y-axis to be in millions.  I used R to generate the following visualization:


What It Tells Us

Book circulation generally increased until 2010 where one observes a reversal of the decade long trend.  There is an exceptionally precipitous drop from 2013 to 2014.

This begs the question why is this changing?  Is it because of a change in the population?  Is it due to a change in the number of libraries reporting (might explain the 2013-2014 drop)?  Is it due to a rise in digital media sources as a substitute for books?  Is it due to a lack of public support/investment in libraries? I plan at looking at that last question in a future post.

Source Code


book_circulation <- read.csv('https://goo.gl/fyybwi', na.strings = 'N/A', stringsAsFactors = FALSE) %>%
  gather(., Year, measurement, X1991:X2014) %>%
  mutate(Year = as.numeric(substr(Year,2,5))) %>%
  mutate(measurement = as.numeric(gsub(',', '', measurement))) %>%
  filter(Year > 1999)%>%
  filter(ifelse(is.na(measurement),0,1)==1) %>%
  group_by(Year) %>%
  summarise(Circulation = sum(measurement)) %>%
  mutate(Circulation = Circulation/1000000)

ggplot(book_circulation, aes(Year, Circulation)) + geom_bar(stat='identity', fill="#9ecae1", colour="#3182bd") + ylab('Book Circulation (in millions)') + ggtitle('Book Circulation in NYS Public Libraries, 2000-2014') + theme_hc()

blsAPI Updated to Deliver QCEW Data

I have previously posted that I developed a R package to facilitate pulling data from the BLS API.  David Hiles asked that I incorporate pulling in QCEW data that is not available through the standard API.  It was a great idea and so I did it.  It is now posted to CRAN or the GitHub repository.

So if you install/update this R package you will have a blsQCEW() function.  You pass in what type of data you are looking for.  Valid options are: Area, Industry and Size.  Other parameters are needed but depend on what type of request you are making.

Area Data Request

Area request require a year, quarter, and area parameters.  The area codes are defined by the BLS and available here: http://www.bls.gov/cew/doc/titles/area/area_titles.htm.  Here’s a code example for an area request:

# Request QCEW data for the first quarter of 2013 for the state of Michigan
MichiganData <- blsQCEW('Area', year='2013', quarter='1', area='26000')

Industry Data Request

Industry requests require a year, quarter, and industry parameters.  Some industry (NAICS) codes contain hyphens but the open data access uses underscores instead of hyphens. So 31-33 becomes 31_33. For all industry codes and titles see: http://www.bls.gov/cew/doc/titles/industry/industry_titles.htm.  Here’s a code example for pulling making a construction industry request:

# Request Construction data for the first quarter of 2013
Construction <- blsQCEW('Industry', year='2013', quarter='1', industry='1012')

Size Data Request

Data by size is only available for the first quarter of each year. To make this type of request, you only need to provide the size and the year parameters. The size codes are available here: http://www.bls.gov/cew/doc/titles/size/size_titles.htm.  Here’s a code example:

# Request data for the first quarter of 2013 for establishments with 100 to 249 employees
SizeData <- blsQCEW('Size', year='2013', size='6')

I also want to mention that the blsAPI() function has been changed to return data either as a JSON string or as a data frame. I hope others will find these improvements helpful.