Status

Bringing Pystan to Anaconda

I’ve recently learned how to create python packages for the Anaconda distribution. I have created pystan packages for 64 bit Linux and Windows systems. Anyone is free to download them with

conda install pystan -c mikesilva

This will save others the effort of downloading and compiling the program. This is not the biggest hurdle in the world but may be large enough to prevent others from using this great software package. I’m happy to make it available so we can focus our efforts on using the tool rather than building the tool.

Status

Did I Just Slip Into the World of Big Data?

I recently was trying to create a random forest classifier on a data set using R.  As you can see I ran into some problems:

big-data

The data set was half a gig with nearly one hundred rows.  I reduced the columns down to three columns (the class and two features) but couldn’t do it in R.  I was able to generate a random forest classifier in Python, however I wanted the R output to develop an API that would use the random forest model.

Status

Getting Back Into the Swing of Things

I have not posted anything to my blog for a while.  I took a bit of a break over the holidays to replenish my drive.  That’s not to say I haven’t been working with data.

At work I developed a Huff Model for casino gambling to model the changes that Atlantic City is experiencing as competing venues come online.  I also have been toying around with a new (to me) laptop.  I have installed Ubuntu on it and have enjoyed setting it up with the necessary data science toys.  I have set it up to allow me to use R from Python (which I think is really cool).

I am beginning the new Coursera course on Machine Learning for Data Analysis.  So expect postings related to my course project in the near future.

Status

U.S. Labor Markets: A Network Approach

I have been busy preforming a network analysis to identify labor markets.  I have previously done this with Florida and thought it would be interesting to try this with the whole United States.

Network Analysis

I used census commuting data to build my network then used Gephi to analyze the network graph.  I came up with 71 labor markets.  Here is a visualization of the network:

graph

Findings

I translated the communities discovered from the graph into the following map (for those wishing to know more please visit my GitHub repository):

map

Discussion

At first blush I think I’m on to something.  I live in Upstate New York and find it interesting to see the division between upstate New York (in purple) with downstate (in green).  It seems to be quite accurate (I lived in NYC and this conforms with my sense where downstate ends and upstate begins). What do you think?

Caveats

A couple of things to keep in mind with this map.  The first is that this is based on a network so there is that six degrees of separation type thing underlying this map.  Look at the LA are (in an admittedly ugly yellow-brown color).  That region includes:

  • Southern California
  • Arizona
  • Hawaii and
  • Part of Nevada, Utah and New Mexico.

How can Utah be connected with Hawaii?  Well people in southern Utah can be connected with people in Las Vegas, and Las Vegas can be connected with eastern California, and eastern California is connected with western California, which is connected with Hawaii.  You can see it in  the visualization of the graph above (look for chains of nodes).  So some of these far flung empires are due to connections.

The other thing to keep in mind is that the borders are fuzzy not hard.  One of my primary motivations for doing this in the first place was to see if I could tease out the labor market which may or may not be related to a political boundaries.  I like seeing Connecticut and part of New Jersey joined with New York City.  It makes total sense.  However this is not to say there are people in the Connecticut that don’t work in the Boston area.  They do.  Because the boundaries are not hard.

Further Work

Now that I have these markets identified I think it would be interesting to see if I could tease out some specializations.  Since the area represents a network of people and knowledge spreads through networks it would be interesting to see where the knowledge base is deepest.  The New York City market could be highly specialized in finance for example.  What other specializations occur?

Another thing that would be interesting it to apply a contagion model to unemployment.  Does a decrease in unemployment “infect” neighbors and pull down their level of employment?

I would also like to put together some dot maps showing the working population in these markets.

Status

Summer Learning

Over the summer I made the conscience decision to take a much needed break.  I had to walk away from a couple of Coursera courses, telling myself it was okay.  I could take them next time around.

I did, however, continue with a course I started earlier called Model Thinking.  It is high level introduction to a variety of models.  I personally enjoyed this course a lot.  It provided a exposure to a variety of models that I never heard of however I could see applications for in my day to day life.  It had some math but nothing that is too difficult to scare people away.  There currently isn’t a text book for the class (which is great for auditory learners like myself) however one is in the works.

What I like most about the course is that it provides me with a set of tools I can work with.  As you can tell from previous posts, I am a big fan of using the right tool for the job and not being a one pony show.  This course was great in that I have lots of different and competing ways of looking at the world.

Status

My First Python Scraping with Beautiful Soup

I recently needed to scrape a cost of living calculator for data.  To save time I wrote a Python program that would pull the data for all the cities.  It was my first case of scrapping a website in Python.  I used Beautiful Soup as I had heard other data scientists mention using this in a podcast.  The documentation provided by the developers is well written and easy to follow.  It is hosted on GitHub in the cgr-work repository.

Status

Finished Process Mining: Data Science in Action

Over the last few weeks I have been taking the Coursera Process Mining: Data science in Action course.  This was a very interesting class but was not exactly what I expected.  I expected the class to be Python or R based.  It wasn’t.  Instead it introduced me to Disco and ProM, two software packages that can mine event data to discover the process structure.  ProM is open source and programmed in Java.  It has an academic feel to it.  Disco is a commercial option and appears more polished but is limited in what it can do for you.

The professor is Wil van der Aalst of the Eindhoven University of Technology in the Netherlands.  He is an engaging professor who did a great job covering the material.  I did find the peer reviewed project a bit of a heavy load as I was trying to get the hang of ProM but learned a lot from doing (just wished I had more time to learn more).  I would recommend this course to any would be data scientist as it give a very practical application to data science.