Random Forests Are Cool

The Problem

I recently was trying to merge in some data work a co-worker of mine did with my data but I quickly had a problem.  The work didn’t share any common key to allow for joining.  Instead I joined on addresses which was not perfect but got a lot done.  I had roughly 4,000 records and about 1,200 of them were missing a vital variable.  To illustrate the problem direct your attention to the figures below:


The dark blue points are missing data.  It is clear to me that all the dark blue points surrounded by yellow for example should also be yellow.  How can I correct this?  Especially since I have very little time to do it?

The Solution: Random Forest

Realizing that this is a classification problem I decided to try out a random forest.  I quickly wrangled the data into training and test sets and using the caret package in R produced a cross-validated random forest.  The model had a 99.5% accuracy on the test data (3 misclassified out of 661).

I used the model to predict the values for those records missing the feature.  The results were excellent as shown below:


All Models Are Wrong, But Some Are Useful

Although this random forest model did a great job I took a closer look in the area where the orange and pinkish dots come together.  This animated image shows the dark blue dots that needed a prediction followed by what the model generated.  I circled one dot in red that is an error.  It should be orange.  But given that the data was going to be used at such a high level this error is allowable (along with the handful of others that are probably out there).  The random forest was a great tool to use to get the job done.