I recently was trying to merge in some data work a co-worker of mine did with my data but I quickly had a problem. The work didn’t share any common key to allow for joining. Instead I joined on addresses which was not perfect but got a lot done. I had roughly 4,000 records and about 1,200 of them were missing a vital variable. To illustrate the problem direct your attention to the figures below:
The dark blue points are missing data. It is clear to me that all the dark blue points surrounded by yellow for example should also be yellow. How can I correct this? Especially since I have very little time to do it?
The Solution: Random Forest
Realizing that this is a classification problem I decided to try out a random forest. I quickly wrangled the data into training and test sets and using the caret package in R produced a cross-validated random forest. The model had a 99.5% accuracy on the test data (3 misclassified out of 661).
I used the model to predict the values for those records missing the feature. The results were excellent as shown below:
All Models Are Wrong, But Some Are Useful
Although this random forest model did a great job I took a closer look in the area where the orange and pinkish dots come together. This animated image shows the dark blue dots that needed a prediction followed by what the model generated. I circled one dot in red that is an error. It should be orange. But given that the data was going to be used at such a high level this error is allowable (along with the handful of others that are probably out there). The random forest was a great tool to use to get the job done.