Seeing the Tree from the Random Forest

Introduction

It is often said that ensemble method are excellent at prediction but impossible to interpret.  I was thinking about this issue and wondered couldn’t there be a way to create something that is very interpretable (say a decision tree) that represents the random forest.  It wouldn’t be a perfect representation but it would offer some insight into what may be going on under the hood.  Could this be a possible way to reverse engineer a machine learning model?

Test Case – Wine Classification

I decided that I would use the University of California Irvine’s databases on wine qualities as a data set.  The classification problem is to determine whether or not a wine is a white wine.

# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
df = pd.read_csv(url, sep=';')
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
temp = pd.read_csv(url, sep=';')
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)

I would break the data into training and test sets, construct a simple classification model, then run the training back through the model to get the random forest’s predictions.

# Make this reproducible
np.random.seed(123)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

training_predictions = rf.predict(training_data)

I would then use the training data with the predictions to construct a decision tree.

# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)

I would evaluate the effectiveness of the model by comparing the predictions of the test data made by the random forest against those of the decision tree representing the random forest.

print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))

Findings

The random forest model produced by the above code had a 99% accuracy rate in classifying a wine as white.

The decision tree built off of the random forest model agreed with the random forest model 98% of the time. The final decision tree is presented below:

rf

Discussion

This was a first attempt at seeing if it were possible to make a model of a model in order to provide some insight into what is the process.  This examination shows promise but I would offer the following warnings:

Simplicity of Task

This was intentionally a very simple classification task. One reason why the results are so positive might be due to the simplicity of the task.

Value of Interpretability

The actual decision tree produced by the process would need to be taken with a grain of salt. As decision trees tend to overfit I would worry about the practical aspect of what people would do with the decision tree. The first cut point in the decision tree is the total sulfur dioxide ≤ 67.5. How much faith should be put in that exact number is questionable. I personally would treat the output as a rule-of-thumb but not get too hung up on the exact value.

Complete Source Code

The following is the source code used to produce this analysis.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

# Make this reproducible
np.random.seed(123)

# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
df = pd.read_csv(url, sep=';')
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
temp = pd.read_csv(url, sep=';')
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

# Use it on the test data
rf_predictions = rf.predict(test_data)

# Assess the accuracy
print('Testing Random Forest')
print('Accuracy Score: ' + str(accuracy_score(test_target, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(test_target, rf_predictions)) 

# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
training_predictions = rf.predict(training_data)
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)

print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))
Advertisements

Random Forests Are Cool

The Problem

I recently was trying to merge in some data work a co-worker of mine did with my data but I quickly had a problem.  The work didn’t share any common key to allow for joining.  Instead I joined on addresses which was not perfect but got a lot done.  I had roughly 4,000 records and about 1,200 of them were missing a vital variable.  To illustrate the problem direct your attention to the figures below:

rf1

The dark blue points are missing data.  It is clear to me that all the dark blue points surrounded by yellow for example should also be yellow.  How can I correct this?  Especially since I have very little time to do it?

The Solution: Random Forest

Realizing that this is a classification problem I decided to try out a random forest.  I quickly wrangled the data into training and test sets and using the caret package in R produced a cross-validated random forest.  The model had a 99.5% accuracy on the test data (3 misclassified out of 661).

I used the model to predict the values for those records missing the feature.  The results were excellent as shown below:

rf3

All Models Are Wrong, But Some Are Useful

Although this random forest model did a great job I took a closer look in the area where the orange and pinkish dots come together.  This animated image shows the dark blue dots that needed a prediction followed by what the model generated.  I circled one dot in red that is an error.  It should be orange.  But given that the data was going to be used at such a high level this error is allowable (along with the handful of others that are probably out there).  The random forest was a great tool to use to get the job done.

rf

Monte Carlo Simulation of Public Sector Spending

I recently finished reading Fooled by Randomness and have been enamored with Monte Carlo simulations.  I recently, as a pilot, tried to simulate spending of New York State County Sheriff department spending.

I used data published by the Office of the State Comptroller which has  spending by chart of account codes.  I sampled all year over year spending and used it to project spending 5 years out.  I ran 10,000 simulations and compared my results to the actual results.

simulation  The jury is still out on this one.  I can say that the further out you get the less reliable the simulations are (which I think is wonderful).  I can also say that the Monte Carlo simulation doesn’t work equally well for all account codes.  I still have some tweaking to do before making anything public. But is has definitely been educational.

Status

Bringing Pystan to Anaconda

I’ve recently learned how to create python packages for the Anaconda distribution. I have created pystan packages for 64 bit Linux and Windows systems. Anyone is free to download them with

conda install pystan -c mikesilva

This will save others the effort of downloading and compiling the program. This is not the biggest hurdle in the world but may be large enough to prevent others from using this great software package. I’m happy to make it available so we can focus our efforts on using the tool rather than building the tool.

Francis Meets Stan

About a year ago I was introduced to Stan.  Stan is a free and open-source probabilistic programming language and Bayesian inference engine.  After watching the webinar I wanted to try it out but lacked the free time.  I finally got to it this past weekend.  Here’s my tale.

I wanted to do a regression on a well-known data set.  I decided to use the data set compiled by Francis Galton on the relationship between the height of a child and their parents.  Galton observed “regression towards mediocrity” or what would be referred to today as regression to the mean.

I analysed the Galton data published in the R HistData package.  I was going to use the pystan interface so I exported the data from R into a CSV.  I did not make any adjustments to the data.

OLS Regression Model

I ran an ordinary least squares (OLS) regression using the StatsModel library as a sort of baseline.  Here are the results of that regression:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  child   R-squared:                       0.210
Model:                            OLS   Adj. R-squared:                  0.210
Method:                 Least Squares   F-statistic:                     246.8
Date:                Mon, 27 Mar 2017   Prob (F-statistic):           1.73e-49
Time:                        11:09:52   Log-Likelihood:                -2063.6
No. Observations:                 928   AIC:                             4131.
Df Residuals:                     926   BIC:                             4141.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     23.9415      2.811      8.517      0.000      18.425      29.458
parent         0.6463      0.041     15.711      0.000       0.566       0.727
==============================================================================
Omnibus:                       11.057   Durbin-Watson:                   0.046
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               10.944
Skew:                          -0.241   Prob(JB):                      0.00420
Kurtosis:                       2.775   Cond. No.                     2.61e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Stan Model

Then I tried out the same process with Stan.  It took a couple of hours of the read the documentation, try, fail, pull my hair out cycle before I was successful.  Here was my Stan code:

stan_model_code = """
data {
    int<lower=0> N; // number of cases
    vector[N] x; // predictor (covariate)
    vector[N] y; // outcome (variate)
}
parameters {
    real alpha; // intercept
    real beta; // slope
    real<lower=0> sigma; // outcome noise
}
model {
    y ~ normal(x * beta + alpha, sigma);
}
"""

stan_data = {
        'N': len(df['child'].values),
        'x': df['parent'].values,
        'y': df['child'].values
        }

stan_model = pystan.stan(model_name='galton', model_code=stan_model_code, data=stan_data, iter=1000, chains=4)

I got hung up on the data section.  I didn’t use the vector type which was throwing errors.  Here’s the output from Stan:

Inference for Stan model: galton_2a77bd156aec196d5a464494a175b11a.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha  24.24    0.11   2.67   18.9  22.45   24.2  26.17  29.37    607   1.01
beta    0.64  1.6e-3   0.04   0.57   0.61   0.64   0.67   0.72    606   1.01
sigma   2.24  1.9e-3   0.05   2.14   2.21   2.24   2.27   2.35    732    1.0
lp__   -1211    0.05   1.19  -1214  -1212  -1211  -1210  -1210    643    1.0

Samples were drawn using NUTS at Mon Mar 27 11:11:15 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparison of Results

I’ve pulled out all the results and put them in a side by side comparison.  The results are very similar (as one would expect).  The intercept is around 24 (18-29) with the coefficient for the parent variable at about 0.6 (0.5-0.7).

OLS Stan
Mean 2.5% 97.5% Mean 2.5% 97.5%
Intercept (Alpha) 23.9415 18.425 29.458 24.24 18.9 29.37
Parent (Beta) 0.6463 0.566 0.727 0.64 0.57 0.72

Now that I’ve use Stan I am confident I will be using this Bayesian modeling tool in the future.  As always my source code is available on my GitHub repository.

Image

Visualizing EMS Service Delivery Options

I put together a neat little visualization that allows the user to do a back-of-the-envelope calculation of what an EMS service delivery option.  You can try it out at https://msilva-cgr.shinyapps.io/essex-county-ems-options/.  It is a shiny app.  Because of the time crunch it is slow because it is doing a lot of calculations on the fly.  If I were to do it again I would have preprocessed the data and saved the results.  It would cut out the costly computations.  But in any event it is a really neat tool.

essex

How to Calculate Cosine Similarity in Excel

I often use cosine similarity at my job to find peers.  Cosine similarity is a measure of distance between two vectors.  While there are libraries in Python and R that will calculate it sometimes I’m doing a small scale project and so I use Excel.  Here’s how to do it.

First the Theory

I will not go into depth on what cosine similarity is as the web abounds in that kind of content.  Suffice it to say the formula for cosine similarity (for those of you who are mathematically inclined) is:

Cosine Similarity

So for the numerator we need to find the dot product of the two vectors.  That is done using the SUMPRODUCT function in Excel.  For each term in the denominator you need to find the square root of the sum of squares.  This is done by using the SUMSQ function nested in a SQRT function.

Now the Practice

I have put together a little template which shows how to calculate the cosine similarity.  I first organize the data in the spreadsheet so the attributes (or features or variables) go across the columns and the geographies go across each row.  I usually put who we are trying to find as a peer at the top of the spreadsheet.  I scale the data so that it ranges between zero and one. This way differences in one attribute doesn’t overpower the others simply due to a difference in scale.  With all that done I am ready to compute the cosine similarity.

Let’s suppose you have four attributes (A, B, C and D) for the baseline and five peers.  It would look like this in Excel:

Cosine Similarity Data

In order to calculate the cosine similarity of Peer 1 and the Baseline, I would divide the dot product (=SUMPRODUCT(B$2:E$2,B3:E3)) by the square root of the sum of squares multiplied together (=SQRT(SUMSQ(B3:E3))*SQRT(SUMSQ($B$2:$E$2))).  So if you want it all in one hairy formula in cell F3 it would be:

=SUMPRODUCT(B$2:E$2,B3:E3)/(SQRT(SUMSQ(B3:E3))*SQRT(SUMSQ($B$2:$E$2)))