Seeing the Tree from the Random Forest

Introduction

It is often said that ensemble method are excellent at prediction but impossible to interpret.  I was thinking about this issue and wondered couldn’t there be a way to create something that is very interpretable (say a decision tree) that represents the random forest.  It wouldn’t be a perfect representation but it would offer some insight into what may be going on under the hood.  Could this be a possible way to reverse engineer a machine learning model?

Test Case – Wine Classification

I decided that I would use the University of California Irvine’s databases on wine qualities as a data set.  The classification problem is to determine whether or not a wine is a white wine.

# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
df = pd.read_csv(url, sep=';')
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
temp = pd.read_csv(url, sep=';')
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)

I would break the data into training and test sets, construct a simple classification model, then run the training back through the model to get the random forest’s predictions.

# Make this reproducible
np.random.seed(123)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

training_predictions = rf.predict(training_data)

I would then use the training data with the predictions to construct a decision tree.

# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)

I would evaluate the effectiveness of the model by comparing the predictions of the test data made by the random forest against those of the decision tree representing the random forest.

print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))

Findings

The random forest model produced by the above code had a 99% accuracy rate in classifying a wine as white.

The decision tree built off of the random forest model agreed with the random forest model 98% of the time. The final decision tree is presented below:

rf

Discussion

This was a first attempt at seeing if it were possible to make a model of a model in order to provide some insight into what is the process.  This examination shows promise but I would offer the following warnings:

Simplicity of Task

This was intentionally a very simple classification task. One reason why the results are so positive might be due to the simplicity of the task.

Value of Interpretability

The actual decision tree produced by the process would need to be taken with a grain of salt. As decision trees tend to overfit I would worry about the practical aspect of what people would do with the decision tree. The first cut point in the decision tree is the total sulfur dioxide ≤ 67.5. How much faith should be put in that exact number is questionable. I personally would treat the output as a rule-of-thumb but not get too hung up on the exact value.

Complete Source Code

The following is the source code used to produce this analysis.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

# Make this reproducible
np.random.seed(123)

# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
df = pd.read_csv(url, sep=';')
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
temp = pd.read_csv(url, sep=';')
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

# Use it on the test data
rf_predictions = rf.predict(test_data)

# Assess the accuracy
print('Testing Random Forest')
print('Accuracy Score: ' + str(accuracy_score(test_target, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(test_target, rf_predictions)) 

# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
training_predictions = rf.predict(training_data)
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)

print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))
Advertisements

Monte Carlo Simulation of Public Sector Spending

I recently finished reading Fooled by Randomness and have been enamored with Monte Carlo simulations.  I recently, as a pilot, tried to simulate spending of New York State County Sheriff department spending.

I used data published by the Office of the State Comptroller which has  spending by chart of account codes.  I sampled all year over year spending and used it to project spending 5 years out.  I ran 10,000 simulations and compared my results to the actual results.

simulation  The jury is still out on this one.  I can say that the further out you get the less reliable the simulations are (which I think is wonderful).  I can also say that the Monte Carlo simulation doesn’t work equally well for all account codes.  I still have some tweaking to do before making anything public. But is has definitely been educational.

Status

Bringing Pystan to Anaconda

I’ve recently learned how to create python packages for the Anaconda distribution. I have created pystan packages for 64 bit Linux and Windows systems. Anyone is free to download them with

conda install pystan -c mikesilva

This will save others the effort of downloading and compiling the program. This is not the biggest hurdle in the world but may be large enough to prevent others from using this great software package. I’m happy to make it available so we can focus our efforts on using the tool rather than building the tool.

Francis Meets Stan

About a year ago I was introduced to Stan.  Stan is a free and open-source probabilistic programming language and Bayesian inference engine.  After watching the webinar I wanted to try it out but lacked the free time.  I finally got to it this past weekend.  Here’s my tale.

I wanted to do a regression on a well-known data set.  I decided to use the data set compiled by Francis Galton on the relationship between the height of a child and their parents.  Galton observed “regression towards mediocrity” or what would be referred to today as regression to the mean.

I analysed the Galton data published in the R HistData package.  I was going to use the pystan interface so I exported the data from R into a CSV.  I did not make any adjustments to the data.

OLS Regression Model

I ran an ordinary least squares (OLS) regression using the StatsModel library as a sort of baseline.  Here are the results of that regression:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  child   R-squared:                       0.210
Model:                            OLS   Adj. R-squared:                  0.210
Method:                 Least Squares   F-statistic:                     246.8
Date:                Mon, 27 Mar 2017   Prob (F-statistic):           1.73e-49
Time:                        11:09:52   Log-Likelihood:                -2063.6
No. Observations:                 928   AIC:                             4131.
Df Residuals:                     926   BIC:                             4141.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     23.9415      2.811      8.517      0.000      18.425      29.458
parent         0.6463      0.041     15.711      0.000       0.566       0.727
==============================================================================
Omnibus:                       11.057   Durbin-Watson:                   0.046
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               10.944
Skew:                          -0.241   Prob(JB):                      0.00420
Kurtosis:                       2.775   Cond. No.                     2.61e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Stan Model

Then I tried out the same process with Stan.  It took a couple of hours of the read the documentation, try, fail, pull my hair out cycle before I was successful.  Here was my Stan code:

stan_model_code = """
data {
    int<lower=0> N; // number of cases
    vector[N] x; // predictor (covariate)
    vector[N] y; // outcome (variate)
}
parameters {
    real alpha; // intercept
    real beta; // slope
    real<lower=0> sigma; // outcome noise
}
model {
    y ~ normal(x * beta + alpha, sigma);
}
"""

stan_data = {
        'N': len(df['child'].values),
        'x': df['parent'].values,
        'y': df['child'].values
        }

stan_model = pystan.stan(model_name='galton', model_code=stan_model_code, data=stan_data, iter=1000, chains=4)

I got hung up on the data section.  I didn’t use the vector type which was throwing errors.  Here’s the output from Stan:

Inference for Stan model: galton_2a77bd156aec196d5a464494a175b11a.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha  24.24    0.11   2.67   18.9  22.45   24.2  26.17  29.37    607   1.01
beta    0.64  1.6e-3   0.04   0.57   0.61   0.64   0.67   0.72    606   1.01
sigma   2.24  1.9e-3   0.05   2.14   2.21   2.24   2.27   2.35    732    1.0
lp__   -1211    0.05   1.19  -1214  -1212  -1211  -1210  -1210    643    1.0

Samples were drawn using NUTS at Mon Mar 27 11:11:15 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparison of Results

I’ve pulled out all the results and put them in a side by side comparison.  The results are very similar (as one would expect).  The intercept is around 24 (18-29) with the coefficient for the parent variable at about 0.6 (0.5-0.7).

OLS Stan
Mean 2.5% 97.5% Mean 2.5% 97.5%
Intercept (Alpha) 23.9415 18.425 29.458 24.24 18.9 29.37
Parent (Beta) 0.6463 0.566 0.727 0.64 0.57 0.72

Now that I’ve use Stan I am confident I will be using this Bayesian modeling tool in the future.  As always my source code is available on my GitHub repository.

Image

Visualizing U.S. Commuters

I recently analyzed the patterns of U.S. Commuters and created a visualization that summarizes these patterns at the state level.

About the Data

This data is from the Census Transporation Planning Products (CTPP). For those who don’t know, the CTPP is derived from the 2006–2010 5-year American Community Survey (ACS) data. You can learn more about this data set by visiting the home page. In an effort to make this easily reproducible, you can download the csv used in this analysis. I chose this data source because I have used it in the past and am familiar with it.

Steps Taken

I created a directed network graph from this data. I used python’s networkx and pandas packages and the complete source code is provided below. I excluded Puerto Rico from the data set because I wanted to analyze state level patterns. I did leave the District of Columbia in thus there are 51 States in the analysis.

I wanted to use Gephi to generate the visualization because I have done so in the past. After creating the directed network graph using Python, I imported the data into Gephi and used the community detection algorithm using the default settings (Randomized checked, Use weights checked, Resolution = 1.0). This resulted in 7 communities being detected.

I grouped the states by these communities and colored them and placed them in a circular layout with straight edges between the nodes. I varied the width based on the edge weight.

Visualization

Commuters Network

Observations

There are a couple of things that are of interest. The first thing to acknowledge is the proliferation of edges in this network.  Almost all of the states are connected with the other states.  This results in the “spirograph” like effect in the visualization.  However I don’t find that to be the most interesting aspect.

The sub-networks highlighted in this visualization are particularly interesting.  For example one readily sees that a lot of people in New Jersey work in New York.  As a former New Yorker there is no surprises there.  You also see the Capital Beltway connections in the visualization.  Residents of Maryland and Virginia find work in D.C.  The community detection algorithm highlights this finding.  Since the states are arranged by “community”, seeing the cross-community connections are interesting.  For example the Connecticut to New York have a connection.

Source Code

The following is how I generated the directed network graph file for Gephi from the CSV.

import pandas as pd
import networkx as nx

df = pd.read_csv('Table 1 Commuting Flows Available at State to POW State and County to POW County only.csv', skiprows=[0,1], thousands=',')
# Only pull the Estimates
df = df[df['Output']=='Estimate']
# Only pull the first 4 columns
df = df[df.columns[:4]]
# Drop the N/A's
df = df.dropna()
# Drop the Output column
df = df.drop('Output', 1)
# Rename the columns
df.columns = ['from','to','weight']
# Drop the Puerto Rico records
df = df[df['from'] != 'Puerto Rico']
df = df[df['to'] != 'Puerto Rico']
# Remove the people who work where they live
commuters = df[df['from'] != df['to']]
# Build the network graph
G = nx.from_pandas_dataframe(commuters, 'from','to', ['weight'], create_using=nx.DiGraph())
# Write the graph so I can use Gephi
nx.write_gexf(G,'Commuters.gexf')

Searching for a Needle in a Haystack with Python

I recently was working with New York State libraries data (hopefully more to come on this) and was trying to match up two data sets on the library name.  But due to minor variations in the names (i.e. an abbreviation or punctuation marks) this is kinda a messy process.  I found fuzzywuzzy which is a Python library that can help out.  Here’s how I constructed my search:

from fuzzywuzzy import fuzz
import pandas as pd

haystack = pd.read_csv('web_list.csv')
haystack = haystack['Libraries'].values.tolist()

needles = pd.read_csv('excel_list.csv')
needles = needles['Library'].values.tolist()

best_matches = []

# Look for best match
for needle in needles:
    print('Searching for: '+needle)
    best_match_needle = ''
    best_match_hay = ''
    best_match_ratio = 0
    search_for_match = True
    while search_for_match:
        for hay in haystack:
            match_score = fuzz.partial_ratio(needle, hay)
            if match_score > best_match_ratio:
                best_match_ratio = match_score
                best_match_needle = needle
                best_match_hay = hay
            if match_score == 100:
                search_for_match = False
        # Looped through haystack so stop searching
        search_for_match = False
    # Append best match search results
    row = {'Searched':best_match_needle, 'Found':best_match_hay, 'Ratio':best_match_ratio}
    best_matches.append(row)

df = pd.DataFrame(best_matches)
writer = pd.ExcelWriter('Best Matches.xlsx', engine='xlsxwriter')
df.to_excel(writer,'Sheet1', index=False)
writer.save()

The results are pretty good.  I tried the fuzzywuzzy processes but I was getting a lot of goofy results.  My only caution is that the ratio in my implementation should not be strongly trusted.  I had a 100 for “Babylon Public Library” being a match with “North Babylon Public Library.”

K Means Clustering (Week 4 Assignment)

Introduction

A k-means clustering analysis was conducted to divide countries into subgroups based on variables that were determined to have an impact on economic well-being. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

Potential Clustering Variables

A random forest analysis was preformed on a training set (N=111) to evaluate a series of explanatory variables in predicting a categorical GDP per person binned into thirds. The following explanatory variables were evaluated:

  • Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
  • CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
  • Female Employment Rate – Percentage of female population, age above 15, that has been
    employed during 2007 as collected by the International Labour Organization
  • Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
  • Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
  • Polity Score – 2009 Democracy score as collected by the Polity IV Project
  • Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
  • Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank

The following table summarizes the variables importance in classifying countries into high middle and low GDP per person countries:

Table 1. Variables Importance

Variable Importance
Life Expectancy 35%
Internet Use Rate 21%
Urbanization Rate 11%
Female Employ 8%
CO2 Emissions 8%
Employ Rate 6%
Alcohol Consumption 6%
Polity Score 5%

The explanatory variables with a relative importance scores greater than or equal to 10% (life expectancy, internet use rate and urbanization rate) we selected as clustering variables.

K-Means Clustering

All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data were randomly split into a training set that included 70% of the observations (N=121) and a test set that included 30% of the observations (N=52). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The mean Euclidean distance was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for selecting the number of clusters. The following figure illustrates the elbow curve

Figure 1. Selecting K with the Elbow Curve Method

elbow

Possible levels of k include 2, 3, 4 and 7. Each level was evaluated to see if there was a significant difference between levels of economic well-being (GDP per person). Breaking the data into 2 clusters proved to be a good choice however I hasten to note that 7 was not a bad choice. My preference towards less complication than was needed lead to selecting k=2.

Findings

Cluster number zero has a much lower economic well-being per person than cluster one.

Table 2. GDP Per Person by Cluster

Cluster Average GDP per Person
Zero $1,373
One $15,684

The following figure shows the clusters (zero in blue, one in green) and the clustering variables:

Figure 2. Clustering Variables Scaterplots

k-means pairplot

Discussion

This examination was done as an exercise in doing K-means clustering. The data in the GapMinder dataset does not lend itself to clustering. The scatter plots above illustrate the lack of distinct clusters. Caution should be exercised in using/interpreting these results.

Source Code

As always this project can be found in this GitHub repository.

# Import libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
sns.set_style('whitegrid')
sns.set_context('talk')

# Eliminate false positive SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Make results reproducible
np.random.seed(1234567890)

df = pd.read_csv('gapminder.csv')

variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate', 
                'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate'] 

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)
print("\n")
"""
"===============  Random Forest to Select Clustering Variables  ===============
"""
n_estimators=25

subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%'])
subset['incomequartiles'] = subset['incomequartiles'].astype('category')

variables.pop(0)

predictors = subset[variables]
targets = subset['incomequartiles']

#Split into training and testing sets+
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.25)

# Build the random forest classifier
classifier=RandomForestClassifier(n_estimators=n_estimators)
classifier=classifier.fit(training_data,training_target)

predictions=classifier.predict(test_data)

# Fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(training_data,training_target)

# Display the relative importance of each attribute
feature_name = list(predictors.columns.values)
feature_importance = list(model.feature_importances_)
features = pd.DataFrame({'name':feature_name, 'importance':feature_importance}).sort_values(by='importance', ascending=False)
print(features.head(len(feature_name)))

"""
" =============================  Data Management  =============================
"""
variables = ['incomeperperson', 'lifeexpectancy', 'internetuserate', 'urbanrate'] 

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)

subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%'])
subset['incomequartiles'] = subset['incomequartiles'].astype('category')

# Remove the first variable from the list since the target is derived from it
variables.pop(0)

# Center and scale data
for variable in variables:
    subset[variable]=preprocessing.scale(subset[variable].astype('float64'))
    
features = subset[variables]
targets = subset[['incomeperperson']]

"""
" ==================  Split Data into Training and Test Sets  ==================
"""
# Split into training and testing sets
training_data, test_data, training_target, test_target  = train_test_split(features, targets, test_size=.3)
print('Size of training data')
print(training_data.shape)

"""
" =====================  Determine the Number of Clusters  ====================
"""
# Identify number of clusters using the elbow method
clusters=range(1,10)
meandist=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(training_data)
    clusassign=model.predict(training_data)
    meandist.append(sum(np.min(cdist(training_data, model.cluster_centers_, 'euclidean'), axis=1)) / training_data.shape[0])

# Visualize the elbow
k = 2

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(clusters, meandist)
ax.plot(clusters[(k-1)], meandist[(k-1)], marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of Clusters')
plt.ylabel('Average Distance')
plt.title('Selecting K with the Elbow Method')
plt.show()

"""
" ==========================  Visualize the Clusters  =========================
"""
model=KMeans(n_clusters=k)
model.fit(training_data)
training_data['cluster'] = model.labels_

my_cmap = plt.cm.get_cmap('brg')
my_cmap.set_under('w')

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(training_data.iloc[:,0], training_data.iloc[:,1], training_data.iloc[:,2], c=training_data['cluster'], cmap=my_cmap)
ax.set_xlabel(training_data.columns.values[0])
ax.set_ylabel(training_data.columns.values[1])
ax.set_zlabel(training_data.columns.values[2])
plt.show()
 
sns.pairplot(training_data, hue ='cluster');

"""
" ====================  Examine Differences Between Clusters  =================
"""

training_target['cluster'] = model.labels_
income_model = smf.ols(formula='incomeperperson ~ C(cluster)', data=training_target).fit()
print (income_model.summary())

print ('means for features by cluster')
m1= training_target.groupby('cluster').mean()
print (m1)

print ('standard deviations for features by cluster')
m2= training_target.groupby('cluster').std()
print (m2)

mc1 = multi.MultiComparison(training_target['incomeperperson'], training_target['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())