Seeing the Tree from the Random Forest

Introduction

It is often said that ensemble method are excellent at prediction but impossible to interpret.  I was thinking about this issue and wondered couldn’t there be a way to create something that is very interpretable (say a decision tree) that represents the random forest.  It wouldn’t be a perfect representation but it would offer some insight into what may be going on under the hood.  Could this be a possible way to reverse engineer a machine learning model?

Test Case – Wine Classification

I decided that I would use the University of California Irvine’s databases on wine qualities as a data set.  The classification problem is to determine whether or not a wine is a white wine.

```# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)
```

I would break the data into training and test sets, construct a simple classification model, then run the training back through the model to get the random forest’s predictions.

```# Make this reproducible
np.random.seed(123)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

training_predictions = rf.predict(training_data)
```

I would then use the training data with the predictions to construct a decision tree.

```# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)
```

I would evaluate the effectiveness of the model by comparing the predictions of the test data made by the random forest against those of the decision tree representing the random forest.

```print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))
```

Findings

The random forest model produced by the above code had a 99% accuracy rate in classifying a wine as white.

The decision tree built off of the random forest model agreed with the random forest model 98% of the time. The final decision tree is presented below: Discussion

This was a first attempt at seeing if it were possible to make a model of a model in order to provide some insight into what is the process.  This examination shows promise but I would offer the following warnings:

This was intentionally a very simple classification task. One reason why the results are so positive might be due to the simplicity of the task.

Value of Interpretability

The actual decision tree produced by the process would need to be taken with a grain of salt. As decision trees tend to overfit I would worry about the practical aspect of what people would do with the decision tree. The first cut point in the decision tree is the total sulfur dioxide ≤ 67.5. How much faith should be put in that exact number is questionable. I personally would treat the output as a rule-of-thumb but not get too hung up on the exact value.

Complete Source Code

The following is the source code used to produce this analysis.

```import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

# Make this reproducible
np.random.seed(123)

# Read in the white wine quality data (4898 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
# Add the flag that it is a white wine
df['white'] = 1

# Read in the red wine quality data (1599 records and 12 features)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
# Add the flag that it is not a white wine
temp['white'] = 0

# Build a data frame with both red and white wines
df = df.append(temp)

# Let's create the target
target = df['white'].values

# Let's get the list of features
features = df.columns[:11]

# Break into train and test sets
training_data, test_data, training_target, test_target  = train_test_split(df[features], target, test_size=.25)

# Let's build the random forest with 500 trees
rf = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1)
rf.fit(training_data, training_target)

# Use it on the test data
rf_predictions = rf.predict(test_data)

# Assess the accuracy
print('Testing Random Forest')
print('Accuracy Score: ' + str(accuracy_score(test_target, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(test_target, rf_predictions))

# Build a decision tree to illustrate what is going on with the random forest
print('Creating decision tree')
training_predictions = rf.predict(training_data)
dt = DecisionTreeClassifier()
dt = dt.fit(training_data, training_predictions)

print('Testing Decision Tree of Random Forest')
dt_predictions = dt.predict(test_data)
print('Accuracy Score: ' + str(accuracy_score(dt_predictions, rf_predictions)))
print('Confusion Matrix')
print(confusion_matrix(dt_predictions, rf_predictions))
```