Running a Random Forest (Week 2 Assignment)


A random forest analysis was preformed to evaluate a series of explanatory variables in predicting a binary categorical variable. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

High Income (Response Variable)

The 2010 Gross Domestic Product per capita is was classified into high income for cases where the absolute deviation divided by the mean absolute deviation is greater than 3.  The GDP per capita is measured in constant 2000 U.S. dollars and was originally came from the World Bank’s Work Development Indicators.

Explanatory Variables

The following explanatory variables were evaluated:

  • Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
  • CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
  • Female Employment Rate – Percentage of female population, age above 15, that has been
    employed during 2007 as collected by the International Labour Organization
  • Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
  • Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
  • Polity Score – 2009 Democracy score as collected by the Polity IV Project
  • Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
  • Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank


The explanatory variables with the highest relative importance scores were life expectancy, internet use rate, urbanization rate.

Table 1 – Variables Importance

Variable Importance
Life Expectancy 40%
Internet Use Rate 21%
Urbanization Rate 10%
CO2 Emissions 7%
Female Employ 7%
Alcohol Consumption 7%
Employ Rate 5%
Polity Score 4%

The accuracy of the random forest was 97%, with the subsequent growing of multiple trees beyond 3, adding little to the overall accuracy of the model.


These findings suggest that my previous work looking at the relationship between the level of democratization and the economic well-being may have been confounded by other variables. The level of democratization was assumed to be a cause while some of these explanatory variables (i.e. life expectancy, internet use rate) are more of outcomes that would be correlated to the level of income.

Source Code

As always my project is available on it’s GitHub repository.

# Import libraries needed
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

# Make results reproducible

df = pd.read_csv('gapminder.csv')

variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate', 
                'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate']
# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')
# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')

" =============================  Data Management  =============================
# Identify contries with a high level of income using the MAD (mean absolute deviation) method
subset['absolute_deviations'] = np.absolute(subset['incomeperperson'] - np.median(subset['incomeperperson']))
MAD = np.mean(subset['absolute_deviations'])

# This function converts the income per person absolute deviations to a high income flag
def high_income_flag(absolute_deviations):
    threshold = 3
    if (absolute_deviations/MAD) > threshold:
        return "Yes"
        return "No"

subset['High Income'] = subset['absolute_deviations'].apply(high_income_flag)
subset['High Income'] = subset['High Income'].astype('category')

" ===========================  Build Random Forest  ===========================
# Remove the first variable from the list since the target is derived from it

predictors = subset[variables]
targets = subset['High Income']

#Split into training and testing sets+
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.4)

# Build the random forest classifier

" =========================  Evaluate Random Forest  ==========================

print('Classification Report')
print(sk.metrics.classification_report(test_target, predictions))

print('Confusion Matrix')
print(sk.metrics.confusion_matrix(test_target, predictions))

print('Accuracy Score')
print(sk.metrics.accuracy_score(test_target, predictions))

# Fit an Extra Trees model to the data
model = ExtraTreesClassifier(),training_target)

# Display the relative importance of each attribute
feature_name = list(predictors.columns.values)
feature_importance = list(model.feature_importances_)
features = pd.DataFrame({'name':feature_name, 'importance':feature_importance}).sort_values(by='importance', ascending=False)

" ========================  Evaluate Number of Trees  =========================
trees = range(n_estimators)
accuracy = np.zeros(n_estimators)

for idx in range(len(trees)):
    accuracy[idx] = sk.metrics.accuracy_score(test_target, predictions)
plt.plot(trees, accuracy)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s