Machine Learning for Data Analysis Week 1

Introduction

Previous research using the GapMinder data set has suggested that a high income per person is a function of the level of democratization and urbanization in a country.  I will preform a decision tree analysis to test the nonlinear relationships between the binary categorical response variable (high income) and these two explanatory variables.

Figure 1 – High Income by Urbanization and Level of Democracy
DecisionTreePlot

About the Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this class only has one year of data for 213 countries. There are 155 countries with complete data for the following variables of interest.

Measures

High Income (Response Variable)

The 2010 Gross Domestic Product per capita is was classified into high income for cases where the absolute deviation divided by the mean absolute deviation is greater than 3.  The GDP per capita is measured in constant 2000 U.S. dollars and was originally came from the World Bank’s Work Development Indicators.

Is Full Democracy (Explanatory Variable)

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project. This value ranges from -10 (autocracy) to 10 (full democracy).  The following plot shows the relationship between the response variable and the Polity IV Score.

Figure 2 – Income per Person by Level of Democracy

HighIncomeByGDPandPolity

I collapsed these 21 categories into two categories:

  • Full Democracy (polity score = 10)
  • Not a Full Democracy (polity score =9 to -10)

Thirty-two of the countries are full democracies and the remaining 123 are not.

Urbanization Rate Quartile (Explanatory Variable)

The urbanization rate is measured by the share of the 2008 population living in an urban setting. The urban population is defined by national statistical offices.  This data was originally produced by the World Bank.  This variable was binned into quartiles.  The following plot shows the relationship between the response variable and the level of urbanization:

Figure 3 – Income per Person by Urbanization Rate

HighIncomeByGDPandUrbanRate

Decision Tree Model

The sample of 155 countries were divided into training and test sets using a 60/40 split.  The following image is the decision tree that my model generated on the training set (apologies for the blurriness.  Click on the image for a clearer one).

Figure 4 – High Income per Person Decision Tree Model

DecisionTree

Please keep in mind that the high income response variable is coded “No” or “Yes.”  Consequently the value in the top most box ([77,16]) represents 77 countries with a “No” and 16 with a “Yes.”

The binary “is full democracy” (1 if country is a full democracy or 0 if they are not) was the first variable to separate the sample into two subgroups.  From there the level of urbanization was used to further break down the sample.  Roughly 7% of the countries that are not full democracies (Is Full Democracy <= 0.5 is true) are classified as high income countries.  These high income countries are all in the third and fourth urbanization rate quartiles.

52% of the “full democracy” countries are classified high income countries.  73% of them are in the third and fourth urbanization quartiles and the remaining 27% are in the second quartile.

Model Accuracy

This model was tested on a set of 62 observations.  The total model classified 97% of the test set correctly, with a precision and recall of 75% for classifying as a high income country and 98% for a not a high income country.

Table 1 – Confusion Matrix

Predicted
Actual No Yes
No 57 1
Yes 1 3

Source Code

As always my project is available on it’s GitHub repository.

# Import libraries needed
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image
from io import BytesIO
import pydotplus as pdp

# Make results reproducible
np.random.seed(0)

# bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%.2f'%x)

df = pd.read_csv('gapminder.csv')

# convert to numeric format
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')
df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce')

# listwise deletion of missing values
subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna()

# Summarize the data
print(subset[['incomeperperson', 'urbanrate']].describe())

# Identify contries with a high level of income using the MAD (mean absolute deviation) method
subset['absolute_deviations'] = np.absolute(subset['incomeperperson'] - np.median(subset['incomeperperson']))
MAD = np.mean(subset['absolute_deviations'])

# This function converts the income per person absolute deviations to a high income flag
def high_income_flag(absolute_deviations):
    threshold = 3
    if (absolute_deviations/MAD) > threshold:
        return "Yes"
    else:
        return "No"

subset['High Income'] = subset['absolute_deviations'].apply(high_income_flag)
subset['High Income'] = subset['High Income'].astype('category')

# This function converts the polity score to a category
def convert_polityscore_to_category(polityscore):
    if polityscore == 10:
        return 1
    else:
        return 0

# Now we can use the function to create the new variable
subset['Is Full Democracy'] = subset['polityscore'].apply(convert_polityscore_to_category)
subset['Is Full Democracy'] = subset['Is Full Democracy'].astype('category')

# Bin urban rate into quartiles
subset['Urban Rate Quartile'] = pd.qcut(subset['urbanrate'], 4, labels=False)

#Split into training and testing sets
predictors = subset[['Is Full Democracy','Urban Rate Quartile']]
targets = subset[['High Income']]
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.4)

#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(training_data, training_target)

# Check how well the classifier worked
predictions=classifier.predict(test_data)
print(sk.metrics.confusion_matrix(test_target,predictions))

print(sk.metrics.accuracy_score(test_target, predictions))

print(sk.metrics.classification_report(test_target, predictions))

#Displaying the decision tree
out = BytesIO()
sk.tree.export_graphviz(classifier, out_file=out, feature_names=predictors.columns)
graph=pdp.graph_from_dot_data(out.getvalue())
Image(graph.create_png())
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s