Previous research using the GapMinder data set has suggested that a high income per person is a function of the level of democratization and urbanization in a country. I will preform a decision tree analysis to test the nonlinear relationships between the binary categorical response variable (high income) and these two explanatory variables.
About the Data
The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this class only has one year of data for 213 countries. There are 155 countries with complete data for the following variables of interest.
High Income (Response Variable)
The 2010 Gross Domestic Product per capita is was classified into high income for cases where the absolute deviation divided by the mean absolute deviation is greater than 3. The GDP per capita is measured in constant 2000 U.S. dollars and was originally came from the World Bank’s Work Development Indicators.
Is Full Democracy (Explanatory Variable)
The level of democracy is measured by the 2009 polity score developed by the Polity IV Project. This value ranges from -10 (autocracy) to 10 (full democracy). The following plot shows the relationship between the response variable and the Polity IV Score.
Figure 2 – Income per Person by Level of Democracy
I collapsed these 21 categories into two categories:
- Full Democracy (polity score = 10)
- Not a Full Democracy (polity score =9 to -10)
Thirty-two of the countries are full democracies and the remaining 123 are not.
Urbanization Rate Quartile (Explanatory Variable)
The urbanization rate is measured by the share of the 2008 population living in an urban setting. The urban population is defined by national statistical offices. This data was originally produced by the World Bank. This variable was binned into quartiles. The following plot shows the relationship between the response variable and the level of urbanization:
Figure 3 – Income per Person by Urbanization Rate
Decision Tree Model
The sample of 155 countries were divided into training and test sets using a 60/40 split. The following image is the decision tree that my model generated on the training set (apologies for the blurriness. Click on the image for a clearer one).
Figure 4 – High Income per Person Decision Tree Model
Please keep in mind that the high income response variable is coded “No” or “Yes.” Consequently the value in the top most box ([77,16]) represents 77 countries with a “No” and 16 with a “Yes.”
The binary “is full democracy” (1 if country is a full democracy or 0 if they are not) was the first variable to separate the sample into two subgroups. From there the level of urbanization was used to further break down the sample. Roughly 7% of the countries that are not full democracies (Is Full Democracy <= 0.5 is true) are classified as high income countries. These high income countries are all in the third and fourth urbanization rate quartiles.
52% of the “full democracy” countries are classified high income countries. 73% of them are in the third and fourth urbanization quartiles and the remaining 27% are in the second quartile.
This model was tested on a set of 62 observations. The total model classified 97% of the test set correctly, with a precision and recall of 75% for classifying as a high income country and 98% for a not a high income country.
Table 1 – Confusion Matrix
As always my project is available on it’s GitHub repository.
# Import libraries needed import pandas as pd import numpy as np import sklearn as sk from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from IPython.display import Image from io import BytesIO import pydotplus as pdp # Make results reproducible np.random.seed(0) # bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%.2f'%x) df = pd.read_csv('gapminder.csv') # convert to numeric format df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce') df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce') df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce') # listwise deletion of missing values subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna() # Summarize the data print(subset[['incomeperperson', 'urbanrate']].describe()) # Identify contries with a high level of income using the MAD (mean absolute deviation) method subset['absolute_deviations'] = np.absolute(subset['incomeperperson'] - np.median(subset['incomeperperson'])) MAD = np.mean(subset['absolute_deviations']) # This function converts the income per person absolute deviations to a high income flag def high_income_flag(absolute_deviations): threshold = 3 if (absolute_deviations/MAD) > threshold: return "Yes" else: return "No" subset['High Income'] = subset['absolute_deviations'].apply(high_income_flag) subset['High Income'] = subset['High Income'].astype('category') # This function converts the polity score to a category def convert_polityscore_to_category(polityscore): if polityscore == 10: return 1 else: return 0 # Now we can use the function to create the new variable subset['Is Full Democracy'] = subset['polityscore'].apply(convert_polityscore_to_category) subset['Is Full Democracy'] = subset['Is Full Democracy'].astype('category') # Bin urban rate into quartiles subset['Urban Rate Quartile'] = pd.qcut(subset['urbanrate'], 4, labels=False) #Split into training and testing sets predictors = subset[['Is Full Democracy','Urban Rate Quartile']] targets = subset[['High Income']] training_data, test_data, training_target, test_target = train_test_split(predictors, targets, test_size=.4) #Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(training_data, training_target) # Check how well the classifier worked predictions=classifier.predict(test_data) print(sk.metrics.confusion_matrix(test_target,predictions)) print(sk.metrics.accuracy_score(test_target, predictions)) print(sk.metrics.classification_report(test_target, predictions)) #Displaying the decision tree out = BytesIO() sk.tree.export_graphviz(classifier, out_file=out, feature_names=predictors.columns) graph=pdp.graph_from_dot_data(out.getvalue()) Image(graph.create_png())