K Means Clustering (Week 4 Assignment)

Introduction

A k-means clustering analysis was conducted to divide countries into subgroups based on variables that were determined to have an impact on economic well-being. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

Potential Clustering Variables

A random forest analysis was preformed on a training set (N=111) to evaluate a series of explanatory variables in predicting a categorical GDP per person binned into thirds. The following explanatory variables were evaluated:

  • Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
  • CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
  • Female Employment Rate – Percentage of female population, age above 15, that has been
    employed during 2007 as collected by the International Labour Organization
  • Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
  • Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
  • Polity Score – 2009 Democracy score as collected by the Polity IV Project
  • Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
  • Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank

The following table summarizes the variables importance in classifying countries into high middle and low GDP per person countries:

Table 1. Variables Importance

Variable Importance
Life Expectancy 35%
Internet Use Rate 21%
Urbanization Rate 11%
Female Employ 8%
CO2 Emissions 8%
Employ Rate 6%
Alcohol Consumption 6%
Polity Score 5%

The explanatory variables with a relative importance scores greater than or equal to 10% (life expectancy, internet use rate and urbanization rate) we selected as clustering variables.

K-Means Clustering

All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data were randomly split into a training set that included 70% of the observations (N=121) and a test set that included 30% of the observations (N=52). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The mean Euclidean distance was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for selecting the number of clusters. The following figure illustrates the elbow curve

Figure 1. Selecting K with the Elbow Curve Method

elbow

Possible levels of k include 2, 3, 4 and 7. Each level was evaluated to see if there was a significant difference between levels of economic well-being (GDP per person). Breaking the data into 2 clusters proved to be a good choice however I hasten to note that 7 was not a bad choice. My preference towards less complication than was needed lead to selecting k=2.

Findings

Cluster number zero has a much lower economic well-being per person than cluster one.

Table 2. GDP Per Person by Cluster

Cluster Average GDP per Person
Zero $1,373
One $15,684

The following figure shows the clusters (zero in blue, one in green) and the clustering variables:

Figure 2. Clustering Variables Scaterplots

k-means pairplot

Discussion

This examination was done as an exercise in doing K-means clustering. The data in the GapMinder dataset does not lend itself to clustering. The scatter plots above illustrate the lack of distinct clusters. Caution should be exercised in using/interpreting these results.

Source Code

As always this project can be found in this GitHub repository.

# Import libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
sns.set_style('whitegrid')
sns.set_context('talk')

# Eliminate false positive SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Make results reproducible
np.random.seed(1234567890)

df = pd.read_csv('gapminder.csv')

variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate', 
                'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate'] 

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)
print("\n")
"""
"===============  Random Forest to Select Clustering Variables  ===============
"""
n_estimators=25

subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%'])
subset['incomequartiles'] = subset['incomequartiles'].astype('category')

variables.pop(0)

predictors = subset[variables]
targets = subset['incomequartiles']

#Split into training and testing sets+
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.25)

# Build the random forest classifier
classifier=RandomForestClassifier(n_estimators=n_estimators)
classifier=classifier.fit(training_data,training_target)

predictions=classifier.predict(test_data)

# Fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(training_data,training_target)

# Display the relative importance of each attribute
feature_name = list(predictors.columns.values)
feature_importance = list(model.feature_importances_)
features = pd.DataFrame({'name':feature_name, 'importance':feature_importance}).sort_values(by='importance', ascending=False)
print(features.head(len(feature_name)))

"""
" =============================  Data Management  =============================
"""
variables = ['incomeperperson', 'lifeexpectancy', 'internetuserate', 'urbanrate'] 

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)

subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%'])
subset['incomequartiles'] = subset['incomequartiles'].astype('category')

# Remove the first variable from the list since the target is derived from it
variables.pop(0)

# Center and scale data
for variable in variables:
    subset[variable]=preprocessing.scale(subset[variable].astype('float64'))
    
features = subset[variables]
targets = subset[['incomeperperson']]

"""
" ==================  Split Data into Training and Test Sets  ==================
"""
# Split into training and testing sets
training_data, test_data, training_target, test_target  = train_test_split(features, targets, test_size=.3)
print('Size of training data')
print(training_data.shape)

"""
" =====================  Determine the Number of Clusters  ====================
"""
# Identify number of clusters using the elbow method
clusters=range(1,10)
meandist=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(training_data)
    clusassign=model.predict(training_data)
    meandist.append(sum(np.min(cdist(training_data, model.cluster_centers_, 'euclidean'), axis=1)) / training_data.shape[0])

# Visualize the elbow
k = 2

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(clusters, meandist)
ax.plot(clusters[(k-1)], meandist[(k-1)], marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of Clusters')
plt.ylabel('Average Distance')
plt.title('Selecting K with the Elbow Method')
plt.show()

"""
" ==========================  Visualize the Clusters  =========================
"""
model=KMeans(n_clusters=k)
model.fit(training_data)
training_data['cluster'] = model.labels_

my_cmap = plt.cm.get_cmap('brg')
my_cmap.set_under('w')

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(training_data.iloc[:,0], training_data.iloc[:,1], training_data.iloc[:,2], c=training_data['cluster'], cmap=my_cmap)
ax.set_xlabel(training_data.columns.values[0])
ax.set_ylabel(training_data.columns.values[1])
ax.set_zlabel(training_data.columns.values[2])
plt.show()
 
sns.pairplot(training_data, hue ='cluster');

"""
" ====================  Examine Differences Between Clusters  =================
"""

training_target['cluster'] = model.labels_
income_model = smf.ols(formula='incomeperperson ~ C(cluster)', data=training_target).fit()
print (income_model.summary())

print ('means for features by cluster')
m1= training_target.groupby('cluster').mean()
print (m1)

print ('standard deviations for features by cluster')
m2= training_target.groupby('cluster').std()
print (m2)

mc1 = multi.MultiComparison(training_target['incomeperperson'], training_target['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())

Lasso Regression (Week 3 Assignment)

Introduction

A lasso regression analysis was conducted to identify a subset of variables from a pool of 8 quantitative predictor variables that best predicted a binary response variable measuring the presence of high per capita income. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

High per Capita Income (Response Variable)

The 2010 Gross Domestic Product per capita is was classified into high income for cases where the absolute deviation divided by the mean absolute deviation is greater than 3.  The GDP per capita is measured in constant 2000 U.S. dollars and was originally came from the World Bank’s Work Development Indicators.

Explanatory Variables

All explanatory variables were standardized to have a mean of zero and a standard deviation of one. The following explanatory variables were included in the data set:

  • Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
  • CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
  • Female Employment Rate – Percentage of female population, age above 15, that has been
    employed during 2007 as collected by the International Labour Organization
  • Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
  • Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
  • Polity Score – 2009 Democracy score as collected by the Polity IV Project
  • Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
  • Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank

Methodology

Data were randomly split into a training set that included 70% of the observations (N=148) and a test set that included 30% of the observations (N=45). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Mean squared error on each fold

lasso

Findings

Of the 8 predictor variables, 3 were retained in the selected model. Internet user rate was the most strongly associated with high income. CO2 emissions and employment rate were also positively associated with high income. These 3 variables accounted for 44.2% of the variance in the high per capita income response variable in the test data set.

Table 1. Regression Coefficients

Variable Regression Coefficients
Internet Use Rate 0.157313
CO2 Emissions 0.019681
Employ Rate 0.001853
Alcohol Consumption 0
Female Employ 0
Life Expectancy 0
Polity Score 0
Urbanization Rate 0

Discussion

The previous week I examined the same variables using a Random Forest. The explanatory variables with the highest relative importance scores were life expectancy, internet use rate, urbanization rate. The LASSO regression only selected the internet use rate as a significant variable. As noted in last week’s post, these findings suggest that my previous work looking at the relationship between the level of democratization and the economic well-being may have been confounded by other variables.

Source Code

# Import libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn import preprocessing

# Make results reproducible
np.random.seed(1234567890)

df = pd.read_csv('gapminder.csv')

variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate',
                'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate']

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)
print("\n")
"""
" =============================  Data Management  =============================
"""
# Remove the first variable from the list since the target is derived from it
variables.pop(0)

# Center and scale data
for variable in variables:
    subset[variable]=preprocessing.scale(subset[variable].astype('float64'))
    
# Identify contries with a high level of income using the MAD (mean absolute deviation) method
subset['absolute_deviations'] = np.absolute(subset['incomeperperson'] - np.median(subset['incomeperperson']))
MAD = np.mean(subset['absolute_deviations'])

# This function converts the income per person absolute deviations to a high income flag
def high_income_flag(absolute_deviations):
    threshold = 3
    if (absolute_deviations/MAD) > threshold:
        return 1
    else:
        return 0

subset['High Income'] = subset['absolute_deviations'].apply(high_income_flag)

"""
" ==========================  Build LASSO Regression  ==========================
"""
predictors = subset[variables]
targets = subset['High Income']

#Split into training and testing sets
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.3)

# Build the LASSO regression model
model=LassoLarsCV(cv=10, precompute=False).fit(training_data, training_target)

"""
" ==========================  Evaluate LASSO Model  ============================
"""
# print variable names and regression coefficients
feature_name = list(predictors.columns.values)
feature_coefficient = list(model.coef_)
features = pd.DataFrame({'Variable':feature_name, 'Regression Coefficients':feature_coefficient}).sort_values(by='Regression Coefficients', ascending=False)
print(features.head(len(feature_name)))

#print(dict(zip(predictors.columns, model.coef_)))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')


# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(training_target, model.predict(training_data))
test_error = mean_squared_error(test_target, model.predict(test_data))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)

# R-square from training and test data
rsquared_train=model.score(training_data, training_target)
rsquared_test=model.score(test_data, test_target)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

Running a Random Forest (Week 2 Assignment)

Introduction

A random forest analysis was preformed to evaluate a series of explanatory variables in predicting a binary categorical variable. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

High Income (Response Variable)

The 2010 Gross Domestic Product per capita is was classified into high income for cases where the absolute deviation divided by the mean absolute deviation is greater than 3.  The GDP per capita is measured in constant 2000 U.S. dollars and was originally came from the World Bank’s Work Development Indicators.

Explanatory Variables

The following explanatory variables were evaluated:

  • Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
  • CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
  • Female Employment Rate – Percentage of female population, age above 15, that has been
    employed during 2007 as collected by the International Labour Organization
  • Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
  • Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
  • Polity Score – 2009 Democracy score as collected by the Polity IV Project
  • Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
  • Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank

Findings

The explanatory variables with the highest relative importance scores were life expectancy, internet use rate, urbanization rate.

Table 1 – Variables Importance

Variable Importance
Life Expectancy 40%
Internet Use Rate 21%
Urbanization Rate 10%
CO2 Emissions 7%
Female Employ 7%
Alcohol Consumption 7%
Employ Rate 5%
Polity Score 4%

The accuracy of the random forest was 97%, with the subsequent growing of multiple trees beyond 3, adding little to the overall accuracy of the model.

Discussion

These findings suggest that my previous work looking at the relationship between the level of democratization and the economic well-being may have been confounded by other variables. The level of democratization was assumed to be a cause while some of these explanatory variables (i.e. life expectancy, internet use rate) are more of outcomes that would be correlated to the level of income.

Source Code

As always my project is available on it’s GitHub repository.

# Import libraries needed
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

# Make results reproducible
np.random.seed(1234567890)

df = pd.read_csv('gapminder.csv')

variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate', 
                'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate']
                
# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')
    
# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)

"""
" =============================  Data Management  =============================
"""
# Identify contries with a high level of income using the MAD (mean absolute deviation) method
subset['absolute_deviations'] = np.absolute(subset['incomeperperson'] - np.median(subset['incomeperperson']))
MAD = np.mean(subset['absolute_deviations'])

# This function converts the income per person absolute deviations to a high income flag
def high_income_flag(absolute_deviations):
    threshold = 3
    if (absolute_deviations/MAD) > threshold:
        return "Yes"
    else:
        return "No"

subset['High Income'] = subset['absolute_deviations'].apply(high_income_flag)
subset['High Income'] = subset['High Income'].astype('category')

"""
" ===========================  Build Random Forest  ===========================
"""
# Remove the first variable from the list since the target is derived from it
variables.pop(0)

predictors = subset[variables]
targets = subset['High Income']

#Split into training and testing sets+
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.4)

# Build the random forest classifier
classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(training_data,training_target)

"""
" =========================  Evaluate Random Forest  ==========================
"""
predictions=classifier.predict(test_data)

print('Classification Report')
print(sk.metrics.classification_report(test_target, predictions))

print('Confusion Matrix')
print(sk.metrics.confusion_matrix(test_target, predictions))

print('Accuracy Score')
print(sk.metrics.accuracy_score(test_target, predictions))

# Fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(training_data,training_target)

# Display the relative importance of each attribute
feature_name = list(predictors.columns.values)
feature_importance = list(model.feature_importances_)
features = pd.DataFrame({'name':feature_name, 'importance':feature_importance}).sort_values(by='importance', ascending=False)
print(features.head(len(feature_name)))

"""
" ========================  Evaluate Number of Trees  =========================
"""
trees = range(n_estimators)
accuracy = np.zeros(n_estimators)

for idx in range(len(trees)):
    classifier=RandomForestClassifier(n_estimators=idx+1)
    classifier=classifier.fit(training_data,training_target)
    predictions=classifier.predict(test_data)
    accuracy[idx] = sk.metrics.accuracy_score(test_target, predictions)
    
plt.cla()
plt.plot(trees, accuracy)

Machine Learning for Data Analysis Week 1

Introduction

Previous research using the GapMinder data set has suggested that a high income per person is a function of the level of democratization and urbanization in a country.  I will preform a decision tree analysis to test the nonlinear relationships between the binary categorical response variable (high income) and these two explanatory variables.

Figure 1 – High Income by Urbanization and Level of Democracy
DecisionTreePlot

About the Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this class only has one year of data for 213 countries. There are 155 countries with complete data for the following variables of interest.

Measures

High Income (Response Variable)

The 2010 Gross Domestic Product per capita is was classified into high income for cases where the absolute deviation divided by the mean absolute deviation is greater than 3.  The GDP per capita is measured in constant 2000 U.S. dollars and was originally came from the World Bank’s Work Development Indicators.

Is Full Democracy (Explanatory Variable)

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project. This value ranges from -10 (autocracy) to 10 (full democracy).  The following plot shows the relationship between the response variable and the Polity IV Score.

Figure 2 – Income per Person by Level of Democracy

HighIncomeByGDPandPolity

I collapsed these 21 categories into two categories:

  • Full Democracy (polity score = 10)
  • Not a Full Democracy (polity score =9 to -10)

Thirty-two of the countries are full democracies and the remaining 123 are not.

Urbanization Rate Quartile (Explanatory Variable)

The urbanization rate is measured by the share of the 2008 population living in an urban setting. The urban population is defined by national statistical offices.  This data was originally produced by the World Bank.  This variable was binned into quartiles.  The following plot shows the relationship between the response variable and the level of urbanization:

Figure 3 – Income per Person by Urbanization Rate

HighIncomeByGDPandUrbanRate

Decision Tree Model

The sample of 155 countries were divided into training and test sets using a 60/40 split.  The following image is the decision tree that my model generated on the training set (apologies for the blurriness.  Click on the image for a clearer one).

Figure 4 – High Income per Person Decision Tree Model

DecisionTree

Please keep in mind that the high income response variable is coded “No” or “Yes.”  Consequently the value in the top most box ([77,16]) represents 77 countries with a “No” and 16 with a “Yes.”

The binary “is full democracy” (1 if country is a full democracy or 0 if they are not) was the first variable to separate the sample into two subgroups.  From there the level of urbanization was used to further break down the sample.  Roughly 7% of the countries that are not full democracies (Is Full Democracy <= 0.5 is true) are classified as high income countries.  These high income countries are all in the third and fourth urbanization rate quartiles.

52% of the “full democracy” countries are classified high income countries.  73% of them are in the third and fourth urbanization quartiles and the remaining 27% are in the second quartile.

Model Accuracy

This model was tested on a set of 62 observations.  The total model classified 97% of the test set correctly, with a precision and recall of 75% for classifying as a high income country and 98% for a not a high income country.

Table 1 – Confusion Matrix

Predicted
Actual No Yes
No 57 1
Yes 1 3

Source Code

As always my project is available on it’s GitHub repository.

# Import libraries needed
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image
from io import BytesIO
import pydotplus as pdp

# Make results reproducible
np.random.seed(0)

# bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%.2f'%x)

df = pd.read_csv('gapminder.csv')

# convert to numeric format
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')
df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce')

# listwise deletion of missing values
subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna()

# Summarize the data
print(subset[['incomeperperson', 'urbanrate']].describe())

# Identify contries with a high level of income using the MAD (mean absolute deviation) method
subset['absolute_deviations'] = np.absolute(subset['incomeperperson'] - np.median(subset['incomeperperson']))
MAD = np.mean(subset['absolute_deviations'])

# This function converts the income per person absolute deviations to a high income flag
def high_income_flag(absolute_deviations):
    threshold = 3
    if (absolute_deviations/MAD) > threshold:
        return "Yes"
    else:
        return "No"

subset['High Income'] = subset['absolute_deviations'].apply(high_income_flag)
subset['High Income'] = subset['High Income'].astype('category')

# This function converts the polity score to a category
def convert_polityscore_to_category(polityscore):
    if polityscore == 10:
        return 1
    else:
        return 0

# Now we can use the function to create the new variable
subset['Is Full Democracy'] = subset['polityscore'].apply(convert_polityscore_to_category)
subset['Is Full Democracy'] = subset['Is Full Democracy'].astype('category')

# Bin urban rate into quartiles
subset['Urban Rate Quartile'] = pd.qcut(subset['urbanrate'], 4, labels=False)

#Split into training and testing sets
predictors = subset[['Is Full Democracy','Urban Rate Quartile']]
targets = subset[['High Income']]
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.4)

#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(training_data, training_target)

# Check how well the classifier worked
predictions=classifier.predict(test_data)
print(sk.metrics.confusion_matrix(test_target,predictions))

print(sk.metrics.accuracy_score(test_target, predictions))

print(sk.metrics.classification_report(test_target, predictions))

#Displaying the decision tree
out = BytesIO()
sk.tree.export_graphviz(classifier, out_file=out, feature_names=predictors.columns)
graph=pdp.graph_from_dot_data(out.getvalue())
Image(graph.create_png())

Regression Modeling in Practice Week 4 Assignement

Introduction

In this week’s class we will be using logistic regression. I have been examining the relationship between economic well-being and the level of democracy or openness of a society.  My hypothesis is that countries that are the most open will be more likely to have a high level of economic well-being. My previous work has established a statistically significant positive relationship between the level of urbanization and economic well-being. So we will be examining the influence of this confounding variable.

About the Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this class only has one year of data for 213 countries. There are 155 countries with complete data for the following variables of interest.

Measures

Economic Well-Being (Response Variable)

Income per person (economic well-being) is the 2010 Gross Domestic Product per capita is measured in constant 2000 U.S. dollars. This allows for comparison across countries with different costs of living. This data originally came from the World Bank’s Work Development Indicators.

Since we are doing a logistic regression I have binned this variable into those that have higher than average income (n=39) and those with less than average incomes (n=116). The average per capita income of the study group is $6,605.

Level of Democracy (Explanatory Variable)

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project. This value ranges from -10 (autocracy) to 10 (full democracy). I will bin these 21 categories into two categories:

  • Full Democracy (polity score = 10)
  • Not a Full Democracy (polity score =9 to -10)

Thirty-two of the countries are full democracies and the remaining 123 are not.

Urbanization Rate (Possible Confounder)

The urbanization rate is measured by the share of the 2008 population living in an urban setting. This data was originally produced by the World Bank. The urban population is defined by national statistical offices. This variable was binned into those with a higher than average urbanization rate (55%+). 84 observations had a higher than average urbanization rate and 71 did not.

Results

The odds of having a higher than average income is almost 28 times higher for countries classified as full-democracies (OR=27.81, 95% CI=10.17 to 76.04, p=0.000).

However after adjusting for urbanization rate, the odds of having a higher than average income dropped to 17% for full democracies (OR=17.09, 95% CI=5.88 to 49.67, p=0.000). Urbanization was also significantly associated with high income, such that countries with a higher than average urbanization rate were significantly more likely to have higher than average per capita incomes (OR= 9.31, 95%CI=2.47 to 35.16, p=0.001).

Source Code

UPDATE: Jupyter notebook available on GitHub.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

# bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%.2f'%x)

df = pd.read_csv('gapminder.csv')

# convert to numeric format
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')
df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce')

# listwise deletion of missing values
subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna()

# This function converts the polity score to a category
def convert_polityscore_to_category(polityscore):
    if polityscore == 10:
        return 1
    else:
        return 0

# Now we can use the function to create the new variable
subset['full_democracy'] = subset['polityscore'].apply(convert_polityscore_to_category)

counts = subset.groupby('full_democracy').size()
print(counts)

# Create a threshold
income_threshold = np.mean(subset['incomeperperson'])
print(income_threshold)

# Set binary flag that income per person is greater than the threshold
def income_higher_than_threshold(income):
    if income > income_threshold:
        return 1
    else:
        return 0

subset['high_income'] = subset['incomeperperson'].apply(income_higher_than_threshold)

counts = subset.groupby('high_income').size()
print(counts)

# Create a threshold
urbanization_threshold = np.mean(subset['urbanrate'])
print(urbanization_threshold)

# Set binary flag that urbanization rate is greater than the threshold
def urbanrate_higher_than_threshold(urbanrate):
    if urbanrate > urbanization_threshold:
        return 1
    else:
        return 0

subset['high_urbanrate'] = subset['urbanrate'].apply(urbanrate_higher_than_threshold)

counts = subset.groupby('high_urbanrate').size()
print(counts)

# logistic regression with society type
lreg1 = smf.logit(formula = 'high_income ~ full_democracy', data = subset).fit()
print (lreg1.summary())

# odd ratios with 95% confidence intervals
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))

# logistic regression with society type and urbanization rate
lreg2 = smf.logit(formula = 'high_income ~ full_democracy + high_urbanrate', data = subset).fit()
print (lreg2.summary())

# odd ratios with 95% confidence intervals
params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))

Model Output

Logistic Regression Model 1 – Full Democracy

Optimization terminated successfully.
         Current function value: 0.389711
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:            high_income   No. Observations:                  155
Model:                          Logit   Df Residuals:                      153
Method:                           MLE   Df Model:                            1
Date:                Sat, 19 Dec 2015   Pseudo R-squ.:                  0.3091
Time:                        12:23:56   Log-Likelihood:                -60.405
converged:                       True   LL-Null:                       -87.436
                                        LLR p-value:                 1.944e-13
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept         -2.0523      0.284     -7.229      0.000        -2.609    -1.496
full_democracy     3.3253      0.513      6.478      0.000         2.319     4.331
==================================================================================
                Lower CI  Upper CI    OR
Intercept           0.07      0.22  0.13
full_democracy     10.17     76.04 27.81

Logistic Regression Model 2 – Full Democracy + Urbanization Rate

Optimization terminated successfully.
         Current function value: 0.342689
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:            high_income   No. Observations:                  155
Model:                          Logit   Df Residuals:                      152
Method:                           MLE   Df Model:                            2
Date:                Sat, 19 Dec 2015   Pseudo R-squ.:                  0.3925
Time:                        12:23:56   Log-Likelihood:                -53.117
converged:                       True   LL-Null:                       -87.436
                                        LLR p-value:                 1.246e-15
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept         -3.5056      0.636     -5.509      0.000        -4.753    -2.258
full_democracy     2.8388      0.544      5.216      0.000         1.772     3.905
high_urbanrate     2.2312      0.678      3.291      0.001         0.903     3.560
==================================================================================
                Lower CI  Upper CI    OR
Intercept           0.01      0.10  0.03
full_democracy      5.88     49.67 17.09
high_urbanrate      2.47     35.16  9.31

Regression Modeling in Practice Week 3 Assignement

Introduction

For week three of Regression Modeling in Practice I have continued to examine the relationship between democratic openness and economic well-being.  My hypothesis is that as a country becomes more democratic the economic well-being is increased.

From previous work I have observed a statistically significant positive relationship between the level of urbanization and economic well-being.  For this week I will examine if the relationship between democratic openness and economic well-being is observed after including the level of urbanization.

About The Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development.  The data set for this class only has one year of data for 213 countries.  There are 155 countries with complete data for the following variables of interest:

Measures

Income per person (economic well-being) is the 2010 Gross Domestic Product per capita is measured in constant 2000 U.S. dollars.  This allows for comparison across countries with different costs of living.  This data originally came from the World Bank’s Work Development Indicators.

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project.  This value ranges from -10 (autocracy) to 10 (full democracy).  The Polity IV Project authors group these measures into five categories:

  1. Full Democracy (polity score = 10)
  2. Democracy (6 to 9)
  3. Open Anocracy (1 to 5)
  4. Closed Anocracy (-5 to 0)
  5. Autocracy (-10 to -6)

The urbanization rate is measured by the share of the 2008 population living in an urban setting.  This data was originally produced by the World Bank.  The urban population is defined by national statistical offices.

Exploratory Analysis

The following scatter plot shows the relationship between economic well-being variable and the level of democratization:
gdp-by-polity-score

One observes that the relationship is nonlinear and there is an outlier in the “Closed Anocracy” category (in purple).  Now we will examine the relationship between the confounding variable, urbanization rate:

gdp-by-urbanization-rate

We observe that there is the positive relationship but it is non-linear.  In fact it looks like it is exponential in nature.  One also observes that the full democracy countries are clustered at the higher end of the urbanization rate spectrum.

Data Management

Since the Polity IV score is categorical I created a quantitative variable measuring the degree to which the country is a full democracy.  This variable ranges from -100 to 100.

I also decided to take the natural log of GDP per capita to straighten out the relationship between GDP per capita and the urbanization rate as this scatter plot shows:

log-gdp-by-urbanization-rate

Model Specification

The economic well-being of a country is dependent on the level of democratization and urbanization.  The relationship between economic and well-being is quadratic and between economic well-being and urbanization is linear.

Results

The results of the linear regression model indicated that the urbanization rate was significantly positively associated with economic well-being on a natural log scale (Beta=0.0409 p=0.000), as well as the squared full democracy degree (Beta=0.002 p=0.000). The adjusted R2 of for this model is 0.718.

To interpret these results consider a country that is a full democracy with 100% of the population urbanized.  The log income per person increase by 20 for the level of democratization and by about 4 for the urbanization.  This suggests the level of democratization has a larger effect than the level of urbanization.

Residuals Analysis

the Q-Q plot residuals that a generally linear but deviate at the lower and higher quantiles.  The residuals are skewed a little bit.q-qplot

The normalized residuals indicate most of the residuals are within 2 standard deviations, however seven of the 155 observations (4.5%) fall outside.  This suggests this model is not a really good fit for the observed data.

week3-standardized-residuals

 

The influence plot identify the seven outliers (those greater than 2 or less than negative 2).  Two observations (194 and 173) have a high leverage and are outliers.

week3-influence-plot

Source Code

# Import libraries needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('talk')

# bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%.2f'%x)

df = pd.read_csv('gapminder.csv')

# Convert to numeric format
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')
df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce')

# Listwise deletion of missing values
subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna()

# This function converts the polity score to a category
def full_democracy_degree(score):
full_democracy = score / 10
return(full_democracy)

# Now we can use the function to create the new variable
subset['full_democracy_degree'] = subset['polityscore'].apply(full_democracy_degree)

# Transform the response variable
subset['log_incomeperperson'] = np.log(subset.incomeperperson)

# Quadratic regression analysis
model = smf.ols('log_incomeperperson ~ urbanrate + I(full_democracy_degree**2)', data=subset).fit()
print (model.summary())

# Q-Q plot for normality
fig = sm.qqplot(model.resid, line='r')

# Simple plot of residuals
stdres=pd.DataFrame(model.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l=plt.axhline(y=0, color='r')
l=plt.axhline(y=2.5, color='r', ls='dashed')
l=plt.axhline(y=-2.5, color='r', ls='dashed')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')

Model Output

                             OLS Regression Results                            
===============================================================================
Dep. Variable:     log_incomeperperson   R-squared:                       0.721
Model:                             OLS   Adj. R-squared:                  0.718
Method:                  Least Squares   F-statistic:                     196.8
Date:                 Sat, 12 Dec 2015   Prob (F-statistic):           6.58e-43
Time:                         20:52:36   Log-Likelihood:                -190.65
No. Observations:                  155   AIC:                             387.3
Df Residuals:                      152   BIC:                             396.4
Df Model:                            2                                         
Covariance Type:             nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------------
Intercept                         4.4301      0.183     24.267      0.000         4.069     4.791
urbanrate                         0.0409      0.003     12.273      0.000         0.034     0.048
I(full_democracy_degree ** 2)     0.0002   2.18e-05      8.719      0.000         0.000     0.000
==============================================================================
Omnibus:                        1.980   Durbin-Watson:                   2.148
Prob(Omnibus):                  0.372   Jarque-Bera (JB):                1.661
Skew:                           0.082   Prob(JB):                        0.436
Kurtosis:                       3.480   Cond. No.                     1.73e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.73e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Regression Modeling in Practice Week 2 Assignement

Introduction

For week two of Regression Modeling in Practice I preformed a simple regression. My study question in this course is the relationship between democratic openness and economic well-being. The GapMinder data set was used. Economic well-being was measured using GDP percapita. Democratic openness is measured using the score from the Polity IV project.

Analysis Source Code

155 countries were categorized into two categories: full democracy (n=32) and not full democracy (133). Then the relationship between economic well-being being dependent on the presence of full democratization was measured using an OLS regression. The analysis was preformed using the following Python code:

# Import libraries needed
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# Read in the Data
df = pd.read_csv('gapminder.csv', low_memory=False)

# Print some basic statistics
n = str(len(df))
cols = str(len(df.columns))
print('Number of observations: '+ n +' (rows)')
print('Number of variables: '+ cols +' (columns)')
print('\n')

# Change the data type for variables of interest
# Response Variable
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
# Explanatory Variable
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')

# Print the number of records with data
print ('Countries with a GDP Per Capita: ' + str(df['incomeperperson'].count()) + ' out of ' + str(len(df)) + ' (' + str(len(df) - df['incomeperperson'].count()) + ' missing)')
print ('Countries with a Democracy Score: ' + str(df['polityscore'].count()) + ' out of ' + str(len(df)) + ' (' + str(len(df) - df['polityscore'].count()) + ' missing)')
print('\n')

# Get the rows not missing a value
subset = df[np.isfinite(df['polityscore'])]
subset = subset[np.isfinite(subset['incomeperperson'])]
print('Number of observations: '+ str(len(subset)) +' (rows)')
print('\n')

# This function converts the polity score to a binary category flag
def is_full_democracy(score):
if score == 10:
return(1)
else:
return(0)

# Now we can use the function to create the new variable
subset['is_full_democracy'] = subset['polityscore'].apply(is_full_democracy)

# Create frequency table
full_democracy_counts = subset.groupby('is_full_democracy').size()
print(full_democracy_counts)
print('\n')

# Create simple regression model
model = smf.ols('incomeperperson ~ is_full_democracy', data=subset).fit()
print(model.summary())

This code is also available in my GitHub repository for this class.

Regression Model Results

The results of the linear regression model indicated that the presence of a full-democracy (Beta=15,990, p=0.000) was significantly and positively associated with the economic well-being. The adjusted R2 of for this model is 0.439.