Regression Modeling in Practice Week 4 Assignement

Introduction

In this week’s class we will be using logistic regression. I have been examining the relationship between economic well-being and the level of democracy or openness of a society.  My hypothesis is that countries that are the most open will be more likely to have a high level of economic well-being. My previous work has established a statistically significant positive relationship between the level of urbanization and economic well-being. So we will be examining the influence of this confounding variable.

About the Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this class only has one year of data for 213 countries. There are 155 countries with complete data for the following variables of interest.

Measures

Economic Well-Being (Response Variable)

Income per person (economic well-being) is the 2010 Gross Domestic Product per capita is measured in constant 2000 U.S. dollars. This allows for comparison across countries with different costs of living. This data originally came from the World Bank’s Work Development Indicators.

Since we are doing a logistic regression I have binned this variable into those that have higher than average income (n=39) and those with less than average incomes (n=116). The average per capita income of the study group is $6,605.

Level of Democracy (Explanatory Variable)

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project. This value ranges from -10 (autocracy) to 10 (full democracy). I will bin these 21 categories into two categories:

  • Full Democracy (polity score = 10)
  • Not a Full Democracy (polity score =9 to -10)

Thirty-two of the countries are full democracies and the remaining 123 are not.

Urbanization Rate (Possible Confounder)

The urbanization rate is measured by the share of the 2008 population living in an urban setting. This data was originally produced by the World Bank. The urban population is defined by national statistical offices. This variable was binned into those with a higher than average urbanization rate (55%+). 84 observations had a higher than average urbanization rate and 71 did not.

Results

The odds of having a higher than average income is almost 28 times higher for countries classified as full-democracies (OR=27.81, 95% CI=10.17 to 76.04, p=0.000).

However after adjusting for urbanization rate, the odds of having a higher than average income dropped to 17% for full democracies (OR=17.09, 95% CI=5.88 to 49.67, p=0.000). Urbanization was also significantly associated with high income, such that countries with a higher than average urbanization rate were significantly more likely to have higher than average per capita incomes (OR= 9.31, 95%CI=2.47 to 35.16, p=0.001).

Source Code

UPDATE: Jupyter notebook available on GitHub.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

# bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%.2f'%x)

df = pd.read_csv('gapminder.csv')

# convert to numeric format
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')
df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce')

# listwise deletion of missing values
subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna()

# This function converts the polity score to a category
def convert_polityscore_to_category(polityscore):
    if polityscore == 10:
        return 1
    else:
        return 0

# Now we can use the function to create the new variable
subset['full_democracy'] = subset['polityscore'].apply(convert_polityscore_to_category)

counts = subset.groupby('full_democracy').size()
print(counts)

# Create a threshold
income_threshold = np.mean(subset['incomeperperson'])
print(income_threshold)

# Set binary flag that income per person is greater than the threshold
def income_higher_than_threshold(income):
    if income > income_threshold:
        return 1
    else:
        return 0

subset['high_income'] = subset['incomeperperson'].apply(income_higher_than_threshold)

counts = subset.groupby('high_income').size()
print(counts)

# Create a threshold
urbanization_threshold = np.mean(subset['urbanrate'])
print(urbanization_threshold)

# Set binary flag that urbanization rate is greater than the threshold
def urbanrate_higher_than_threshold(urbanrate):
    if urbanrate > urbanization_threshold:
        return 1
    else:
        return 0

subset['high_urbanrate'] = subset['urbanrate'].apply(urbanrate_higher_than_threshold)

counts = subset.groupby('high_urbanrate').size()
print(counts)

# logistic regression with society type
lreg1 = smf.logit(formula = 'high_income ~ full_democracy', data = subset).fit()
print (lreg1.summary())

# odd ratios with 95% confidence intervals
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))

# logistic regression with society type and urbanization rate
lreg2 = smf.logit(formula = 'high_income ~ full_democracy + high_urbanrate', data = subset).fit()
print (lreg2.summary())

# odd ratios with 95% confidence intervals
params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))

Model Output

Logistic Regression Model 1 – Full Democracy

Optimization terminated successfully.
         Current function value: 0.389711
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:            high_income   No. Observations:                  155
Model:                          Logit   Df Residuals:                      153
Method:                           MLE   Df Model:                            1
Date:                Sat, 19 Dec 2015   Pseudo R-squ.:                  0.3091
Time:                        12:23:56   Log-Likelihood:                -60.405
converged:                       True   LL-Null:                       -87.436
                                        LLR p-value:                 1.944e-13
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept         -2.0523      0.284     -7.229      0.000        -2.609    -1.496
full_democracy     3.3253      0.513      6.478      0.000         2.319     4.331
==================================================================================
                Lower CI  Upper CI    OR
Intercept           0.07      0.22  0.13
full_democracy     10.17     76.04 27.81

Logistic Regression Model 2 – Full Democracy + Urbanization Rate

Optimization terminated successfully.
         Current function value: 0.342689
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:            high_income   No. Observations:                  155
Model:                          Logit   Df Residuals:                      152
Method:                           MLE   Df Model:                            2
Date:                Sat, 19 Dec 2015   Pseudo R-squ.:                  0.3925
Time:                        12:23:56   Log-Likelihood:                -53.117
converged:                       True   LL-Null:                       -87.436
                                        LLR p-value:                 1.246e-15
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept         -3.5056      0.636     -5.509      0.000        -4.753    -2.258
full_democracy     2.8388      0.544      5.216      0.000         1.772     3.905
high_urbanrate     2.2312      0.678      3.291      0.001         0.903     3.560
==================================================================================
                Lower CI  Upper CI    OR
Intercept           0.01      0.10  0.03
full_democracy      5.88     49.67 17.09
high_urbanrate      2.47     35.16  9.31
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s