## Introduction

In this week’s class we will be using logistic regression. I have been examining the relationship between economic well-being and the level of democracy or openness of a society. My hypothesis is that countries that are the most open will be more likely to have a high level of economic well-being. My previous work has established a statistically significant positive relationship between the level of urbanization and economic well-being. So we will be examining the influence of this confounding variable.

## About the Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this class only has one year of data for 213 countries. There are 155 countries with complete data for the following variables of interest.

### Measures

#### Economic Well-Being (Response Variable)

Income per person (economic well-being) is the 2010 Gross Domestic Product per capita is measured in constant 2000 U.S. dollars. This allows for comparison across countries with different costs of living. This data originally came from the World Bank’s Work Development Indicators.

Since we are doing a logistic regression I have binned this variable into those that have higher than average income (n=39) and those with less than average incomes (n=116). The average per capita income of the study group is $6,605.

#### Level of Democracy (Explanatory Variable)

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project. This value ranges from -10 (autocracy) to 10 (full democracy). I will bin these 21 categories into two categories:

- Full Democracy (polity score = 10)
- Not a Full Democracy (polity score =9 to -10)

Thirty-two of the countries are full democracies and the remaining 123 are not.

#### Urbanization Rate (Possible Confounder)

The urbanization rate is measured by the share of the 2008 population living in an urban setting. This data was originally produced by the World Bank. The urban population is defined by national statistical offices. This variable was binned into those with a higher than average urbanization rate (55%+). 84 observations had a higher than average urbanization rate and 71 did not.

## Results

The odds of having a higher than average income is almost 28 times higher for countries classified as full-democracies (OR=27.81, 95% CI=10.17 to 76.04, p=0.000).

However after adjusting for urbanization rate, the odds of having a higher than average income dropped to 17% for full democracies (OR=17.09, 95% CI=5.88 to 49.67, p=0.000). Urbanization was also significantly associated with high income, such that countries with a higher than average urbanization rate were significantly more likely to have higher than average per capita incomes (OR= 9.31, 95%CI=2.47 to 35.16, p=0.001).

## Source Code

UPDATE: Jupyter notebook available on GitHub.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf # bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%.2f'%x) df = pd.read_csv('gapminder.csv') # convert to numeric format df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce') df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce') df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce') # listwise deletion of missing values subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna() # This function converts the polity score to a category def convert_polityscore_to_category(polityscore): if polityscore == 10: return 1 else: return 0 # Now we can use the function to create the new variable subset['full_democracy'] = subset['polityscore'].apply(convert_polityscore_to_category) counts = subset.groupby('full_democracy').size() print(counts) # Create a threshold income_threshold = np.mean(subset['incomeperperson']) print(income_threshold) # Set binary flag that income per person is greater than the threshold def income_higher_than_threshold(income): if income > income_threshold: return 1 else: return 0 subset['high_income'] = subset['incomeperperson'].apply(income_higher_than_threshold) counts = subset.groupby('high_income').size() print(counts) # Create a threshold urbanization_threshold = np.mean(subset['urbanrate']) print(urbanization_threshold) # Set binary flag that urbanization rate is greater than the threshold def urbanrate_higher_than_threshold(urbanrate): if urbanrate > urbanization_threshold: return 1 else: return 0 subset['high_urbanrate'] = subset['urbanrate'].apply(urbanrate_higher_than_threshold) counts = subset.groupby('high_urbanrate').size() print(counts) # logistic regression with society type lreg1 = smf.logit(formula = 'high_income ~ full_democracy', data = subset).fit() print (lreg1.summary()) # odd ratios with 95% confidence intervals params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (np.exp(conf)) # logistic regression with society type and urbanization rate lreg2 = smf.logit(formula = 'high_income ~ full_democracy + high_urbanrate', data = subset).fit() print (lreg2.summary()) # odd ratios with 95% confidence intervals params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI', 'Upper CI', 'OR'] print (np.exp(conf))

## Model Output

### Logistic Regression Model 1 – Full Democracy

Optimization terminated successfully. Current function value: 0.389711 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: high_income No. Observations: 155 Model: Logit Df Residuals: 153 Method: MLE Df Model: 1 Date: Sat, 19 Dec 2015 Pseudo R-squ.: 0.3091 Time: 12:23:56 Log-Likelihood: -60.405 converged: True LL-Null: -87.436 LLR p-value: 1.944e-13 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept -2.0523 0.284 -7.229 0.000 -2.609 -1.496 full_democracy 3.3253 0.513 6.478 0.000 2.319 4.331 ==================================================================================

Lower CI Upper CI OR Intercept 0.07 0.22 0.13 full_democracy 10.17 76.04 27.81

### Logistic Regression Model 2 – Full Democracy + Urbanization Rate

Optimization terminated successfully. Current function value: 0.342689 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: high_income No. Observations: 155 Model: Logit Df Residuals: 152 Method: MLE Df Model: 2 Date: Sat, 19 Dec 2015 Pseudo R-squ.: 0.3925 Time: 12:23:56 Log-Likelihood: -53.117 converged: True LL-Null: -87.436 LLR p-value: 1.246e-15 ================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept -3.5056 0.636 -5.509 0.000 -4.753 -2.258 full_democracy 2.8388 0.544 5.216 0.000 1.772 3.905 high_urbanrate 2.2312 0.678 3.291 0.001 0.903 3.560 ==================================================================================

Lower CI Upper CI OR Intercept 0.01 0.10 0.03 full_democracy 5.88 49.67 17.09 high_urbanrate 2.47 35.16 9.31