## Introduction

For week three of Regression Modeling in Practice I have continued to examine the relationship between democratic openness and economic well-being. My hypothesis is that as a country becomes more democratic the economic well-being is increased.

From previous work I have observed a statistically significant positive relationship between the level of urbanization and economic well-being. For this week I will examine if the relationship between democratic openness and economic well-being is observed after including the level of urbanization.

## About The Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this class only has one year of data for 213 countries. There are 155 countries with complete data for the following variables of interest:

### Measures

Income per person (economic well-being) is the 2010 Gross Domestic Product per capita is measured in constant 2000 U.S. dollars. This allows for comparison across countries with different costs of living. This data originally came from the World Bank’s Work Development Indicators.

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project. This value ranges from -10 (autocracy) to 10 (full democracy). The Polity IV Project authors group these measures into five categories:

- Full Democracy (polity score = 10)
- Democracy (6 to 9)
- Open Anocracy (1 to 5)
- Closed Anocracy (-5 to 0)
- Autocracy (-10 to -6)

The urbanization rate is measured by the share of the 2008 population living in an urban setting. This data was originally produced by the World Bank. The urban population is defined by national statistical offices.

## Exploratory Analysis

The following scatter plot shows the relationship between economic well-being variable and the level of democratization:

One observes that the relationship is nonlinear and there is an outlier in the “Closed Anocracy” category (in purple). Now we will examine the relationship between the confounding variable, urbanization rate:

We observe that there is the positive relationship but it is non-linear. In fact it looks like it is exponential in nature. One also observes that the full democracy countries are clustered at the higher end of the urbanization rate spectrum.

## Data Management

Since the Polity IV score is categorical I created a quantitative variable measuring the degree to which the country is a full democracy. This variable ranges from -100 to 100.

I also decided to take the natural log of GDP per capita to straighten out the relationship between GDP per capita and the urbanization rate as this scatter plot shows:

## Model Specification

The economic well-being of a country is dependent on the level of democratization and urbanization. The relationship between economic and well-being is quadratic and between economic well-being and urbanization is linear.

### Results

The results of the linear regression model indicated that the urbanization rate was significantly positively associated with economic well-being on a natural log scale (Beta=0.0409 p=0.000), as well as the squared full democracy degree (Beta=0.002 p=0.000). The adjusted R^{2} of for this model is 0.718.

To interpret these results consider a country that is a full democracy with 100% of the population urbanized. The log income per person increase by 20 for the level of democratization and by about 4 for the urbanization. This suggests the level of democratization has a larger effect than the level of urbanization.

### Residuals Analysis

the Q-Q plot residuals that a generally linear but deviate at the lower and higher quantiles. The residuals are skewed a little bit.

The normalized residuals indicate most of the residuals are within 2 standard deviations, however seven of the 155 observations (4.5%) fall outside. This suggests this model is not a really good fit for the observed data.

The influence plot identify the seven outliers (those greater than 2 or less than negative 2). Two observations (194 and 173) have a high leverage and are outliers.

## Source Code

# Import libraries needed import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm import statsmodels.formula.api as smf import seaborn as sns sns.set_style('whitegrid') sns.set_context('talk') # bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%.2f'%x) df = pd.read_csv('gapminder.csv') # Convert to numeric format df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce') df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce') df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce') # Listwise deletion of missing values subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna() # This function converts the polity score to a category def full_democracy_degree(score): full_democracy = score / 10 return(full_democracy) # Now we can use the function to create the new variable subset['full_democracy_degree'] = subset['polityscore'].apply(full_democracy_degree) # Transform the response variable subset['log_incomeperperson'] = np.log(subset.incomeperperson) # Quadratic regression analysis model = smf.ols('log_incomeperperson ~ urbanrate + I(full_democracy_degree**2)', data=subset).fit() print (model.summary()) # Q-Q plot for normality fig = sm.qqplot(model.resid, line='r') # Simple plot of residuals stdres=pd.DataFrame(model.resid_pearson) plt.plot(stdres, 'o', ls='None') l=plt.axhline(y=0, color='r') l=plt.axhline(y=2.5, color='r', ls='dashed') l=plt.axhline(y=-2.5, color='r', ls='dashed') plt.ylabel('Standardized Residual') plt.xlabel('Observation Number')

### Model Output

OLS Regression Results =============================================================================== Dep. Variable: log_incomeperperson R-squared: 0.721 Model: OLS Adj. R-squared: 0.718 Method: Least Squares F-statistic: 196.8 Date: Sat, 12 Dec 2015 Prob (F-statistic): 6.58e-43 Time: 20:52:36 Log-Likelihood: -190.65 No. Observations: 155 AIC: 387.3 Df Residuals: 152 BIC: 396.4 Df Model: 2 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------------- Intercept 4.4301 0.183 24.267 0.000 4.069 4.791 urbanrate 0.0409 0.003 12.273 0.000 0.034 0.048 I(full_democracy_degree ** 2) 0.0002 2.18e-05 8.719 0.000 0.000 0.000 ============================================================================== Omnibus: 1.980 Durbin-Watson: 2.148 Prob(Omnibus): 0.372 Jarque-Bera (JB): 1.661 Skew: 0.082 Prob(JB): 0.436 Kurtosis: 3.480 Cond. No. 1.73e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.73e+04. This might indicate that there are strong multicollinearity or other numerical problems.