Regression Modeling in Practice Week 3 Assignement


For week three of Regression Modeling in Practice I have continued to examine the relationship between democratic openness and economic well-being.  My hypothesis is that as a country becomes more democratic the economic well-being is increased.

From previous work I have observed a statistically significant positive relationship between the level of urbanization and economic well-being.  For this week I will examine if the relationship between democratic openness and economic well-being is observed after including the level of urbanization.

About The Data

The data is from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development.  The data set for this class only has one year of data for 213 countries.  There are 155 countries with complete data for the following variables of interest:


Income per person (economic well-being) is the 2010 Gross Domestic Product per capita is measured in constant 2000 U.S. dollars.  This allows for comparison across countries with different costs of living.  This data originally came from the World Bank’s Work Development Indicators.

The level of democracy is measured by the 2009 polity score developed by the Polity IV Project.  This value ranges from -10 (autocracy) to 10 (full democracy).  The Polity IV Project authors group these measures into five categories:

  1. Full Democracy (polity score = 10)
  2. Democracy (6 to 9)
  3. Open Anocracy (1 to 5)
  4. Closed Anocracy (-5 to 0)
  5. Autocracy (-10 to -6)

The urbanization rate is measured by the share of the 2008 population living in an urban setting.  This data was originally produced by the World Bank.  The urban population is defined by national statistical offices.

Exploratory Analysis

The following scatter plot shows the relationship between economic well-being variable and the level of democratization:

One observes that the relationship is nonlinear and there is an outlier in the “Closed Anocracy” category (in purple).  Now we will examine the relationship between the confounding variable, urbanization rate:


We observe that there is the positive relationship but it is non-linear.  In fact it looks like it is exponential in nature.  One also observes that the full democracy countries are clustered at the higher end of the urbanization rate spectrum.

Data Management

Since the Polity IV score is categorical I created a quantitative variable measuring the degree to which the country is a full democracy.  This variable ranges from -100 to 100.

I also decided to take the natural log of GDP per capita to straighten out the relationship between GDP per capita and the urbanization rate as this scatter plot shows:


Model Specification

The economic well-being of a country is dependent on the level of democratization and urbanization.  The relationship between economic and well-being is quadratic and between economic well-being and urbanization is linear.


The results of the linear regression model indicated that the urbanization rate was significantly positively associated with economic well-being on a natural log scale (Beta=0.0409 p=0.000), as well as the squared full democracy degree (Beta=0.002 p=0.000). The adjusted R2 of for this model is 0.718.

To interpret these results consider a country that is a full democracy with 100% of the population urbanized.  The log income per person increase by 20 for the level of democratization and by about 4 for the urbanization.  This suggests the level of democratization has a larger effect than the level of urbanization.

Residuals Analysis

the Q-Q plot residuals that a generally linear but deviate at the lower and higher quantiles.  The residuals are skewed a little bit.q-qplot

The normalized residuals indicate most of the residuals are within 2 standard deviations, however seven of the 155 observations (4.5%) fall outside.  This suggests this model is not a really good fit for the observed data.



The influence plot identify the seven outliers (those greater than 2 or less than negative 2).  Two observations (194 and 173) have a high leverage and are outliers.


Source Code

# Import libraries needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns

# bug fix for display formats to avoid run time errors
pd.set_option('display.float_format', lambda x:'%.2f'%x)

df = pd.read_csv('gapminder.csv')

# Convert to numeric format
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')
df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce')

# Listwise deletion of missing values
subset = df[['incomeperperson', 'polityscore', 'urbanrate']].dropna()

# This function converts the polity score to a category
def full_democracy_degree(score):
full_democracy = score / 10

# Now we can use the function to create the new variable
subset['full_democracy_degree'] = subset['polityscore'].apply(full_democracy_degree)

# Transform the response variable
subset['log_incomeperperson'] = np.log(subset.incomeperperson)

# Quadratic regression analysis
model = smf.ols('log_incomeperperson ~ urbanrate + I(full_democracy_degree**2)', data=subset).fit()
print (model.summary())

# Q-Q plot for normality
fig = sm.qqplot(model.resid, line='r')

# Simple plot of residuals
plt.plot(stdres, 'o', ls='None')
l=plt.axhline(y=0, color='r')
l=plt.axhline(y=2.5, color='r', ls='dashed')
l=plt.axhline(y=-2.5, color='r', ls='dashed')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')

Model Output

                             OLS Regression Results                            
Dep. Variable:     log_incomeperperson   R-squared:                       0.721
Model:                             OLS   Adj. R-squared:                  0.718
Method:                  Least Squares   F-statistic:                     196.8
Date:                 Sat, 12 Dec 2015   Prob (F-statistic):           6.58e-43
Time:                         20:52:36   Log-Likelihood:                -190.65
No. Observations:                  155   AIC:                             387.3
Df Residuals:                      152   BIC:                             396.4
Df Model:                            2                                         
Covariance Type:             nonrobust                                         
                                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept                         4.4301      0.183     24.267      0.000         4.069     4.791
urbanrate                         0.0409      0.003     12.273      0.000         0.034     0.048
I(full_democracy_degree ** 2)     0.0002   2.18e-05      8.719      0.000         0.000     0.000
Omnibus:                        1.980   Durbin-Watson:                   2.148
Prob(Omnibus):                  0.372   Jarque-Bera (JB):                1.661
Skew:                           0.082   Prob(JB):                        0.436
Kurtosis:                       3.480   Cond. No.                     1.73e+04

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.73e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s