Lasso Regression (Week 3 Assignment)


A lasso regression analysis was conducted to identify a subset of variables from a pool of 8 quantitative predictor variables that best predicted a binary response variable measuring the presence of high per capita income. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

High per Capita Income (Response Variable)

The 2010 Gross Domestic Product per capita is was classified into high income for cases where the absolute deviation divided by the mean absolute deviation is greater than 3.  The GDP per capita is measured in constant 2000 U.S. dollars and was originally came from the World Bank’s Work Development Indicators.

Explanatory Variables

All explanatory variables were standardized to have a mean of zero and a standard deviation of one. The following explanatory variables were included in the data set:

  • Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
  • CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
  • Female Employment Rate – Percentage of female population, age above 15, that has been
    employed during 2007 as collected by the International Labour Organization
  • Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
  • Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
  • Polity Score – 2009 Democracy score as collected by the Polity IV Project
  • Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
  • Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank


Data were randomly split into a training set that included 70% of the observations (N=148) and a test set that included 30% of the observations (N=45). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Mean squared error on each fold



Of the 8 predictor variables, 3 were retained in the selected model. Internet user rate was the most strongly associated with high income. CO2 emissions and employment rate were also positively associated with high income. These 3 variables accounted for 44.2% of the variance in the high per capita income response variable in the test data set.

Table 1. Regression Coefficients

Variable Regression Coefficients
Internet Use Rate 0.157313
CO2 Emissions 0.019681
Employ Rate 0.001853
Alcohol Consumption 0
Female Employ 0
Life Expectancy 0
Polity Score 0
Urbanization Rate 0


The previous week I examined the same variables using a Random Forest. The explanatory variables with the highest relative importance scores were life expectancy, internet use rate, urbanization rate. The LASSO regression only selected the internet use rate as a significant variable. As noted in last week’s post, these findings suggest that my previous work looking at the relationship between the level of democratization and the economic well-being may have been confounded by other variables.

Source Code

# Import libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn import preprocessing

# Make results reproducible

df = pd.read_csv('gapminder.csv')

variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate',
                'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate']

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
" =============================  Data Management  =============================
# Remove the first variable from the list since the target is derived from it

# Center and scale data
for variable in variables:
# Identify contries with a high level of income using the MAD (mean absolute deviation) method
subset['absolute_deviations'] = np.absolute(subset['incomeperperson'] - np.median(subset['incomeperperson']))
MAD = np.mean(subset['absolute_deviations'])

# This function converts the income per person absolute deviations to a high income flag
def high_income_flag(absolute_deviations):
    threshold = 3
    if (absolute_deviations/MAD) > threshold:
        return 1
        return 0

subset['High Income'] = subset['absolute_deviations'].apply(high_income_flag)

" ==========================  Build LASSO Regression  ==========================
predictors = subset[variables]
targets = subset['High Income']

#Split into training and testing sets
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.3)

# Build the LASSO regression model
model=LassoLarsCV(cv=10, precompute=False).fit(training_data, training_target)

" ==========================  Evaluate LASSO Model  ============================
# print variable names and regression coefficients
feature_name = list(predictors.columns.values)
feature_coefficient = list(model.coef_)
features = pd.DataFrame({'Variable':feature_name, 'Regression Coefficients':feature_coefficient}).sort_values(by='Regression Coefficients', ascending=False)

#print(dict(zip(predictors.columns, model.coef_)))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')

# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(training_target, model.predict(training_data))
test_error = mean_squared_error(test_target, model.predict(test_data))
print ('training data MSE')
print ('test data MSE')

# R-square from training and test data
rsquared_train=model.score(training_data, training_target)
rsquared_test=model.score(test_data, test_target)
print ('training data R-square')
print ('test data R-square')

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s