K Means Clustering (Week 4 Assignment)

Introduction

A k-means clustering analysis was conducted to divide countries into subgroups based on variables that were determined to have an impact on economic well-being. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

Potential Clustering Variables

A random forest analysis was preformed on a training set (N=111) to evaluate a series of explanatory variables in predicting a categorical GDP per person binned into thirds. The following explanatory variables were evaluated:

  • Alcohol Consumption – 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization
  • CO2 Emissions – Total amount of CO2 emission in metric tons from 1751 to 2006 as collected by CDIAC
  • Female Employment Rate – Percentage of female population, age above 15, that has been
    employed during 2007 as collected by the International Labour Organization
  • Internet Use Rate – 2010 Internet users per 100 people as collected by the World Bank
  • Life Expectancy – 2011 life expectancy at birth (in years) as collected by various sources
  • Polity Score – 2009 Democracy score as collected by the Polity IV Project
  • Employment Rate – Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization
  • Urbanization Rate – 2008 Urban population (% total population) as collected by the World Bank

The following table summarizes the variables importance in classifying countries into high middle and low GDP per person countries:

Table 1. Variables Importance

Variable Importance
Life Expectancy 35%
Internet Use Rate 21%
Urbanization Rate 11%
Female Employ 8%
CO2 Emissions 8%
Employ Rate 6%
Alcohol Consumption 6%
Polity Score 5%

The explanatory variables with a relative importance scores greater than or equal to 10% (life expectancy, internet use rate and urbanization rate) we selected as clustering variables.

K-Means Clustering

All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data were randomly split into a training set that included 70% of the observations (N=121) and a test set that included 30% of the observations (N=52). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The mean Euclidean distance was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for selecting the number of clusters. The following figure illustrates the elbow curve

Figure 1. Selecting K with the Elbow Curve Method

elbow

Possible levels of k include 2, 3, 4 and 7. Each level was evaluated to see if there was a significant difference between levels of economic well-being (GDP per person). Breaking the data into 2 clusters proved to be a good choice however I hasten to note that 7 was not a bad choice. My preference towards less complication than was needed lead to selecting k=2.

Findings

Cluster number zero has a much lower economic well-being per person than cluster one.

Table 2. GDP Per Person by Cluster

Cluster Average GDP per Person
Zero $1,373
One $15,684

The following figure shows the clusters (zero in blue, one in green) and the clustering variables:

Figure 2. Clustering Variables Scaterplots

k-means pairplot

Discussion

This examination was done as an exercise in doing K-means clustering. The data in the GapMinder dataset does not lend itself to clustering. The scatter plots above illustrate the lack of distinct clusters. Caution should be exercised in using/interpreting these results.

Source Code

As always this project can be found in this GitHub repository.

# Import libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
sns.set_style('whitegrid')
sns.set_context('talk')

# Eliminate false positive SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Make results reproducible
np.random.seed(1234567890)

df = pd.read_csv('gapminder.csv')

variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate', 
                'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate'] 

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)
print("\n")
"""
"===============  Random Forest to Select Clustering Variables  ===============
"""
n_estimators=25

subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%'])
subset['incomequartiles'] = subset['incomequartiles'].astype('category')

variables.pop(0)

predictors = subset[variables]
targets = subset['incomequartiles']

#Split into training and testing sets+
training_data, test_data, training_target, test_target  = train_test_split(predictors, targets, test_size=.25)

# Build the random forest classifier
classifier=RandomForestClassifier(n_estimators=n_estimators)
classifier=classifier.fit(training_data,training_target)

predictions=classifier.predict(test_data)

# Fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(training_data,training_target)

# Display the relative importance of each attribute
feature_name = list(predictors.columns.values)
feature_importance = list(model.feature_importances_)
features = pd.DataFrame({'name':feature_name, 'importance':feature_importance}).sort_values(by='importance', ascending=False)
print(features.head(len(feature_name)))

"""
" =============================  Data Management  =============================
"""
variables = ['incomeperperson', 'lifeexpectancy', 'internetuserate', 'urbanrate'] 

# convert to numeric format
for variable in variables:
    df[variable] = pd.to_numeric(df[variable], errors='coerce')

# listwise deletion of missing values
subset = df[variables].dropna()

# Print the rows and columns of the data frame
print('Size of study data')
print(subset.shape)

subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%'])
subset['incomequartiles'] = subset['incomequartiles'].astype('category')

# Remove the first variable from the list since the target is derived from it
variables.pop(0)

# Center and scale data
for variable in variables:
    subset[variable]=preprocessing.scale(subset[variable].astype('float64'))
    
features = subset[variables]
targets = subset[['incomeperperson']]

"""
" ==================  Split Data into Training and Test Sets  ==================
"""
# Split into training and testing sets
training_data, test_data, training_target, test_target  = train_test_split(features, targets, test_size=.3)
print('Size of training data')
print(training_data.shape)

"""
" =====================  Determine the Number of Clusters  ====================
"""
# Identify number of clusters using the elbow method
clusters=range(1,10)
meandist=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(training_data)
    clusassign=model.predict(training_data)
    meandist.append(sum(np.min(cdist(training_data, model.cluster_centers_, 'euclidean'), axis=1)) / training_data.shape[0])

# Visualize the elbow
k = 2

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(clusters, meandist)
ax.plot(clusters[(k-1)], meandist[(k-1)], marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of Clusters')
plt.ylabel('Average Distance')
plt.title('Selecting K with the Elbow Method')
plt.show()

"""
" ==========================  Visualize the Clusters  =========================
"""
model=KMeans(n_clusters=k)
model.fit(training_data)
training_data['cluster'] = model.labels_

my_cmap = plt.cm.get_cmap('brg')
my_cmap.set_under('w')

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(training_data.iloc[:,0], training_data.iloc[:,1], training_data.iloc[:,2], c=training_data['cluster'], cmap=my_cmap)
ax.set_xlabel(training_data.columns.values[0])
ax.set_ylabel(training_data.columns.values[1])
ax.set_zlabel(training_data.columns.values[2])
plt.show()
 
sns.pairplot(training_data, hue ='cluster');

"""
" ====================  Examine Differences Between Clusters  =================
"""

training_target['cluster'] = model.labels_
income_model = smf.ols(formula='incomeperperson ~ C(cluster)', data=training_target).fit()
print (income_model.summary())

print ('means for features by cluster')
m1= training_target.groupby('cluster').mean()
print (m1)

print ('standard deviations for features by cluster')
m2= training_target.groupby('cluster').std()
print (m2)

mc1 = multi.MultiComparison(training_target['incomeperperson'], training_target['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s