## Introduction

A k-means clustering analysis was conducted to divide countries into subgroups based on variables that were determined to have an impact on economic well-being. The data for the analysis is and extract from the GapMinder project. The GapMinder project collects country-level time series data on health, wealth and development. The data set for this analysis only has one year of data for 213 countries.

### Potential Clustering Variables

A random forest analysis was preformed on a training set (N=111) to evaluate a series of explanatory variables in predicting a categorical GDP per person binned into thirds. The following explanatory variables were evaluated:

**Alcohol Consumption**– 2008 recorded and estimated average alcohol consumption, adult (15+) per capita as collected by the World Heath Organization**CO**– Total amount of CO_{2}Emissions_{2}emission in metric tons from 1751 to 2006 as collected by CDIAC**Female Employment Rate**– Percentage of female population, age above 15, that has been

employed during 2007 as collected by the International Labour Organization**Internet Use Rate**– 2010 Internet users per 100 people as collected by the World Bank**Life Expectancy**– 2011 life expectancy at birth (in years) as collected by various sources**Polity Score**– 2009 Democracy score as collected by the Polity IV Project**Employment Rate**– Percentage of total population, age above 15, that has been employed during 2009 as collected by the International Labour Organization**Urbanization Rate**– 2008 Urban population (% total population) as collected by the World Bank

The following table summarizes the variables importance in classifying countries into high middle and low GDP per person countries:

**Table 1. Variables Importance**

Variable | Importance |
---|---|

Life Expectancy | 35% |

Internet Use Rate | 21% |

Urbanization Rate | 11% |

Female Employ | 8% |

CO_{2} Emissions |
8% |

Employ Rate | 6% |

Alcohol Consumption | 6% |

Polity Score | 5% |

The explanatory variables with a relative importance scores greater than or equal to 10% (life expectancy, internet use rate and urbanization rate) we selected as clustering variables.

## K-Means Clustering

All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data were randomly split into a training set that included 70% of the observations (N=121) and a test set that included 30% of the observations (N=52). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The mean Euclidean distance was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for selecting the number of clusters. The following figure illustrates the elbow curve

**Figure 1. Selecting K with the Elbow Curve Method**

Possible levels of k include 2, 3, 4 and 7. Each level was evaluated to see if there was a significant difference between levels of economic well-being (GDP per person). Breaking the data into 2 clusters proved to be a good choice however I hasten to note that 7 was not a bad choice. My preference towards less complication than was needed lead to selecting k=2.

### Findings

Cluster number zero has a much lower economic well-being per person than cluster one.

**Table 2. GDP Per Person by Cluster**

Cluster | Average GDP per Person |
---|---|

Zero | $1,373 |

One | $15,684 |

The following figure shows the clusters (zero in blue, one in green) and the clustering variables:

**Figure 2. Clustering Variables Scaterplots**

## Discussion

This examination was done as an exercise in doing K-means clustering. The data in the GapMinder dataset does not lend itself to clustering. The scatter plots above illustrate the lack of distinct clusters. Caution should be exercised in using/interpreting these results.

## Source Code

As always this project can be found in this GitHub repository.

# Import libraries needed import pandas as pd import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans from scipy.spatial.distance import cdist import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier sns.set_style('whitegrid') sns.set_context('talk') # Eliminate false positive SettingWithCopyWarning pd.options.mode.chained_assignment = None # Make results reproducible np.random.seed(1234567890) df = pd.read_csv('gapminder.csv') variables = ['incomeperperson', 'alcconsumption', 'co2emissions', 'femaleemployrate', 'internetuserate', 'lifeexpectancy','polityscore','employrate','urbanrate'] # convert to numeric format for variable in variables: df[variable] = pd.to_numeric(df[variable], errors='coerce') # listwise deletion of missing values subset = df[variables].dropna() # Print the rows and columns of the data frame print('Size of study data') print(subset.shape) print("\n") """ "=============== Random Forest to Select Clustering Variables =============== """ n_estimators=25 subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%']) subset['incomequartiles'] = subset['incomequartiles'].astype('category') variables.pop(0) predictors = subset[variables] targets = subset['incomequartiles'] #Split into training and testing sets+ training_data, test_data, training_target, test_target = train_test_split(predictors, targets, test_size=.25) # Build the random forest classifier classifier=RandomForestClassifier(n_estimators=n_estimators) classifier=classifier.fit(training_data,training_target) predictions=classifier.predict(test_data) # Fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(training_data,training_target) # Display the relative importance of each attribute feature_name = list(predictors.columns.values) feature_importance = list(model.feature_importances_) features = pd.DataFrame({'name':feature_name, 'importance':feature_importance}).sort_values(by='importance', ascending=False) print(features.head(len(feature_name))) """ " ============================= Data Management ============================= """ variables = ['incomeperperson', 'lifeexpectancy', 'internetuserate', 'urbanrate'] # convert to numeric format for variable in variables: df[variable] = pd.to_numeric(df[variable], errors='coerce') # listwise deletion of missing values subset = df[variables].dropna() # Print the rows and columns of the data frame print('Size of study data') print(subset.shape) subset['incomequartiles'] = pd.cut(subset['incomeperperson'], 3, labels=['0%-33%','34%-66%','67%-100%']) subset['incomequartiles'] = subset['incomequartiles'].astype('category') # Remove the first variable from the list since the target is derived from it variables.pop(0) # Center and scale data for variable in variables: subset[variable]=preprocessing.scale(subset[variable].astype('float64')) features = subset[variables] targets = subset[['incomeperperson']] """ " ================== Split Data into Training and Test Sets ================== """ # Split into training and testing sets training_data, test_data, training_target, test_target = train_test_split(features, targets, test_size=.3) print('Size of training data') print(training_data.shape) """ " ===================== Determine the Number of Clusters ==================== """ # Identify number of clusters using the elbow method clusters=range(1,10) meandist=[] for k in clusters: model=KMeans(n_clusters=k) model.fit(training_data) clusassign=model.predict(training_data) meandist.append(sum(np.min(cdist(training_data, model.cluster_centers_, 'euclidean'), axis=1)) / training_data.shape[0]) # Visualize the elbow k = 2 fig = plt.figure() ax = fig.add_subplot(111) plt.plot(clusters, meandist) ax.plot(clusters[(k-1)], meandist[(k-1)], marker='o', markersize=12, markeredgewidth=2, markeredgecolor='r', markerfacecolor='None') plt.grid(True) plt.xlabel('Number of Clusters') plt.ylabel('Average Distance') plt.title('Selecting K with the Elbow Method') plt.show() """ " ========================== Visualize the Clusters ========================= """ model=KMeans(n_clusters=k) model.fit(training_data) training_data['cluster'] = model.labels_ my_cmap = plt.cm.get_cmap('brg') my_cmap.set_under('w') fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(training_data.iloc[:,0], training_data.iloc[:,1], training_data.iloc[:,2], c=training_data['cluster'], cmap=my_cmap) ax.set_xlabel(training_data.columns.values[0]) ax.set_ylabel(training_data.columns.values[1]) ax.set_zlabel(training_data.columns.values[2]) plt.show() sns.pairplot(training_data, hue ='cluster'); """ " ==================== Examine Differences Between Clusters ================= """ training_target['cluster'] = model.labels_ income_model = smf.ols(formula='incomeperperson ~ C(cluster)', data=training_target).fit() print (income_model.summary()) print ('means for features by cluster') m1= training_target.groupby('cluster').mean() print (m1) print ('standard deviations for features by cluster') m2= training_target.groupby('cluster').std() print (m2) mc1 = multi.MultiComparison(training_target['incomeperperson'], training_target['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())