As part of Week 3 of the Coursera Data Analysis and Tools course I examined the relationship between a country’s urbanization rate and the economic well-being of the citizens. I have saved my python script and Jupyter notebook to GitHub repository for the course.
For this analysis the following Python code was used:
# Import libraries needed import pandas as pd import scipy import seaborn as sns import matplotlib.pyplot as plt # Read in the GapMinder Data df = pd.read_csv('gapminder.csv', low_memory=False) # Change the data type for variables of interest df['urbanrate'] = pd.to_numeric(df['urbanrate'], errors='coerce') df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce') # Get the subset of complete data cases subset = df[['urbanrate','incomeperperson']].dropna() # Pearson's Correlation Coefficient print ('Association Between Urbanization Rate and Economic Well-Being') r = scipy.stats.pearsonr(subset['urbanrate'], subset['incomeperperson']) print (r) r_squared = r * r print('R Squared = '+str(r_squared)) # Visualize the data sns.set_context('poster') plt.figure(figsize=(14, 7)) sns.regplot(x="urbanrate", y="incomeperperson", data=subset) plt.ylabel('Economic Well-Being (GDP Per Person)') plt.xlabel('Urbanization Rate')
The above Python code resulted in a small yet positive Pearson’s correlation coefficient of 0.49. This relationship is statistically significant (p-value of 0.000000000000082). The R squared is 0.240192366296, so roughly a quarter of the variation is explained by these two variables.