I put together a neat little visualization that allows the user to do a back-of-the-envelope calculation of what an EMS service delivery option. You can try it out at https://msilva-cgr.shinyapps.io/essex-county-ems-options/. It is a shiny app. Because of the time crunch it is slow because it is doing a lot of calculations on the fly. If I were to do it again I would have preprocessed the data and saved the results. It would cut out the costly computations. But in any event it is a really neat tool.
I recently analyzed the patterns of U.S. Commuters and created a visualization that summarizes these patterns at the state level.
About the Data
This data is from the Census Transporation Planning Products (CTPP). For those who don’t know, the CTPP is derived from the 2006–2010 5-year American Community Survey (ACS) data. You can learn more about this data set by visiting the home page. In an effort to make this easily reproducible, you can download the csv used in this analysis. I chose this data source because I have used it in the past and am familiar with it.
I created a directed network graph from this data. I used python’s networkx and pandas packages and the complete source code is provided below. I excluded Puerto Rico from the data set because I wanted to analyze state level patterns. I did leave the District of Columbia in thus there are 51 States in the analysis.
I wanted to use Gephi to generate the visualization because I have done so in the past. After creating the directed network graph using Python, I imported the data into Gephi and used the community detection algorithm using the default settings (Randomized checked, Use weights checked, Resolution = 1.0). This resulted in 7 communities being detected.
I grouped the states by these communities and colored them and placed them in a circular layout with straight edges between the nodes. I varied the width based on the edge weight.
There are a couple of things that are of interest. The first thing to acknowledge is the proliferation of edges in this network. Almost all of the states are connected with the other states. This results in the “spirograph” like effect in the visualization. However I don’t find that to be the most interesting aspect.
The sub-networks highlighted in this visualization are particularly interesting. For example one readily sees that a lot of people in New Jersey work in New York. As a former New Yorker there is no surprises there. You also see the Capital Beltway connections in the visualization. Residents of Maryland and Virginia find work in D.C. The community detection algorithm highlights this finding. Since the states are arranged by “community”, seeing the cross-community connections are interesting. For example the Connecticut to New York have a connection.
The following is how I generated the directed network graph file for Gephi from the CSV.
import pandas as pd import networkx as nx df = pd.read_csv('Table 1 Commuting Flows Available at State to POW State and County to POW County only.csv', skiprows=[0,1], thousands=',') # Only pull the Estimates df = df[df['Output']=='Estimate'] # Only pull the first 4 columns df = df[df.columns[:4]] # Drop the N/A's df = df.dropna() # Drop the Output column df = df.drop('Output', 1) # Rename the columns df.columns = ['from','to','weight'] # Drop the Puerto Rico records df = df[df['from'] != 'Puerto Rico'] df = df[df['to'] != 'Puerto Rico'] # Remove the people who work where they live commuters = df[df['from'] != df['to']] # Build the network graph G = nx.from_pandas_dataframe(commuters, 'from','to', ['weight'], create_using=nx.DiGraph()) # Write the graph so I can use Gephi nx.write_gexf(G,'Commuters.gexf')
I have recently been exploring data on the public libraries of New York State for a side project (more on that in a latter post hopefully). I have also stated a Data Visualization course on Coursera and have decided to feature some visualization of this data set.
About the Data
The data used in this analysis comes from the Annual Report for Public and Association Libraries produced for New York State Education Department (NYSED). You can access the data at http://collectconnect.baker-taylor.com/ using “new york” as the username and “pals” as the password. Load the saved list named “All Libraries as of 15 March 2016” and select the “Total Circulation” data element.
For this visualization I decided to use all data from 2000 to 2014 (latest data available). I aggregated the library level circulation data to generate the aggregate circulation for New York State Public Libraries. I used colorblind safe colors from the Color Brewer palette. I adjusted the scale on the Y-axis to be in millions. I used R to generate the following visualization:
What It Tells Us
Book circulation generally increased until 2010 where one observes a reversal of the decade long trend. There is an exceptionally precipitous drop from 2013 to 2014.
This begs the question why is this changing? Is it because of a change in the population? Is it due to a change in the number of libraries reporting (might explain the 2013-2014 drop)? Is it due to a rise in digital media sources as a substitute for books? Is it due to a lack of public support/investment in libraries? I plan at looking at that last question in a future post.
library(dplyr) library(tidyr) library(ggplot2) library(ggthemes) book_circulation <- read.csv('https://goo.gl/fyybwi', na.strings = 'N/A', stringsAsFactors = FALSE) %>% gather(., Year, measurement, X1991:X2014) %>% mutate(Year = as.numeric(substr(Year,2,5))) %>% mutate(measurement = as.numeric(gsub(',', '', measurement))) %>% filter(Year > 1999)%>% filter(ifelse(is.na(measurement),0,1)==1) %>% group_by(Year) %>% summarise(Circulation = sum(measurement)) %>% mutate(Circulation = Circulation/1000000) ggplot(book_circulation, aes(Year, Circulation)) + geom_bar(stat='identity', fill="#9ecae1", colour="#3182bd") + ylab('Book Circulation (in millions)') + ggtitle('Book Circulation in NYS Public Libraries, 2000-2014') + theme_hc()
I was talking with a friend yesterday when he mentioned he came across an article that looked at the number to accidents across the United States. He said they had a map showing the per capita rates. He described it as being darker in the south and in the Rocky Mountain states (with the exception of Utah), and lighter on the coasts. I didn’t find the article he was referencing so I decided to pull the latest summary data of the U.S. Department of Transportation’s Fatality Analysis Reporting System (FARS) data from the IIHS (insurance institute for highway safety) and make the map myself.
This didn’t exactly match what he described (or what I thought he described) so I broke the states out into quartiles.
Now that seemed closer. He expressed his objections to the measurements. He said using a per capita rate because it makes New York’s rate look good because New York City makes up a large part of the population and a lot of the people who live there don’t drive. He thought that it would be better to use highway miles traveled as a denominator. He thought rural states would not be as dark as they would using a per capita measure. Well I wanted to explore the data and see if there was something to his objection. The data existed in the IIHS data set so here’s what I produced side by side to the above maps for easy comparison:
By changing the rate from a person based to a mileage based measure we see some impact. For example we see changes in the quartiles (i.e. Virginia is considered a safer state and Texas a more risky one). We don’t see much of a change for New York (the shade of orange is slightly darker). Wyoming seems to be a much safer state when viewed through a per 100 million miles lens. For those interested you can see my code in the project’s GitHub repository.
Recently I wanted to organize the 67 counties in Florida into regions that made sense. I wanted these regions to be based off of data.
I decided I could use commuter data from the Census Transportation Planning Package (CTPP) to create a network graph. It would show the migration of workers 16 years and older from the county they reside in to the county the work in. The thought being that counties that share a commuters are economically linked together.
I used Gephi to create the network graph. It was a weighted directional graph. I had Gephi calculate the modularity using the default settings which broke the state up into 7 regions.
I then exported the data out of Gephi and pulled it into R and created a quick map to visualize the regions. At first blush these regions seem to make sense. This may be a good approach to use in the future.
This is the third in a series of posts chronicling my project for the Data Management and Visualization course. This week we learned about visualizations to summarize the data.
As usual I put together a Jupyter notebook which is hosted on my GitHub project repository. It is an example of literate programming so it mixes narrative content with machine readable code. If you want to view the Python script sans narration it is available too.
In this analysis I would like to examine the relationship between the economic well-being of a society and the level of openness. My hypothesis is that countries with a more open society will have a higher level of economic well-being.
The data for this analysis comes from a subset of the GapMider project data. I use the level of democratization as a measure of the openness of a country. In order to measure the economic well-being I will be using GDP per capita data. There were data management decisions that were made and chronicled in my previous post which would not be covered here.
The income per person is unimodal and right skewed. Values range from about $100 to $40,000. The mean is $6,600 and the median is $2,200. There is a natural floor as it is not possible to have a negative GDP per person.
This data is categorical in nature so here’s the count of countries on the open to closed spectrum (most open on the left and closed on the right). Most of the countries are generally open. 32 are Full Democracies and 20 are Autocracies. These groups are especially important in this analysis.
Average Economic Well-Being by County’s Openness
To look at the relationship I will compare the average economic well-being by the level of a country’s openness. We see that the full democracies have a higher average than the autocracies. It is also noteworthy to point out the U shape to this distribution.
Median Economic Well-Being by County’s Openness
We did observe a considerable range in the univariate analysis so I made the comparison again using the median as the measure.
For those not accustom to box plots, the median is the line inside the box. The median of the full democracy is higher than that of the autocracy. One can still observe a U shape in the distribution.
I like the box plots because you can see the outliers that influence the means.
The data support the hypothesis that countries that are more open and democratic have a higher standard of living or economic well-being than those that are closed. I would hasten to note that there is a U shape in the distribution which suggests that as a country moves from an autocracy towards a more open democracy, it might lower the economic well-being of the citizens. As a summary I would like to present the average (mean and median) economic well-being by the level of openness.
|1 – Full Democracy||32||$19,290||$17,222|
|2 – Democracy||57||$3,425||$1,621|
|3 – Open Anocracy||19||$1,167||$669|
|4 – Closed Anocracy||27||$2,473||$591|
|5 – Autocracy||20||$6,114||$2,385|
I have created an interesting visualization. It shows the density of jobs in a given sector. Here’s an example of the visualization near where I live:
It is created with R and is fully replicable. Interested people can check out https://github.com/mikeasilva/dot-density-map.