How to Calculate Cosine Similarity in Excel

I often use cosine similarity at my job to find peers.  Cosine similarity is a measure of distance between two vectors.  While there are libraries in Python and R that will calculate it sometimes I’m doing a small scale project and so I use Excel.  Here’s how to do it.

First the Theory

I will not go into depth on what cosine similarity is as the web abounds in that kind of content.  Suffice it to say the formula for cosine similarity (for those of you who are mathematically inclined) is:

Cosine Similarity

So for the numerator we need to find the dot product of the two vectors.  That is done using the SUMPRODUCT function in Excel.  For each term in the denominator you need to find the square root of the sum of squares.  This is done by using the SUMSQ function nested in a SQRT function.

Now the Practice

I have put together a little template which shows how to calculate the cosine similarity.  I first organize the data in the spreadsheet so the attributes (or features or variables) go across the columns and the geographies go across each row.  I usually put who we are trying to find as a peer at the top of the spreadsheet.  I scale the data so that it ranges between zero and one. This way differences in one attribute doesn’t overpower the others simply due to a difference in scale.  With all that done I am ready to compute the cosine similarity.

Let’s suppose you have four attributes (A, B, C and D) for the baseline and five peers.  It would look like this in Excel:

Cosine Similarity Data

In order to calculate the cosine similarity of Peer 1 and the Baseline, I would divide the dot product (=SUMPRODUCT(B$2:E$2,B3:E3)) by the square root of the sum of squares multiplied together (=SQRT(SUMSQ(B3:E3))*SQRT(SUMSQ($B$2:$E$2))).  So if you want it all in one hairy formula in cell F3 it would be:



Did I Just Slip Into the World of Big Data?

I recently was trying to create a random forest classifier on a data set using R.  As you can see I ran into some problems:


The data set was half a gig with nearly one hundred rows.  I reduced the columns down to three columns (the class and two features) but couldn’t do it in R.  I was able to generate a random forest classifier in Python, however I wanted the R output to develop an API that would use the random forest model.

Does Investment in Public Libraries Increase Usage?

I recently mentioned that I had been exploring some data on public libraries.  Here’s the reason why.  A recent local new paper article chronicled the role libraries are playing today. They highlight the fact that some local libraries that have undergone major renovations recently. In the article they claim:

The surge in popularity mirrors what other communities have seen. When they invest in libraries, the number of people using them goes up.

The claim seemed to rely on anecdotal evidence, so I determined to examine this using data.


I want preface this by admitting that I am a big fan of libraries.  I have fond memories of summer reading programs in my childhood.  My very first exposure to the Internet happened in a public library.  I used to roller blade to the local public library as a teenager to do my homework (even though I had my own desk at home). When my parents moved and I visited them, one of the local attractions I wanted to see was their public libraries.  I love them.  However I love claims being backed with data more than anecdote, especially when it touches something close to me.


I used data from the annual report for public and association libraries to evaluate the claims.  I looked at the data from 1991-2014.  As always, for those who care to replicate my analysis, you can check out the GitHub repository.

I examined the change in library “usage” in terms of circulation and visits.  I wanted to see if the investment in libraries spurred on increase usage that died out over time so I looked at the difference from a one year before and after investment window up to ten years.

There are just under 500 libraries that had a renovation over the time period.  There were also about 200 libraries in New York State didn’t have major renovations.  I was able to use these libraries as a control group.  If there was a statistically significant difference between these two groups there would be data to back up the news paper article’s claim.


After looking at circulation and visitation over the various time frames there was no difference between the libraries that were renovated and those that were not.  Not over the short term, or long term.  So the bottom-line is that the claim that investment increases library usage is not supported by the data.


Visualizing U.S. Commuters

I recently analyzed the patterns of U.S. Commuters and created a visualization that summarizes these patterns at the state level.

About the Data

This data is from the Census Transporation Planning Products (CTPP). For those who don’t know, the CTPP is derived from the 2006–2010 5-year American Community Survey (ACS) data. You can learn more about this data set by visiting the home page. In an effort to make this easily reproducible, you can download the csv used in this analysis. I chose this data source because I have used it in the past and am familiar with it.

Steps Taken

I created a directed network graph from this data. I used python’s networkx and pandas packages and the complete source code is provided below. I excluded Puerto Rico from the data set because I wanted to analyze state level patterns. I did leave the District of Columbia in thus there are 51 States in the analysis.

I wanted to use Gephi to generate the visualization because I have done so in the past. After creating the directed network graph using Python, I imported the data into Gephi and used the community detection algorithm using the default settings (Randomized checked, Use weights checked, Resolution = 1.0). This resulted in 7 communities being detected.

I grouped the states by these communities and colored them and placed them in a circular layout with straight edges between the nodes. I varied the width based on the edge weight.


Commuters Network


There are a couple of things that are of interest. The first thing to acknowledge is the proliferation of edges in this network.  Almost all of the states are connected with the other states.  This results in the “spirograph” like effect in the visualization.  However I don’t find that to be the most interesting aspect.

The sub-networks highlighted in this visualization are particularly interesting.  For example one readily sees that a lot of people in New Jersey work in New York.  As a former New Yorker there is no surprises there.  You also see the Capital Beltway connections in the visualization.  Residents of Maryland and Virginia find work in D.C.  The community detection algorithm highlights this finding.  Since the states are arranged by “community”, seeing the cross-community connections are interesting.  For example the Connecticut to New York have a connection.

Source Code

The following is how I generated the directed network graph file for Gephi from the CSV.

import pandas as pd
import networkx as nx

df = pd.read_csv('Table 1 Commuting Flows Available at State to POW State and County to POW County only.csv', skiprows=[0,1], thousands=',')
# Only pull the Estimates
df = df[df['Output']=='Estimate']
# Only pull the first 4 columns
df = df[df.columns[:4]]
# Drop the N/A's
df = df.dropna()
# Drop the Output column
df = df.drop('Output', 1)
# Rename the columns
df.columns = ['from','to','weight']
# Drop the Puerto Rico records
df = df[df['from'] != 'Puerto Rico']
df = df[df['to'] != 'Puerto Rico']
# Remove the people who work where they live
commuters = df[df['from'] != df['to']]
# Build the network graph
G = nx.from_pandas_dataframe(commuters, 'from','to', ['weight'], create_using=nx.DiGraph())
# Write the graph so I can use Gephi

New York State Public Libraries Circulation Visualization

I have recently been exploring data on the public libraries of New York State for a side project (more on that in a latter post hopefully).  I have also stated a Data Visualization course on Coursera and have decided to feature some visualization of this data set.

About the Data

The data used in this analysis comes from the Annual Report for Public and Association Libraries produced for New York State Education Department (NYSED). You can access the data at using “new york” as the username and “pals” as the password.  Load the saved list named “All Libraries as of 15 March 2016” and select the “Total Circulation” data element.

Visualization Decisions

For this visualization I decided to use all data from 2000 to 2014 (latest data available).  I aggregated the library level circulation data to generate the aggregate circulation for New York State Public Libraries.  I used colorblind safe colors from the Color Brewer palette.  I adjusted the scale on the Y-axis to be in millions.  I used R to generate the following visualization:


What It Tells Us

Book circulation generally increased until 2010 where one observes a reversal of the decade long trend.  There is an exceptionally precipitous drop from 2013 to 2014.

This begs the question why is this changing?  Is it because of a change in the population?  Is it due to a change in the number of libraries reporting (might explain the 2013-2014 drop)?  Is it due to a rise in digital media sources as a substitute for books?  Is it due to a lack of public support/investment in libraries? I plan at looking at that last question in a future post.

Source Code


book_circulation <- read.csv('', na.strings = 'N/A', stringsAsFactors = FALSE) %>%
  gather(., Year, measurement, X1991:X2014) %>%
  mutate(Year = as.numeric(substr(Year,2,5))) %>%
  mutate(measurement = as.numeric(gsub(',', '', measurement))) %>%
  filter(Year > 1999)%>%
  filter(ifelse(,0,1)==1) %>%
  group_by(Year) %>%
  summarise(Circulation = sum(measurement)) %>%
  mutate(Circulation = Circulation/1000000)

ggplot(book_circulation, aes(Year, Circulation)) + geom_bar(stat='identity', fill="#9ecae1", colour="#3182bd") + ylab('Book Circulation (in millions)') + ggtitle('Book Circulation in NYS Public Libraries, 2000-2014') + theme_hc()