I often use cosine similarity at my job to find peers. Cosine similarity is a measure of distance between two vectors. While there are libraries in Python and R that will calculate it sometimes I’m doing a small scale project and so I use Excel. Here’s how to do it.
First the Theory
I will not go into depth on what cosine similarity is as the web abounds in that kind of content. Suffice it to say the formula for cosine similarity (for those of you who are mathematically inclined) is:
So for the numerator we need to find the dot product of the two vectors. That is done using the SUMPRODUCT function in Excel. For each term in the denominator you need to find the square root of the sum of squares. This is done by using the SUMSQ function nested in a SQRT function.
Now the Practice
I have put together a little template which shows how to calculate the cosine similarity. I first organize the data in the spreadsheet so the attributes (or features or variables) go across the columns and the geographies go across each row. I usually put who we are trying to find as a peer at the top of the spreadsheet. I scale the data so that it ranges between zero and one. This way differences in one attribute doesn’t overpower the others simply due to a difference in scale. With all that done I am ready to compute the cosine similarity.
Let’s suppose you have four attributes (A, B, C and D) for the baseline and five peers. It would look like this in Excel:
In order to calculate the cosine similarity of Peer 1 and the Baseline, I would divide the dot product (=SUMPRODUCT(B$2:E$2,B3:E3)) by the square root of the sum of squares multiplied together (=SQRT(SUMSQ(B3:E3))*SQRT(SUMSQ($B$2:$E$2))). So if you want it all in one hairy formula in cell F3 it would be: