Searching for a Needle in a Haystack with Python

I recently was working with New York State libraries data (hopefully more to come on this) and was trying to match up two data sets on the library name.  But due to minor variations in the names (i.e. an abbreviation or punctuation marks) this is kinda a messy process.  I found fuzzywuzzy which is a Python library that can help out.  Here’s how I constructed my search:

from fuzzywuzzy import fuzz
import pandas as pd

haystack = pd.read_csv('web_list.csv')
haystack = haystack['Libraries'].values.tolist()

needles = pd.read_csv('excel_list.csv')
needles = needles['Library'].values.tolist()

best_matches = []

# Look for best match
for needle in needles:
    print('Searching for: '+needle)
    best_match_needle = ''
    best_match_hay = ''
    best_match_ratio = 0
    search_for_match = True
    while search_for_match:
        for hay in haystack:
            match_score = fuzz.partial_ratio(needle, hay)
            if match_score > best_match_ratio:
                best_match_ratio = match_score
                best_match_needle = needle
                best_match_hay = hay
            if match_score == 100:
                search_for_match = False
        # Looped through haystack so stop searching
        search_for_match = False
    # Append best match search results
    row = {'Searched':best_match_needle, 'Found':best_match_hay, 'Ratio':best_match_ratio}
    best_matches.append(row)

df = pd.DataFrame(best_matches)
writer = pd.ExcelWriter('Best Matches.xlsx', engine='xlsxwriter')
df.to_excel(writer,'Sheet1', index=False)
writer.save()

The results are pretty good.  I tried the fuzzywuzzy processes but I was getting a lot of goofy results.  My only caution is that the ratio in my implementation should not be strongly trusted.  I had a 100 for “Babylon Public Library” being a match with “North Babylon Public Library.”

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s