Aside

How I Scrapped a Quarter of a Million URLs in Three Quarters of an Hour (Roughly)

I do a lot of web scrapping at my job. I recently heard about Python’s asyncio (yeah I know it’s been out for a while) and thought it could really make a difference. I dove into the documentation and the examples published by others. Most of the examples were not very helpful. I mean really what’s the point of asynchronously printing? After much trial and error I finally was able to code up an asynchronous web scrapper.

Our county has a property portal where you can look up the property records for any property in the county. I knew that you could search by address or by property tax map identifiers. NYS GIS maintains a set of tax parcel centroids which has these identifiers.  I used them as my universe of locations (although they are current as of 2016 and the property portal is current as of a couple of months ago). 

I ran my script overnight and when I came in I had a database of 227,062 webpages that it scrapped in 49 about  minutes.  I want to publish my code so that other trying to build a web scrapper that uses asyncio can have a working model. It downloads the tax parcel centroids and extracts them, builds a list of urls to scrape, then scrapes them and saves the HTML to a sqlite database. I will write another script to comb through these pages and pull out the data I’m interested in.

# -*- coding: utf-8 -*-
"""
Created on Mon May 14 08:32:20 2018

@author: Michael Silva
"""
import urllib
import zipfile
import shutil
from dbfread import DBF
import sqlite3 as db
import requests 
import asyncio
import aioodbc
import time

start_time = time.time()
    
print('Setting up db')
db_name = "Monroe Real Property Data.db"
con = db.connect(db_name)
con.row_factory = lambda cursor, row: row[0]
c = con.cursor()
c.execute('DROP TABLE IF EXISTS `scrapped_data`')
c.execute('CREATE TABLE `scrapped_data` (`url`,`html`)')
con.commit()

print("--- %s seconds ---" % (time.time() - start_time))

    
print('Getting the Monroe County Tax Parcels Centroids')
url = 'http://gis.ny.gov/gisdata/fileserver/?DSID=1300&file=Monroe-Tax-Parcels-Centroid-Points-SHP.zip'
zip_file_name = url.split('file=')[1]
dbf_file_name = 'Monroe_2016_Tax_Parcel_Centroid_Points_SHP.dbf'

with urllib.request.urlopen(url) as response, open(zip_file_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(zip_file_name) as zf:
    zf.extractall()
print("--- %s seconds ---" % (time.time() - start_time))

print('Building list of urls to scrape')
urls_to_scrape = list()

for record in DBF(dbf_file_name):
    try: 
        if record['PROP_CLASS'][0] == '2':
            scrape_me = 'https://www.monroecounty.gov/etc/rp/report.php?a='+record['SWIS']+'-'+record['SBL']
            urls_to_scrape.append(scrape_me)
    except IndexError:
        continue
print("--- %s seconds ---" % (time.time() - start_time))

def scrape(url):
	print('Scrapping ' + url)
	return requests.get(url)
                
async def scrape_all(urls_to_scrape):
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            scrape, 
            url
        )
        for url in urls_to_scrape
    ]
    for response in await asyncio.gather(*futures):
        print('Saving ' + response.url)
        c.execute('INSERT INTO `scrapped_data` VALUES (?, ?)', (response.url, response.content))


loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_all(urls_to_scrape))
loop.close()

print("--- %s seconds ---" % (time.time() - start_time))

print('Finalizing database')
con.commit()    
con.close()
print("--- %s seconds ---" % (time.time() - start_time))
Advertisements
Status

My First Python Scraping with Beautiful Soup

I recently needed to scrape a cost of living calculator for data.  To save time I wrote a Python program that would pull the data for all the cities.  It was my first case of scrapping a website in Python.  I used Beautiful Soup as I had heard other data scientists mention using this in a podcast.  The documentation provided by the developers is well written and easy to follow.  It is hosted on GitHub in the cgr-work repository.