I do a lot of web scrapping at my job. I recently heard about Python’s asyncio (yeah I know it’s been out for a while) and thought it could really make a difference. I dove into the documentation and the examples published by others. Most of the examples were not very helpful. I mean really what’s the point of asynchronously printing? After much trial and error I finally was able to code up an asynchronous web scrapper.
Our county has a property portal where you can look up the property records for any property in the county. I knew that you could search by address or by property tax map identifiers. NYS GIS maintains a set of tax parcel centroids which has these identifiers. I used them as my universe of locations (although they are current as of 2016 and the property portal is current as of a couple of months ago).
I ran my script overnight and when I came in I had a database of 227,062 webpages that it scrapped in 49 about minutes. I want to publish my code so that other trying to build a web scrapper that uses asyncio can have a working model. It downloads the tax parcel centroids and extracts them, builds a list of urls to scrape, then scrapes them and saves the HTML to a sqlite database. I will write another script to comb through these pages and pull out the data I’m interested in.
# -*- coding: utf-8 -*- """ Created on Mon May 14 08:32:20 2018 @author: Michael Silva """ import urllib import zipfile import shutil from dbfread import DBF import sqlite3 as db import requests import asyncio import aioodbc import time start_time = time.time() print('Setting up db') db_name = "Monroe Real Property Data.db" con = db.connect(db_name) con.row_factory = lambda cursor, row: row c = con.cursor() c.execute('DROP TABLE IF EXISTS `scrapped_data`') c.execute('CREATE TABLE `scrapped_data` (`url`,`html`)') con.commit() print("--- %s seconds ---" % (time.time() - start_time)) print('Getting the Monroe County Tax Parcels Centroids') url = 'http://gis.ny.gov/gisdata/fileserver/?DSID=1300&file=Monroe-Tax-Parcels-Centroid-Points-SHP.zip' zip_file_name = url.split('file=') dbf_file_name = 'Monroe_2016_Tax_Parcel_Centroid_Points_SHP.dbf' with urllib.request.urlopen(url) as response, open(zip_file_name, 'wb') as out_file: shutil.copyfileobj(response, out_file) with zipfile.ZipFile(zip_file_name) as zf: zf.extractall() print("--- %s seconds ---" % (time.time() - start_time)) print('Building list of urls to scrape') urls_to_scrape = list() for record in DBF(dbf_file_name): try: if record['PROP_CLASS'] == '2': scrape_me = 'https://www.monroecounty.gov/etc/rp/report.php?a='+record['SWIS']+'-'+record['SBL'] urls_to_scrape.append(scrape_me) except IndexError: continue print("--- %s seconds ---" % (time.time() - start_time)) def scrape(url): print('Scrapping ' + url) return requests.get(url) async def scrape_all(urls_to_scrape): loop = asyncio.get_event_loop() futures = [ loop.run_in_executor( None, scrape, url ) for url in urls_to_scrape ] for response in await asyncio.gather(*futures): print('Saving ' + response.url) c.execute('INSERT INTO `scrapped_data` VALUES (?, ?)', (response.url, response.content)) loop = asyncio.get_event_loop() loop.run_until_complete(scrape_all(urls_to_scrape)) loop.close() print("--- %s seconds ---" % (time.time() - start_time)) print('Finalizing database') con.commit() con.close() print("--- %s seconds ---" % (time.time() - start_time))