Sunday 2:00 PM–2:45 PM in Room #1025 (1st Floor)

Sustainable scrapers

David Eads

Audience level:
Intermediate

Description

Scraping data from the web is an essential skill, whether you want to or not. Learn the code and systems tricks that go into testable, fast, low-maintenance scraping gleaned from years of real world practice.

Abstract

A recent NPR project that collects structured data about gun sale listings from Armslist.com demonstrates several of the speaker's favorite tricks for writing simple, fast scrapers with Python.

In this session, we'll discuss:

  • Scraping legality and ethics
  • Using a model classes to encapsulate and test the scraper
  • Using simple controller scripts to scrape
  • Optimizing efficiency using GNU Parallel
  • Using Amazon Elastic Cloud Compute to really fly
  • How frameworks like Scrapy can help (and why knowing what's going on beneath the hood is always helpful)
  • Advanced issues and techniques: Continual scraping, inferred data, caching, and more