Saturday 3:15 PM–4:00 PM in Speakeasy

How Soon is Now: automatically extracting publication dates of news articles with machine learning

Julie Lavoie

Audience level:
Novice

Description

Scraping New York Times articles for publication dates is easy, scraping 10 000 different sites is hard. Beyond page-specific scraping, how do you build a parser than can extract the publication date of (almost) any news article online, no matter what the site is? We implemented a research paper in machine learning to solve this problem, and talk about challenges we faced.

Abstract

Scraping New York Times articles for publication dates is easy, scraping 10 000 different sites is hard. Beyond page-specific scraping, how do you build a parser than can extract the publication date of (almost) any news article online, no matter what the site is? We implemented a research paper in machine learning to solve this problem, and talk about the challenges we faced.

We’ll cover when to use machine learning vs. humans or heuristics for data extraction, the different steps of how to phrase the problem in terms of machine learning, including feature selection on HTML documents, and issues that arise when turning research into production code. Data scientists and developers will leave knowing how to extract information from the web using new and more sophisticated techniques than simply writing a scraper.