In this talk I will present an information extraction system that we have built to extract mentions of riots, protests and other political risk events from streaming unstructured text (blogs, news, social media.)
In this talk I will present an information extraction system that we have built to extract mentions of riots, protests and other political risk events from streaming unstructured text (blogs, news, social media.)
In the first part of the talk I will describe various information extraction (IE) algorithms suitable for extracting event information from unstructured text. Such algorithms can be used to identify real-world events, such as a protest, an accident or natural disaster, from free form text. In particular, I will compare two approaches to identify event mentions, the first one based on pattern matching, the second one based on discriminative machine learning. I will show how these approaches offer different tradeoffs in terms of the achievable accuracy and extensibility.
The second part of the talk will focus on lessons from running an information extraction pipeline at scale. I will compare different frameworks for distributed low-latency stream processing, and explain how we have chosen the Python streamparse project to map our Python micro services into an Apache Storm topology. Next, I will show how ideas from the "Lambda Architecture" can be used to operate a fully versioned information extraction pipeline, where reprocessing of old data with new and improved algorithms, whilst keeping old versions accessible, is a common requirement. I will show how this stream processing paradigm, thanks to the parallelism offered by Apache Storm, extends naturally to the batch processing of multi-terabyte historical archives.