Saturday 14:30–15:15 in Hall 1

Designing spaCy: A high-performance natural language processing (NLP) library written in Cython

Matthew Honnibal

Audience level:
Experienced

Description

The spaCy natural language processing (NLP) library features state-of-the-art performance, and a high-level Python API. Efficiency is crucial for NLP, because job sizes are constantly increasing. This talk describes how we’ve met these challenges in spaCy, by implementing the library in Cython.

Abstract

The spaCy natural language processing (NLP) library features state-of-the-art performance, and a high-level Python API. Efficiency is crucial for NLP, because job sizes are constantly increasing. The key algorithms are also relatively complicated, and frequently subject to change, as new research is published. This talk describes how we’ve met these challenges in spaCy, by implementing the library in Cython. Unlike many Cython users, we did not write the library in Python first, and then optimize it. Instead, we designed the library as a C extension from the start, and added the Python API on top. This allows us to build the library on top of efficient, memory-managed data structures, without having to maintain a separate C or C++ codebase. The result is the fastest NLP library in the world, support for GIL-free multithreading, in a concise readable codebase, and with no compromise on user friendliness.