Parsing natural language time expressions into structured data is challenging. Whilst there are quite a few tools, many are either too simplistic or problematic in a python setup. ctparse
is a pure python library (MIT-License) build on straight forward concepts, allowing to parse complex expressions efficiently and can easily be adjusted for domain specific use cases.
When trying to parse natural language time and date expressions like e.g. "next Monday afternoon before 4pm"
into something structured, the choice in the python eco-system is limited. There are some libraries like e.g. dateparser
, but they can only really handle reasonably formal expression (the mentioned one already not being formal enough).
Amongst the best alternatives is probably wit.ai/facebook's duckling
(https://github.com/facebook/duckling). However, it also has some major shortcomings, amongst others that it is not native python and offers little control over what is going on - unless you are a Haskell pro. And to our experience, the details of the use case are very important.
To this end we developed ctparse
(https://github.com/comtravo/ctparse). In many ways it is similar to duckling
, albeit admittedly having a significantly smaller scope right now. ctparse
implements a regular-expression and rule based system to parse time and date expressions, equipped with a statistical model to favour reasonable resolutions over others. Whilst still in a pretty early stage, we can outperform duckling
in our application - which is parsing date/time expressions from e-mail booking requests - both, in terms of speed and accuracy.
I will lay out the basic concepts and ideas behind building this PCFG (probabilistic context free grammar) inspired parser, discuss in detail some of the more challenging algorithmic building blocks and demonstrate how python is actually a very good choice to implement such a system. Whilst the currently implemented parsing rules might be too specific for other use cases, adjustments to other use cases should be easy given the background insights from this presentation.