Friday 10:15 AM–10:50 AM in Main Room

Exposing dark data in the enterprise with custom NLP

Justin J. Nguyen

Audience level:


Companies today are inundated with largely inaccessible unstructured data, virtually lost to the enterprise. We present NLP frameworks and tradeoffs for consuming and parsing unstructured data into a highly accessible knowledge platform in order to (a) allow your employees to make better decisions faster, (b) quickly raise the expertise of your employees, and (c) preserve organizational knowledge.


In almost every organization, there is valuable knowledge hidden in documentation such as emails, forums, technical reports, etc. Without access to this information, key stakeholders can make under-informed decisions can lead to critical failures that can cost millions in redundant work and lost opportunity for productivity. Every organization is different and therefore, so is their documentation. The diction, syntax, semantics, and topics can vary tremendously, as does the scope and breadth of their knowledge domains. In this talk, we discuss building flexible NLP pipelines that can handle a breadth of different corpuses. We present an open-source framework and process that utilizes unsupervised and semi-supervised learning approaches to produce reliable question-answer pairs with minimal training and labelled data. Key aspects of our approach are considerations around balancing multiple pre and post processing techniques, feature extraction, and domain dictionaries. We will also share lessons learned from deploying custom NLP pipelines and services for large corporations, including insights around training and stacked modeling ensembles.

Subscribe to Receive PyData Updates



Get Now