Natural Language Processing (NLP) is an inherently messy, iterative process. A well-designed data flow can make all the difference between a scalable NLP project and project that makes everyone involved weep. During this talk, we'll investigate one way of integrating pre-made tools into a self-contained pipeline - using entry level tools like SkLearn and NLTK.
Natural Language Processing (NLP) is an inherently messy, iterative process. A well-designed data flow can make all the difference between a scalable NLP project and project that makes everyone involved weep. During this talk, we'll investigate one way of integrating pre-made tools into a self-contained pipeline - using entry level tools like SkLearn and NLTK.
We'll begin by introducing NLP and discussing the inherent iterative nature of NLP projects, focusing on the messy combinations that come out of those iterations. After deciding that NLP is a nightmare for anyone that prefers organization to chaos, we'll discuss the benefit of pipelines. Then we'll discuss the modern toolset for doing NLP, with things like Spacy, GenSim, and other "commercial grade" open source software that isn't as beginner-friendly as building your own toolset in many ways. After deciding we need to build our own, we'll talk about how we should design a reproducible, efficient, and save-able pipeline. Finally, we'll go through some code that's designed to make all of this happen and talk about some of the choices being made and how we can use this as a base pipeline for more complex tools like document classification, topic modeling, and article recommendation engines.