To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
Data science has become key to most organizations’ efforts to deliver value to their customers. Since it involves significant research and experimentation, data science work is often done in parallel with (and in isolation from) the standard tech team and software development processes like agile and scrum. The result is innovative but siloed work that is difficult to integrate into existing systems. Now, organizations are increasingly looking for ways to deploy data science products, and the pressure is on for data scientists to formalize our processes to be accepted in these systems.
One of the processes that’s most critical to robust software development and often lacking in the data science pipeline is the creation of automated tests for production code. For software engineers, tests aren’t optional; they reduce bugs, improve feedback loops, and increase user confidence in the system. Tests for data science features have the same upsides, as well as the potential for improved reproducibility. However, most guidance on writing tests (e.g. the Test Driven Development philosophy) is oriented toward the testing of deterministic algorithms, which can be relied on to exhibit the same behavior over and over, given the same input. Unfortunately, machine learning algorithms often don’t function in this way, using statistical models that are stochastic by design.
This talk is for data scientists with little or no experience writing tests, and for veteran testers who are new to building data science products. We'll begin by discussing the value of tests and how software engineers normally think about testing. Then we’ll go over how we came to write tests for our Natural Language Processing (NLP) microservices at ByteCubed, and how you can test a system that isn't necessarily deterministic. The examples used in this talk will be focused on Entity Extraction and Semantic Role Labelling (using open source libraries like SpaCy, NLTK, and NetworkX), but the broad concepts can generalize to any Python project. We will discuss how you can write useful tests when the output of your system can vary, and how to prepare for a system that won't pass everything you test for. Examples provided will be in Python using Pytest.