Why do we need to “test” our data?

Before we can really start working with data, we need to make sure that the data are what we think they are: that all of the data are there, that they contain the signals we expect at the fidelity we need, etc. And because data are always changing, this isn’t a one-time cost: we need to continually assert that our expectations about data are being met. This is challenging with just one dataset, but imagine your organization is trying to work with tens or hundreds or even thousands of datasets. It quickly becomes impossible to review every new delivery, and context-switching between different datasets and team members is expensive.

We face an analogous problem in developing software: we need to continually assert that our software is working as expected as the codebase changes: as bugs are fixed, features are added and removed, etc. To address this, most software is equipped with a suite of unit tests that evaluate whether or not each part (i.e., unit) of the code is functioning as expected.

In this talk, we’ll apply unit testing to data to see where and how this framework can be used to greatest effect, and explore the unique challenges of unit testing data. We’ll see how Two Sigma’s open source library Marbles extends Python’s built-in unittest library to address these challenges to enable unit testing of data in several real-world scenarios.

Challenges

We find that it’s possible to express our expectations about data as unit tests, but we also find that unit testing data is distinct from unit testing code in three meaningful ways:

our expectations about data aren't always concrete
it's not always clear how to mitigate different data "failures"
data "failures" are often introduced externally

What these three things have in common is that, when a data unit test fails, context is expensive to recover for the test consumer, even if they wrote the test. This is why we built Marbles.

Marbles

Marbles extends Python's unittest framework to provide more information-rich failure messages and allow the test author to embed any relevant context about the test right into the test itself. The idea is to make sure that test author's intent, and their context and background, are made available to the test consumer when they need it the most: namely, when the test fails. Marbles achieves this with the following:

Rich, human-readable failure messages
Full assertion statement that failed
Local variable context
Ability to toggle the traceback
Semantic assertions
Annotations

We'll see how these features help address the challenges of unit testing data that were introduced earlier. We'll also see how focusing on the test consumer, whether that’s the test author in a few months or someone completely new to to the test suite, helps everyone get in the habit of writing better, clearer tests, regardless of whether you're writing unit tests for data or for code. We'll also see how writing Marbles tests is almost identical to writing vanilla unittest tests, meaning that if you're already familiar with unittest then you'll be able to write Marbles tests without a steep learning curve (or any learning curve at all).

Conclusion

In summary, in this talk we'll discuss why everyone should be testing their data (an idea that will likely be new to many audience members), and we'll see several concrete examples of how to do this. We'll discuss the challenges we are likely to encounter with data unit tests, and see how Two Sigma addresses some of these challenges with Marbles. Upon leaving this talk, the audience will have some ideas of what unit tests they could write for their own data, and a familiar tool they can start using today to write them.

Thursday 5:00 PM–5:40 PM in Central Park East (#6501a)

Unit Testing Data with Marbles

Jane Stewart Adams, Leif Walsh

Description

Abstract

Why do we need to “test” our data?

Challenges

Marbles

Conclusion

Subscribe to Receive PyData Updates