Saturday 2:15 PM–3:00 PM in Room #1023/1022/1020 (1st Floor)

Eat Your Vegetables - Data Security for Data Scientists

Will Voorhees

Audience level:
Intermediate

Description

You've got data: lots of it. People want to get their hands on that data. You don't want that, so let's go over a few things you can do to dissuade attackers from getting their grubby mitts on your hard processed datastore. We'll cover the obvious things (spoiler alert: encryption) and then some advanced techniques for keeping data secure while still keeping it usable (that is to say, analyzable).

Abstract

The ubiquity of data in the modern age has created an environment where data scientists can thrive, but it's also leading to a nasty situation: the very data that makes our lives so interesting is also making us a target for some people who don't have our interests at heart. Data scientists are quickly becoming the caretakers of their organization's data if for no other reason than that we use it the most! That means data scientists must become the guardians of that data.

"But that's someone else's job!" Even if you are part of an organization that's large enough to have a dedicated security team, you should still care about your data. It's your data. Your security team isn't working with it every day. They aren't relying on it for their next big project. They haven't spent hours upon hours cleaning and tagging. When it comes down to it, you are the person with the most investment in the data. Can you really trust some far off team to give it the same attention you do?

If I haven't already convinced you that this is an important problem, then this talk will really drive the point home. We'll discuss recent data breaches from Ashley Madison, the IRS and OPM, LinkedIn, Sony, and even Wendy's. Small companies aren't immune either! Hacking activity against small businesses is on the rise and even a single breach can cost a company several hundred thousand dollars in lost revenue.

The world is not all dark and scary. There are several relatively easy things you can do to add a great deal of protection. Oftentimes, default security is so lax that just applying a few simple tactics can vastly reduce your attack potential and make you a far less tempting target. We'll discuss access controls on accounts and data, how to do credential management the right way, and dispel the myths around SSL that put people off. Security isn't free, but it sure can be cheap.

After you get your accounts locked down, you've still got huge datasets that need protection. Encryption can go a long way to making sure that your hard processed data doesn't get into the hands of some ne'er-do-well. We'll also briefly discuss backup options, since all the encryption in the world can't protect against deletes…

Finally, we'll discuss PyCrypto and Openssl for low-level encryption and explore how these libraries can be combined with provider frameworks like Boto to create some very powerful tooling. We'll conclude with some of the hosted solutions for data storage that have begun providing out-of-the-box security. Audience members can expect to walk away with a better understanding of the hazards we face as data scientists as well as a sense of the full spectrum of security options, from simple tricks you can integrate into your regular scripts, to complete, turn-key security solutions.