PyData London 2017 - Presentation: Scale out from the very beginning

Most companies a very well aware of the potential behind Big Data solutions today and happily start collecting every piece of information building huge pools of Dark Data. How could Data Science teams create an initial overview on what's available? A simple search strategy, optimised and refined to scale could be a promising way to start.

In this talk the authors journey of making the pool of Dark Data available to teams with quite different goals is reflected, emphasising on creating a simple and robust set of tools matching each other and addressing the several needs of the teams based mainly on solutions such as dask distributed, dask based dataframes, bokeh and flask.

The key to success was to prevent structuring too much at the very beginning and postpone this task into the several projects of the users consuming the results of these services giving them the freedom to create and use their own models.

It is shown how we implemented a distributed filesystem scanning utility to crawl for data in our 1.5 PB storage system every night ending up in a simple, yet useful table of contents, and how this result set is processed further to fulfill all the project teams requirements.

These services are for example used to

find expensive duplicates of datasets
create customer as well as product and service orientated views on the available data
find data suitable to test algorithms, software and procedures, and to derive current performance
serve training and education material
show the usage frequency of the datasets to support an optimised data tiering process

Finally the involved procedures helped to gain more awareness of the value the available data had, both helping to build more trust in Big Data based solutions and to reduce the volume of the data itself that is available online, which in turn keeps the corresponding costs at a reasonable rate.

Saturday 10:00–10:45 in Dining Room

Scale out from the very beginning

Jens Nie

Description

Abstract

Subscribe to Receive PyData Updates

Tickets