Saturday 10:00–10:45 in Dining Room

Scale out from the very beginning

Jens Nie

Audience level:


Most companies a very well aware of the potential behind Big Data solutions today and happily start collecting every piece of information building huge pools of Dark Data. How could Data Science teams create an initial overview on what's available? A simple search strategy, optimised and refined to scale could be a promising way to start.


In this talk the authors journey of making the pool of Dark Data available to teams with quite different goals is reflected, emphasising on creating a simple and robust set of tools matching each other and addressing the several needs of the teams based mainly on solutions such as dask distributed, dask based dataframes, bokeh and flask.

The key to success was to prevent structuring too much at the very beginning and postpone this task into the several projects of the users consuming the results of these services giving them the freedom to create and use their own models.

It is shown how we implemented a distributed filesystem scanning utility to crawl for data in our 1.5 PB storage system every night ending up in a simple, yet useful table of contents, and how this result set is processed further to fulfill all the project teams requirements.

These services are for example used to

Finally the involved procedures helped to gain more awareness of the value the available data had, both helping to build more trust in Big Data based solutions and to reduce the volume of the data itself that is available online, which in turn keeps the corresponding costs at a reasonable rate.