Keynotes
Trent McConaghy
ascribe
Rewiring the Internet for Ownership with Big Data and Blockchains
Abstract | Bio
When it comes to ownership, the internet is broken. Artists, designers, and other creatives can share their work easily on the internet, but keeping it as "theirs" and get fairly compensated has proven difficult. How do you "own" something when bits can be copied freely? It turns out that visionaries of hypertext foresaw this issue in the 60s. They even proposed systems to handle this. However, those systems were too complex and hard to build. By the early 90s, the simpler WWW had won, but unfortunately in its simplicity it left out attribution to owners. We ask a new question: can we retrofit the internet for ownership? It turns out the answer is yes, with the help of python-powered big data, machine learning, and the blockchain. First, we crawl the internet and create a large scale crawl database, then preprocess all media into machine learning features. Then, creators can "register" their work onto the blockchain. Finally, we use machine learning to cross-reference registered works against the large-scale crawl database. We can do this for images, text, and even 3d designs; and it works even if the design has changed meaningfully. Python-powered big data is making it possible to revive the dream of ownership on the internet.
Trent McConaghy, PhD has been doing machine learning (ML) research since the mid 90s. He co-founded ascribe GmbH, which enables copyright protection via internet-scale ML and the blockchain. Before that, he co-founded Solido where he applied ML to circuit design; the majority of big semis now use Solido. Before that, he co-founded ADA also doing ML + circuits; it was acquired in 2004. Before that he did ML research at the Canadian Department of Defense. He has written two books and 50 papers+patents on ML. He co-organizes the Berlin ML meetup. He keynoted Data Science Day Berlin 2014, gave an invited talk at PyData Berlin 2014, and more.
Matthew Rocklin
Continuum Analytics
Abstract | Bio
Matthew Rocklin likes numerics, mathematics, and programming paradigms. He contributes to a variety of open source projects and endeavors to demonstrate the value of abstract solutions to concrete problems. A graduate of UC Berkeley (Physics, Math) and of the University of Chicago (PhD in CS). Matthew is currently a Computational Scientist at Continuum Analytics.
Felix Wick
Blue Yonder
From the Life of a Data Scientist
Abstract | Bio
T. Davenport and DJ Patil have pointed out already in 2012 that Data Scientists are working in the “sexiest job of the 21st century”. Although there are plenty of imaginations what a Data Scientist actually does, core skills for this job definitely include Machine Learning, software development, and business domain expertise.
The talk will give a short tour through these key aspects and a glimpse of the working life of a Data Scientist at Blue Yonder. We build predictive applications for customers from various business fields or, to put it in another way, deliver productive data science as a service. Thereby, we make extensive use of the Python ecosystem and leverage it to overcome the two-language problem for rapid prototyping (instead of e.g. R) and productive operation (instead of e.g. C++) in different project phases.
The talk will give a short tour through these key aspects and a glimpse of the working life of a Data Scientist at Blue Yonder. We build predictive applications for customers from various business fields or, to put it in another way, deliver productive data science as a service. Thereby, we make extensive use of the Python ecosystem and leverage it to overcome the two-language problem for rapid prototyping (instead of e.g. R) and productive operation (instead of e.g. C++) in different project phases.
Felix Wick leads the machine learning and core development team at Blue Yonder, a SaaS provider for predictive applications. He obtained his diploma (2008) and PhD (2011) in physics at the Karlsruhe Institute of Technology (KIT). The topic of his research was experimental particle physics with a focus on statistical data analysis. In 2011, he joined Blue Yonder as Data Scientist, building predictive analytics solutions for various customer projects. He now focuses on the development of machine learning and data science algorithms primarily based on Python.
Speakers
Claas Abert
Vienna University of Technology
Scientific computing with Python: Tools for the solution of continuous problems
Abstract | Bio
Although not exactly a classical big data application, the numerical treatment of partial differential equations (PDEs) has very similar characteristics: By spatial discretization, the continuous problem is translated to linear systems and the discrete solution is represented by a vector of floating point numbers. Depending on the dimensions of the domain and the granularity of the spatial discretization, the size of the arising matrices and vectors may range from a few thousands to billions of entries. Usual operations include matrix-vector multiplications, the solution of linear systems and more complicated tasks like the solution of eigenvalue problems.
Due to usually large problem sizes, computational performance is certainly a main design goal of PDE solvers. However, when it comes to the implementation of complex PDEs or algorithms in general it is equally desirable to use high level programming tools that allow a concise domain-related problem definition.
Recently, Python gained a lot of attention in the scientific computing community that was dominated by compiled languages as Fortran and C for a long time. The reason for this development is most likely the fulfillment of the above mentioned criteria: performance and brief syntax. While the brief syntax is a feature of the language itself, Python owes its high performance the existence of excellent third party libraries such as NumPy. Many novel scientific special purpose libraries are still written in compiled languages, but come with Python wrappers and seamlessly integrate with NumPy.
The aim of this talk is to give a brief introduction to the problem domain and present a selection of Python tools and libraries for scientific computing with a focus on continuous problems.
Due to usually large problem sizes, computational performance is certainly a main design goal of PDE solvers. However, when it comes to the implementation of complex PDEs or algorithms in general it is equally desirable to use high level programming tools that allow a concise domain-related problem definition.
Recently, Python gained a lot of attention in the scientific computing community that was dominated by compiled languages as Fortran and C for a long time. The reason for this development is most likely the fulfillment of the above mentioned criteria: performance and brief syntax. While the brief syntax is a feature of the language itself, Python owes its high performance the existence of excellent third party libraries such as NumPy. Many novel scientific special purpose libraries are still written in compiled languages, but come with Python wrappers and seamlessly integrate with NumPy.
The aim of this talk is to give a brief introduction to the problem domain and present a selection of Python tools and libraries for scientific computing with a focus on continuous problems.
Claas works as a research fellow at the Vienna University of Technology, where he develops simulation software for the investigation of magnetic processes at the nanoscale. He received his PhD in physics at the University of Hamburg and discovered his interest in modern scripting languages while working as a web developer for different companies.
Alejandro Correa Bahnsen
Luxembourg University
CostCla a cost-sensitive classification library
Abstract | Bio
Classification, in the context of machine learning, deals with the problem of predicting the class of a set of examples given their features. Traditionally, classification methods aim at minimizing the misclassification of examples, in which an example is misclassified if the predicted class is different from the true class. Such a traditional framework assumes that all misclassification errors carry the same cost. This is not the case in many real-world applications such as credit card fraud detection, credit scoring, churn modeling and direct marketing. In this talk I would like to present CostCla a cost-sensitive classification library. The library incorporates several cost-sensitive algorithms. Moreover, during the talk I will show the huge differences in profit when using traditional machine learning algorithms versus cost-sensitive algorithms, on several real-world databases.
Alejandro Correa Bahnsen is currently working towards a Ph.D in Machine Learning at Luxembourg University. His research area relates to cost-sensitive classification and its application in a variety of real-world problems such as fraud detection, credit risk, direct marketing and churn modeling. Also, he works part time a fraud data scientist at CETREL a SIX Company applying his research for detecting fraud. Before starting his PhD, he worked for five years as a data scientist at GE Money and Scotiabank, applying data mining models in a variety of areas from advertisement to financial risk management. He have written and published many academic and industrial papers in the best per-review publications. Moreover, Alejandro also have experience as instructor of econometrics, financial risk management and machine learning. He is also a co-organizer of the Data Science Luxembourg meetup.
Paul Balzer
MechLab Engineering
Running, walking, sitting or biking? - Motion prediction with acceleration and rotationrates
Abstract | Bio
A lot of devices can measure acceleration and rotationrates. With the right features, Machine Learning can predict, weather you are sitting, running, walking or going by bike. This talk will show you, how to calculate features with Pandas and set up a real time classifier with SciKit-Learn. Including hardware demo.
Paul Balzer is Data Analyst/CEO at MechLab Engineering and Open Data activist in Dresden.
Paul Balzer
MechLab Engineering
Analysing and predicting inner-city parking space occupancy
Abstract | Bio
The city of Dresden has an excellent traffic monitoring and guiding system (VAMOS), which also measures the occupancy rate of city parking spaces. The data is pushed to the city's website, from which it has been scraped by Dresden's Open Data activists for the past year. This talk shows, how to analyse the data with Pandas and make predictions of future occupation with SciKit-Learn. Especially, which features are important for predicting the shopping behavior of citizens and tourists.
Paul Balzer is Data Analyst/CEO at MechLab Engineering and Open Data activist in Dresden.
Sylvain Bellemare
Learning to use Docker for developmentAbstract | Bio
A very simple tutorial, ideally aimed at beginners, for both Docker and scientific Python, who wish to learn the basics to be able to create and manage their own development environments, using Docker.
We'll write a Dockerfile to build a Docker image that will have a few basic scientific libraries (matplotlib, numpy, ipython/jupyter notebook). We'll run the notebook in the docker container, and then learn how to interact with the notebook.
Next step, we'll use docker-machine to run our docker container on a remote host. (This may be practical for performance reasons.)
If time allows, we may also spin up another container, running Postgres, or another data store (e.g.: Redis, Elasticsearch). We'll tie in the communication between both containers using docker-compose. We'll figure out a way of storing some data in our data store and plotting some graphs for it in the notebook.
The goal of the tutorial is to empower the developers, such that they can totally take over their own development environment, meanwhile leveraging what Docker has to offer to do so. Hence, the tutorial will adopt a pace that beginners will be able to follow, such that at the end, they will have completed a re-usable Docker-based development environment, which they may extend and modify according to theirs needs, in the future.
Tutorial prerequisites and instructions.
We'll write a Dockerfile to build a Docker image that will have a few basic scientific libraries (matplotlib, numpy, ipython/jupyter notebook). We'll run the notebook in the docker container, and then learn how to interact with the notebook.
Next step, we'll use docker-machine to run our docker container on a remote host. (This may be practical for performance reasons.)
If time allows, we may also spin up another container, running Postgres, or another data store (e.g.: Redis, Elasticsearch). We'll tie in the communication between both containers using docker-compose. We'll figure out a way of storing some data in our data store and plotting some graphs for it in the notebook.
The goal of the tutorial is to empower the developers, such that they can totally take over their own development environment, meanwhile leveraging what Docker has to offer to do so. Hence, the tutorial will adopt a pace that beginners will be able to follow, such that at the end, they will have completed a re-usable Docker-based development environment, which they may extend and modify according to theirs needs, in the future.
Tutorial prerequisites and instructions.
Sylvain Bellemare is a software engineer, and currently works mainly with Django and Postgres. He is himself new to Docker and scientific Python. He has experience managing development environments with Vagrant, SaltStack, and VirtualBox, at work, and for fun, such as in https://github.com/sciboxes/scipybox.
Lea Böhm
Founder of Alles Roger
What people need to be happy at work & how you can influence a diverse team environment
Abstract | Bio
A lot of studies have investigated lately how happy and engaged people are at work. They found that a big influencer is the team atmosphere and the relationship you have with your boss.
Being engaged describes a state where you feel energized, involved and effective. Looking at the numbers, we'll find that there are only about 16% of people that are truly engaged at work.
The talk will give an insight on things that lead to a happier work environment, give inspiration for what you can do to actively shape a happy work environment and how this will establish the foundation for more diversity within your team.
Being engaged describes a state where you feel energized, involved and effective. Looking at the numbers, we'll find that there are only about 16% of people that are truly engaged at work.
The talk will give an insight on things that lead to a happier work environment, give inspiration for what you can do to actively shape a happy work environment and how this will establish the foundation for more diversity within your team.
Lea has studied economics, politics and sociology and has worked several years in finance and operations until she realized that people are much more interesting than numbers. Consequently she switched to people management in 2012 and became an expert for topics such as internal communication, motivation, team building and development and employer branding. She holds a degree in coaching and mediation, is the MD of BLS Venture Capital and founder of Alles Roger.
Miguel Fernando Cabrera
TrustYou
Processing Hotel Reviews with Python
Abstract | Bio
In this talk I will present experiences of using a combination of Hadoop and Python to build pipelines that process large amount of textual hotel reviews in more than a dozen a languages. In particular I will show the application Word2vec (via Gensim) to extract information and cluster similar hotels based on the opinion of users.
A Neuberliner, Miguel works as Data Scientist for TrustYou, juggling with (mostly textual) data. He obtained a M.Sc. degree Informatics from TU in Munich where he also founded and ran the Munich Datageeks, the largest ML/Data Science group in Bavaria. When not using computer Miguel enjoys playing Latin American folk music and doing martial arts.
Brian Carter
IBM Software Group
Lifecycle of Web Text Mining: Scrape to Sense
Abstract | Bio
Pillreports.net is an on-line database of reviews of Ecstasy pills. In consumer theory illicit drugs are experience goods, in that the contents are not known until the time of consumption. Websites like Pillreports.net, may be viewed as an attempt to bridge that gap, as well as highlighting instances, where a particular pill is producing undesirable effects. This talk will present the experiences and insights from a text mining project using data scraped from the Pillreports.net site.
- The setting up and the benefits, ease of using BeautifulSoup package and pymnogo to store the data in MongoDB will be outlined.
- A brief overview of some interesting parts of data cleansing will be detailed.
- Insights and understanding of the data gained from applying classification and clustering techniques will be outlined. In particular visualizations of decision boundaries in classification using "most important variables". Similarly visualizations of PCA projections for understanding cluster separation will be detailed to illustrate cluster separation.
The talk will be presented in the iPython notebook and all relevant datasets and code will be supplied. Python Packages Used: (bs4, matplotlib, nltk, numpy, pandas, re, seaborn, sklearn, scipy, urllib2)
Peadar Coyle
Probabilistic Programming in Sports AnalyticsAbstract | Bio
Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.
I'll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I'll be applying these methods to studying the problem of 'rugby sports analytics' particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert. Slides at: http://nbviewer.ipython.org/format/slides/github/springcoil/Probabilistic_Programming_and_Rugby/blob/master/Bayesian_Rugby.ipynb#/
I'll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I'll be applying these methods to studying the problem of 'rugby sports analytics' particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert. Slides at: http://nbviewer.ipython.org/format/slides/github/springcoil/Probabilistic_Programming_and_Rugby/blob/master/Bayesian_Rugby.ipynb#/
Peadar Coyle is an Energy Analyst at Vodafone Procurement Company in Luxembourg, where he is applying his analytical skills helping his colleagues better understand their data and their market. Peadar is enthusiastic at driving innovation in new products, and developing novel applications of artificial intelligence. He has full-stack development experience, including design, building, and shipping of data, machine learning, and visualisation intensive products and solutions. He regularly contributes to the Data Science community in Luxembourg by speaking at the local meetup. His open source contributions include Probabilistic Programming and Bayesian Methods for Hackers and some small contributions to Pandas. Peadar is a regular speaker throughout Europe on Data Analysis and contributes to the Python Community. This includes speaking at PyCon in Florence. Peadar has a Masters in Mathematics from the University of Luxembourg where he specialized in Statistics and Theoretical Machine Learning.
Christine Doig
Continuum Analytics
Interactive Data Visualizations with Python
Abstract | Bio
Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. It provides elegant, concise construction of novel graphics in the style of D3.js without having to write any JS. Attendees will learn how to get set up with Bokeh, key library and plotting concepts, how to plot basic glyphs, use high-level charts, style visualizations, configure plot tools, layer multiple plots, add interactions, deploy with the bokeh-server and embed plots in your applications. The tutorial will include notebooks and python scripts with exercises that will be solved during the session, with the aim of making the tutorial hands-on and interactive for attendees.
The talk is aimed at Web developers, Data Scientists and Python programmers interested in data visualizations for the browser. Advanced beginner or intermediate python skills would be nice to have to get the most out of the tutorial.
The talk is aimed at Web developers, Data Scientists and Python programmers interested in data visualizations for the browser. Advanced beginner or intermediate python skills would be nice to have to get the most out of the tutorial.
Christine Doig is a Data Scientist at Continuum Analytics. She holds a M.S. in Industrial Engineering from UPC, Barcelona, where she also started a Masters in Innovation and Research in Informatics, Data Mining and BI, before joining Continuum Analytics.
She is interested in Data Science and Python and loves to share her knowledge with others. She has taught tutorials and presented talks on conda, Blaze, Bokeh, scikit-learn and Data Science at PyCon, PyTexas, PyCon Spain, PyData Dallas, ScipyConf and local meetup groups like PyBCN, PyladiesBCN, APUG and ACM SIGKDD. Blogposts, talks, slides and videos can be found on her site http://chdoig.github.io/.
She is interested in Data Science and Python and loves to share her knowledge with others. She has taught tutorials and presented talks on conda, Blaze, Bokeh, scikit-learn and Data Science at PyCon, PyTexas, PyCon Spain, PyData Dallas, ScipyConf and local meetup groups like PyBCN, PyladiesBCN, APUG and ACM SIGKDD. Blogposts, talks, slides and videos can be found on her site http://chdoig.github.io/.
Ignacio Elola
import.io
Python as a Framework for Analytics and Growth Hacking
Abstract | Bio
Python is the perfect language to build with little effort a framework to control and hack the growth of a company. Being import.io data scientist for the last 2 years, I've come across many different problems and needs on how to wrangle data, clean data, report on it and make predictions. In this talk I will cover all main analytics and data science needs of a start-up using Python, numpy, pandas, and sklearn. We will go through examples of how to wrangle analytics and KPIs in Python, and make simple models to do basic predictions that can really make the difference in the business world. For every use case I there will be snippets of code using IPython notebooks and some will be run as live demos.
Maths and complex systems nerd. I studied physics at Universidad Autonoma de Madrid, and done some research on systems biology in Spain and The Netherlands. Now I am interested in start-ups, new technologies and lean methodologies, still driven by curiosity and motivation. I am passionate about data democratization, data science and big data. I've been import.io Data Scientist for the last 2 years, where I am responsible of the analytics and data modelling for business and growth. I'v also been mentor in the first S2DS (Science to Data Science) course in the UK, a course for PhDs in Science to transition into real life Data Science.
Valentin Haenel
haenel.co
Blosc
Abstract | Bio
Blosc is a fast metacodec with two main features: the shuffle filter and threading. The shuffle filter, which is implemented using SSE2 instructions, allows reordering bytes to reduce the complexity of certain datasets. Threading, on the other hand, allows parallelization of existing codecs, hence the term *metacodec*. Blosc was originally conceived to mitigate the problem of starving CPUs which results from the ever growing divide between clock speed and memory latency. Recently, it has become increasingly useful for other scenarios too, for example, out-of-core approaches and compressed in-memory storage. Blosc has a small codebase and is implemented in C. Additionally, several pieces of interesting software, largely written in Python, have emerged that make use of Blosc, showcasing it's potential and exploring the new use-cases. Bcolz, for example, is a compressed in-memory and out-of-core container for numerical data. This talk is about Blosc and it's Python friends.
Valentin is a freelance software engineer interested in compression, exploiting the memory hierarchy for accelerated computation and out-of-core compute engines. in the past, he has worked on psychophysics data analysis, large-scale brain simulations and analytical engines for business intelligence. Also, he wrote a book about git and has contributed to a selection open source projects. he currently resides in Berlin where he also co-organizes the monthly pydata-berlin meetups.
Peter Hoffmann
Blue Yonder
Indroduction to the PySpark DataFrame API
Abstract | Bio
Apache Spark is a computational engine for large-scale data processing. It is responsible for scheduling, distribution and monitoring applications which consist of many computational task across many worker machines on a computing cluster.
This talk will give an overview of the PySpark DataFrame API. While Spark core itself is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. The Spark DataFrame API was introduced in Spark 1.3. DataFrames envolve Spark's Resiliant Distributed Datasets model and are inspired by Pandas and R data frames. The API provides simplified operators for filtering, aggregating, and projecting over large datasets. The DataFrame API supports diffferent data sources like JSON datasources, Parquet files, Hive tables and JDBC database connections.
This talk will give an overview of the PySpark DataFrame API. While Spark core itself is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. The Spark DataFrame API was introduced in Spark 1.3. DataFrames envolve Spark's Resiliant Distributed Datasets model and are inspired by Pandas and R data frames. The API provides simplified operators for filtering, aggregating, and projecting over large datasets. The DataFrame API supports diffferent data sources like JSON datasources, Parquet files, Hive tables and JDBC database connections.
I'm a Python Developer from Karlsruhe, Germany. I'm working as a Senior Web Developer at Blue Yonder building Predictive Applications.
Alexander Kagoshima
Pivotal
A Data Science Operationalization Framework
Abstract | Bio
In a lot of our Data Science customer engagements at Pivotal, the question comes up how to put the developed Data Science models into production. Usually, the code produced by the Data Scientist is a bunch of scripts that go from data loading over data cleansing to feature extraction and then model training. There is rarely much thought put into how the resulting model can be used by other pieces of software and this is generally not a good practice of encapsulating the Data Scientist's work for others to re-use.
What we as Data Scientists want is to create models that drive automated decision-making but there is clearly a mismatch to the above way of going about Big Data projects. Considering these challenges, we created a small prototype for a Data Science operationalization framework. This allows the Data Scientist to implement a model which is exposed by the framework as a REST API for easy access by software developers.
The difference to other predictive APIs is that this framework allows for automatic periodic retraining of the implemented model on incoming streaming data and is able to free the Data Scientist of some tedious work - like Ÿkeeping track of results for different modelling and feature engineering approaches, basic visualization of model performance and the creation of multiple model instances for different data streams. It is written by practitioning Data Scientists for Data Scientists.
Moreover, the framework will be released this year under an Open Source license which means that unlike other predictive APIs which only host one instance for Data Scientists to push their models to, this allows Data Scientists to completely control their own model codebase. In addition, it is deployable on Cloud Foundry and Heroku and can thus use some features of PaaS, which means less work in thinking about how to deploy and scale a model in production.
The model is implemented in Python and uses Flask to expose the REST API and the current prototype uses Redis as backend storage for the trained models. Models can be either custom-written or use existing Python ML libraries like scikit-learn. The framework is currently geared towards online learning, but it is possible to hook it up to a Spark backend to realize model training in batch on large datasets.
What we as Data Scientists want is to create models that drive automated decision-making but there is clearly a mismatch to the above way of going about Big Data projects. Considering these challenges, we created a small prototype for a Data Science operationalization framework. This allows the Data Scientist to implement a model which is exposed by the framework as a REST API for easy access by software developers.
The difference to other predictive APIs is that this framework allows for automatic periodic retraining of the implemented model on incoming streaming data and is able to free the Data Scientist of some tedious work - like Ÿkeeping track of results for different modelling and feature engineering approaches, basic visualization of model performance and the creation of multiple model instances for different data streams. It is written by practitioning Data Scientists for Data Scientists.
Moreover, the framework will be released this year under an Open Source license which means that unlike other predictive APIs which only host one instance for Data Scientists to push their models to, this allows Data Scientists to completely control their own model codebase. In addition, it is deployable on Cloud Foundry and Heroku and can thus use some features of PaaS, which means less work in thinking about how to deploy and scale a model in production.
The model is implemented in Python and uses Flask to expose the REST API and the current prototype uses Redis as backend storage for the trained models. Models can be either custom-written or use existing Python ML libraries like scikit-learn. The framework is currently geared towards online learning, but it is possible to hook it up to a Spark backend to realize model training in batch on large datasets.
Alexander Kagoshima became the first member of Pivotal's EMEA Data Science Team after receiving a MSc (Dipl.-Ing.) in Business and Software Engineering from TU Berlin. While working towards his degree, he already started applying his focus area of Machine Learning to real-world problems. He used techniques from this field to analyze and predict malfunctions in a test fleet of fuel-cell cars from Volkswagen as well as optimizing wind-turbine control systems in a research project at Siemens. As Data Scientist at Pivotal he mainly develops large-scale Machine Learning systems for customers of Pivotal's Big Data Suite across different industries and use-cases, with a special focus on network security, fraud detection and Internet-of-Things. In his spare time, he searches for new ways to analyze soccer with statistical methods and likes to see the German soccer team winning.
Tobias Kuhn
Trademob
Real-Time Monitoring of Distributed Systems
Abstract | Bio
Instrumentation has seen explosive adoption on the cloud in recent years. With the rise of micro-services we are now in an era where we measure the most trivial events in our systems. At Trademob, a mobile DSP with upwards of 125k requests per second across +700 instances, we generate and collect millions of time-series data points. Gaining key insights from this data has proven to be a huge challenge.
Outlier and Anomaly detection are two techniques that help us comprehend the behavior of our systems and allow us to take actionable decisions with little or no human intervention. Outlier Detection is the identification of misbehavior across multiple subsystems and/or aggregation layers on a machine level, whereas Anomaly Detection lets us identify issues by detecting deviations against normal behavior on a temporal level. The analysis of these deviations is simplified through the use of a time and memory efficient data structure called a t-digest. With t-digests we are able to store error distributions with high accuracy, especially for extreme quantile values.
At Trademob, we developed a Python-based real-time monitoring system to conquer those challenges in order to reduce false positive alerts and increase overall business performance. By correlating a multitude of metrics we can determine system interdependencies, preemptively detect issues and also gain key insights to causality. This session will provide insights into both the system’s architecture and the algorithms used to detect unwanted behaviors.
Outlier and Anomaly detection are two techniques that help us comprehend the behavior of our systems and allow us to take actionable decisions with little or no human intervention. Outlier Detection is the identification of misbehavior across multiple subsystems and/or aggregation layers on a machine level, whereas Anomaly Detection lets us identify issues by detecting deviations against normal behavior on a temporal level. The analysis of these deviations is simplified through the use of a time and memory efficient data structure called a t-digest. With t-digests we are able to store error distributions with high accuracy, especially for extreme quantile values.
At Trademob, we developed a Python-based real-time monitoring system to conquer those challenges in order to reduce false positive alerts and increase overall business performance. By correlating a multitude of metrics we can determine system interdependencies, preemptively detect issues and also gain key insights to causality. This session will provide insights into both the system’s architecture and the algorithms used to detect unwanted behaviors.
Tobias is a physicist and works at Trademob GmbH as a Data Analyst.
Mike Müller
Python Academy
Measure, don't Guess - How to find out if and where to optimize
Abstract | Bio
Python is a great language. But it can be slow compared to other languages for certain types of tasks. If applied appropriately, optimization may reduce program runtime or memory consumption considerably. But this often comes at a price. Optimization can be time consuming and the optimized program may be more complicated. This, in turn, means more maintenance effort. How do you find out if it is worthwhile to optimize your program? Where should you start? This tutorial will help you to answer these questions. You will learn how to find an optimization strategy based on quantitative and objective criteria.
You will experience that one's gut feeling what to optimize is often wrong.
The solution to this problem is: „Measure, Measure, and Measure!“. You will learn how to measure program run times as well as profile CPU and memory.
There are great tools available. You will learn how to use some of them. Measuring is not easy because, by definition, as soon as you start to measure, you influence your system. Keeping this impact as small as possible is important. Therefore, we will cover different measuring techniques.
Furthermore, we will look at algorithmic improvements. You will see that the right data structure for the job can make a big difference. Finally, you will learn about different caching techniques.
Tutorial prerequisites and instructions.
You will experience that one's gut feeling what to optimize is often wrong.
The solution to this problem is: „Measure, Measure, and Measure!“. You will learn how to measure program run times as well as profile CPU and memory.
There are great tools available. You will learn how to use some of them. Measuring is not easy because, by definition, as soon as you start to measure, you influence your system. Keeping this impact as small as possible is important. Therefore, we will cover different measuring techniques.
Furthermore, we will look at algorithmic improvements. You will see that the right data structure for the job can make a big difference. Finally, you will learn about different caching techniques.
Tutorial prerequisites and instructions.
Ronert Obst
Pivotal
Smart cars of tomorrow: real-time driving patterns
Abstract | Bio
In recent years, the adoption of electric cars has resulted in a desperate need from carmakers for accurate range prediction. In addition, fuel efficiency is of increasing concern due to today’s ever-rising fuel costs. In this talk, we will outline a machine learning framework for real-time data analysis to demonstrate how live data collected from cars can be used to provide valuable information for range prediction and smart navigation.
For our solution, we use a Bluetooth dongle that connects to a standard OBD II car diagnostics data port. Together with a self-developed iOS app we can then stream OBD II data into our framework’s big data infrastructure for long-term storage, batch training processes, and subsequent real-time analysis. We will show how we used different open-source technologies (Spark. Spring XD, Python and others) to stream, store, and reason over this data in a scalable way.
In particular, we will focus on how we designed the machine learning framework to derive individual driver ‘fingerprints’ from variables such as speed, acceleration, driving times, and location, taken from historical data. These fingerprints are then used within the real-time prediction framework to determine final journey destination and driving behavior in real time during the journey. We will also look at how other public and free data sources such as traffic information, weather, and fuel station locations could be used to further improve the accuracy and scope of our models.
This talk is intended to demonstrate pioneering work in the space of big data and the connected car. We will take into consideration the insights we have gained from building this prototype, both into infrastructure and analysis, to give our view on what such real-time driving intelligence applications of tomorrow could look like.
For our solution, we use a Bluetooth dongle that connects to a standard OBD II car diagnostics data port. Together with a self-developed iOS app we can then stream OBD II data into our framework’s big data infrastructure for long-term storage, batch training processes, and subsequent real-time analysis. We will show how we used different open-source technologies (Spark. Spring XD, Python and others) to stream, store, and reason over this data in a scalable way.
In particular, we will focus on how we designed the machine learning framework to derive individual driver ‘fingerprints’ from variables such as speed, acceleration, driving times, and location, taken from historical data. These fingerprints are then used within the real-time prediction framework to determine final journey destination and driving behavior in real time during the journey. We will also look at how other public and free data sources such as traffic information, weather, and fuel station locations could be used to further improve the accuracy and scope of our models.
This talk is intended to demonstrate pioneering work in the space of big data and the connected car. We will take into consideration the insights we have gained from building this prototype, both into infrastructure and analysis, to give our view on what such real-time driving intelligence applications of tomorrow could look like.
Ronert got his MSc in Statistics at Ludwig-Maximilians-Universität in Munich and now works as a Data Scientist at Pivotal (http://www.pivotal.io ) in Berlin. His focus is on applying algorithms from machine learning and statistics to large data sets.
Philipp Pahl
Peachbox: Agile and Accessible Big ETL FrameworkAbstract | Bio
Today data is generated in greater volumes than ever before. In addition to vast amounts of legacy data, new data sources such as application logs or social media complicate data-processing challenges. The ultimate goal is to gain insights and derive prescriptions to support decisions or develop predictive apps. On the other hand preceding steps of data integration and warehousing allowing for exploration and application of data are usually hard and require expert knowledge in order to design and implement it.
Peachbox solves this by providing an agile and accessible open source solution to the Big ETL process. Peachbox is a Python framework based on and conforming to the ‘Lambda Architecture’, which in turn is an abstracted pattern providing principles and best practices for real-time and scalable data systems. The main underlying technology is PySpark.
In the tutorial we will set up Peachbox and implement a general and extensible Big ETL system. Furthermore we will explore potential applications.
Tutorial prerequisites and instructions.
Peachbox solves this by providing an agile and accessible open source solution to the Big ETL process. Peachbox is a Python framework based on and conforming to the ‘Lambda Architecture’, which in turn is an abstracted pattern providing principles and best practices for real-time and scalable data systems. The main underlying technology is PySpark.
In the tutorial we will set up Peachbox and implement a general and extensible Big ETL system. Furthermore we will explore potential applications.
Tutorial prerequisites and instructions.
Thomas Pfaff
Blue Yonder
Advanced Data Storage
Abstract | Bio
In this tutorial we will give an introduction to two advanced data storage formats. HDF5 and NetCDF were designed to efficiently store the results of supercomputing applications like climate model outputs, or the data streams received from NASA's fleet of earth observing satellites. They provide a lot of optimizations concerning transparent file compression, speed of access or working with multiple files as if it were one large data set.
A couple of Python libraries exist that allow fast and pythonic access to these formats.
We will show you how to create and access these types of files from Python, and how to use their advanced features to tune them for maximum efficiency.
Tutorial prerequisites and instructions.
A couple of Python libraries exist that allow fast and pythonic access to these formats.
We will show you how to create and access these types of files from Python, and how to use their advanced features to tune them for maximum efficiency.
Tutorial prerequisites and instructions.
Thomas Pfaff recieved his Diploma degree in Environmental Engineering in 2004 and his PhD from Stuttgart University in 2012 where he had been working on improving precipitation estimates from weather radar data. Therefore, one of his specialties is the processing and analysis of large amounts of environmental sensing data. Since 2014 he is a member of the core data science algorithms team at Blue Yonder GmbH, Karlsruhe
Jose Luis Lopez Pino
GetYourGuide
Lessons learned from applying PyData to our marketing organization
Abstract | Bio
For all e-commerce sites, marketing is a big part of the business and marketing efficiency and effectiveness are critical to their success. Companies must make many data-driven decisions in order to reach customers that their competitors don’t, maximize the revenue of each click, decide wisely what are the costs to cut, enter new markets, etc.
GetYourGuide has been working for more than two years on building a marketing intelligence that allows us growing our marketing efforts in the travel market without building a huge team or buying extremely expensive tools.
All the decisions are supported by a dedicated system that runs on the PyData stack that allows marketers to extract valuable insights from data and performs critical marketing tasks: keyword mining, campaign automation, predictive modeling, omni-channel marketing data integration, customer segmentation, pattern mining from click data, etc.
As a result of this, we were able to scale up 3 times our marketing efforts, launch campaigns in 13 markets and automate 75% of our work only in the last 8 months. But this is not the end of our journey, GetYourGuide is building a Data Science team to understand travelers needs and wants and make our Customers' trips amazing.
GetYourGuide has been working for more than two years on building a marketing intelligence that allows us growing our marketing efforts in the travel market without building a huge team or buying extremely expensive tools.
All the decisions are supported by a dedicated system that runs on the PyData stack that allows marketers to extract valuable insights from data and performs critical marketing tasks: keyword mining, campaign automation, predictive modeling, omni-channel marketing data integration, customer segmentation, pattern mining from click data, etc.
As a result of this, we were able to scale up 3 times our marketing efforts, launch campaigns in 13 markets and automate 75% of our work only in the last 8 months. But this is not the end of our journey, GetYourGuide is building a Data Science team to understand travelers needs and wants and make our Customers' trips amazing.
Jose Luis Lopez Pino is part of the Data Science team of GetYourGuide. Before that, he worked as Business Intelligence consultant and joined the Information Technology For Business Intelligence high quality education program. His new venture is Smergy, an initiative project to revolutionize energy efficiency using big data supported by the European Institute of Innovation and Technology.
Nakul Selvaraj
Trademob
Real-Time Monitoring of Distributed Systems
Abstract | Bio
Instrumentation has seen explosive adoption on the cloud in recent years. With the rise of micro-services we are now in an era where we measure the most trivial events in our systems. At Trademob, a mobile DSP with upwards of 125k requests per second across +700 instances, we generate and collect millions of time-series data points. Gaining key insights from this data has proven to be a huge challenge.
Outlier and Anomaly detection are two techniques that help us comprehend the behavior of our systems and allow us to take actionable decisions with little or no human intervention. Outlier Detection is the identification of misbehavior across multiple subsystems and/or aggregation layers on a machine level, whereas Anomaly Detection lets us identify issues by detecting deviations against normal behavior on a temporal level. The analysis of these deviations is simplified through the use of a time and memory efficient data structure called a t-digest. With t-digests we are able to store error distributions with high accuracy, especially for extreme quantile values.
At Trademob, we developed a Python-based real-time monitoring system to conquer those challenges in order to reduce false positive alerts and increase overall business performance. By correlating a multitude of metrics we can determine system interdependencies, preemptively detect issues and also gain key insights to causality. This session will provide insights into both the system’s architecture and the algorithms used to detect unwanted behaviors.
Outlier and Anomaly detection are two techniques that help us comprehend the behavior of our systems and allow us to take actionable decisions with little or no human intervention. Outlier Detection is the identification of misbehavior across multiple subsystems and/or aggregation layers on a machine level, whereas Anomaly Detection lets us identify issues by detecting deviations against normal behavior on a temporal level. The analysis of these deviations is simplified through the use of a time and memory efficient data structure called a t-digest. With t-digests we are able to store error distributions with high accuracy, especially for extreme quantile values.
At Trademob, we developed a Python-based real-time monitoring system to conquer those challenges in order to reduce false positive alerts and increase overall business performance. By correlating a multitude of metrics we can determine system interdependencies, preemptively detect issues and also gain key insights to causality. This session will provide insights into both the system’s architecture and the algorithms used to detect unwanted behaviors.
Selvaraj works at Trademob GmbH as a Software Engineer in R&D. His work focuses on Data Engineering and Monitoring.