PyData 2014 | Berlin | July 25

Presentation Abstracts

(click on the Title to view presentation details)

ABBY - A Django app to document your A/B tests

Jul 27 - 10:10 a.m.

Andy Goldschmidt

ABBY is a Django app that helps you manage your A/B tests. The main objective is to document all tests happening in your company, in order to better understand which measures work and which don't. Thereby leading to a better understanding of your product and your customer. ABBY offers a front-end that makes it easy to edit, delete or create tests and to add evaluation results. Further, it provides a RESTful API to integrate directly with our platform to easily handle A/B tests without touching the front-end. Another notable feature is the possibility to upload a CSV file and have the A/B test auto-evaluated, although this feature is considered highly experimental. At Jimdo, a do-it-yourself website builder, we have a team of about 180 people from different countries and with professional backgrounds just as diverse. Therefore it is crucial to have tools that allow having a common perspective on the tests. This facilitates having data informed discussions and to deduce effective solutions. In our opinion tools like ABBY are cornerstones to achieve the ultimate goal of being a data-driven company. It enables all our co-workers to review past and plan future tests to further improve our product and to raise the happiness of our customers. The proposed talk will give a detailed overview of ABBY, which eventually will be open-sourced, and its capabilities. I will further discuss the motivation behind the app and the influence it has on the way our company is becoming increasingly data driven.

Algorithmic Trading with Zipline

Jul 26 - 3:05 p.m.

Thomas Wiecki

Python is quickly becoming the glue language which holds together data science and related fields like quantitative finance. Zipline is a BSD-licensed quantitative trading system which allows easy backtesting of investment algorithms on historical data. The system is fundamentally event-driven and a close approximation of how live-trading systems operate. Moreover, Zipline comes "batteries included" as many common statistics like moving average and linear regression can be readily accessed from within a user-written algorithm. Input of historical data and output of performance statistics is based on Pandas DataFrames to integrate nicely into the existing Python eco-system. Furthermore, statistic and machine learning libraries like matplotlib, scipy, statsmodels, and sklearn integrate nicely to support development, analysis and visualization of state-of-the-art trading systems. Zipline is currently used in production as the backtesting engine powering Quantopian.com -- a free, community-centered platform that allows development and real-time backtesting of trading algorithms in the web browser.

Blaze

Jul 26 - 4:45 p.m.

Building the PyData Community

Jul 27 - 12:30 p.m.

Travis Oliphant

Coming Soon

CUDA 6 Tutorial

Kashif Rasul

Learn how to program and utilize the parallel computing power of the Graphics Processing Unit (GPU) using NVIDIA’s CUDA programming framework. We will pay particular attention to the new CUDA 6 features like Unified Memory, which simplifies memory management by automatically migrating data between the CPU and GPU, as well as the new cuBLAS-XT library for multi-GPU blas as well as NVBLAS for drop in replacement of blas libraries. You will get insight into development of CUDA and how it will take advantage of current and future GPUs in Python libraries for data analysis.

Color Analysis Through k-Means Clustering

John Mangual

Our eyes naturally and instantly extract information from complex images with a minimum of thought. In this tutorial, we will get our feet wet with analysis of images using Scikit-Image. Images are a great example of data sets, since they have thousands of pixels to work with at a time. In this tutorial, geared towards beginners, how Sci-Kit Image stores images and how to transform them in RGB space. Afterwards, we use some matrix math and k-means clustering to extract the significant colors in the image. In fact, we will write the k-means ourselves in just a few lines with NumPy. Image examples will come from my own photographs of graffiti in San Juan, Puerto Rico. However, you may use your own images or download some from Instagram. The result of this tutorial should be some rather pleasant color schemes which can be incorporated into art or your web site.

Commodity Machine Learning

Andreas Mueller

Coming Soon

Commodity Machine Learning

Jul 27 - 9 a.m.

Andreas Mueller

Conda: a cross-platform package manager for any binary distribution

Jul 27 - 1:20 p.m.

Ilan Schnell

Conda is an open source package manager, which can be used to manage binary packages and virtual environments on any platform. It is the package manager of the Anaconda Python distribution, although it can be used independently of Anaconda. We will look at how conda solves many of the problems that have plagued Python packaging in the past, followed by a demonstration of its features.

We will look at the issues that have plagued packaging in the Python ecosystem in the past, and discuss how Conda solves these problems. We will show how to use conda to manage multiple environments. Finally, we will look at how to build your own conda packages.

Data Oriented Programming

Jul 26 - 1:20 p.m.

Francesc Alted

Computers have traditionally been thought as tools for performing computations with numbers. Of course, its name in English has a lot to do with this conception, but in other languages, like the french 'ordinateur' (which express concepts more like sorting or classifying), one can clearly see the other side of the coin: computers can also be used to extract (usually new) information from data. Storage, reduction, classification, selection, sorting, grouping, among others, are typical operations in this 'alternate' goal of computers, and although carrying out all these tasks does imply doing a lot of computations, it also requires thinking about the computer as a different entity than the view offered by the traditional von Neumann architecture (basically a CPU with memory). In fact, when it is about programming the data handling efficiently, the most interesting part of a computer is the so-called hierarchical storage, where the different levels of caches in CPUs, the RAM memory, the SSD layers (there are several in the market already), the mechanical disks and finally, the network, are pretty much more important than the ALUs (arithmetic and logical units) in CPUs. In data handling, techniques like data deduplication and compression become critical when speaking about dealing with extremely large datasets. Moreover, distributed environments are useful mainly because of its increased storage capacities and I/O bandwidth, rather than for their aggregated computing throughput. During my talk I will describe several programming paradigms that should be taken in account when programming data oriented applications and that are usually different than those required for achieving pure computational throughput. But specially, and in a surprising turnaround, how the amazing amount of computational power in modern CPUs can also be useful for data handling as well.

Data Science for Activists: An Introduction to Pandas

Just because data is open doesn't mean it is accessible to the wider public. This tutorial will take government data, clean it, and visualize it. If time permits, participants will also have a chance to put their visualization into a flask application. Only basic knowledge of Python is required for this tutorial. We will go over IPython, and use Pandas to process, and clean our data. If you are a beginner Pythonista interested in making sense of open data, here is your chance to use your programming skills to get involved in the field of civic hacking.

Data Science for Activists: An Introduction to Pandas

Dealing with Complexity

Jul 26 - 9 a.m.

Jean-Paul Schmetz

Coming Soon

Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspective

Jul 26 - 11 a.m.

Trent McConaghy

People talk about a Moore's Law for gene sequencing, a Moore's Law for software, etc. This is talk is about *the* Moore's Law, the bull that the other "Laws" ride; and how Python-powered ML helps drive it. How do we keep making ever-smaller devices? How do we harness atomic-scale physics? Large-scale machine learning is key. The computation drives new chip designs, and those new chip designs are used for new computations, ad infinitum. High-dimensional regression, classification, active learning, optimization, ranking, clustering, density estimation, scientific visualization, massively parallel processing -- it all comes into play, and Python is powering it all.

Exploratory Time Series Analysis of NYC Subway Data

Jul 25 - 3:55 p.m.

Felix Marczinowski, Philipp Mack, Sönke Niekamp

What questions arise during a quick model assessment? In this hands-on-tutorial we want to cover the whole chain from preparing data to choosing and fitting a model to properly assessing the quality of a predictive model. Our dataset in this tutorial are the numbers of people entering and exiting New York subway stations. Among other ways of building a predictive model, we introduce the python package pydse ( http://pydse.readthedocs.org/ ) and apply it to the dataset in order to derive the parameters of an ARMA-model (autoregressive moving average). At the end of the tutorial we evaluate the models and examine the strengths and weaknesses of various ways to measure the accuracy and quality of a predictive model.

Exploring Patent Data with Python

Jul 27 - 1:20 p.m.

Franta Polach

Experiences from building a recommendation engine for patent search using pythonic NLP and topic modeling tools such as Gensim.

Extract Transform Load using mETL

Bence Faludi

mETL is an ETL package written in Python which was developed to load elective data for Central European University. Program can be used in a more general way, it can be used to load practically any kind of data to any target. Code is open source and available for anyone who want to use it. The main advantage to configurable via Yaml files and You have the possibility to write any transformation in Python and You can use it natively from any framework as well. We are using this tool in production for many of our clients and It is really stable and reliable. The project has a few contributors all around the world right now and I hope many developer will join soon. I really want to show you how you can use it in your daily work. In this tutorial We will see the most common situations: - Installation - Write simple Yaml configration files to load CSV, JSON, XML into MySQL or PostgreSQL Database, or convert CSV to JSON, etc. - Add tranformations on your fields - Filter records based on condition - Walk through a directory to feed the tool - How the mapping works - Generate Yaml configurations automatically from data source - Migrate a database to another database

Extract Transform Load using mETL

Jul 25 - 5:25 p.m.

Bence Faludi

Fast Serialization of Numpy Arrays with Bloscpack

Jul 27 - 11 a.m.

Valentin Haenel

Bloscpack [1] is a reference implementation and file-format for fast serialization of numerical data. It features lightweight, chunked and compressed storage, based on the extremely fast Blosc [2] metacodec and supports serialization of Numpy arrays out-of-the-box. Recently, Blosc -- being the metacodec that it is -- has received support for using the popular and widely used Snappy [3], LZ4 [4], and ZLib [5] codecs, and so, now Bloscpack supports serializing Numpy arrays easily with those codecs! In this talk I will present recent benchmarks of Bloscpack performance on a variety of artificial and real-world datasets with a special focus on the newly available codecs. In these benchmarks I will compare Bloscpack, both performance and usability wise, to alternatives such as Numpy's native offerings (NPZ and NPY), HDF5/PyTables [6], and if time permits, to novel bleeding edge solutions. Lastly I will argue that compressed and chunked storage format such as Bloscpack can be and somewhat already is a useful substrate on which to build more powerful applications such as online analytical processing engines and distributed computing frameworks. [1]: https://github.com/Blosc/bloscpack [2]: https://github.com/Blosc/c-blosc/ [3]: http://code.google.com/p/snappy/ [4]: http://code.google.com/p/lz4/ [5]: http://www.zlib.net/ [6]: http://www.pytables.org/moin

Faster than Google? Optimization lessons in Python.

Jul 27 - 11 a.m.

Radim Řehůřek

Lessons from translating Google's deep learning algorithm into Python. Can a Python port compete with Google's tightly optimized C code? Spoiler: making use of Python and its vibrant ecosystem (generators, NumPy, Cython...), the optimized Python port is cleaner, more readable and clocks in—somewhat astonishingly—4x faster than Google's C. This is 12,000x faster than a naive, pure Python implementation and 100x faster than an optimized NumPy implementation. The talk will go over what went well (data streaming to process humongous datasets, parallelization and avoiding GIL with Cython, plugging into BLAS) as well as trouble along the way (BLAS idiosyncrasies, Cython issues, dead ends). The quest is also documented on my blog.

Generators Will Free Your Mind

Jul 26 - 10:10 a.m.

James Powell

What are generators and coroutines in Python? What additional conceptualisations do they offer, and how can we use them to better model problems? This is a talk I've given at PyCon Canada, PyData Boston, and PyTexas. It's an intermediate-level talk around the core concept of generators with a lot of examples of not only neat things you can do with generators but also new ways to model and conceptualise problems.

Get To Know Your Data

How to Spy with Python

Lynn Root

This talk will walk through what the US government has done in terms of spying on US citizens and foreigners with their PRISM program, then walk through how to do exactly that with Python.

IPython and Sympy to Develop a Kalman Filter for Multisensor Data Fusion

Jul 27 - 3:05 p.m.

Paul Balzer

The best filter algorithm to fuse multiple sensor informations is the Kalman filter. To implement it for non-linear dynamic models (e.g. a car), analytic calculations for the matrices are necessary. In this talk, one can see, how the IPython Notebook and Sympy helps to develop an optimal filter to fuse sensor information from different sources (e.g. acceleration, speed and GPS position) to get an optimal estimate. more: http://balzer82.github.io/Kalman/

Improving scientific visualizations with Blender

Andreas Klostermann

Blender for Scientific Visualization Sometimes, static plots of data aren't enough. Sometimes you need animations of data for presentations or videos. There are several methods to create movie clips from single frame images, but there is plenty more you can do with the right tools. Blender is an Open Source software package which can serve as a swiss army knife in computer graphics. Using examples from finance, biology and physics I will demonstrate how you can use Blender to improve your visualizations and presentations. With blender you can turn rendered graphs into video presentations of your data. We can move around 2d animations, we can use motion blur and variable speed, pan, zoom, and much more. On top of that, Blender has a fully fledged 3D animation engine and lets you use any 3D model or 2D image to improve the look of the video or to make your point more convincingly. Finally, Blender can edit video sequences, achieve special effects with the node-based compositor, and render the final result for upload to your favourite streaming platform. And to supersede this awesomeness: It can be scripted with Python!

Interactive Analysis of (Large) Financial Data Sets

Jul 26 - 12:30 p.m.

Yves Hilpisch

Interactive Plots Using Bokeh

Jul 25 - 12:45 p.m.

Bryan Van De Ven

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients. This tutorial will walk users through the steps to create different kinds of interactive plots using Bokeh. We will cover using Bokeh for static HTML output, the IPython notebook, and plot hosting and embedding using the Bokeh server.

Intro to ConvNets

Jul 27 - 3:05 p.m.

Kashif Rasul

We will give an introduction to the recent development of Deep Neural Networks and focus in particular on Convolution Networks which are well suited to image classification problems. We will also provide you with the practical knowledge of how to get started with using ConvNets via the cuda-convnet python library.

Introduction to Natural Language Processing with Python tools

Shankar

The talk aims to provide a practical introduction to natural language processing (NLP) for a working programmer. The talk assumes no prior exposure to NLP, and will cover sufficient detail so that you know what modern NLP has to offer, and where to look for more details when you need it. We will cover the following topics: -Language modelling and vector space representation -Text classification and named entity recognition -Part-of-speech tagging -Question answering and text summarization The overarching goal of the talk is that you walk away with a mental framework that allows you to systematically think about problems in natural language processing. Note: I gave a similar introductory talk on machine learning at EuroPython 2013, and i got excellent feedback on it. Some people told me that it was the best talk they heard at the conference: http://www.youtube.com/watch?v=n-_o5Vd9ceM

Introduction to the Signal Processing and Classification Environment pySPACE

Jul 27 - 10:10 a.m.

Mario Michael Krell

This talk will give a basic introduction to the pySPACE framework and its current applications.

pySPACE (Signal Processing And Classification Environment) is a modular software for the processing of large data streams that has been specifically designed to enable distributed execution and empirical evaluation of signal processing chains. Various signal processing algorithms (so called nodes) are available within the software, from finite impulse response filters over data-dependent spatial filters (e.g., PCA, CSP, xDAWN) to established classifiers (e.g., SVM, LDA). pySPACE incorporates the concept of node and node chains of the Modular Toolkit for Data Processing (MDP) framework. Due to its modular architecture, the software can easily be extended with new processing nodes and more general operations. Large scale empirical investigations can be configured using simple text-configuration files in the YAML format, executed on different (distributed) computing modalities, and evaluated using an interactive graphical user interface.

pySPACE allows the user to connect nodes modularly and automatically benchmark the respective chains for different parameter settings and compare these with other node chains, e.g., by automatic evaluation of classification performances provided within the software. In addition, the pySPACElive mode of execution can be used for online processing of streamed data. The software specifically supports but is not limited to EEG data. Any kind of time series or feature vector data can be processed and analyzed.

pySPACE additionally provides interfaces to specialized signal processing libraries such as SciPy, scikit-learn, LIBSVM, the WEKA Machine Learning Framework, and the Maja Machine Learning Framework (MMLF).

Web page: http://pyspace.github.io/pyspace/

Low-rank matrix approximations in Python

Jul 26 - 2:10 p.m.

Christian Thurau

Low-rank approximations of data matrices have become an important tool in machine learning and data mining. They allow for embedding high dimensional data in lower dimensional spaces and can therefore mitigate effects due to noise, uncover latent relations, or facilitate further processing. These properties have been proven successful in many application areas such as bio-informatics, computer vision, text processing, recommender systems, social network analysis, among others. Present day technologies are characterized by exponentially growing amounts of data. Recent advances in sensor technology, internet applications, and communication networks call for methods that scale to very large and/or growing data matrices. In this talk, we will describe how to efficiently analyze data by means of matrix factorization using the Python Matrix Factorization Toolbox (PyMF) and HDF5. We will briefly cover common methods such as k-means clustering, PCA, or Archetypal Analysis which can be easily cast as a matrix decomposition, and explain their usefulness for everyday data analysis tasks.

Make sense of your (big) data using Elasticsearch

Jul 27 - 2:10 p.m.

Honza Král

In this talk I would like to show you a few real-life use-cases where Elasticsearch can help you make sense of your data. We will start with the most basic use case of searching your unstructured data and move on to more advanced topics such as faceting, aggregations and structured search. I would like to demonstrate that the very same tool and dataset can be used for real-time analytics as well as the basis for your more advanced data processing jobs. All in a distributed environment capable of handling terabyte-sized datasets. All examples will be shown with real data and python code demoing the new libraries we have been working on to make this process easier.

Mall Analytics Using Telco Data & Pandas

Jul 26 - 12:30 p.m.

Karolina Alexiou

This talk will be about my latest project in mall analytics, where we estimated visitor trends in malls around the globe using telco data as a basis, and employed map reduce technologies and data science to extrapolate from this basis to reality and correct for biases. We succeeded in extracting valuable information such as count of visitors per hour, demographics breakdown, competitor analysis and popularity of the mall among different parts of the surrounding areas, all the while preserving user privacy and working only with aggregated data. I will show an overview of our system's modules, how we got a first raw estimation of the visitors and their behaviours, and how we refined and evaluated this estimation using pandas, matplotlib, scikit-learn and other python libraries.

Map Reduce: 0-60 in 80 Minutes

Christopher Roach

In 2004, at the Sixth Symposium on Operating System Design and Implementation, Jeffrey Dean and Sanjay Ghemawat, a couple of engineers working for Google, published a paper titled “MapReduce: Simplified Data Processing on Large Clusters” that introduced the world to a simple, yet powerful heuristic for processing large amounts of data at previously unheard of scales. Though the concepts were not new---map and reduce had existed for quite some time in functional programming languages---the observation that they could be used as a general programming paradigm for solving large data processing problems changed the current state of the art. If you find yourself working with data nowadays, you’re bound to find yourself at some point with a need to process “Big Data”. Big Data can be a troublesome phrase, arguably more hype than anything at this point, it has many different meanings, but for the purpose of this tutorial we’ll consider it to be any data that is too large to fit into the main memory of a single machine. With that in mind, if you’ve ever found yourself with the need to process an amount of data that stretched the boundaries of your own personal laptop, and you wanted to apply the ideas expressed in Dean's and Ghemawat's seminal paper but had no idea what to do, or even where to start, then this tutorial is for you. The goal of the tutorial is to give attendees a basic working knowledge of what MapReduce is, and how it can be used to process massive sets of data relatively quickly. We will walk through the basics of what MapReduce is and how it works. Though there are a handful of MapReduce implementations out there to choose from, Hadoop is without a doubt the most well known and, as such, we will take a look at how to use it to run our MapReduce jobs. With that in mind, we will discuss what you need to know to use Hadoop and take a look at how to write our own Hadoop jobs in Python using the Hadoop Streaming utility. Finally, we’ll look at a library created at Yelp called MRJob that can make writing Hadoop jobs in Python much easier. By the end of the tutorial an attendee with little to no knowledge of MapReduce, but a working knowledge of Python, should be able to write their own basic MapReduce tasks for Hadoop and run them on a cluster of machines using Amazon’s Elastic MapReduce service.

Map Reduce: 0-60 in 80 Minutes

Christopher Roach

Massively Parallel Processing with Procedural Python

Jul 27 - 3:55 p.m.

Ronert Obst

The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.

Networks meet Finance in Python

Jul 27 - 2:10 p.m.

Miguel Vaz

In the course of the 2008 Lehman and the subsequent European debt crisis, it became clear that both industry and regulators had underestimated the degree of interconnectedness and interdependency across financial assets and institutions. This type of information is especially well represented by network models, which had first gained popularity in other areas, such as computer science, biology and social sciences.

Although in its early stages, the study of network models in finance is gaining momentum and could be key to building the next generation of risk management tools and averting future financial crises. After a short overview of some of the most relevant work in the field, I will walk through (real data) examples using the pydata toolset.

Omnia.md: Engineering a Full Python Stack for Biophysical Computation

Packaging and Deployment

Jul 25 - 5:25 p.m.

Travis Oliphant

Coming soon.

Pandas' Thumb: unexpected evolutionary use of a Python library.

Jul 27 - 3:55 p.m.

Chris Nyland

Lawyers are not famed for their mathematical ability. On the contary - the law almost self-selects as a career choice for the numerically-challenged. So when the one UK tax that property lawyers generally felt comfortable dealing with (lease duty) was replaced with a new tax (stamp duty land tax) that was both arithmetically demanding and conceptually complex, it was inevitable that significant frustrations would arise. Suddenly, lawyers had to deal with concepts such as net present valuations, aggregation of several streams of fluctuating figures, and constant integration of a complex suite of credits and disregards. This talk is a description of how - against a backdrop of data-drunk tax authorities, legal pressures on businesses to have appropriate compliance systems in place, and the constant pressure on their law firms to commoditise compliance services, Pandas may be about to make a foray from its venerable financial origins into a brave new fiscal world - and can revolutionise an industry by doing so. A case study covering the author's development of a Pandas-based stamp duty land tax engine ("ORVILLE") is discussed, and the inherent usefulness of Pandas in the world of tax analysis is explored.

Parallel processing using python and gearman

Jul 26 - 1:20 p.m.

Pedro Miguel Dias Cardoso

When talking of parallel processing, some task requires a substantial set-up time. This is the case of Natural Language Processing (NLP) tasks such as classification, where models need to be loaded into memory. In these situations, we can not start a new process for every data set to be handled, but the system needs to be ready to process new incoming data. This talk will look at job queue systems, with particular focus on gearman. We will see how we are using it at Synthesio for NLP tasks; how to set up workers and clients, make it redundant and robust, monitor its activity and adapt to demand.

Python and Big Data Frameworks

Frank Kaufer

The Python community has developed a powerful and convenient data analysis software stack. But although the PyData tools and libraries become more and more popular, they are often still associated with small to medium-scale data analysis. On the other side "Big Data" is sometimes treated as a Trademark-feature of Hadoop & Co. Fortunately, another strength of the Python community is its openness and pragmatic culture which not only entails many excellent tools written in Python with rich interfaces to software without Python-core but more general makes Python a popular glue language. Not only, but particularly for data analysis. In this spirit we show how to develop data-heavy software in Python while using cluster-computing technology around the Hadoop ecosystem in the backend. Particularly, we focus on recent frameworks such as Spark or Stratosphere and interactive data analysis use cases.

Python and pandas as back end to real-time data driven applications

Jul 26 - 3:55 p.m.

Giovanni Lanzani

For data, and data science, to be the fuel of the 21th century, data driven applications should not be confined to dashboards and static analyses. Instead they should be the driver of the organizations that own or generates the data. Most of these applications are web-based and require real-time access to the data. However, many Big Data analyses and tools are inherently batch-driven and not well suited for real-time and performance-critical connections with applications. Trade-offs become often inevitable, especially when mixing multiple tools and data sources. In this talk we will describe our journey to build a data driven application at a large Dutch financial institution. We will dive into the issues we faced, why we chose Python and pandas and what that meant for real-time data analysis (and agile development). Important points in the talk will be, among others, the handling of geographical data, the access to hundreds of millions of records as well as the real time analysis of millions of data points.

Quantified Self: Analyzing the Big Data of our Daily Life

Jul 26 - 10:10 a.m.

Andreas Schreiber

Applications for self tracking that collect, analyze, or publish personal and medical data are getting more popular. This includes either a broad variety of medical and healthcare apps in the fields of telemedicine, remote care, treatment, or interaction with patients, and a huge increasing number of self tracking apps that aims to acquire data form from people’s daily life. The Quantified Self movement goes far beyond collecting or generating medical data. It aims in gathering data of all kinds of activities, habits, or relations that could help to understand and improve one’s behavior, health, or well-being. Both, health apps as well as Quantified Self apps use either just the smartphone as data source (e.g., questionnaires, manual data input, smartphone sensors) or external devices and sensors such as ‘classical’ medical devices (e.g,. blood pressure meters) or wearable devices (e.g., wristbands or eye glasses). The data can be used to get insights into the medical condition or one’s personal life and behavior. This talk will provide an overview of the various data sources and data formats that are relevant for self tracking as well as strategies and examples for analyzing that data with Python. The talk will cover:

Accessing local and distributed sources for the heterogeneous Quantified Self data. That includes local data files generated by smartphone apps and web applications as well as data stored on cloud resources via APIs (e.g., data that is stored by vendors of self tracking hardware or data of social media channels, weather data, traffic data etc.)
Homogenizing the data. Especially, covering typical problems of heterogeneous Quantified Self data, such as missing data or different and non-standard data formatting.
Analyzing and visualizing the data. Depending on the questions one has, the data can be analyzed with statistical methods or correlations. For example, to get insight into one's personal physical activities, steps data form activity trackers can be correlated to location data and weather information. The talk covers how to conduct this and other data analysis tasks with tools such as pandas and how to visualize the results.

The examples in this talk will be shown as interactive IPython sessions.

Semantic Python: Mastering Linked Data with Python

Jul 26 - 11 a.m.

Valerio Maggio

Tim Berners-Lee defined the Semantic Web as a web of data that can be processed directly and indirectly by machines.

More precisely, the Semantic Web can be defined as a set of standards and best practices for sharing data and the semantics of that data over the Web to be used by applications [DuCharme, 2013].

In particular, the Semantic Web is built on top of three main pillars: the RDF (i.e., Resource Description Framework) data model, the SPARQL query language, and the OWL standard for storing vocabularies and ontologies. These standards allows the huge amount of data on the Web to be available in a unique and unified standard format, contributing to the definition of the Web of Data (WoD) [1].

The WoD makes the web data to be reachable and easily manageable by Semantic Web tools, providing also the relationships among these data (thus practically setting up the “Web”). This collection of interrelated datasets on the Web can also be referred to as Linked Data [1].

Two typical examples of large Linked Dataset are FreeBase, and DBPedia, which essentially provides the so called Common sense Knowledge in RDF format.

Python offers a very powerful and easy to use library to work with Linked Data: rdflib.

RDFLib is a lightweight and functionally complete RDF library, allowing applications to access, create and manage RDF graphs in a very Pythonic fashion.

In this talk, a general overview of the main features provided by the rdflib package will be presented. To this end, several code examples will be discussed, along with a case study concerning the analysis of a (semantic) social graph. This case study will be focused on the integration between the networkx module and the rdflib library in order to crawl, access (via SPARQL), and analyze a Social Linked Data Graph represented using the FOAF (Friend of a Friend) schema.

This talk is intended for an Novice level audience, assuming a good knowledge of the Python language.

Speed Without Drag

Jul 26 - 3:55 p.m.

Saul Diez-Guerra

Speed without drag: making code faster when there's no time to waste A practical walkthrough over the state-of-the-art of low-friction numerical Python enhancing solutions, covering: exhausting CPython, NumPy, Numba, Parakeet, Cython, Theano, Pyston, PyPy/NumPyPy and Blaze.

Street Fighting Trend Resarch

Jul 26 - 2:10 p.m.

Benedikt Koehler

This talk presents a very hands-on approach for identifying research and technology trends in various industries with a little bit of Pandas here, NTLK there and all cooked up in an IPython Notebook. Three examples featured in this talk are:

How to find out the most interesting research topics cutting edge companies are after right now?
How to pick sessions from a large conference program (think PyCon, PyData or Strata) that are presenting something really novel?
How to automagically identify trends in industries such as computer vision or telecommunications?

The talk will show how to tackle common tasks in applied trend research and technology foresight from identifying a data-source, getting the data and data cleaning to presenting the insights in meaningful visualizations.

Using Cloud Foundry for Data Driven Python apps

This tutorial will introduce how to use the popular data science library Pandas on PySpark to enable solving big data tasks with pandas. Spark is a top level Apache project for lightning fast large-scale data processing. The basic unit of data in Spark is an RDD (resilient distributed data set), which has a simple functional API. By having panda frames stored inside a Spark RDD and using its basic API we perform many parallel operations, but the resulting syntax, rdd.map(lambda x: x.map(lambda y: z)), leaves something to be desired. We will extend the basic Spark RDD to be aware of the underlying Panda data frames. Using this extended RDD we will examine how to implement simple operations on panda data frames. From there we will look at how to implement some of the panda 2-d operations over an RDD of 1-d panda data frames and incorporate this in our extended RDD. We will cover how to load data effectively into Pandas using both SparkSQL and using Spark's file load mechanism combined with Pandas native parsing. This gives us the ability to load data from CSV files, paquet files, and from Apache Hive.

Visualising Data through Pandas

Jul 25 - 3:55 p.m.

Vincent Warmerdam

Python has been taking over R and SPSS in the last couple of years. Python has many tools to offer like NumPy and scikit-learn that make any data scientist happy but pandas has been the main reason for many analysts to switch. It is fast, flexible, simple to learn, well documented, has a substantial community and has been accepted in many businesses as an every day tool. One reason why people still prefer R is the ggplotmodule. It offers a non-verbose yet flexible way to visualise information. In this session we will show you that a proper workflow with ipython-notebook/pandas still allows the use of this library ggplot. With this workflow you have the best of both worlds with no compromises. At the end of this session you have been show how to load data, aggregate data and analyse data with pandas and how to then visualise it with ggplot.

scikit-learn

Jul 25 - 12:45 p.m.

Andreas Mueller

This will be an interactive tutorial using IPython notebook, where explanatory slides are interleaved with time for data exploration and trying out. We will start from basics of data loading and preparation with pandas. Then we will discuss basics of machine learning, followed by algorithms for visualization and supervised learning. In the end, we will talk about model selection and the importance of the bias variance tradeoff. Depending on the audience we will finish with an end-to-end process for a text classification task. The goal of the tutorial is that the participants have a good idea how to attack a machine learning task, and what kind of tools are offered by scikit-learn to help.