PyData 2014 | London | Feb 21

Presentation Abstracts

(click on the Title to view presentation details)

Massively Parallel Processing with Procedural Python

Feb 22 - 11 a.m.

Ian Huston

The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.

A Beginner's guide to Random Forests - R vs Python

Feb 22 - 4:10 p.m.

Linda Uruchurtu

In this talk I will give an overview of Random Forests and will show their versatility when attempting to predict song ratings using the EMI music data set. I will present results using the randomForest library in R and the scikit learn implementation in Python and discuss their performance. I will also comment on their ease of use from a beginner's point of view.

A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook

Feb 23 - 12:40 p.m.

Bart Baddeley

Clustering data is a fundamental technique in data mining and machine learning. The basic problem can be specified as follows: “Given a set of data, partition the data into a set of groups so that each member of a given group is as similar as possible to the other members of that group and as dissimilar as possible to members of other groups”. In this talk I will try to unpack some of the complexities inherent in this seemingly straightforward description. Specifically, I will discuss some of the issues involved in measuring similarity and try to provide some intuitions into the decisions that need to be made when using such metrics to cluster data.

Adaptive Filtering of Tweets with Machine Learning

Feb 23 - 11 a.m.

Neri Van Otten

At Conversocial we use machine learning to filter through our customer's social data. We get the relevant customer service content to them, minimizing their response time and limiting the number of agents needing to sift through all of their tweets, facebook and google plus data. This is a story about how to tackle this problem using NLP and Python. What works well and what doesn't. This talk will cover both the machine learning models and the infrastructure used to scale the approach in production.

An introduction to video action recognition

Feb 21 - noon

Bertrand NOUVEL

At WIDE IO, we are specialists in image processing and video analytics; we have individual experience using Python, Numpy and Scipy for Computer Vision applications since 2007. Now, the environment has become much mature. Our goal with this talk is to share our enthusiasm and to present the basic steps required to perform image and video pattern analysis with Python. In our tutorial, we’ll investigate how to build an action recognition framework and how to do video-tracking with traditional vision models based on a bag-of-keypoints. By going through examples, we’ll discuss how in practice computer vision for real-applications involve a trade-off between esthetical theories and utilitarianism. We will explore the various tricks that allow engineers to boost global performances, methods for running experiments and a mechanism for how to prepare the data... All these points are just a nice pretext to discuss our favorite tools: Numpy and Scipy of course, but also more exotic ones such as MediaLovinToolkit, PyCUDA, Bob and PyCVF... At the end of the talk, we’ll conclude by briefly discussing future imperatives, especially with respect to mobility and cloud computing.

Authorship Attribution using Python

Feb 22 - 3:20 p.m.

Kostas

The aim of this talk is to demonstrate methods and applications of Authorship Attribution and related Forensic Linguistics techniques, using Python. http://en.wikipedia.org/wiki/Stylometry http://en.wikipedia.org/wiki/Forensic_linguistics#Author_identification

Blosc: Sending data from memory to CPU (and back) faster than memcpy()

Feb 22 - 10:10 a.m.

Francesc Alted

Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations (which is typical in vector-vector operations). It uses the blocking technique (as described in this talk: http://www.pytables.org/docs/StarvingCPUs.pdf) to reduce activity on the memory bus as much as possible. In short, the blocking technique works by dividing datasets in blocks that are small enough to fit in L1 cache of modern processor and perform compression/decompression there. It also leverages SIMD and multi-threading capabilities present in nowadays multi-core processors so as to accelerate the compression/decompression process to a maximum. Recently, in Blosc 1.3, support for different ultra-fast compressors like LZ4 and Snappy have been added, as well as LZ4HC and Zlib, which are more meant for achieving high compression ratios. This adds a lot of flexibility to Blosc, so that it can be used in more scenarios. During my talk I will explain why I created Blosc, and its ultimate goal: reducing memory I/O and hence, making computations more efficient.

Bokeh - Interactive Visualization for Large Datasets

Feb 21 - 3:20 p.m.

Bryan Van De Ven

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients. This tutorial will walk users through the steps to create different kinds of interactive plots using Bokeh. We will cover using Bokeh for static HTML output, the IPython notebook, and plot hosting and embedding using the Bokeh server.

Building a Cutting-Edge Data Processing Environment on a Budget

Feb 23 - 9 a.m.

Gael Varoquaux

As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.

I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?

Correcting 10 years of messy CRM data: A practical data science project & Introducing Cartopy

Feb 23 - 4:20 p.m.

Jonathan Sedar, Philip Elson

Cython

Mike Mueller

Coming Soon

DX Analytics -- A Python-based Library for Derivatives Analytics

Feb 23 - 11 a.m.

Yves Hilpisch

Derivatives analytics is one of the most compute and data intensive areas in the financial industry. This mainly stems from the fact, that Monte Carlo simulation techniques have to be applied in general to value and risk manage single derivatives trades and whole books of derivatives.

DX Analytics is a derivatives analytics library that is completely build in Python and that has a rather Pythonic API. It allows the modeling and valuation of both single- and multi-risk factor derivatives with European and American exercise. It also allows the consistent valuation of complex portfolios of such derivatives, e.g. incorporating the correlation between single risk factors.

The talk provides some theoretical and technical background information, discusses the basic architectural features of DX Analytics and illustrates its use by a number of simple and more complex examples.

Data Science at Berkeley

Feb 22 - 4:10 p.m.

Ben Moran

"Much of what we want to do with data involves optimization: whether it's to find a model that best fits the data, or to decide on the optimal action given some information.

We'll explore the embarrassment of riches Python offers to tackle custom optimization problems: the scipy.optimize package, Sympy for calculus and code generation, Cython for speedups and binding to external libraries."

Data Science: It's Easy as Pyǃ

DataViz Showdown: a comparison of different data visualisation libraries

Feb 23 - 10:10 a.m.

Peter Passaro

There are now a wide variety of data visualisation libraries available for use to make your data beautiful. I'll do a review of the different options available, focusing on 3 that are specifically Python based (Matplotlib, Seaborn, and Bokeh) and one javascript library that is the reigning champion of dataviz for the web (d3.js). I'll highlight how each of them works, when they might be appropriate, and generally how to use them effectively.

Databases for Scientists: Narrowing the Gap Between Array Computing and Databases

Feb 21 - 4:50 p.m.

David Cournapeau , Simon Jagoe

Numpy and pandas are the corner stones of data analysis in python. They allow for efficient data access and manipulation. Yet, they are not always appropriate for more heterogeneous data usage, when access patterns are hard to predict, or when you need to support write parallelism. This is an area where traditional databases systems still shine compared to the traditional data scientist toolset.

The goal of this tutorial is to give you an idea of how databases can help you dealing with data which are not just numerical, with minimal effort or knowledge. We will focus on Postgresql, an open source database that have powerful extensions to deal with heterogeneous (aka 'schemaless') data, while being simple to use from python.

Embeddings of Python

Feb 21 - 3:20 p.m.

James Powell

How can Python be embedded into other applications (C/C++)? What bearing does this have on our conceptualisation of systems written in Python? This talk covers the very-high-level embedding and the pure-embedding and also includes two novel embeddings: a zero-interpreter embedding using Cython and a Python interpreter embedded as a extension model within another Python interpreter (the "Xzibit" embedding.) This is a talk I've given at PyData Boston and PyTexas. It's an advanced level talk around the ways we can embed Python into other (C/C++) applications. It includes two fairly novel results of my own research.

Eurex Tutorial - Interactive Financial Analytics with Python and IPython -- With Examples Based on the VSTOXX Volatility Index Requires Separate Registration

Feb 21 - 8:30 a.m.

Yves Hilpisch

This tutorial is free, but requires separate registration.

Today's financial market environment demands for ever shorter times-to-insight when it comes to financial analytics tasks. For the analysis of financial times series or for typical tasks related to derivatives analytics and trading, Python has developed to the ideal technology platform.

Not only that Python provides powerful and efficient libraries for data analytics, such as NumPy or pandas. With IPython there is a tool and environment available that facilitates interactive, and even real-time, financial analytics tremendously.

The tutorial introduces into IPython and shows, mainly on the basis of practical examples related to the VSTOXX volatility index, how Python and IPython might re-define interactive financial analytics.

Quants, traders, financial engineers, analysts, financial researchers, model validators and the like all benefit from the tutorial and the new technologies provided by the Python ecosystem.

BACKGROUND

For background information see the Python-based "VSTOXX Advanced Services" and the related backtesting applications:

http://www.eurexchange.com/vstoxx/

http://www.eurexchange.com/vstoxx/app1/

http://www.eurexchange.com/vstoxx/app2/

TECHNICAL REQUIREMENTS

To follow the tutorial, you should have installed the Anaconda Python distribution on your notebook. Download and follow the instructions here:

http://continuum.io/downloads

After installation, start IPython from the command line interface/shell as follows:

$ ipython notebook --pylab inline

IPython should then start in your default Web browser.

Faster Python Programs through Optimization

Feb 21 - 1:30 p.m.

Mike Müller

Although Python programs may be slow for certain types of tasks, there are many different ways to improve the performance. This tutorial will introduce optimization strategies and demonstrate techniques to implement them. Another objective of this course is to help the participants to gain the ability to decide what might be the optimal solution for a specific performance problem. Abstract This is a hands-on course. Students are strongly encouraged to work along with the trainer at the interactive prompt. There will be exercises for the students to do on their own. Experience shows that this active involvement is essential for an effective learning.

Generator Showcase Showdown

Feb 23 - 10:10 a.m.

James Powell

What are generators and coroutines in Python? What additional conceptualisations do they offer, and how can we use them to better model problems? This is a talk I've given at PyCon Canada, PyData Boston, and PyTexas. It's an intermediate-level talk around the core concept of generators with a lot of examples of not only neat things you can do with generators but also new ways to model and conceptualise problems.

Getting it out there: Python, Javascript and Web-visualizations

Feb 22 - 3:20 p.m.

Kyran

Python makes hacking data into shape a far less painful process than it used to be, and has some fairly powerful visualization tools to boot. But what happens if you want to get those results onto the browser, something that is becoming increasingly imperative for much modern data-science? For the foreseeable future that means playing nicely with Javascript (JS) and, due to a recent arms-race among the JS engines, the good news is that JS is quite capable of handling the demands of cutting, splicing, filtering and visualizing etc. large data-sets interactively. This interativity, the ability to play with data in real-time, is day to the night of static, pre-rendered images. This talk will focus on web-based data-visualizations, showing what JS can do on the client-side, how Python can provide the perfect back-end partner and some of the challenges and gotches experienced by the speaker. The key JS libraries discussed will be D3 (data driven documents) and Angular. Python old faithfuls Scipy, Numpy, and Panda will get the usual head-nod.

Gradient Boosted Regression Trees in scikit-learn

Feb 23 - 1:30 p.m.

Gilles Louppe, Peter Prettenhofer

This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.

I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.

Hierarchical Text Classification using Python (and friends)

Feb 22 - 10:10 a.m.

Jurgen Van Gael

In this talk I will describe a system that we've built for doing hierarchical text classification. I will describe the logical setup of the various steps involved: data processing, feature selection, training, validation and labelling. To make this all work in practice we've mapped the setup onto a Hadoop cluster. I'll discuss some of the pro's and con's that we've run into when working with Python and Hadoop. Finally, I'll discuss how we use crowdsourcing to continuously improve the quality of our hierarchical classifier.

House Prices and Rents: Evidence from a Matched Dataset in Central London

Feb 23 - 1:30 p.m.

Philippe Bracke

The average ratio between house prices and rents is an indicator of housing market conditions. Since micro data on rents are difficult to find, little is known about this ratio at the individual-property level. In this project, I analyse a real estate agency's proprietary dataset containing tens of thousands of rental transactions in Central London during the 2006-2012 period. I merge it with the Land Registry and isolate 1,922 properties which were both sold and rented out within a six-month period. I measure their price-rent ratios and show that price-rent ratios are higher for bigger and more central units.

How to build a SQL-based data warehouse for 100+ billion rows in Python

Feb 23 - 3:30 p.m.

Clyde Fare

Modern programs used in quantum chemistry represent an enormous number of man years of research and development. Their use has resulted in significant contributions to both chemistry and solid state physics and they remain key to research today. On the practical side however many of these programs are firmly lodged in a paradigm from another age requiring construction of complex text based input files to direct calculations and producing large quantities of text based output files that are not machine readable and are barely human readable. Further, the manner in which quantum chemical research is reported in the literature often makes reproducing the work of other scientists an unnecessarily onerous task. In this talk I will discuss how use of python wrappers and parsers together with the extended scientific python stack can change the way we construct and analyse calculations using Gaussian (the most popular quantum chemistry package) and how in combination with the IPython notebook we can transform the way quantum chemists share both the output of their research as well as the important lessons they learned along on the way.

Life after matplotlib: Harder, better, faster, stronger

Feb 22 - 11 a.m.

Kayla Iacovino

AvoPlot is a simple graphical plotting program created by scientists, for scientists. Born out of the frustrations of a multi-disciplinary research group working in the lab, the field and a desk, it aims to solve some of the greatest remaining problems in scientific data handling, such as: * "I hate programming! I wish I could create an interactive plot of this data file without having to write a script." * "This plot that matplotlib created is beautiful! But I wish I could just click and drag that title a bit to the left". * "I have this great bit of data processing code and I'd like to make it available as a graphical application, but I have neither the time nor the programming ability to do so." * "Spreadsheets suck! I want to work with my data in a visual way, rather than having to deal with tables of numbers." Built on top of Python's excellent matplotlib plotting library, AvoPlot is a graphical plotting tool designed for visualising and analysing scientific data. In addition to providing a graphical user interface to many of the capabilities of matplotlib, it also provides a plug-in framework, allowing users to extend its standard feature set to meet their specific requirements. Plug-ins can be written both to import different data sets (e.g. binary data), and to provide analysis tools for working with them (e.g. fitting routines, background subtraction etc.). This talk will take a light-hearted look at some of the data handling problems encountered by scientists, and explain how we have brought together the capabilities of many well established Python packages into a single convenient application in an attempt to solve them.

Measuring and Predicting Departures from Routine in Human Mobility

Feb 23 - 4:20 p.m.

Dirk Gorissen

Understanding human mobility patterns is a significant research endeavor that has recently received considerable attention. Developing the science to describe and predict how people move from one place to another during their daily lives promises to address a wide range of societal challenges: from predicting the spread of infectious diseases, improving urban planning, to devising effective emergency response strategies. This presentation will discuss a Bayesian framework to analyse an individual’s mobility patterns and identify departures from routine. It is able to detect both spatial and temporal departures from routine based on heterogeneous sensor data (GPS, Cell Tower, social media, ..) and outperforms existing state-of-the-art predictors. Applications include mobile digital assistants (e.g., Google Now), mobile advertising (e.g., LivingSocial), and crowdsourcing physical tasks (e.g., TaskRabbit).

Measuring the digital economy using big data

Feb 22 - 12:40 p.m.

Prash Majmudar

The UK has one of strongest digital economies in the world, yet the size and scope of this space is poorly understood through conventional classification systems and datasets. This is not only important for economists but for the people affected by Government policy. In this talk, we will provide an overview of the work Growth Intelligence has done in generating our own classification system to describe businesses using web-based data and python tools and techniques – including the challenges we still face. We will also look at some situations in which traditional methods fall short of a data driven approach.

Most Winning A/B Test Results are Illusory

Feb 22 - 12:40 p.m.

Martin Goodson

Many people have started to suspect that their A/B testing results are not what they seem. A/B test reports an uplift of 20% and yet this increase never seems to translate into increased profits. So what’s going on? I'll use python simulations to show that A/B testing is often conducted in such a way that it virtually guarantees false positive results. I'll also mention some python functions that can be used to avoid these problems.

PYDATA AFTER PARTY

Feb 22 - 6 p.m.

Skimlinks

Join us for a glass of vino, amazing tapas, and expertly selected craft beers at Levante Bar to discuss all things PyData! Invitation only so please RSVP via the link here.

Panel Discussion: "Shouldn't more companies be using data science?"

Feb 22 - 5 p.m.

Ian Ozsvald, James Powell

A goal of this panel is to identify problems that companies in London have with data science in the hope that people get together to solve them. Who does what, what's wrong, what's missing, and how can we improve things?

Probabilistic Data Structures and Approximate Solutions

Feb 21 - 1:30 p.m.

Oleksandr Pryymak

Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.

This tutorial is a practical survey of useful probabilistic data structures and algorithmic tricks for obtaining approximate solutions. When should we use them, and when we should not trade accuracy for scalability. In particular, we start with hashing and sampling; address the problems of comparing and filtering sets, counting the number of unique values and their occurrences; touch basic hashing tricks used in machine learning algorithms. Finally, we analyse some examples of their usage show the full power: how to organise an online analytics, or how to decode a DNA sequence by squeezing a large graph into a bloom filter.

Python and MongoDB as a platform for financial market data

Feb 22 - 1:30 p.m.

James Blackburn

As businesses search for diversification by trading new financial products, it is easy for market data infrastructure to become fragmented and inconsistent. We describe how we have successfully used Python, Pandas and MongoDB to build a market data system that stores a variety of Timeseries-based financial data for research and live trading at a large systematic hedge fund. Our system has a simple, high-performance schema, a consistent API for all data access, and built-in support for data versioning and deduplication. We support fast interactive access to the data by quants, as well as clustered batch processing by running a dynamic data flow graph on a cluster.

Python for High Throughput Science

Feb 23 - 2:40 p.m.

Mark Basham

Diamond Light Source is the UK Synchrotron, a national facility containing over 20 experimental stations or beamlines, many of which are capable of generating Terra-bytes of raw data every day. In this data rich environment, many scientists that come to the facility can be daunted by the sheer quantity and complexity of the data on offer. The scientific software group is charged with assisting with this deluge of data and as a small team it is imperative that provides sustainable and rapid solutions to problems. Python has proved to be well suited to this and is now used heavily at the facility, from cutting edge research projects, through general pipe-lining and data management, to simple data manipulation scripts. And by a range of staff and facility users, from experienced software engineers and scientists, to support staff and PhD students simply wanting something to help make sense of the data or experimental set-up.

This presentation focuses on the current state of the scientific management and data analysis within Diamond, and details the workhorses which are relied on, as well as what the future holds.

Python in the Financial Industry: The Universal Tool for End-to-End Development

Feb 22 - 9 a.m.

Felix Fernandez

In the context of a rapidly evolving financial industry, managing increasing amounts of data and coping with regulatory requirements, time-to market of services and cost efficiency along the value chain are key success drivers for any financial institution. Especially the shift from monolithic architectures (e.g. Open VMS/Cobol) to heterogeneous technology stacks and systems (e.g. Linux/Java/SQL) creates additional challenges for IT. In addition, the “technology empowerment of the business analysts” adds complexity to the implementation of IT systems if not managed properly. After the introduction of Python at Deutsche Börse Group several years ago, the presentation today is a reflection about experiences in real world applications, the potential of Python as a universal tool for end-to-end development and an outlook to the future of this language framework in the financial industry.

Python, Pharmaceuticals and Drug Discovery

Feb 23 - 12:40 p.m.

Emlyn Clay

The pharmaceutical industry is a £250 billion dollar a year industry and a third of the world’s R&D in pharmaceuticals occurs in the UK. Python is well used in high-throughput screening and target validation with a notable example at AstraZeneca displayed prominently on the python.org website but further along the drug development process Python and it’s scientific stack offers a compelling and comphrensive toolkit for use in preclinical and clinical drug development.

In this talk, a demonstration of how Python/SciPy was used to calculate cardiac liability of a drug was assessed as part of routine preclinical screen, how Python was used to statistically analyse a Phase II clinical dataset and how Python was used organise and structure documentation about a new chemical entity according to regulated standards for submission to the European Medicines Agency. Lastly, the talk will conclude with the current barriers to progress for Python to be used more routinely for pharmaceutical problems and how the community might address Python being used in a heavily regulated environment.

Python, R and Cloud Computing for Higher Education and Research

Feb 21 - 4:50 p.m.

Karim Chine

Thanks to Python and R, data scientists and researchers have in hands highly powerful tools to program with data, simulate and publish reproducible computational results. Educators have access to free and open environments to teach efficiently statistics and numerical subjects. Thanks to cloud computing, anyone can work today on advanced high capacity technological infrastructures without having to build them or to comply with rigid and limiting access protocols. By combining the power of Python, R and public clouds such as Amazon EC2, it became possible to build a new generation of collaboration-centric platforms for virtual data science and virtual education of considerable power and flexibility.

This tutorial aims to familiarise the attendees with what public clouds can do for e-Science and e-Learning, to present the challenges and opportunities raised by the use of Python and R on such infrastructures and to introduce Elastic-R, one of the first free Python/R-centric virtual data science platforms (www.elasticr.com).

Recommenders in Python

Feb 21 - noon

Jonny Edwards

Recommender Systems are now ubiquitous in solving the problem of predicting latent behaviour from customer data. This workshop presents an introduction to this problem using a wide range of Python based libraries. We firstly review the general problem and simple collaborative filtering methods in numpy, before examining Matrix Factorisation using the SVD methods present in scipy and the sparsesvd. We then explore Non-negative Matrix Factorisation and binary problems using the nimfa library. To summarise, we end end with a comparative study on an example data-set, comparing efficacy and accuracy. This tutorial is online at http://nbviewer.ipython.org/gist/jonnydedwards/8652546

Shared Memory Parallelism with Python

Feb 22 - 2:30 p.m.

Mike Mueller

Python threads cannot utilize the power of multiple CPUs. Other solutions such multiprocessing or MPI wrapper are based on message passing, resulting substantial overhead for certain types of tasks.

While pure Python does not support shared memory calculations, Cython combined with OpenMP can provide full access to this type of parallel data processing.

This talk gives a whirlwind tour of Cython and introduces Cython's OpenMP abilities focusing on parallel loops over NumPy arrays. Source code examples demonstrate how to use OpenMP from Python. Results for parallel algorithms with OpenMP show what speed-ups can be achieved for different data sizes compared to other parallelizing strategies.

The High Performance Python Landscape

Feb 22 - 1:30 p.m.

Ian Ozsvald

A Python programmer has many options to profile and optimize CPU-bound and data-bound systems, common solutions include Cython, numpy and PyPy. Increasingly we have single-core solutions that should take advantage of many cores and clusters. This talk reviews the current state of the art, looking at the compromises and outcomes of the current approaches and reviews upcoming solutions like Numba, Pythran and PyPy’s numpy. Thoughts will be shared on how current hindrances might be improved.

The IPython protocol, frontends and kernels

Christian Prokopp

Processing large-scale data set with Hadoop are commonplace nowadays. Python programmers can leverage a large set of frameworks like mrjob, dumbo, hadoopy, pydoop, and so forth to write sophisticated data processing jobs. However, these come with overhead and require some learning. Apache Hive is a SQL engine on Hadoop, which can stream data through Python programs in a distributed fashion. The approach has multiple benefits. Hive and SQL do all the heavy lifting of selecting, joining and filtering the data, and programmers can focus their time and effort on the core logic of what they want to achieve. Nowadays, Hadoop cloud services make clusters available inexpensively, instantaneously, and without expert Hadoop knowledge. This talk will discuss how Hive and Python (and Hadoop) work together, how to get started with it in the cloud, and will showcase some scalable Python examples.

Winning Ways for Your Visualization Plays

Feb 23 - 3:30 p.m.

Mark Grundland

Practical Principles of Information Visualization Design Data is meant to be seen. In an information economy, there is no shortage of information; only genuine understanding is in short supply. Knowledge workers are continually asked to make sense of more information than they could possibly have time to read and assimilate. Users have come to demand insight at a glance: the whole picture, not just an endless list of results. After all, as information becomes ever more abundant, attention remains as scarce as ever. Visualization, animation, and interaction can be gainfully employed to develop information systems that are both useful, enabling users to get the job done well, and usable, empowering users to do job with ease. Effective information visualization should be immediately appealing to the eye and directly relevant to the task, routinely enjoyable to the user and uniquely valuable to the business. By integrating the power of computational analysis with the expertise of human judgment, visualization serves to turn aggregated information into actionable insight, illustrating the way numbers can tell a story compelling enough for people to make decisions they can trust. This presentation focuses on making the art and science of information visualization design more accessible and applicable to a wider audience, outlining a variety of ready-to-use techniques that cover business requirement validation, functional design analysis, visual perception considerations, robust descriptive statistics, and usability study evaluations. It features a brief guided tour of visualization resources on internet, including the available open source Python implementations. Time permitting, it may also include examples of innovative visualizations used in various applications, including a research project for Grapeshot and IBM to create an online news analysis service that provides objective metrics for the relative influence of different news sources on shaping how news coverage evolves over time.

Writing a simple backend framework for 1-line AB tests in Django

Feb 23 - 2:40 p.m.

Greg Detre

Making product decisions is hard - often, we take one step forward, but two steps back. We need to become scientists about our own product - run an AB test comparing two versions and measure which works better. We'll talk through a simple but effective backend AB testing framework in Python (using Django as an example), along with some of the issues, gotchas and best practices of running AB tests in production.

You give me data, I give you art.

Feb 22 - 2:30 p.m.

Eric Drass

Data and algorithms are artistic materials just as much as paint and canvas.

A talk covering my recent work with The Tate's CC dataset, David Cameron's deleted speeches and the role of the artist in the world of Big Data.

Presentation Abstracts

Massively Parallel Processing with Procedural Python

Feb 22 - 11 a.m.

A Beginner's guide to Random Forests - R vs Python

Feb 22 - 4:10 p.m.

A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook

Feb 23 - 12:40 p.m.

Adaptive Filtering of Tweets with Machine Learning

Feb 23 - 11 a.m.

An introduction to video action recognition

Feb 21 - noon

Authorship Attribution using Python

Feb 22 - 3:20 p.m.

Blosc: Sending data from memory to CPU (and back) faster than memcpy()

Feb 22 - 10:10 a.m.

Bokeh - Interactive Visualization for Large Datasets

Feb 21 - 3:20 p.m.

Building a Cutting-Edge Data Processing Environment on a Budget

Feb 23 - 9 a.m.

Correcting 10 years of messy CRM data: A practical data science project & Introducing Cartopy

Feb 23 - 4:20 p.m.

Jonathan Sedar, Philip Elson

Cython

DX Analytics -- A Python-based Library for Derivatives Analytics

Feb 23 - 11 a.m.

Data Science at Berkeley

Feb 22 - 4:10 p.m.

Data Science: It's Easy as Pyǃ

DataViz Showdown: a comparison of different data visualisation libraries

Feb 23 - 10:10 a.m.

Databases for Scientists: Narrowing the Gap Between Array Computing and Databases

Feb 21 - 4:50 p.m.

David Cournapeau , Simon Jagoe

Embeddings of Python

Feb 21 - 3:20 p.m.

Eurex Tutorial - Interactive Financial Analytics with Python and IPython -- With Examples Based on the VSTOXX Volatility Index ****** Requires Separate Registration ******

Feb 21 - 8:30 a.m.

Faster Python Programs through Optimization

Feb 21 - 1:30 p.m.

Generator Showcase Showdown

Feb 23 - 10:10 a.m.

Getting it out there: Python, Javascript and Web-visualizations

Feb 22 - 3:20 p.m.

Gradient Boosted Regression Trees in scikit-learn

Feb 23 - 1:30 p.m.

Gilles Louppe, Peter Prettenhofer

Hierarchical Text Classification using Python (and friends)

Feb 22 - 10:10 a.m.

House Prices and Rents: Evidence from a Matched Dataset in Central London

Feb 23 - 1:30 p.m.

How to build a SQL-based data warehouse for 100+ billion rows in Python

Feb 23 - 3:30 p.m.

Life after matplotlib: Harder, better, faster, stronger

Feb 22 - 11 a.m.

Measuring and Predicting Departures from Routine in Human Mobility

Feb 23 - 4:20 p.m.

Measuring the digital economy using big data

Feb 22 - 12:40 p.m.

Most Winning A/B Test Results are Illusory

Feb 22 - 12:40 p.m.

PYDATA AFTER PARTY

Feb 22 - 6 p.m.

Panel Discussion: "Shouldn't more companies be using data science?"

Feb 22 - 5 p.m.

Ian Ozsvald, James Powell

Probabilistic Data Structures and Approximate Solutions

Feb 21 - 1:30 p.m.

Python and MongoDB as a platform for financial market data

Feb 22 - 1:30 p.m.

Python for High Throughput Science

Feb 23 - 2:40 p.m.

Python in the Financial Industry: The Universal Tool for End-to-End Development

Feb 22 - 9 a.m.

Python, Pharmaceuticals and Drug Discovery

Feb 23 - 12:40 p.m.

Python, R and Cloud Computing for Higher Education and Research

Feb 21 - 4:50 p.m.

Recommenders in Python

Feb 21 - noon

Shared Memory Parallelism with Python

Eurex Tutorial - Interactive Financial Analytics with Python and IPython -- With Examples Based on the VSTOXX Volatility Index Requires Separate Registration