PyData 2013 | New York, NY

Presentation Abstracts

(click on the Title to view presentation details)

A practical introduction to IPython Notebook & pandas

Nov 08 - 4:45 p.m.

Julia Evans

I'll walk you through Python's best tools for getting your hands dirty with a new dataset: IPython Notebook and pandas. I'll show you how to read in data, clean it up, graph it, and draw some conclusions, using some open data about the number of cyclists on Montréal's bike paths as an example.

Image Features in Python

Nov 10 - 2:10 p.m.

Matthew Trentacoste

An increasing amount of information is being conveyed via images, with mobile photographs being a particularly notable case. While image metadata, such as the text linking to an image in Google image search, is a common input to machine learning tasks, the content of the image itself is used far less frequently. Whether picking your best vacation photos, recommending similar-looking images to use in a talk or searching for hidden cats, learning tasks can do better if the actual pixels are used as input.

However, The choice of features determines the quality of result as much as the choice of machine learning algorithm and using the pixels directly often yields the poor results. Higher-level image features, such as face detection, histograms and color statistics like hue binning, provide significantly better performance. While advantageous, these features force the developer to choose from a vast number to accurately capture the details of their problem domain, a challenging task. This talk covers classes of simple image features and how to employ them in machine learning algorithms and focuses on providing basic domain knowledge in imaging/computer vision to developers already familiar with machine learning.

The outline is as follows: We begin with an overview of common image features and discuss potential applications for each. Common features include examples from computer vision such as blob identification, face detection and edge statistics as well as from image statistics such as intensity histograms, Fourier properties and color statistics such as hue binning. Next, we present how to generate the features with python imaging libraries. Finally, we discuss approaches to converting complex image features into a series of scalars for the input vector of an ML algorithm that best represent the problem domain.

A Beginner’s Guide to Machine Learning with Scikit-Learn

Nov 09 - 10:15 a.m.

Sarah Guido

Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.

A practical introduction to Pandas with Citibike Data

Nov 10 - 3:10 p.m.

Paddy Mullen

Attendees will be given a practical introduction to Data analysis with Pandas. A sample dataset with minute-wise station populations will be the primary dataset that participants will analyze, explore, and find novel threads of insight about the Citibike bike sharing system. At the end of the workshop students will be able to derive the total number of rides in a day, a prediction of total rides based on weather and number of users, an estimate number of daily passes purchased, comparisons of weekday vs weekend statistics, and a method of determining when rebalancing occurred. The focus will be on exploratory play rather than advanced statistics. A basic understanding of Python is necessary and prior NumPy/Pandas experience is helpful. No math or statistics background is necessary to understand this tutorial.

An Intro to FOSS Licenses and Copyrights in Data and Software

Nov 10 - 3:10 p.m.

Joshua Horowitz

Coming soon.

Bayesian Data Analysis with PyMC3

Nov 09 - 12:55 p.m.

Thomas Wiecki

Probabilistic Programming allows flexible specification of statistical models to gain insight from data. Estimation of best fitting parameter values, as well as uncertainty in these estimations, can be automated by sampling algorithms like Markov chain Monte Carlo (MCMC). The high interpretability and flexibility of this approach has lead to a huge paradigm shift in scientific fields ranging from Cognitive Science to Data Science and Quantitative Finance.

PyMC3 is a new Python module that features next generation sampling algorithms and an intuitive model specification syntax. The whole code base is written in pure Python and Just-in-time compiled via Theano for speed.

In this talk I will provide an intuitive introduction to Bayesian statistics and how probabilistic models can be specified and estimated using PyMC3.

Beyond the dict: Python Tools to Wrangle Data From CSV Up

Nov 10 - 1:20 p.m.

Imran Haque

There's one real problem in data science: data, and the cleaning and slicing required before it can be fed into exciting machine learning algorithms. In this talk, I will present a survey of data management tools in Python, ranging from easy ways to handle CSV data at the small end (csv, sheets); using Python's standard sqlite3 package as an in-memory RDBMS to get code "beyond the dict"; using PyTables as a highly CPU- and IO-efficient store for numerical and relational data; to ORM systems to scale to data beyond the needs of a script pipeline. After this talk, attendees will understand how to make their data-handling code more efficient and expressive with these major libraries.

Bokeh Workshop

Nov 10 - 2:10 p.m.

Paddy Mullen

Sponsor Workshop

Building Data Pipelines with Python and Luigi

Nov 09 - 3:35 p.m.

Erik Bernhardsson

Erik will be talking about Luigi, a recently open-sourced Python framework that helps you build complex pipelines of batch jobs, handle dependency resolution, and create visualizations to help manage multiple workflows.

Luigi provides an infrastructure that powers several Spotify features including recommendations, top lists, A/B test analysis, external reports, internal dashboards, and many more. It also comes with Hadoop support built in (and that’s where really where its strength becomes clear). Spotify uses it to run 10,000+ of Hadoop jobs every day, but also other actions like training machine learning algorithm, sending out reports, loading data into databases, and much more.

Churn Prediction With Graphical Models

Nov 09 - 1:45 p.m.

Allen Grimm

PlayHaven is working to model and predict user behavior in games on mobile devices. We are integrated into 5,000 games, see around 130 million unique users monthly, and record events from around 2.5 billion game sessions. Given the right data science tools, we will be able to intelligently redirect that traffic amongst our client's games to both dramatically improve the game player's experience and, thus, to improve the quality of our client's business.

The project chosen as PlayHaven's first run with predicting user behavior is churn prediction. Churn in our context is defined as a user that was previously active within a game decides to leave that game. The implications of successfully predicting include providing us with the chance to see why and solve a user's new lack of interest and, if that fails, provide them with content more in line with their interests.

The method used is called Reconstructability Analysis (RA) - a graphical model framework with heavy overlap in Loglinear Models and Bayesian Networks. This is a conception-to-deliverable overview. We will start with a top level view of PlayHaven and what that means in terms of the data we collect and what we want from data science. We then discuss what data we pull and the resources needed to pull and pre-process it. With a data set built, we will discuss model construction and predictor performance in terms of predictive accuracy and computational resources needed. We will conclude with a top level discussion of a few special applications of RA.

Data Engineering 101: Building your First Data Product

Nov 10 - 10:45 a.m.

Ben Lerner

Excel is the lingua franca of data - if you share data with non-programmers, you probably use Excel to do it. This talk will show you how to use Python to work with spreadsheets programatically, and what you can build once you start scripting Excel.

DataViz Showdown: a comparison of different data visualisation libraries

Nov 09 - 11:05 a.m.

Dan Blanchard

As the popularity of machine learning techniques spreads to new areas of industry and science, the number of potential machine learning users is growing rapidly. While the fantastic scikit-learn library is widely used in the Python community for tackling such tasks, there are two significant hurdles in place for people working on new machine learning problems:

1. Scikit-learn requires writing a fair amount of boilerplate code to run even simple experiments.

2. Obtaining good performance typically requires tuning various model parameters, which can be particularly challenging for beginners.

SciKit-Learn Laboratory (SKLL) is an open source Python package, originally developed by the NLP & Speech group at the Educational Testing Service (ETS), that addresses these issues by providing the ability to run scikit-learn experiments with tuned models without writing any code beyond what generates the features. This talk will provide an overview of performing common machine learning tasks with SKLL.

Efficient Computing with NumPy

Nov 08 - 3 p.m.

Jake Vanderplas

Coming Soon.

Embeddings of Python

Nov 10 - 10:45 a.m.

James Powell

How can Python be embedded into C/C++ applications?

This talk covers the well-known very high-level and pure embeddings and includes two novel forms of embedding: a zero-interpreter embedding using Cython and Python running from within Python

Excel and IPython

Nov 10 - 2:10 p.m.

Victor Jakubiuk

IPython is a powerful shell that enhances dynamic scripting in Python. Microsoft Excel is the default spreadsheet software that serves as a great user interface. Combining them together can greatly increase productivity of software developers, analysts and business users. In this talk I’m going to demonstrate some of the most valuable features of IPython in Excel. Live demos will be provided.

Functional Performance with Core Data Structures

Nov 10 - 9:30 a.m.

Brian Granger

Computing, and thus software, is one of the foundations of modern technical work across a broad range of fields. Like anything, all software has attributes: slow, fast, buggy, robust, etc. However, these attributes are not passive and neutral. In this talk I will describe how the attributes of software have a profound affect on human behavior, attitudes and thought patterns. These attributes, for better or worse, infect all of the work that is done using the software. To explore these ideas, I will provide an attribute based tour of the IPython Notebook. This tour will elucidate the overall vision for the project and cover our recent work on interactive widgets and converting notebooks to different formats.

Generator Showcase Showdown

Nov 09 - 2:45 p.m.

James Powell

How can we model problems using generators and coroutines? What additional conceptualizations do these modelings allow and what benefits can we derive from these approaches?

Generators Will Free Your Mind

Nov 09 - 1:45 p.m.

Avishek Panigrahi

This is an introduction to data analysis in the context of chip design.

Chip design complexity has grown exponentially in the last 5 years. The primary driver for this has been a demand for higher performance and lower power . While these requirements are not new, what is relatively new is an increased complexity in design styles and manufacturing processes - multiple process corners, multiple on chip voltages, voltage and frequency scaling, many operating modes etc. This has led to a huge increase in the amount of data generated in the process of designing current generation chips. This has made it harder to analyze using traditional methods used in the chip design space like reviewing waveforms, reports and ad-hoc scripting. All this makes this area attractive for rich data analysis.

This talk will be an introduction to pandas and matplotlib with chip design data as the background.

GeoPandas: Geospatial Data in Python Made Easy

Nov 09 - 2:45 p.m.

Kelsey Jordahl

GeoPandas extends the pandas data analysis library to work with geographic objects. File I/O, geometric operations, map projection transformations and plotting are provided in a high level interface that makes use of other libraries including Shapely and Fiona. GeoPandas is ideal for interactive use with IPython, and provides easy geospatial analysis and manipulation tools without a need for complicated desktop GIS applications or spatial databases.

House Prices and Rents: Evidence from a Matched Dataset in Central London

Nov 10 - 1:20 p.m.

Daniel Krasner

You will use pre-existing code and write some or your own in order to classify emails from the Enron corpus.

Intro to Python Data Analysis in Wakari

Nov 08 - 1:15 p.m.

Karissa McKelvey

Coming Soon

K-means Clustering with Scikit-Learn

Nov 09 - 12:55 p.m.

Chris Colbert

Enaml is an open source library for building rich user interfaces utilizing a declarative extension to the Python language grammar. Notable features of the framework include: automatic dependency analysis at the bytecode level, a constraints based layout system, support for multiple rendering engines, and the ability to interface with a variety of model notification systems. The Enaml DSL is a strict superset of Python which allows the developer to declaratively define dynamic, extensible, and reactive trees; such trees are particularly well suited for the definition of user interface hierarchies.

Enaml has been used for building production applications at multiple Fortune 500 companies and serves as the foundational UI framework at one of the world's leading investment banks. This talk by the author of Enaml will provide an introduction to the language and demonstrate some of the notable features of the framework.

Machine Learning with scikit-learn

Nov 08 - 9 a.m.

Jake Vanderplas

Scikit-learn is a popular Python machine learning library. In this tutorial, I'll give an introduction to the core concepts of machine learning, using scikit-learn to demonstrate applications of these concepts on real-world datasets. We'll cover some of the most powerful and popular supervised and unsupervised learning techniques, including classification and regression models like Support Vector Machines and Random Forests, clustering models like K Means and Gaussian Mixtures, and dimensionality reduction models like PCA and manifold learning. Throughout, I'll emphasize the key features of the scikit-learn API, so that participants will be well-poised to begin exploring their own datasets using the wide array of algorithms implemented in scikit-learn.

My First Numba

Nov 08 - 9 a.m.

Saul Diez-Guerra

What is numba, why is it different to other alternatives and why would you want to use it. Basic tutorial, modeling some simple interview question-like problems, optimizing using numba and analyzing performance from the perspective of a first-time user.

Old School - Functional Data Analysis

Nov 08 - 10:45 a.m.

Matthew Rocklin

This talk will use core functionality from the `PyToolz` projects. Students will leave both with a set of concrete tools and with an understanding of some of the more applicable lessons from the functional style.

Packaging and Deployment

Nov 08 - 1:15 p.m.

Travis Oliphant

Coming soon.

Performance Python

Nov 08 - 4:45 p.m.

Yves Hilpisch

Coming Soon

Practical Medium Data Analytics with Python

Nov 09 - 3:35 p.m.

Wes McKinney

Coming Soon

Probabilistic Data Structures for Realtime Analytics

Nov 09 - 10:15 a.m.

Martin Laprise

More and more applications are now dealing with massive data that need to be processed in realtime. While easing the development of realtime analytics applications, computing platforms like Storm increases the need for efficient algorithms that can run on a single pass on the data stream. In this talk, I'll give a brief overview of some interesting probabilistic data structures that can used in this context: Bloomfilter, Temporal Bloomfilter, Count-Min Sketch and HyperLogLog.

PyParallel: How we Removed the GIL and Exploited all Cores (Without Needing to Remove the GIL at all)

Nov 10 - 3:10 p.m.

Trent Nelson

During the fall of 2012, a heated technical discussion regarding asynchronous programming occurred on python-ideas. One of the outcomes of this discussion was Tulip, an asynchronous programming API for Python 3.3, spearheaded by Guido van Rossum. A lesser known outcome was PyParallel: a set of modifications to the CPython interpreter that allows Python code to execute concurrently across multiple cores.

Twisted, Tulip, Gevent, Stackless/greenlets and even node.js are all variations on the same pattern for achieving "asynchronous I/O": non-blocking I/O performed on a single thread. Each framework provides extensive mechanisms for encapsulating computational work via deferreds, coroutines, generators and yield from clauses that can be executed in the future when a file descriptor is ready for reading or writing.

What I found troubling with all these solutions is that so much effort was being invested to encapsulate future computation (to be executed when a file descriptor is ready for reading or writing), without consideration of the fact that execution is still limited to a single core.

PyParallel approaches the problem in a fundamentally different way. Developers will still write code in such a way that they're encapsulating future computation via the provided APIs, however, thanks to some novel CPython interpreter modifications, such code can be run concurrently across all available cores.

This talk will cover the history behind PyParallel, the numerous novel techniques invented to achieve concurrent interpreter execution, real-life examples of PyParallel in action (multi-threaded HTTP servers, parallel task decomposition). It will detail the approach PyParallel takes towards facilitating asynchronous I/O compared to competing libraries like Twisted and Tulip. It will also provide insight into the direction of PyParallel in the future, including things like LLVM integration via Numba/Blaze to further improve computational performance.

Link to the deck being presented: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all-cores-1

Python @ Datadog: Building High-Volume Data Systems in the Python Ecosystem

Nov 09 - 11:05 a.m.

Olivier Pomel

Datadog processes and reports on tens of billions of events a day and is used to manage infrastructures ranging from a couple of servers to many thousands of instances.

This talk will explain why we chose python as the core ecosystem to build it on, how we scaled the system to date and what we learned in the process.

Expect to hear about numerical processing with numpy or pandas, memory management, concurrency, profiling and performance tuning, as well as examples of optimization using C or Cython.

Python Powered Data Science at Pivotal

Nov 10 - 10:45 a.m.

Ian Huston, Srivatsan Ramanujam

The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems.

The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. The Greenplum Database is now part of the Pivotal Platform and provides super fast analytics capabilities through a shared-nothing architecture and SQL interface (based on Postgres). In addition to running parallel queries across terabytes of data using pure SQL it can also run procedural languages such as PL/Python. MADlib, Pivotal’s open source library for scalable in-database machine learning, uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package. We will also show how we have used many of the standard tools in the Python data analysis toolkit in this framework, including Pandas, scikit-learn, nltk and of course NumPy and SciPy.

In particular combining these tools has allowed us to perform sentiment analysis on large datasets and we will discuss the strategies and issues we have come across along the way.

Python as Part of a Production Machine Learning Stack

Nov 10 - 11:35 a.m.

Bryan Lewis, Jake Vanderplas

SciDB-Py connects two of the most powerful open source tools whose shared mission is to change the way quants, data scientists, scientists and analysts work with big data: SciDB and python.

The need to safely store and quickly manipulate vast datasets has made databases increasingly important to data scientists. Unfortunately, the array-based datasets central to many applications do not fit neatly into classic relational or key-value paradigm of popular database architectures. Enter SciDB, a new database platform which is built around large multi-dimensional arrays. It provides both efficient distributed storage and fast array-oriented computations, from simple slices and joins to more involved parallel linear algebraic operations. This talk will introduce SciDB-py, a Python package which wraps SciDB with a familiar, numpy-like syntax. With SciDB-py, data scientists can effortlessly store and manipulate extremely large array-based data from the comfort of the Python interpreter.

We'll demonstrate working with timeseries data in SciDB and present basic examples that illustrate SciDB's native analytics capabilities including aggregation and data decimation, regression and generalized linear models, covariance matrices, singular value decomposition, and extending SciDB with custom operations. The examples apply to a broad range of applications including quantitative finance, econometrics, and risk and credit analysis.

The examples demonstrate how SciDB-Py’s scale out MPP architecture enables interactive exploratory analytics on large-scale data.

Python on the GPU with Parakeet

Nov 10 - 1:20 p.m.

Alex Rubinsteyn

Parakeet is a runtime compiler for numerical Python. It takes array-oriented computations, optimizes them, and compiles them to native code. Parakeet reimplements a subset of NumPy's library functions using data parallel operators, which are amenable to parallel execution. Until recently, however, this parallelism was wasted on a single-core LLVM backend. A new CUDA backend for Parakeet is under development and might prove to be the easiest to write GPU programs in Python.

Python's Role in the Future of Data Analysis

Nov 09 - 9:10 a.m.

Peter Wang

Coming Soon

Rapid Data Visualization, from Python to Browser

Nov 08 - 3 p.m.

Andrew Montalenti

Many data sets are beautiful in themselves, but how do we make their beauty obvious? Data visualization, of course. This hands-on tutorial will explore the craft, once described by industry expert Ben Fry as a clear multi-step process oriented around data: acquire, parse, filter, mine, represent, refine, & interact. Python -- especially with new analysis tools like Pandas -- excels at the first few steps. However, other tools beyond Python must be used in order to represent, refine, and interact with data. The best toolset for this lives in the modern browser. Many PyData attendees are familiar with IPython Notebook. It provides an ideal place for us to build out a "read-evaluate-print loop" (REPL) for the visual data exploration process. Join us as we unify Python with browser technologies like JavaScript, CSS, and SVG, in a single space. The tutorial will cover how to iteratively produce visualizations from a raw data set of online news articles & web traffic metrics. You will use IPython Notebook to discover hidden patterns in the data, then you will convert your own explorations into production visualizations ready for interaction (& publication!) on the web. The tools covered will include: Pandas; D3.js; NVD3.js; Vega / Vincent; and PhantomJS.

Rapid Development and High Performance Text Processing

Nov 10 - 11:35 a.m.

Daniel Krasner

This talk covers rapid prototyping of a high performance scalable text processing pipeline in Python. We demonstrate how Python modules can be used to analyze, clean, perform feature extraction, and finally classify half a million documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing module.

Scaling Python Across Hardware & Organizations

Nov 09 - 4:25 p.m.

Panel Discussion

This panel is a unique take on the question of Python Scalability.

The first topic is the challenge of scaling Python across computational clusters and hardware like GPUs. How do we take a language that prides itself on readability and ease-of-use, and apply it to cutting edge challenges like cluster computing, multi-core processors, and GPUs? Are there fundamental limitations - do we have to move to a different language for these things?

The second topic is the challenge of scaling the *usage* of Python across a team or an organization. Oftentimes, Python is chosen by the early prototypers in a team, and as they succeessfuly deploy their code into production, more and more people - who do not necessarily possess the mindset of the early evangelists - start needing to use or adopt the language, without necessarily having "organically" grown an appreciation for its idioms. Sometimes the code they produce more closely resembles Java or C++, but just spelled in Python. How can experienced Python programmers help grow the effective usage of the language across the organization?

These are two very different questions, but they are both rooted in a common theme: Python's ease of use and its power as a scripting language is actually a double-edged sword. Since people are used to things being much easier and more pleasant to code in Python, when they run into traditionally challenging problems like distributed computing or software management, there is a tendency to just assume that Python is "really bad" for that use case. But Python can be successfully scaled across a cluster and across and organization. In this panel discussion, we'll talk about lessons learned in both of these situations.

Working with Hadoop from Python

Nov 10 - 11:35 a.m.

Emily Chen

Coming soon.

ddpy: Data-Driven Music for Big Data Analysis

Nov 08 - 10:45 a.m.

Thomas Levine

Are your data too complicated to visualize? Have you considered complimenting your visualizations with music? In this session, we'll analyze data from the American Community Survey by composing data-driven music with the ddpy package (https://github.com/csv/ddpy).