PyData 2014 | Silicon Valley

Presentation Abstracts

(click on the Title to view presentation details)

May 04 - 4 p.m.

Using Python and Paver to Control a Large Medical Informatics ETL Process

May 03 - 12:40 p.m.

The The Greater Plains Collaborative (GPC) is a new network of 10 leading medical centers in 7 states working to improve healthcare delivery and advance research by mining electronic medical records and patient registries. To do this, we must de-identify and securely migrate patient data from heterogeneous formats (e.g. Clarity, IDX, NAACCR) to our data warehouse platform (HERON) which is built on top of I2B2. Task dependencies in the complex network of python scripts that wrap our SQL code is managed via paver, permitting a robust, modular, and maintainable architecture. In the process, we developed new python tools for generating dependency graphs from SQL code and for integrating R and RedCap into our analytical pipeline. Moreover, by adapting our python code to work across multiple member institutions we have started moving toward a generic workflow for building, testing, documenting, and deploying medical informatics research data warehouses.

A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook

May 03 - 10:10 a.m.

Jason Sundram

To a lot of people, Facebook is a website for interacting with friends and family. It's also a giant treasure trove of rich, fascinating data. Facebook has made accessing large data sets easier in part by releasing open source technologies like Presto and Tornado. In this talk Jason will explain how he turns terabytes of data into compelling, interactive, data-driven applications whose purposes run the gamut from internal insights to debugging, to beautiful visualizations. He will show examples of each kind of visualization, and talk about the architecture behind each and how Python is integral to tying all the pieces together, quickly. Jason will cover several open source technologies like crossfilter.js, dc.js, d3.js and 0mq.

Ad Targeting at Yelp

May 04 - 12:30 p.m.

Daniel Yao

The ad targeting team at Yelp is tasked with presenting the most relevant advertisements to millions of users every day. In this talk I will discuss some of the technical problems facing the ads team, the models, approaches, and workflow we have adopted for targeting, and some of the Python tools we use within that workflow.

Analyzing Satellite Images With Python Scientific Stack

May 04 - 3:10 p.m.

Bokeh

May 02 - 9 a.m.

Bryan Van De Ven

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients. This tutorial will walk users through the steps to create different kinds of interactive plots using Bokeh. We will cover using Bokeh for static HTML output, the IPython notebook, and plot hosting and embedding using the Bokeh server.

Bokeh

Bryan Van De Ven

Building an Army of Data Collecting Robots in Python

May 03 - 12:40 p.m.

Freedom Dumlao

Sometimes the greatest challenge in working with data is getting data to work with in the first place. In this talk I'll take the audience through the process of building a toolset that can be used to launch a virtual army of data collectors that can help get large volumes of useful data quickly. (No live coding or slides full of code will be presented, we're going to deal with concepts and I'll direct the audience to a github repository with examples at the end of the talk.)

Since it's the most widely available and a common source of valuable information, we'll focus primarily on gathering data from the web, although the principals could certainly be used to churn through other data sources as well.

We'll start by examining a simple web-scraper and the limitations of a singular, linear process. We'll then progress through the concepts of threading and concurrency, and all the way through multiprocessing. (Again, very little code, mostly graphics to help improve understanding of the concepts.)

Once we reach this point, we can discover together that there are limitations to this approach, even on super fast multi-core machines with tons of RAM. Network bottlenecks, ISP issues, and the possibility of creating an inadvertent Denial of Service Attack, not to mention the fact that you may not be able to use the computer in question while the data harvesting is going on.

From here we can consider the idea of using an inexpensive virtual machine running somewhere else (such as AWS) to do our bidding and harvest data while we wait. I'll show how some very simple tools like Vagrant and Fabric can be combined to make running code on a remote machine simple.

We'll still have some limitations though. Moving everything to a remote machine solves some of our original problems, but in the end it's still one machine and even the most powerful machine is going to have limits.

I'll present ways that we can spawn a network (an Army!) of virtual machines that can all work together to complete the task at hand, and have that power available to run any python code we desire.

Conda

May 03 - 10:10 a.m.

Travis Oliphant

Crushing the head of the snake

May 03 - 1:30 p.m.

Robert Brewer

Big Data brings with it particular challenges in any language, mostly in performance. This talk will explain how to get immediate speedups in your Python code by exploiting both timeless programming techniques and fixes specific to Python. We will cover: I. Amongst Our Weaponry 1. How to Time and Profile Python 2. Extracting Loop invariants: constants, lookup tables, even methods! 3. Caching: memoization and heavier things II Gunfight at the O.K. Corral in Morse Code 1. Python functions vs C functions 2. Vector operations: NumPy 3. Reducing calls: loops, generators, recursion III. The Semaphore Version of Wuthering Heights 1. Using select instead of Queue 2. Serialization overhead 3. Parallelizing work

Dark Data: A Data Scientist's Exploration of the Unknown

May 03 - 11 a.m.

Rob Witoff

Modern Data Science is enabling NASA's engineers uncover actionable information from our "dark" data coffers. From starting small to operating at scale, Rob will discuss applications in telemetry, workforce analytics and liberating data from the Mars Rovers. Tools include iPython, Pandas, Boto and more.

Data Analysis with SciDB-Py

May 03 - 5 p.m.

Chris Beaumont

Speakers: Alex Poliakov, Paradigm4 Solutions Architect Chris Beaumont, Ph.D, Harvard Center for Astrophysics With the emergence of the Internet of Everything in the commercial and industrial worlds and with the advances in device and instrument technologies in the science world, there is an urgent need for analysts to be able to work more easily with extremely large and diverse data sets. While the Python ecosystem thrives at in-memory data analysis, what is missing is a high performance, seamless integration with a persistent data store. When dealing with larger datasets, data scientists are forced to think more carefully about low-level bookkeeping details -- this inhibits discovery. SciDB-Py addresses this problem by providing a high-level interface to SciDB -- a high performance database optimized for array-based computation at scale. Array-based data — like much of the time-stamped device data, geospatial data, and genomic data central to many applications — often do not fit neatly into classic relational or key-value paradigms of popular database architectures. SciDB is a new database platform that is built around large multidimensional arrays. It provides efficient distributed storage and fast array-oriented computations, from simple slices and joins to more involved parallel linear algebraic operations like singular value decomposition. Like traditional databases, SciDB also preserves data integrity and allows concurrent, multi-user access. SciDB-Py provides a Python interface to SciDB’s storage and computation capabilities. With SciDB-Py, data scientists can more naturally store and manipulate extremely large datasets, using a familiar NumPy-like interface and Pythonic idioms. Users can also seamlessly transfer subsets of SciDB datasets into Python to leverage analysis and visualization libraries like matplotlib, scikit-learn, and pandas. We'll demonstrate working with time-series, geospatial, and genomics data in SciDB from Python, presenting basic examples that illustrate SciDB's native analytics and math capabilities including aggregation and data decimation, regression and generalized linear models, covariance matrices, and singular value decomposition.

Data Engineering 101: Building your First Data Product

May 04 - 1:20 p.m.

Jonathan Dinu

Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.

As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.

Data Science at Berkeley

May 04 - 10:40 a.m.

Joshua Bloom

I will describe data science efforts at Berkeley, with a particular focus on teaching and the new Berkeley Institute for Data Science (BIDS), funded by the Moore and Sloan Foundations. BIDS will be a space for the open and interdisciplinary work that is typical of the SciPy community. In the creation of BIDS, open source scientific tools for data science, and specifically the SciPy ecosystem, played an important role.

DataPad: Python-powered Business Intelligence

May 03 - 1:30 p.m.

Wes McKinney

In this talk, I'll discuss the architecture of the DataPad system and how we are leveraging components in the scientific Python ecosystem (along with a wee bit of JavaScript) to craft a modern business intelligence and analytics system. The basic technical architecture of the Badger high performance analytical query engine (inspired by experiences Wes had building pandas) will also be included.

Dataswarm

May 04 - 3:10 p.m.

Mike Starr

At Facebook, data is used to gain insights for existing products and drive development of new products. In order to do this, engineers and analysts need to seamlessly process data across a variety of backend data stores. Dataswarm is a framework for writing data processing pipelines in Python. Using an extensible library of operations (e.g. executing queries, moving data, running scripts), developers programmatically define dependency graphs of tasks to be executed. Dataswarm takes care of the rest: distributed execution, scheduling, and dependency management. Talk will cover high level design, example pipeline code, and plans for the future.

Designing and Deploying Online Experiments with PlanOut

May 02 - 2:30 p.m.

Eytan Bakshy

PlanOut is an open-source framework developed at Facebook for designing, implementing, and logging online experiments. This tutorial teaches you how to create both simple A/B tests and complex experiments in Python, as well as best practices for managing and analyzing experiments. We will also be taking a deep dive into how PlanOut is organized and can be extended to fit into production environments. For more information, see the PlanOut page on Github http://facebook.github.io/planout

Ferry - Share and Deploy Big Data Applications with Docker

May 04 - 1:20 p.m.

James Horey

Ferry is a Python-based, open-source tool to help developers share and run big data applications. Users can provision Hadoop, Cassandra, GlusterFS, and Open MPI clusters locally on their machine using YAML and afterwards distribute their applications via Dockerfiles. These capabilities are useful for data scientists experimenting with big data technologies, developers that need an accessible big data development environment, or for developers simply interested in sharing their big data applications. In this presentation, I’ll introduce you to Docker, show you how to create a simple big data application in Ferry, and discuss ways the Python community can contribute to the open-source project. I’ll also discuss future directions for Ferry with a focus on better application sharing and operational deployments.

Functional Performance with Core Data Structures

May 04 - 12:30 p.m.

Matthew Rocklin

Generators Will Free Your Mind

May 03 - 11 a.m.

James Powell

What are generators and coroutines in Python? What additional conceptualisations do they offer, and how can we use them to better model problems? This is a talk I've given at PyCon Canada, PyData Boston, and PyTexas. It's an intermediate-level talk around the core concept of generators with a lot of examples of not only neat things you can do with generators but also new ways to model and conceptualise problems.

Gradient Boosted Regression Trees in scikit-learn

May 02 - 1 p.m.

Peter Prettenhofer

Gradient Boosted Regression Trees (GBRT) is powerful a statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling -- it is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price. I will start with a brief introduction to the GBRT model -- focusing on intuition rather than mathematical formulas. The majority of the tutorial will be dedicated to an in depth discussion how to apply GBRT successfully in practice using scikit-learn. We will cover important topics such as regularization, model tuning, and model interpretation that should significantly improve your score on Kaggle.

How to Spy with Python

May 04 - 9:50 a.m.

Lynn Root

This talk will walk through what the US government has done in terms of spying on US citizens and foreigners with their PRISM program, then walk through how to do exactly that with Python.

How to Spy with Python

Jul 26 - 3:05 p.m.

Lynn Root

This talk will walk through what the US government has done in terms of spying on US citizens and foreigners with their PRISM program, then walk through how to do exactly that with Python.

How to build a SQL-based data warehouse for 100+ billion rows in Python

May 04 - 1:20 p.m.

Ville Tuulos

In this talk, we show how and why AdRoll built a custom, high-performance data warehouse in Python which can handle hundreds of billions of data points with sub-minute latency on a small cluster of servers. This feat is made possible by a non-trivial combination of compressed data structures, meta-programming, and just-in-time compilation using Numba, a compiler for numerical Python. To enable smooth interoperability with existing tools, the system provides a standard SQL-interface using Multicorn and Foreign Data Wrappers in PostgreSQL.

Hustle: a column oriented, distributed event database

May 02 - 4:05 p.m.

Tim Spurway

Hustle was born out of the need to efficiently process and query billions of event records per day. It was built on the proven petabyte scale Disco MapReduce framework, and the surreally fast LMDB persistent data layer. Hustle sports the following feature set:

column oriented - super fast queries
append-only event semantics
distributed insert - massive write loads
compressed - save up to 50% memory/disk space
Python DSL query language

IPython Interactive Widgets

May 02 - 9 a.m.

Brian Granger, Jonathan Frederic

IPython provides an architecture for interactive computing. The IPython Notebook is a web-based interactive computing environment for exploratory and reproducible computing. With the IPython Notebook, users create documents, called notebooks, that contain formatted text, figures, equations, programming code, and code output. As of version 2.0, the IPython Notebook includes interactive JavaScript widgets. These widgets provide a way for users to interact with UI controls in the browser that are tied to Python code in running in the kernel. We will begin by covering the highest-level API for these widgets, “interact,” which automatically builds a user interface for exploring a Python function. Next we will describe the lower-level widget objects that are included with IPython: sliders, text boxes, buttons, etc. However, the full potential of the widget framework lies with its extensibility. Users can create their own custom widgets using Python, JavaScript, HTML and CSS. We will conclude with a detailed look at custom widget creation. Python Dependencies: IPython 2.0 or latest stable running on Anaconda or Canopy Additional Comments: Attendees should already know Python and be familiar with the IPython Notebook. Some JavaScript/HTML/CSS experience will also be helpful.

IPython: what's new, what's cool, and what's coming

May 04 - 9 a.m.

Min Ragan-Kelley

IPython is an exciting, fast-moving project, which just had a major release. IPython 2.0 introduces lots of cool new features for exploratory computing, such as interactive widgets, a new, more customizable, modal user interface, notebook security, and directory navigation. With 2.0 released, we are already hard at work on our next major milestones, including updates to better integrate languages other than Python, a multi-user notebook server, static widget export, and more. I will demonstrate some of the new features, discuss future plans, and examine some of our experiences managing a popular open source project with a small team of funded developers.

Introducing xray: extended arrays for scientific datasets

May 03 - 4:10 p.m.

Stephan Hoyer

xray is a new Python package for labeled array data. It aims to provide a data analysis toolkit as efficient and powerful as pandas but designed for homogeneous N-dimensional arrays instead of tabular data. Indeed, many of its internals are built on pandas (most notably, fast indexing), and its interface mirrors pandas for features such as label-based indexing, data alignment and group-by operations. xray implements two data-structures that are missing in pandas: the DataArray, an extended array object with labeled coordinates and dimensions, and the Dataset, a dictionary-like container for manipulating a collection of DataArrays aligned along shared dimensions. The labeled dimensions of the DataArray allow for array alignment (e.g., broadcasting) and operations (e.g., sum) based on dimension names instead of array shapes and axis numbers. The data model is based on Unidata’s Common Data Model for self-describing scientific datasets, which is widely used in the geosciences.

K-means Clustering with Scikit-Learn

May 02 - 4:50 p.m.

Sarah Guido

In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

May 02 - 4:05 p.m.

Portia Burton

This presentation will give a brief overview of machine learning, the k-nearest neighbor algorithm and scikit-learn. Sometimes developers need to make decisions, even when they don't have all of the required information. Machine learning attempts to solve this problem by using known data (a training data sample) to make predictions about the unknown. For example, usually a user doesn't tell Amazon explicitly what type of book they want to read, but based on the user's purchasing history, and the user's demographic, Amazon is able to induce what the user might like to read. Scikit-learn makes use of the k-nearest neighbor algorithm and allows developers to make predictions. Using training data one could make inferences such as what type of food, tv show, or music the user prefers. In this presentation we will introduce the k-nearest neighbor algorithm, and discuss when one might use this algorithm.

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

Portia Burton

Map Reduce: 0-60 in 80 Minutes

May 02 - 1 p.m.

Christopher Roach

In 2004, at the Sixth Symposium on Operating System Design and Implementation, Jeffrey Dean and Sanjay Ghemawat, a couple of engineers working for Google, published a paper titled “MapReduce: Simplified Data Processing on Large Clusters” that introduced the world to a simple, yet powerful heuristic for processing large amounts of data at previously unheard of scales. Though the concepts were not new---map and reduce had existed for quite some time in functional programming languages---the observation that they could be used as a general programming paradigm for solving large data processing problems changed the current state of the art. If you find yourself working with data nowadays, you’re bound to find yourself at some point with a need to process “Big Data”. Big Data can be a troublesome phrase, arguably more hype than anything at this point, it has many different meanings, but for the purpose of this talk we’ll consider it to be any data that is too large to fit into the main memory of a single machine. With that in mind, if you’ve ever found yourself with the need to process an amount of data that stretched the boundaries of your own personal laptop, and you wanted to apply the ideas expressed in Dean's and Ghemawat's seminal paper but had no idea what to do, or even where to start, then this talk is for you. The goal of the talk is to give attendees a basic working knowledge of what MapReduce is, and how it can be used to process massive sets of data relatively quickly. During the course of this talk we will walk through the basics of what MapReduce is and how it works. Though there are a handful of MapReduce implementations out there to choose from, Hadoop is without a doubt the most well known and, as such, we will take a look at how to use it to run our MapReduce jobs. With that in mind, we will discuss what you need to know to use Hadoop and take a look at how to write our own Hadoop jobs in Python using the Hadoop Streaming utility. Finally, we’ll look at a library created at Yelp called MRJob that can make writing Hadoop jobs in Python much easier. By the end of the talk an attendee with little to no knowledge of MapReduce, but a working knowledge of Python, should be able to write their own basic MapReduce tasks for Hadoop and run them on a cluster of machines using Amazon’s Elastic MapReduce service.

Outlier Detection in Time Series Signals

May 04 - 2:10 p.m.

Bugra Akyildiz

Many real-world datasets have missing observations, noise and outliers; usually due to logistical problems, component failures and erroneous procedures during the data collection process. Although it is easy to avoid missing points and noise to some level, it is not easy to detect wrong measurements and outliers in the dataset. These outliers may present a larger problem in time-series signals since every data point has a temporal dependency to the data point before and after. Therefore, it is crucially important to be able to detect and possibly correct these outliers. In this talk, I will introduce three different methods to be able to detect outliers in time-series signals; Fast Fourier Transform(FFT), Median Filtering and Bayesian approach. http://bugra.github.io/work/notes/2014-03-31/outlier-detection-in-time-series-signals-fft-median-filtering/

PyAlgoViz: Python Algorithm Visualization in the Browser

May 03 - 5 p.m.

Chris Laffra

Description:

PyAlgoViz is an HTML5 browser application that allows Python students and practitioners to prototype an algorithm, visualize it, replay the execution, and share the end-result with others. A great use would be as a tool in the Datastructures and Algorithm track of the Computer Science curriculum.

Abstract:

PyAlgoViz is an HTML5 browser application that allows Python students and practitioners to prototype an algorithm, visualize it, and share it with others. To visualize an algorithm, it is sent to a server that runs the code, records the execution, and sends the recording back to the client. In the browser, the recording is then replayed at the speed the user wants. Graphics primitives to draw rectangles, lines, and text, in addition to generating sounds, allow algorithm visualizations that enhance the understanding of the algorithm.

Intended usage for PyAlgoViz is in the Datastructures and Algorithm track of the Computer Science curriculum or for personal education in the area of program algorithms. Not only will students learn how to implement algorithms in Python, they will also will be able to better understand asymptotic or even buggy algorithms by inducing patterns from observing the visualizations they create themselves.

Technologies used to develop PyAlgoViz were MacVIM, Google App Engine, Python 2.7, Python NBD DataStore, , HTML/JavaScript/CSS, CodeMirror, jQuery, and .

Python as Part of a Production Machine Learning Stack

May 04 - 2:10 p.m.

Michael Manapat

Over the course of three years, we've built Stripe from scratch and scaled it to process billions of dollars of transaction volume a year by making it easy and painless for merchants to get set up and start accepting payments. While the vast majority of transactions facilitated by Stripe are honest, we do need to protect our merchants from rogue individuals and groups seeing to "test" or "cash" stolen credit cards. To combat this sort of activity, Stripe uses Python (together with Scala and Ruby) as part of its production machine learning pipeline to detect and block fraud in real time. In this talk, I'll go through the scikit-based modeling process for a sample data set that is derived from production data to illustrate how we train and validate our models. We'll also walk through how we deploy the models and monitor them in our production environment and how Python has allowed us to do this at scale.

Python as Part of a Production Machine Learning Stack

Michael Manapat

(**Note: This is the same talk I gave at PyData SV 2014. I'm proposing it again for PyData Berlin, but I also have a different talk [at a novice+/intermediate- level] with totally new content if you'd rather I submit that.) Over the course of three years, we've built Stripe from scratch and scaled it to process billions of dollars of transaction volume a year by making it easy and painless for merchants to get set up and start accepting payments. While the vast majority of transactions facilitated by Stripe are honest, we do need to protect our merchants from rogue individuals and groups seeing to "test" or "cash" stolen credit cards. To combat this sort of activity, Stripe uses Python (together with Scala and Ruby) as part of its production machine learning pipeline to detect and block fraud in real time. In this talk, I'll go through the scikit-based modeling process for a sample data set that is derived from production data to illustrate how we train and validate our models. We'll also walk through how we deploy the models and monitor them in our production environment and how Python has allowed us to do this at scale.

Pythran: Static Compiler for High Performance

May 04 - 9:50 a.m.

Mehdi Amini

Pythran is a an ahead of time compiler that turns modules written in a large subset of Python into C++ meta-programs that can be compiled into efficient native modules. It targets mainly compute intensive part of the code, hence it comes as no surprise that it focuses on scientific applications that makes extensive use of Numpy. Under the hood, Pythran inter-procedurally analyses the program and performs high level optimizations and parallel code generation. Parallelism can be found implicitly in Python intrinsics or Numpy operations, or explicitly specified by the programmer using OpenMP directives directly in the Python source code. Either way, the input code remains fully compatible with the Python interpreter. While the idea is similar to Parakeet or Numba, the approach differs significantly: the code generation is not performed at runtime but offline. Pythran generates C++11 heavily templated code that makes use of the NT2 meta-programming library and relies on any standard-compliant compiler to generate the binary code. We propose to walk through some examples and benchmarks, exposing the current state of what Pythran provides as well as the limit of the approach.

Querying your Database in Natural Language

May 04 - 3:10 p.m.

Daniel Moisset

Most end users can't write a database query, and yet, they often have the need to access information that keyword-based searches can't retrieve precisely. Lately, there's been an explosion of proprietary Natural Language Interfaces to knowledge databases, like Siri, Google Now and Wolfram Alpha. On the open side, huge knowledge bases like DBpedia and Freebase exists, but access to them is typically limited to using formal database query languages. We implemented Quepy as an approach to provide a solution for this problem. Quepy is an open source framework to transform Natural Language questions into semantic database queries that can be used with popular knowledge databases like, for example, DBPedia and Freebase. So instead of requiring end users to learn to write some query language, a Quepy Application can fills the gap, allowing end users to make their queries in "plain English". In this talk we would discuss the techniques used in Quepy, what additional work can be done, and its limitations.

Real-time streams and logs with Storm and Kafka

May 04 - 12:30 p.m.

Andrew Montalenti, Keith Bourgoin

Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets.

A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams.

In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. This will include a discussion of current open source interoperability options with Python, and how to combine real-time computation with batch logic written for Hadoop. We will also discuss Kafka and Storm's alternatives, current industry usage, and some real-world examples of how these technologies are being used in production by Parse.ly today.

Sentiment Classification Using scikit-learn

May 03 - 2:20 p.m.

Ryan Rosario

Facebook users produce millions of pieces of text content every day. Text content such as status updates and comments tell us a lot about how people feel on a daily basis and how people feel about the web of things. In this talk, we discuss a system based on scikit-learn and the Python scientific computing ecosystem that describes and models positive and negative sentiment of user generated content on Facebook. Unlike more traditional polarized word counting methodologies, our system trains machine learning classifiers using naturally labelled data to achieve high accuracy.

SociaLite: Python-integrated query language for big data analysis

May 04 - 2:10 p.m.

Jiwon Seo

SociaLite is a Python-integrated query language for big data analysis. It makes big data analysis simple, yet achieves fast performance with its compiler optimizations, often more than three orders of magnitude faster than Hadoop MapReduce programs. For example, PageRank algorithm can be implemented in just 2 lines of SociaLite query, which runs nearly as fast as an optimal C implementation. High-level abstractions in SociaLite help implement distributed data analysis algorithms. For example, its distributed in-memory tables allow large data to be stored across multiple machines, and with minimal user annotations, fast distributed join operations can be performed. Moreover, its Python integration makes SociaLite very powerful. We support embedding and extending, where embedding supports using SociaLite queries directly in Python code, and extending supports using Python functions in SociaLite queries. To support embedding, we apply source code rewriting that transforms SociaLite queries to invoke SociaLite runtime interfaces. For extending, we keep track of functions defined in Python interpreter and make them accesible from SociaLite. The integration makes it easy to implement various data mining algorithms in SociaLite and Python. I will demonstrate in the talk a few well-known algorithms implemented in SociaLite, including PageRank, k-means, and logistic regression. The high-level queries can achieve fast performance with various optimizations. The queries are compiled into Java bytecode with compiler optimizations applied, such as prioritizations or pipelined evaluation. Also, the runtime system gives its best effort to achieve full utilizations of multi-core processors as well as network bandwidths. With the compiler optimizations and the runtime system we achieve very fast performance that is often close to optimal C implementations.

Speed without drag

May 04 - 9 a.m.

Saul Diez-Guerra

Speed without drag: making code faster when there's no time to waste A practical walkthrough over the state-of-the-art of low-friction numerical Python enhancing solutions, covering: exhausting CPython, NumPy, Numba, Parakeet, Cython, Theano, Pyston, PyPy/NumPyPy and Blaze.

The IPython protocol, frontends and kernels

May 03 - 2:20 p.m.

Paul Ivanov, Thomas Kluyver

A key idea behind the IPython Notebook is decoupling code execution from user interfaces. IPython relies on a documented JSON protocol, which can be implemented by different frontends and different kernels. By implementing the messaging protocol, new frontends gain the ability to communicate with a kernel regardless of the kernel implementation language. Conversely, new kernels automatically gain access to the existing client ecosystem. The IPython project maintains three different frontends, and there are multiple third party frontends and kernels already in use. We'll describe some important features of the messaging protocol, before demonstrating some of our alternative frontends, including vim-ipython and bipython. We'll show kernels that people have written in other languages, such as Julia and R, and preview the upcoming features that will expose these alternative kernels in the Notebook user interface. This talk is proposed jointly by Paul Ivanov and myself, both core IPython developers.

The Reference Model for Disease Progression uses MIST to find data fitness

May 03 - 3:20 p.m.

Jacob Barhak

The Reference Model for Disease progression [1,2] is a league of disease models that compete amongst themselves to fit existing clinical trial results. Clinical Trial results are widely available publicly at the summary level. Disease models are typically extracted from a single trial and therefore may not generalize well to all populations. The Reference Model determines the fitness of multiple disease models for multiple populations and helps deduce better fitting scenarios. This is done using High Performance Computing (HPC) techniques that support Monte Carlo simulation at the Micro individual level. The MIcro Simulation Tool (MIST) [3,4] facilitates running those simulations in HPC environment using Sun Grid Engine (SGE) [5]. MIST can even run over the cloud using StarCluster [6] and an anaconda AMI [7]. Note, however, the public published data is summary data while simulations are conducted at the individual level. The individual population is reconstructed from summary data using the MIST Domain Specific Language (DSL) and is optimized using evolutionary computation using Inspyred [8]. This allows creating populations that conform to the clinical trial summary statistics and allow incorporating trial inclusion and exclusion criteria as well as cope with skewed population distributions. The Reference Model allows exploring new assumptions and hypothesis about disease progression and determines their fitness to existing population/model data. These virtual trials consider much more information than a single trial, using already available and public data. Links to relevant own publications: [1] The Reference Model video: http://youtu.be/7qxPSgINaD8 [2] The Reference Model short description: http://healtheconblog.com/2014/03/04/the-reference-model-for-disease-progression/ [3] MIST video presentation: http://www.youtube.com/watch?v=AD896WakR94 [4] MIST github repository: https://github.com/Jacob-Barhak/MIST Links to external free software tools relevant to this work: [5] SGE: http://gridengine.org/blog/ [6] INSPYRED github repository: https://github.com/inspyred/inspyred [7] StarCluster home page: http://star.mit.edu/cluster/ [8] B. Zaitlen, StarCluster Anaconda. Online: http://continuum.io/blog/starcluster-anaconda

Up and Down the Python Data and Web Visualization Stack

May 03 - 4:10 p.m.

Rob Story

In the past two years, there has been incredible progress in Python data visualization libraries, particularly those built on client-side JavaScript tools such as D3 and Leaflet. This talk will give a brief demonstration of many of the newest charting libs: mpld3 (using Seaborn/ggplot), nvd3-python, ggplot, Vincent, Bearcart, Folium,and Kartograph will be used to visualize a newly-released USGS/FAA wind energy dataset (with an assist from Pandas and the IPython Notebook). After a demo of the current state of Python and web viz, it will discuss the future of how the Python data stack can have seamless interoperability and interactivity with JavaScript visualization libraries.

Using Python to Find a Bayesian Network Describing Your Data

May 03 - 3:20 p.m.

Bartek Wilczynski

Today's world is full of data that is easily accessible for anyone. The problem now is how to make sense of this data and extract some useful insights from the terabytes of raw material. Typically, this involves using machine learning tools - allowing you to build classifiers, cluster data, etc. Many of these approaches give you models that describe the data accurately, but may be difficult to interpret. If you want to be able to understand the result more intuitively it is worth looking at Bayesian Networks - a graphical representation that simplifies complex mathematical model into a most likely graph of dependencies between your variables. I will talk about BNFinder - a python library allowing you to take any tabular data and convert it to a much simplified representation of conditional dependencies between variables. It can be the used for classification of unseen objects while the connection structure can be interpreted even by a non specialist. BNfinder is publicly available under GNU GPL and it can be used by anyone on their data.

Welcome to PyData at Facebook!

May 03 - 9 a.m.

Guy Bayes

Intro to Burc Arpat

Why Python is Awesome When Working With Data at any Scale

May 03 - 9:05 a.m.

Burc Arpat

Day 42 of your brand new startup. You are the CTO; your friend from b-school is the CEO. She is busy meeting with potential clients to understand their needs. You are busy with coding and you are about to start implementing a machine learning algorithm. Which language do you choose? So many possibilities: R, Julia, Lua, Java, C++, etc. etc. etc. I'd pick Python. After all, Python is a highly productive, highly customizable, batteries included language that integrates nicely with low-level code and is supported by a great community. But mainly because… Well… Python will stay with you and continue to make your life easier even when you get to Facebook scale. Don't believe me? I'll show you concrete proof in the form of examples from a day in the life at Facebook.

ggplot for python

May 02 - 2:30 p.m.

Greg Lamp

Making basic, good-looking plots in Python is tough. Matplotlib gives you great control, but at the expense of being very detailed. The rise of pandas has made Python the go-to language for data wrangling and munging but many people are still reluctant to leave R because of its outstanding data viz packages. ggplot is a port of the popular R package ggplot2. It provides a high level grammar that allow users to quickly and easily make good looking plots. So say good-bye to matplotlib, and hello to ggplot as your everyday Python plotting library! https://github.com/yhat/ggplot

Presentation Abstracts

May 04 - 4 p.m.

Using Python and Paver to Control a Large Medical Informatics ETL Process

May 03 - 12:40 p.m.

Alex F. Bokov , Dan Connolly

A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook

May 03 - 10:10 a.m.

Ad Targeting at Yelp

May 04 - 12:30 p.m.

Analyzing Satellite Images With Python Scientific Stack

May 04 - 3:10 p.m.

Bokeh

May 02 - 9 a.m.

Bokeh

Building an Army of Data Collecting Robots in Python

May 03 - 12:40 p.m.

Conda

May 03 - 10:10 a.m.

Crushing the head of the snake

May 03 - 1:30 p.m.

Dark Data: A Data Scientist's Exploration of the Unknown

May 03 - 11 a.m.

Data Analysis with SciDB-Py

May 03 - 5 p.m.

Data Engineering 101: Building your First Data Product

May 04 - 1:20 p.m.

Data Science at Berkeley

May 04 - 10:40 a.m.

DataPad: Python-powered Business Intelligence

May 03 - 1:30 p.m.

Dataswarm

May 04 - 3:10 p.m.

Designing and Deploying Online Experiments with PlanOut

May 02 - 2:30 p.m.

Ferry - Share and Deploy Big Data Applications with Docker

May 04 - 1:20 p.m.

Functional Performance with Core Data Structures

May 04 - 12:30 p.m.

Generators Will Free Your Mind

May 03 - 11 a.m.

Gradient Boosted Regression Trees in scikit-learn

May 02 - 1 p.m.

How to Spy with Python

May 04 - 9:50 a.m.

How to Spy with Python

Jul 26 - 3:05 p.m.

How to build a SQL-based data warehouse for 100+ billion rows in Python

May 04 - 1:20 p.m.

Hustle: a column oriented, distributed event database

May 02 - 4:05 p.m.

IPython Interactive Widgets

May 02 - 9 a.m.

Brian Granger, Jonathan Frederic

IPython: what's new, what's cool, and what's coming

May 04 - 9 a.m.

Introducing xray: extended arrays for scientific datasets

May 03 - 4:10 p.m.

K-means Clustering with Scikit-Learn

May 02 - 4:50 p.m.

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

May 02 - 4:05 p.m.

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

Map Reduce: 0-60 in 80 Minutes

May 02 - 1 p.m.

Outlier Detection in Time Series Signals

May 04 - 2:10 p.m.

PyAlgoViz: Python Algorithm Visualization in the Browser

May 03 - 5 p.m.

Python as Part of a Production Machine Learning Stack

May 04 - 2:10 p.m.

Python as Part of a Production Machine Learning Stack

Pythran: Static Compiler for High Performance

May 04 - 9:50 a.m.

Querying your Database in Natural Language

May 04 - 3:10 p.m.

Real-time streams and logs with Storm and Kafka

May 04 - 12:30 p.m.

Andrew Montalenti, Keith Bourgoin

Sentiment Classification Using scikit-learn

May 03 - 2:20 p.m.