PyData 2013 | Boston, MA | July 27

Presentation Abstracts

(click on the Title to view presentation details)

Bayesian data analysis using PyMC

Jul 28 - 2:20 p.m.

Imri Sofer

It is becoming increasingly popular to model and analyze data using Bayesian statistics. PyMC is a Python module that allows users to create Bayesian statistical models and fit them using several algorithms. One of the most popular algorithms is Markov chain Monte Carlo (MCMC). The lecture will introduce the algorithm, and go over the key components of the PyMC package. We will use the package to build a model, fit it to our data, diagnose the procedure, and validate the model. The lecture is aimed both for beginners who are not familiar with MCMC as well as for a more advance

An Introduction to Pydoop: The Python Interface to Hadoop

Jul 28 - 2:20 p.m.

Michael Sun

I will talk about my experience in evaluating the Python interface to Hadoop and give examples of use cases. Previously before Pydoop was available, we created our own Python interface to Hadoop using Hadoop streaming. I will demonstrate how to use Pydoop HDFS to interact with HDFS on Hadoop, fast program with Pydoop script, and write custom Map/Reduce with Pydoop API for processing data on Hadoop, with all written in Python.

Analytical Tools for Circuit Neurobiology

Jul 28 - 12:45 p.m.

Andrew Giessel

Modern neuroscience is a diverse field that encompasses many experimental approaches to understanding nervous system function. Spanning from molecules to behavior, the data generated by each approach has specific computational and analytic demands. In this talk I will highlight the analytic and data management frameworks we've established in our lab, which studies the mouse olfactory system and innate behaviors using a wide range of techniques and approaches.

Beginner NumPy Tutorial

Jul 27 - 10:30 a.m.

Andrew Giessel

Coming soon.

Bestseller Analysis: Sex Scene Detection, Topics, and Pacing

Jul 27 - 3:10 p.m.

Lynn Cherny

Some pythonic how-to's behind how I identified sex scenes in 50 Shades of Grey and visualized topics and structural details in Dan Brown's novels. After a little help from Mechanical Turkers, the texts are processed and modeled in python, and the results are visualized in d3.js.

Citation Network Analysis

Jul 28 - 11:15 a.m.

Thom Neale

Coming soon.

Comparing R and Python for PCA Analysis

Jul 28 - 3:10 p.m.

Vipin Sachdeva

Coming soon.

Cython: Readable like Python, fast like C

Jul 27 - 12:10 p.m.

Zak Fallows

Cython is a dialect of Python that allows you to optionally declare a static type for each variable. Cython is often 10 to 100 times faster than normal Python, and Cython is generally exactly as fast as hand-written C. This talk will explain how to use Cython and why Cython is awesome. Cython provides more than just speed, so we will examine some of the novel uses for Cython.

DX Analytics -- A Python-based Library for Derivatives Analytics

Jul 27 - 3:10 p.m.

Demian Wassermann

Current hypotheses in neuroscience postulate that anatomy is closely related with brain function. Subtle variations in anatomical structure could lead to differences in behavior as well as providing a substrate for psychiatric diseases such as schizophrenia or autism. However, how to represent current anatomical knowledge computationally remains an open question. In this talk I will introduce python tools to represent human anatomy in two different manners. On one side I will introduce a domain specific programming language to represent and automatically extract the major white matter structures in the human brain’s white matter. This technique enables not only a formal documentation of anatomical knowledge but the automation of population studies including specific brain changes in schizophrenia and dyscalculia. Secondly, I will introduce the field of computational anatomy as well as a python package to implement some of its algorithms. Computational anatomy aims to provide a mathematical and computational framework to find areas in the brain and other organs which have equivalent functions across and within subjects, enabling the characterization of changes specific to diseases as well as surgical planning.

DataPad: Python-powered Business Intelligence

Jul 28 - 4:20 p.m.

Craig Schmidt

Predictobot is a website that makes predictive modeling possible for people without experience in machine learning or programming. You upload a spreadsheet of data, and specify which column you would like to predict, using the data in the other columns. Predictobot automatically analyzes the data and builds a predictive model. The website allows you to see which columns had the biggest effect on the predictions, and how well your model should perform in the future. You get an Excel spreadsheet with formulas that make predictions on future data.

Predictobot is written entirely in Python, using a number of frameworks. I'll talk a little about the architecture, using Django and PiCloud to create a scalable service.

It uses a new, proprietary machine learning algorithm. The technique gives a sparse set of interpretable rules specifying how attributes or pairs of attributes affect the model. It is doing variable selection while building the model. It combines ideas from Naive Bayes and Boosting, to give something better.

This will be the first public demonstration of Predictobot. Attendees of the conference will have early access to Predictobot during our beta test period.

Discovering Hidden Structure in Data with Latent Variable Models in Python

Jul 27 - 4 p.m.

Eric Jonas

"Predictive Analytics" has traditionally focused on the task of predicting a single variable given some other data. But often prediction isn't enough — the next generation of intelligent applications require the ability to understand the many hidden causes behind data. In the Predictive Intelligence group at Salesforce, we're building scalable latent variable models using a combination of C++ and Python to help customers understand and exploit the massive datasets they've accumulated on the force.com platform. "

Embeddings of Python

Jul 27 - 5 p.m.

James Powell

Coming soon.

Financial Analysis in Python

Jul 27 - 5 p.m.

Thomas Wiecki

This tutorial will provide hands-on experience of various data analysis tools relevant for financial analysis in Python. We will first see how financial data can be imported from various sources such as Yahoo! finance. Pandas, Matplotlib and statsmodels can be used for basic and more advanced time-series analysis. While rudimentary backtesting of investment strategies on historical data can be carried out using Pandas, a more realistic simulation that considers transaction costs, slippage and avoids look-ahead bias, introduces various complexities. We will then see how Zipline, an open-source streaming-based financial simulator written in Python, can make realistic backtesting much easier. After going through some simple example algorithms we will see how statistical Python libraries like scikits.learn can easily be incorporated with Zipline to build state-of-the art trading algorithms. Finally, I will briefly show how the same algorithm code can be run with minimal code changes on Quantopian -- a free, browser-based platform for developing algorithmic trading models.

Generators the Third

James Powell

Coming soon.

Grizzly: Reverse-engineering Cancer and Brain Development with Multidimensional Dataflow Programming

Jul 27 - 11:20 a.m.

Adrian Heibut

Computational biology increasingly depends on integrating many kinds of measurments (such as mutation status, gene expression, and phenotype) from ever-growing datasets. Real world analysis is ad-hoc, experimental, iterative, and collaborative, and both data and statistical approaches constantly evolve. Ensuring that computational experiments are reproducible, understandable, and efficient remains a major challenge.

Drawing examples from current research into mechanisms of cancer and neural differentiation, I'll dissect some of these challenges, and describe grizzly, a set of abstractions and tools under development to help computational scientists to design, run, and interactively explore complex analytical and statistical workflows over structured, multidimensional data. Grizzly is implemented in python, and builds on top of pandas, NumPy, statsmodels, and IPython.

How does text become data?

Jul 27 - 2:20 p.m.

Robert Speer

People and businesses want to make decisions based on large amounts of quantifiable data. If what you actually have is text in natural language, how can you quantify it and make decisions with it? How do you compare it and put error bars on it?

Fortunately, there are now some freely available data sources and libraries that make these questions easier to answer, including Google Ngrams, WordNet, ConceptNet, and NLTK. I'll present code examples, mostly in Python, that use these resources to create language models from text. These models can be built into useful tools such as search engines, recommender systems, and classifiers.

How to Succeed in Industry without Really Trying (a Guide for Academics)

Michael Selik

Comin soon.

Importing and Analyzing SQL Data With the Django ORM

Jul 28 - 4:20 p.m.

Ian Stokes-Rees

Coming soon.

Intro to scikit-learn

Jul 27 - 10:30 a.m.

Michael Becker

Coming soon.

Introduction to Data Wrangling with pandas

Jul 28 - noon

Skipper Seabold

Pandas is a data analysis library for Python. This will be a tutorial-style introduction to data wrangling with pandas. We will first introduce pandas and its data structures. Then we will cover the following topics: core functionality, controlling the display of data, indexing and alignment, slicing and assignment, core input/output functions, how to work with time series, and basic plotting. Participants are encouraged to install the pandas package beforehand. IPython notebooks for the tutorial will be provided via github, so you are encouraged to get IPython setup as well.

Introduction to DataNitro: Python in Excel

Jul 28 - 3:10 p.m.

Ben Lerner

There's an incredible amount of data trapped inside Excel. This bottlenecks our ability to leverage the data analysis tools in the Python ecosystem. This talk will give an introduction to DataNitro, which integrates Python with Excel, and show how to easily get data into and out of Excel.

Julia

Jul 28 - 4:20 p.m.

Jeff Bezanson

Coming soon

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

Jul 27 - 5 p.m.

Cam Davidson-Pilon

We present an overview of the open-source IPython book Bayesian Methods for Hackers, including its unique real-time features and future goals, by presenting some practical Bayesian algorithms: ranking, optimal choice, and a solution to The Price is Right's Showdown. We'll also perform a real-time experiment using the PyData audience, and use the results in a later chapter of the book

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

Jul 28 - 9:30 a.m.

Cathy O'Neil

We've recently seen ample evidence, from the fields of macroeconomics, pharmaceutical research, and educational testing, of an urgent need for high-impact mathematical models to be explained, open sourced, and testable. Moreover, the technology to do this is evolving quickly in the python community via tools like the IPython Notebook and Wakari. I will discuss my vision of how to set this up and make it accessible to the public as a crucial tool to live as an informed citizen in this age of big data.

My First Numba

Jul 27 - 12:10 p.m.

James Powell, Saul Diez-Guerra

Basic tutorial, modelling some simple interview question-like problems, optimising using numba autojit, and analysing performance from the perspective of a first-time user

Negawatt mapping: city-scale energy analytics

Jul 28 - noon

Jonathan Jesneck

Essess (pronounced EE-sess) performs high-throughput, city-wide energy scanning to provide multiple levels of energy efficiency and consumption analytics. Our goal is to provide building owners with an easy-to-read thermal image that identifies potential energy leaks, a leading source of energy waste in the building sector. We make this possible using multiple systems including GIS, physical data capture, data processing, and cloud-based production. Our distributed curation platform lets us ensure that high quality data makes its way into Essess' machine learning platform for energy leak detection and modeling. Leaks, buildings, neighborhoods, and cities are automatically scored by integrating several levels of data analysis. We will present how we use python for everything from recording imaging to front-end services.

Orange Canvas

Jul 27 - 1 p.m.

Justin Sun

Orange Canvas is an open source visual environment for building data mining applications. Users connect graphical widgets to load data, visualize data, and to run algorithms such as classification, clustering, and regression. Add-ons modules are available for text mining, bioinformatics, and social network analysis. Developers can create or modify widgets to extend Orange, or use Python scripting to access Orange.

PYDATA AFTER PARTY

Jul 27 - 12:10 p.m.

David Weisman

Coming soon.

Parakeet

Jul 27 - 2:20 p.m.

Alex Rubinsteyn

Parakeet is a a runtime compiler for an array-oriented subset of Python. Parakeet is built on top of data parallel array operators such as Map/Reduce/Scan but largely hides their use from the programmer. Idiomatic NumPy code (array expressions and calls to NumPy library functions) is converted under the hood into a data parallel program, allowing for some fancy high level optimizations and parallel code generation. I'll show a few example programs and give a sketch of how they end up so darned fast.

Personalizing spatial search by blending machine learning and spatial statistics

Jul 27 - 4 p.m.

Jonathan Jesneck

Current online searches for real estate produce a dump of listings for a city, but the consumer is left with the burden of poring through them to assess for personal fit. To produce a more useful search ranking, spatially aware search engine uses personal local environment preferences to map personal fit over all the listings. By aggregating (using Selenium and BeautifulSoup) and mining (using NumPy with MongoDB) social data and applying spatial statistical models (using NumPy, StatsModels, and Pandas), it allows for more useful and intuitive searches, such as

"Find studio apartments close to my work and by good cafes and a subway station in an area with mostly younger people."
"Find an affordable house with an upgraded kitchen that's in a good school district and allows for convenient commutes for my husband and me."
"Find a relatively underpriced hotel room in a well-rated hotel that is close to my conference center, near good shopping, walkable to a good Thai restaurant, and in a safe, low-crime area."

As people search by describing their desired local environments, the machine learning algorithm builds hedonic maps, spatial models of real estate demand, that enable targeted advertising and lead qualification for real estate professionals. It also compares these demand maps with real estate pricing maps to identify price correction opportunities (using distributed queue processing in Celery).

Predictive model deployment with Yhat

Jul 28 - noon

Austin Ogilvie, Greg Lamp

Building predictive models in Python is fun. When I first started with Scientific Python, I was blown away by the examples in the scikit-learn documentation (http://scikit-learn.org/stable/auto_examples/index.html). Buliding a classifier or regressor is fairly easy with scikit-learn and pandas, but after I finish building a model I often find myself saying "Now what?".

For many people, their "Now what?" moment means deploying their model to a production setting for real-time decision making. We'd like to show how Yhat makes deploying predictive models written in Python (or R) fast and easy. We'd like to show data scientists how they can incorporate Yhat into their workflow just by adding a couple lines of code.

At the end of the presentation, you will know how to:

build a recommender system using Python
deploy their recommender to Yhat
embed their recommender in a website

PySpark: Fast and Expressive Big Data Analytics from Python

Jul 28 - 12:45 p.m.

Matei Zaharia

As big data becomes a concern for more and more organizations, there is a need for both faster tools to process it and easier-to-use APIs. Spark is a Hadoop-compatible cluster computing engine that addresses these needs through (1) in-memory computing primitives that let it run 100x faster than Hadoop and (2) concise, high-level APIs in Python, Java and Scala. In this talk, we'll cover PySpark, the Python API for Spark, which lets you process large datasets from a standard Python program by passing functions to be parallelized to special operators (e.g. map, reduce, join). PySpark can also be used interactively from the Python interpreter, allowing fast exploration for big data. PySpark is open source as part of the broader Apache Spark project, which has a growing community with over 60 developers and 17 companies contributing.

Python Tools for Reproducible Research in Brain Imaging

Jul 27 - 10:30 a.m.

Satrajit Ghosh

Reproducible research requires that information pertaining to all aspects of a research activity are captured and represented richly. However, most scientific domains, including neuroscience, only capture pieces of information that are deemed relevant. In this talk, we provide an overview of the components necessary to create this information-rich landscape and describe a prototype platform for capturing standardized provenance and reproducibility in brain imaging. While the data and analysis methods are related to brain imaging, the same principles and architecture are applicable to any scientific domain.

Python and Windows Azure

Jul 27 - 2:20 p.m.

Wen-ming Ye

Microsoft has made Windows Azure an open computing platform, running Hadoop, MongoDB, Linux, and many other open source applications finding wider application in Data processing. Python is alive, well, and doing very nicely on Windows Azure, with new Visual Studio support as well as the Windows Azure Python SDK and a growing developer community. This tutorial gets you started computing on Windows Azure with these new tools, highlighting what the Python community can gain from running Python in the Cloud.

Python and Windows Azure

Wen-ming Ye

Pythran: Static Compiler for High Performance

Jul 28 - 3:10 p.m.

Bryan Lewis, Travis Oliphant

SciDB-Py connects two of the most powerful open source tools whose shared mission is to change the way scientists, engineers and analysts work with big data: SciDB and python.

SciDB is an innovative, next-gen open source database designed for massively scalable in-database complex analytics. It natively organizes data in n-dimensional arrays which is an optimal representation for many of the new types of data being generated and mashed up today: location data, sensor data, genomics data, population data, telematics data, financial time series data, and image data. SciDB supports both embarrassingly parallel and not-embarrassingly parallel processing, distributed storage, fully ACID transactions, efficient sparse array storage, and native, scalable complex math operations like generalized linear models and principal component analysis.

SciDB-Py lets python developers work in a familiar IDE like Wakari, using Blaze arrays that seamlessly reference large scale arrays managed by SciDB. SciDB brings persistent data storage, MPP parallel processing and scale-out linear algebra to python.

We illustrate SciDB-Py with two examples. The first of these examples performs a truncated singular value decomposition of a very large, sparse matrix. This operation is widely used as the “guts” of recommendation engines used in many large web properties. The goal is to cluster customers with similar behavior, but the technique has very general utility.

The second example is from computational finance. We illustrate SciDB's fast parallel aggregation capability to quickly build a custom national best bid and offer price book from daily NYSE TAQ ARCA tick data.

The examples demonstrate that SciDB-Py’s scale out MPP architecture enables interactive exploratory analytics on large-scale data.

Realtime Predictive Analytics Using RabbitMQ & scikit-learn

Jul 27 - 11:20 a.m.

Michael Becker

In this talk I'll discuss how to use RabbitMQ and scikit-learn to create a realtime content classification system.

Scalable Analytics and Visualization: Connecting Expertise to Data With Python

Travis Oliphant

Coming soon.

Sentiment Classification Using scikit-learn

Jul 27 - 4 p.m.

Chris Colbert

Enaml is an open source library for building rich user interfaces utilizing a declarative extension to the Python language grammar. Notable features of the framework include: automatic dependency analysis at the bytecode level, a constraints based layout system, support for multiple rendering engines, and the ability to interface with a variety of model notification systems. The Enaml DSL is a strict superset of Python which allows the developer to declaratively define dynamic, extensible, and reactive trees; such trees are particularly well suited for the definition of user interface hierarchies.

Enaml has been used for building production applications at multiple Fortune 500 companies and serves as the foundational UI framework at one of the world's leading investment banks. This talk will provide a introduction to the Enaml language and demonstrate some of the notable features of the library.

SimpleAI (Artificial Intelligence Python lib)

Jul 28 - 10:30 a.m.

Juan Pedro Fisanotti

Short introduction to search problems on AI (kind of problems we will be able to solve, restrictions)
Live example of solving an AI problem using the lib:

Expose the problem
Define the problem in python
Try several algorithms to find the solution
Use the graphical debugger to view the algorithms behaviour

Conclusions
Where to find more information (book, lib docs, ..)

Speed without drag

Jul 28 - 2:20 p.m.

Chris Halpert

Data manipulation is an essential and often time-consuming task. Pandas offers high-level tools which make data manipulation and analysis easier. This talk will introduce fundamental concepts and we'll work our way to common data wrangling tasks, including: slicing, missing data, grouping, merging and joining, pivoting, statistical functions, and plotting.

Starcluster - Cluster Computing Toolkit for the Amazon EC2 Cloud

Jul 28 - 11:15 a.m.

Justin Riley

StarCluster is a cluster computing toolkit for the cloud, developed in Python by The Software Tools for Academics and Researchers (STAR) group at MIT. StarCluster makes it easy to create and manage parallel and distributed computing clusters on Amazon's EC2. Additionally, a command line interface provides utilities for working with clusters, machines, and data volumes. StarCluster also exposes a Python plugin API that allows users to customize their systems beyond the defaults. StarCluster also includes public machine images, equipped with frameworks out-of-the-box including OpenMPI, OpenMP, Hadoop, (Py)CUDA, (Py)OpenCL, and IPython (parallel).

In this talk I will give an overview of StarCluster and how to get started using it with these various parallel frameworks on real clusters in the cloud. I will also cover using StarCluster's Python plugin API to further configure a cluster and automate various workflows.