Presentation Abstracts

(click on the Title to view presentation details)

Bayesian Algorithms are employed in machine learning algorithms including classification, collaborative filtering & recommendation engines. This tutorial introduces the application of Bayesian Algorithms to interesting ML problems like movie recommendation, market research and of course spams filtering. We will use iPython, pandas, scipy and publicly available data. It will be a hands-on tutorial – we will have the appropriate install packages and data in GitHub before the tutorial. We will also cover the fundamental aspects of Bayesian Methods, Bayesian Belief Networks and other related topics relevant to solving the problems.
When it comes to plotting with Python many people think about matplotlib. It is widely used and provides a simple interface for creating a wide variety of plots from very simple diagrams to sophisticated animations. This tutorial is a hands-on introduction that teaches the basics of matplotlib. Students will learn how to create publication-ready plots with just a few lines of Python. Students should have a working knowledge of Python. NumPy knowledge is helpful but not required.

PythonFashionForecaster is an ongoing open source code project that I'd like to present to the PyData Community in order to initiate discussion about applications of Python in a traditionally non data-centric industry. It will hopefully extend the use of Python and open source to the world of fashion. A quick search of python repositories on github show a lack of true fashion apps, those mostly involving weather forecast or shopping tools rather than specifically fashion styles. On the other spectrum of fashion apps, those highly relevant to fashion styles are commercial. PythonFashionForecaster is different in that the objective is to display fashion style trends as an information resource in an automatic and computational manner.

This talk would be of interest to anyone that would like to see a case study on the application of parsing JSON data with Python, a survey of data analysis libraries that can be use to analyze social data, as well as anyone interested in fashion related topics. I believe that indirectly this project will bring exposure to the Python Open Source community in non-traditional domains.

Bitdeli is a platform for creating custom analytics in Python, conveniently in your web browser.

You can use Bitdeli to create real-time dashboards and reports, or as a quick and robust way to experiment with up to terabytes of real-time data. Bitdeli is based on vanilla Python to maximize developer-friendliness. There is no need to learn a new paradigm or stop using existing Python packages.

A typical customer of Bitdeli today is a mobile or web startup that wants to understand and leverage the behavior of their users in ways that are not supported by mainstream analytics services. To further support the long tail of custom analytics, we encourage developers to open-source and share their metrics in GitHub, which is tightly integrated to Bitdeli.

Coming Soon
Title and Panelists TBD
Analytic queries require different systems and approaches than operational transactions if they're going to be efficient. This talk will cover what tools Python gives us out of the box for building fast analytic databases: memory manipulation, compression, dynamic typing, optimized representations, multiprocessing, map-reduce.

In 1967 sociologist Stanley Milgram began a series of experiments into the "small world problem" that would firmly cement the phrase "six degrees of separation" within the popular culture. Because of these experiments, nearly all of us today have heard that we are simply a few hand shakes away from anyone in the world. Indeed it's a popular past time amongst academics to figure our their Erdos number and, amongst the rest of us, to calculate a favorite actor's Bacon number. Fast forward to today and the world seems even smaller. With the internet connecting all of us to one another at the speed of light, and social networks such as Twitter and Facebook creating communities that quite literally span the globe, this new era in connectedness has given us a wealth of data about how we interact with one another. There's hardly anyone in the tech community today who hasn't heard of social network analysis, but this combination of sociology, computer science, and mathematics has significance beyond just the analysis of social networks.

Between nearly any set of entities a relationship can be found, and thus a network can be made, from which the inner workings of those relationships can be studied. The still nascent field of network science is quickly becoming THE science of the 21st century and this talk will introduce this budding field and demonstrate how tools such as NetworkX and Matplotlib make it possible for Pythonistas to make meaningful contributions or simply just analyze their own popularity on Twitter.

The goal of this talk is to give the attendees a basic understanding of what network science is and what it can be used for, as well as demonstrate its use in a specific scenario. During the course of this talk we'll walk through a proper definition of a network and introduce some of the jargon necessary to converse with others working in the field. We'll also take a look at some of the statistical properties of networks and how to use them to analyze our own networks. Finally, we'll look at a specific example of the application of network science principles on a real life social network. By the end of the talk, an attendee should feel comfortable enough with field of network science to be able to start analyzing their own networks of data.

Matplotlib is the leading scientific visualization tool for Python. Though its ability to generate publication-quality plots is well-known, some of its more advanced features are less-often utilized. In this tutorial, we will explore the ability to create custom mouse- and key-bindings within matplotlib plot windows, giving participants the background and tools needed to create simple cross-platform GUI applications within matplotlib. After going through the basics, we will walk through some more intricate scripts, including a simple MineSweeper game and a 3D interactive Rubik's cube, both implemented entirely in Matplotlib.

Some of today’s greatest challenges to the scientific community are “big data”, “reproducibility/transparency” and “code sharing”. The state-of-the-art Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT) environment addresses the first two issues with new visualizations and techniques to address big data and provenance. This talk addresses code re-sharing and re-distribution by introducing the UV-CDAT Re-sharable Analyses and Diagnoses (U-ReAD). U-ReAD will offer scientists a complete set of tools (framework) based on the Python programming language along with a code repository. U-ReAD’s goal is to use structured documentation to help build the interface between UV-CDAT and a diagnostic, with few or no changes to the original code. This framework will allow scientists to quickly and seamlessly re-implement their diagnostics so that they will fit perfectly into the UV-CDAT environment. As a result U-ReAD-enhanced diagnostics will be automatically provenance-enabled, making it easy to reproduce any set of results exactly and transparently, a crucial functionality considering today’s increased scrutiny toward scientific results.

This talk aims to demonstrate how easy it can be to plug any diagnostic into UV-CDAT using U-ReAD. We will show how few changes are necessary to create these plugins and how “augmented” the diagnostics are in return.

U-ReAD’s developers also hope to create a central repository of U-ReAD-enhanced tools so that scientists can easily share their tools. This talk will show what is in store along these lines.

In this talk I'll show how a number of tools from the pandas library can be used to quickly wrangle raw data into shape for analysis. Techniques for structured and semi-structured data manipulation, cleaning and preparation, reshaping, and other common tasks will be the main focus.

This talk discusses generators as a mechanism for modelling data-centric problems. The techniques suggested focus on simplifying the semantics of processing code, adding flexibility by inverting control structures, and allowing performance optimisations through caching, laziness, and targeted specialisations.

* This would be a continuation of the material I presented at PyData NYC 2012. I would incorporate feedback from that presentation to cover areas of particular interest. It would also use material developed since then, including some illustrative examples of how generators could be used to model certain problems in finance (the benchmark pricing problem, the refdata problem, &c.)

The goal of Disco has been to be a simple and usable implementation of MapReduce. To keep things simple, this MapReduce aspect has been hard-coded into Disco, both in the Erlang job scheduler, as well as in the Python library. To fix various issues in the implementation, we decided to take a cold hard look at the dataflow in Disco's version of MapReduce. We came up with a generalization that should be more flexible and hence also more useful than plain old MapReduce. We call this the Pipeline model, and we hope to use this in the next major release of Disco. This will implement the old MapReduce model in terms of a more general programmable pipeline, and also expose the pipeline to users wishing to take advantage of the optimization opportunities it offers.

If time permits, we will also discuss other aspects of the Disco roadmap, and the future of the Disco project.

HDF5 is a hierarchical, binary database format that has become a de facto standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and in-core calculations. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py).

This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller. With such powerful features at the developer's disposal, what is not to love?!

This tutorial is targeted at a more advanced audience which has a prior knowledge of Python and NumPy. Knowledge of C or C++ and basic HDF5 is recommended but not required.

This tutorial will require Python 2.7, IPython 0.12+, NumPy 1.5+, and PyTables 2.3+. ViTables and MatPlotLib are also recommended. These may all be found in Linux package managers. They are also available through EPD or easy_install. ViTables may need to be installed independently.

Python has been an important tool for analysis and manipulation of scientific data. This has traditionally taken the form of large datasets on disk or in local databases, which are then processed by sophisticated numerical and scientific libraries (SciPy and friends). Increasingly, science is becoming a collaborative enterprise where "big data" is generated in multiple locations and analyzed by multiple research groups.

In this talk we discuss how Python data analysis can help scientists work more collaboratively by integrating Web APIs to access remote data. We will discuss the details of this approach as applied to the Materials Project (see, a Department of Energy project that aims to remove the guesswork from materials design using an open database of computed properties for all known materials. Using the Python Materials Genomics (pymatgen) analysis package (see, Materials Project data can be seamlessly analyzed alongside local computed and experimental data. We will describe how we make this data available as a web API (through Django) and how we provide access to both data and analysis under a single library. The talk will go over the technology stack and demonstrate the potential power of these tools within an IPython notebook. We will finish by describing plans to extend this work to address key challenges for distributed scientific data.

IPython is a great tool for doing interactive exploration of code and data. IPython.parallel is part of IPython that enables interactive exploration of parallel code, and aims to make distributing your work on local clusters or AWS simple and straightforward. The tutorial will cover the basics of getting IPython.parallel up and running in various environments, and how to do interactive and asynchronous parallel computing with IPython. Some of IPython's cooler interactive features will be demonstrated, such as automatically parallelizing code with magics in the IPython Notebook and interactive debugging of remote execution, all with the help of real-world examples.
IPython has evolved from an enhanced interactive shell into a large and fairly complex set of components that include a graphical Qt-based console, a parallel computing framework and a web-based notebook interface. All of these seemingly disparate tools actually serve a unified vision of interactive computing that covers everything from one-off exploratory codes to the production of entire books made from live computational documents. In this talk I will attempt to show how these ideas form a coherent whole and how they are represented in IPython's codebase. I will also discuss the evolution of the project, attempting to draw some lessons from the last decade as we plan for the future of scientific computing and data analysis.
Abstract coming

Luigi is Spotify's recently open sourced Python framework for batch data processing including dependency resolution and monitoring. We will demonstrate how Luigi can help you get started with data processing in Hadoop MapReduce as well as on your local workstation.

Spotify has terabytes of data being logged by backend services every day for everything from debugging to reporting reasons. The logs are basically huge semi-structured text files that can be parsed using a few lines of Python. From this data aggregated reports need to be created, data needs to be pushed into SQL databases for internal dashboards, related artists need to be calculated using complex algorithms and a lot of other tasks need to be performed, many of which have to be run on an daily or even hourly basis.

A lot of the initial processing steps are very similar for the many data products that are produced, and instead of re-doing a lot of work, intermediate results are stored and form dependencies for later tasks. The dependency graph forms a data pipeline.

Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.

As a professional data scientist, I'm very interested in how I (and others) can become completely free from R. It's clear that R's main advantage is CRAN and its large cohort of highly skilled and specialized contributors. CRAN presents potential contributors with a centralized and well defined model for designing, creating, and publishing packages for easy use by the R community. In my talk I'll explore the process of creating and contributing a package for the Python statistical / data science community using my experience with the MARS algorithm as a case study and compare that process with my past experience contributing packages to CRAN. I hope to stimulate discussion around how the Python community can work together to make Python viable as a complete R replacement.

I will be discussing the approaches taken by the Editor Engagement Experimentation team at the Wikimedia Foundation to discover the new site features that lead to stronger collaborative contributions from editors and readers. The focus will be on how we define, gather and analyze our metrics [2,3,4] and how these have been exposed via a RESTful API built with Flask.

I'll also discuss the experimental results of new features (article feedback, post-edit feedback) and improved ones (account creation) in the context of the analytics implementation with the "e3_analysis" [3,4] python package. Finally, I will give an overview of the work we are carrying out on ranking the quality of reader feedback comments using the pybrain [5] and mdp [6] machine learning and data processing packages.

  1. [1]
  2. [2]
  3. [3]
  4. [4]
  5. [5]
  6. [6]
Our data pipeline is growing like crazy, processing more than 30 terabytes of data every day and more than tripling in the last year alone. In 2011, we moved our data pipeline to a Hadoop stack in order to enable horizontal scalability for future growth. Our optimization tools used for data exploration, aggregations, and general data hackery are critical for updating budgets and optimization data. However, these tools are built in Python, and integrating them with our Hadoop data pipeline has been an enormous challenge. Our continued explosive growth demands increased efficiency, whether that's in simplifying our infrastructure or building more shared services. Over the past few months, we evaluated multiple solutions for integrating Python with Hadoop including using Hadoop Streaming, PIG with Jython UDFs, writing MapReduce in Jython, and of course, why not just do it in Java? In our talk, we'll explore the different Python-Hadoop integration options, share our evaluation process and best practices, and invite an interactive dialogue of lessons learned.
This talk is about PyCascading, an end-to-end framework to script Hadoop in Python. Traditional ways of implementing Hadoop jobs either involve chaining streaming Python map-reduce stages by hand, or using a language other than Python. PyCascading is a Python wrapper for Cascading, and therefore data flows are defined and manipulated intuitively and fully in Python, without requiring experience with the MapReduce paradigm. In this talk I introduce the basic concepts and show example applications using dynamic social network data.

In this talk we will introduce the typical predictive modeling tasks on "not-so-big-data-but-not- quite-small-either" that benefit from distributed the work on several cores or nodes in a small cluster (e.g. 20 * 8 cores).

We will talk about cross validation, grid search, ensemble learning, model averaging, numpy memory mapping, Hadoop or Disco MapReduce, MPI AllReduce and disk & memory locality.

We will also feature some quick demos using scikit-learn and IPython.parallel from the notebook on an spot-instance EC2 cluster managed by StarCluster.

The Data Science team at Vast builds data products informed by the behavior of consumers making big purchases. Our big data is billions of user interactions with millions of pieces of inventory. Recently we have adopted a data processing, analysis, and visualization environment based on remote access to IPython Notebook hosted by a powerful compute server.

Our Data Science environment is inspired by a Development environment proposed by blogger Mark O'Connor. O'Connor advocates using an iPad as a thin client to connect to a more powerful server in the cloud. The combination of tablet plus server is better than a laptop for several reasons including:

  • The tablet is more portable and offers longer battery life than a laptop;
  • The server offers better performance (more and faster cores, more RAM, more CACHE) than a laptop;
  • Laptops run loud and hot. The noise and heat of the server need not be close the tablet or the ears and lap of the user;
  • The server is always running and the tablet can wake up and reconnect instantly;

IPython Notebook is the keystone of our environment. It enables us to use the tablet browser as a thin client to work with our favorite Python libraries including matplotlb for visualization, scikit-learn for predictive modeling, and pandas for processing and aggregation.

In this talk, I'll discuss configuring the Notebook server and the tablet client. I'll also show examples and results of actual analyses performed in this environment.

Exploratory analysis and predictive modeling of time series is an enormously important part of practical data analysis. From basic processing and cleaning to statistical modeling and analysis, Python has many powerful and high productivity tools for manipulating and exploring time series data using numpy, pandas, and statsmodels.

We will use practical code examples to illustrate important topics such as:

  • -resampling
  • -handling of missing data
  • -intraday data filtering
  • -moving window computations
  • -analysis of autocorrelation
  • -predictive time series models
  • -time series visualizations

At we are building a machine-learning platform that makes efficient and accurate learning algorithms available in an easy-to-use service. In this presentation, I will describe how the platform works and how we're using Python to make it scalable and accessible.

Machine-learning is an active field of data science, where sophisticated models are "trained" on data and used to enable human-like cognition in data analysis pipelines and data-heavy applications. Data scientists need the most efficient and most accurate machine-learning implementations, while developers need on-ramps that make it easy to incorporate machine-learning into their applications.

Highlights of our platform include one-step data ingestion and model building, validation, hosting, integration and sharing. A domain intelligence "marketplace" enables domain-specific knowledge to be incorporated in a model with a click (or a "git push") and is scaled automatically to handle large datasets. We use Python and a range of cloud and data frameworks to make this possible, including Anaconda, PiCloud, Pandas and PyTables.

Simulation has become an indispensable research tool across different scientific disciplines ranging from neuroscience to econometrics and quantitative finance. These computational simulations often involve parameters which have to be optimized on data. This parameter optimization gets increasingly challenging the more complex and longer simulations take to run. Cloud services like Amazon Web Services (AWS) provide a compelling tool in scaling this optimization problem by offering computing resources that allow everyone to spawn their own personal cluster within minutes.

With a focus on algorithmic trading models, in this talk I will show how large-scale simulations can be optimized in parallel in the cloud. Specifically, I will (i) provide a tutorial on how trading strategies of varying sophistication can be developed using Zipline -- our open-source financial backtesting system written in Python; (ii) how StarCluster provides an easy interface to launch an Amazon EC2 cluster; (iii) how IPython Parallel can then be used to test large parameter ranges in parallel; and (iv) a brief demo of how can greatly simplify parts of this process by offering a completely web-based solution free-of-charge. While a case study in quantitative finance, the general approach has direct application to other research domains.

tutorial abstract coming