PyData 2014 | New York | November 22

Presentation Abstracts

(click on the Title to view presentation details)

(Easy), High Performance Text Processing with Rosetta

Nov 22 - 9:55 a.m.

Daniel Krasner

This talk covers rapid prototyping of a high performance scalable text processing pipeline development in Python. We demonstrate how Python modules, in particular from the Rosetta library, can be used to analyze, clean, extract features, and finally perform machine learning tasks such as classification or topic modeling on millions of documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing library.

A Machine Learning Pipeline with Scikit-Learn

Nov 22 - 9 a.m.

Bugra Akyildiz

Scikit-Learn is one of the most popular machine learning library written in Python, it has quite active community and extensive coverage for a number of machine learning algorithms. It has feature extraction, feature and model selection algorithms, and validation methods as well to build a modern machine learning pipeline. This tutorial introduces common recipes to build a modern machine learning pipeline for different input domains and show how one might construct the components using advanced features of Scikit-learn. Specifically, I will introduce feature extraction methods using image and text, and show how one may use feature selection methods to reduce the input dimension space and remove the features which are not useful for classification. For optimization, I will show model selection methods using parameter search. Last in the pipeline, I will show validation methods to be able to choose best parameters. After building the pipeline, I will also show how one might deploy the model into production.

Advanced IPython Notebook Widgets

Nov 22 - 10:35 a.m.

Jason Grout

IPython recently introduced a new framework for interactive javascript widgets in their online notebook. These widgets allow one to easily create and interact with buttons, sliders, text fields, and other HTML/Javascript objects from Python code. The framework has also been used for advanced systems of widgets, such as WebGL 3d graphics and interactive maps. I will discuss and give examples of how to write simple and complicated widgets, demonstrate complicated widget examples such as the pythreejs widget collection for interactive 3d graphics (https://github.com/jasongrout/pythreejs), and discuss changes in the widget infrastructure in the upcoming IPython 3.0. Since this is an advanced talk, it would be most helpful if participants worked through the widget tutorial notebooks at http://nbviewer.ipython.org/github/ipython/ipython/blob/master/examples/Interactive%20Widgets/Index.ipynb before the talk.

Advanced scikit-learn

Nov 23 - 10:35 a.m.

Andreas Mueller

Coming soon.

Analyzing Satellite Images With Python Scientific Stack

Nov 22 - 3 p.m.

Milos Miljkovic

Python has a rich ecosystem of open source geographical information science (GIS) applications. Most of the GIS packages are Python bindings to binaries for data transformation and image manipulation. This makes it hard to study what the data processing encompasses and masks the underlying algorithms. This talk will use Landsat 8 satellite imagery and Python scientific stack to demonstrate a typical data-centric approach for GIS analysis and at the same time explain algorithmic underpinnings. Image recognition and machine learning techniques will be applied to satellite images to expose the data's openness to exploration.

Beautiful Interactive Visualizations in the Browser with Bokeh

Nov 23 - 10:35 a.m.

Bryan Van De Ven

Coming soon.

Biological Data Science

Nov 23 - 3 p.m.

BOF

------

Blaze Foundations: Part 1

Nov 22 - 2:25 p.m.

Matthew Rocklin

Blaze is a NumPy/Pandas interface to big data systems like SQL, HDFS, and Spark. Blaze provides Python developers access to the rich analytic processing available both within the Python ecosystem and beyond.

Internally Blaze is a lightweight data modeling language (expressions with type information) alongside a set of interpreters (Python, SQL, Spark, MongoDB, ...). The modeling language provides an intuitive and familiar user experience. The interpreters connect that experience to a wide variety of data technologies. This combination allows developers to construct connections to novel technologies. These connections enable users to interact with their system of choice, regardless if that system is a single CSV file or a large HDFS cluster running Impala.

This is followed by a second talk using Blaze in the wild.

Data Community/Meetup Organizers

Nov 22 - 4:20 p.m.

BOF

A round table discussion where we can talk about what types of events have worked well and share ideas. Organized by Michael Becker.

Data Science: It's Easy as Pyǃ

Nov 22 - 2:25 p.m.

Michael Becker

Ever wonder how Google Chrome detects the language of every webpage you visit?

Data science!

What is data science and how can you use Python to do it? In this talk, I'll teach you the data science process OSEMN, while creating a language prediction algorithm, utilizing nothing but Python and data from Wikipedia!

I'll cover

What is data science?
Writing a simple web scraping bot.
Using pandas for data processing and exploration.
Creating beautiful data visualizations with seaborn.
Working with text data in scikit-learn.
Building a predictive model.
Ensuring the model is accurate.
Utilizing the model and interpreting the results.

Data warehouse and conceptual modelling with Cubes 1.0

Nov 22 - 11:10 a.m.

Stefan Urbanek

Cubes is a light-weight framework for modelling of conceptual analytical data (part of OLAP) and aggregated browsing. Recent release of Cubes brings ability to build a heterogenous data warehouse – plug-in architecture for analytical backends (SQL, Mongo, Mixpanel, Google Analytics, custom, ...) and provide unified aggregation and conceptual browsing interface in Python or through HTTP/JSON. This talk will go through the main features of the new Cubes, such as: analytical workspace, pluggable models (model providers) and backends, authentication/authorization (for fine-tuning reporting in a restricted business environment), new modelling concepts and advanced SQL features.

Data-driven conversations about biology

Nov 23 - 11:10 a.m.

Olga Botvinnik

The field of biology is constantly evolving new methods for quantification and must constantly adapt to different upstream workflows that quantify biological measurements such as gene expression or alternative splicing. We present flotilla (github.com/YeoLab/flotilla) as a flexible, open source, community-driven software package that enables biologists with a rudimentary knowledge of statistical methods and programming to analyze and visualize hundreds of biological datasets. Users of flotilla harness the power of IPython interactive widgets and Python’s scientific stack, and can quickly perform powerful analyses of biological datasets such as dimensionality reduction, covariance analysis, clustering, classification, regression and outlier detection. As a case study, we will present a single-cell RNA-Sequencing analysis use case of flotilla, with data from two publicly available datasets of individual mouse immune cells in response to a stimulus. With these data, we show the utility of flotilla in quantifying the variation of gene expression and mRNA splicing in single cells.

Decreasing Uncertainty with Weakly Informative Priors and Penalized Regression

Nov 23 - 10:35 a.m.

Jared Lander

Reducing uncertainty is at the heart of statistical analysis and is all the more important when modeling data with numerous predictors. Regularization can reduce uncertainty when interpreting variables and making predictions, leading to better results. Bayesians and Frequentists implement this in different ways; prior distributions for the former, penalized regression for the latter. We examine how the techniques differ and how they are similar and even explore a Bayesian interpretation of the lasso (L1 Penalized Regression) that should make both sides happy.

Disco: Distributed Multi-Stage Data Pipelines

Nov 22 - 3:45 p.m.

Tim Spurway

Disco is a distributed data processing platform that can be used for both batch and low-latency stream processing. Jobs are constructed in Python using multi-stage data pipelines and run on a light weight, highly available clustered architecture. Tim will describe building stream, batch and ad-hoc query based data infrastructure using the Disco ecosystem.

Driving Blaze in the Real World of Data Land Mines

Nov 22 - 3 p.m.

Phillip Cloud

Blaze is a library for harnessing the power of big data technologies. In part deux of the blaze-a-thon, we show motivating use cases illustrating why you might want to use blaze, including a comparison of out-of-core pandas and an experimental backend using the lesser known KDB+. Time permitting, we'll show how easy it is for users of blaze to scratch their own itch by hooking an existing API into blaze via a small set of multiply dispatched functions.

Evaluating skills in educational and other settings: An overview

Nov 22 - 3 p.m.

Kevin Wilson

One of the big problems in educational assessment is *norming*, whose simplest form amounts to giving more credit for answering more difficult questions. But the ideas that come from norming have wide-ranging applications: Comparing examinees who have seen different sets of questions, effectively defining "performance at grade level", measuring how students learn (relative to their cohort, their school, or any other bucketing scheme). Moreover, similar techniques have been used to rank the skills of players in online video games and assessing the competency of Mechanical Turkers among other applications. In this talk, we'll cover some of the common settings and objective functions whose maxima might be interpreted to solve these problems. And, like any good modeler should do, we'll spend most of the time on where their assumptions break down, difficulties in computing and interpreting the solutions to these programs, and where they differ. Much Python will be discussed, and many pretty pictures will be drawn.

From DataFrame to Web Application in 10 minutes

Nov 22 - 10:35 a.m.

Adam Hajari

As an engineer, analyst, or scientist, sharing your work with someone outside of your immediate team can be a challenge. End-users embody many roles with a wide range of technical skill and often times no familiarity with Python or the command line. Findings, key results, and models are frequently boiled down to static graphs, tables, and figures presented in short reports or slideshow presentations. However, engaging research and data analysis is interactive, anticipating the users’ questions and giving them the tools to answer those questions with a simple and intuitive user interface.

Browser based applications are an ideal vehicle for delivering these types of interactive tools, but building a web app requires setting up backend applications to serve up content and creating a UI with languages like HTML, CSS, and JavaScript. This is a non-trivial task that can be overwhelming for anyone not familiar with the web stack.

Spyre is a web application framework meant to help those python developers that might have little knowledge of how web applications works, much less how to build them. Spyre takes care of setting up both the front and back-end of your web application. It uses CherryPy to handle HTTP request logic and Jinja2 to auto-generate all of the client-side nuts and bolts, allowing developers to quickly move the inputs and outputs of their python modules into a browser based application. Inputs, controls, outputs, and the relationships between all of these components are specified in a python dictionary. The developer need only define this dictionary and override the methods needed to generate content (text, tables, and plots).

While Spyre apps are launched on CherryPy’s production-ready server, Spyre’s primary goal is to provide a development path to simple light-weight apps without the need for a designer or front-end engineer. For example, Spyre can be used for

rapid prototyping and building MVPs
data exploration
developing educational resources
building monitoring tools
presenting interactive scientific or analytical results to a non-technical audience

just to name a few.

At Next Big Sound we recently used Spyre to build an app to visualize the effects of sampling parameter values on the volume of tweets collected from one of our data providers (see screenshot below).

Web applications like this can turn a highly technical process into a simple tool that can be used by anyone with any level of technical skill.

After you’ve finished the foundational parts of your project -- the data collection, data cleaning, exploration, modeling, and analysis -- Spyre provides a quick and simple way to package the results into an interactive web application that can be viewed by the rest of the world.

Get To Know Your Data

Nov 23 - 2:25 p.m.

Hannah Aizenman

A recent article in the New York Times estimates that data scientists spend somewhere between %50 and %80 of their time "collecting and preparing unruly digital data" before they ever get to the analysis. Data is often badly labeled, inconsistently sampled, incorrect in strange places, missing, and otherwise contains a whole host of errors, leading to the "garbage in, garbage out" problem. While detecting the myriad ways in which the data is broken can sometimes be difficult, traditional visualization techniques, exploratory data analytics, and cluster analysis can help. This talk will discuss some of the typical methods for sanity checking small data sets: visualization, simple statistics, and some basic combinations of the two. This talk will then veer into some machine learning techniques for exploring the underlying structure of larger data sets to verify the occurrence of known patterns and to detect outliers that could be due to errors rather than the occurance of something interesting.

Grids, Streets & Pipelines: Making a linguistic streetmap with scikit-learn

Nov 23 - 1:30 p.m.

Michelle Fullwood

This talk is aimed at scikit-learn novices who have built their first classifier with out-of-the-box features, been disappointed with the results, and wondered what to do for a next step. As a case study, I'll use a personal project to classify streetnames in Singapore according to their language of origin, which I then turned into a colour-coded streetmap. We'll build a baseline classifier using OpenStreetMap data, GeoPandas, and scikit-learn, then explore how to add your own feature Pipelines and how to tune hyperparameters using GridSearchCV, including how to pick a parameter grid. Lastly we'll plot the map and review which method of improving the baseline classifier worked best: more data, adding features, hyperparameter tuning, or swapping out classifiers?

Healthcare Analytics

Nov 22 - 2:25 p.m.

BOF

Open discussion for all those interested in healthcare analytics.

High Performance Hardware for Data Analysis

Nov 23 - 11:10 a.m.

Mike Pittaro

Choosing hardware for big data analysis is difficult because of the many options and variables involved. The problem is more complicated when you need a full cluster for big data analytics.

This session will cover the basic guidelines and architectural choices involved in choosing analytics hardware for Spark and Hadoop. I will cover processor core and memory ratios, disk subsystems, and network architecture. This is a practical advice oriented session, and will focus on performance and cost tradeoffs for many different options.

How to Make Your Future Data Scientists Love You

Nov 22 - 11:45 a.m.

Sasha Laundy

It's a common story. Software developers are working hard to get a project off the ground. They set up logging to catch errors, but when they go to do data science down the road, they find that their logs are missing crucial information. A few days up front doing a "data audit" could have saved them time, made them money, and helped them gain insight into their customers. This talk will give you the toolkit you need to collect data properly, years before you bring on a data scientist. You will be able to do your own data audit, even if you don't know anything about data science. You will learn the three major things to check—is your data complete? Is it correct? And is it connectable? You'll also get a concise list of tools—Python and the command line—to quickly look through your data to get some intuition for what's hiding in those CSVs. Be a hero to your future data team.

Julia + Python = ♡

Nov 23 - 3:45 p.m.

Stefan Karpinski

Julia is a new language for scientific computing, which combines the high-level abstractions and dynamic interactivity of languages like Python with the performance of low-level languages like C. It has been gaining traction as a faster alternative to NumPy, Matlab or R, and as a more productive alternative to C or C++. Julia is particularly relevant when both expressiveness and performance are paramount – in areas like machine learning, “big statistics”, data mining and linear algebra. But a major challenge for any young programming language is the lack of a large ecosystem of mature libraries and tools. To overcome this difficulty, Julia is “bootstrapping” off of the Python ecosystem, both by making it easy to call Python code and also by exploiting infrastructure such as IPython/Jupyter. Since it would be rude to take and give nothing back, we've also made it possible to call Julia libraries from Python, allowing Julia to be (yet) another alternative when CPython isn't quite fast enough. This talk will be full of live coding, with all the fun and tantalizing possibility of disaster that entails.

KitchenSink - framework for working with remote datasets interactively

Nov 22 - 11:45 a.m.

Hugo Shi

The goal of this project is to make it easy to execute both registered and arbitrary functions on a remote server, just as easily as executing code locally. Passes back all stdout and stderr and exceptions back to the client so working remotely feels like working locally. Supports server side remote data objects, so large results can be left on the server/cluster, but interacted with easily.

Logistic Regression & NFL

Nov 23 - 3:45 p.m.

Alain Ledon, Amit Bhattacharyya

Using machine learning to beat your friends in a NFL confidence pool.

Betting spreads provide a consistent and robust mechanism for encapsulating the variables and predicting outcomes of NFL games. In a weekly confidence pool, spreads also perform very well as opposed to intuition-based guessing and supposed knowledge from years of being a fan. We present some attempts and analysis to use machine learning in order to make improvements on the spread method of ranking winners on a weekly basis.

Making BIG DATA smaller

Nov 23 - 3:45 p.m.

Tony Tran

Do you fear "Out of Memory" errors?
Do you wish that there was a way to reduce the memory footprint of your data?
Do you cry tears of joy when you can crunch your data on an m1.small rather than an m1.xlarge?

If so, this talk is for you!

In this talk, I will go over conventional and unconventional techniques that I've used to reduce the size of my data. I will go over traditional dimensionality reduction techniques such as PCA and NMF. In addition, I will go over more esoteric approaches such as Random Projection.

By the end of this talk, you will be able to understand when and how to appropriately apply these techniques to your own data.

Monary: Really fast analysis with MongoDB and NumPy

Nov 22 - 11:45 a.m.

Anna Herlihy

MongoDB is a scalable, flexible way to store large data sets. Python and NumPy provide a comprehensive toolkit for analysis. But they don't work well together: the official Python driver for MongoDB is inefficient at loading MongoDB data into NumPy arrays. Enter Monary. It's a fast, specialized driver written in C, that copies data directly from MongoDB documents into NumPy arrays. This talk will provide an introduction Monary, and practical demonstrations of Monary's speed benefits and uses. We'll use Monary to store data about millions of New York taxi rides in MongoDB, and we'll analyze it using scientific Python tools to find surprising outcomes about stingy riders and long-suffering drivers. The combination of MongoDB, Monary, and NumPy is very powerful: it's a data analysis pipeline that is scalable, convenient, and completely free and open source.

Omnia.md: Engineering a Full Python Stack for Biophysical Computation

Nov 22 - 3:45 p.m.

Kyle Beauchamp, Patrick Grinaway

Engineering new therapeutics is hard--and getting harder. Accurate physical modeling promises to improve the way we design drugs, but the necessary open source infrastructure is lacking. The Omnia Consortium---a collaboration of multiple academic laboratories working on physical modeling tools for drug discovery---is producing a suite of open-source tools for understanding drugs, proteins, and the biomolecular mechanisms of disease. Our Python-centric software stack is uses Python, Cython, C++, and CUDA/OpenCL to achieve bleeding-edge performance. Part of our stack (OpenMM) is also implemented on the Folding@Home distributed computing project and currently runs on tens of thousands of high-end GPUs around the world, producing over 18PFLOP/s of computational power. In our talk, we will introduce biophysical simulation and its application to understanding mechanisms of disease and its potential for designing new therapeutics. We will discuss the challenges in building robust tools for automating and scaling up biophysical simulations, compared with the relatively mature tools already available for modern data science. We will describe some of the tools in our stack (OpenMM, MDTraj, Mixtape, Yank) and how we use the conda packaging environment to facilitate distribution of our domain-specific code. Finally, we will discuss our plans to improve physical models and study drug resistance using iterative cycles of modeling and automated biophysical experiments performed at Memorial Sloan-Kettering Cancer Center.

On Building a Data Science Curriculum

Nov 23 - 1:30 p.m.

Jonathan Dinu

Data Science is a comparatively new field and as such it is constantly changing as new techniques, tools, and problems emerge every day. Traditionally education has taken a top down approach where courses are developed on the scale of years and committees approve curricula based on what might be the most theoretically complete approach. This is at odds however with an evolving industry that needs data scientists faster than they can be (traditionally) trained.

If we are to sustainably push the field of Data Science forward, we must collectively figure out how to best scale this type of education. At Zipfian I have seen (and felt) first hand what works (and what doesn't) when tools and theory are combined in a classroom environment. This talk will be a narrative about the lessons learned trying to integrate high level theory with practical application, how leveraging the Python ecosystem (numpy, scipy, pandas, scikit-learn, etc.) has made this possible, and what happens when you treat curriculum like product (and the classroom like a team).

Performance Python

Nov 23 - 2:25 p.m.

Saul Diez-Guerra

Coming soon.

Putting Together World's Best Data Processing Research with Python

Nov 23 - 9:55 a.m.

Akira Shibata

The talk illustrates how selective-search object recognition and the latest deep-learning object identification algorithm was applied to solving the problem of image cropping.

How can you identify the most important part of the image that must not be cropped out when shown as a thumbnail? This is a problem faced by various media and e-commerce where space for the image can be of various sizes and the best portion of the original image must be preserved to maximize effectiveness.

Selective search is a new method proposed by Uijlings (U. Amsterdam) et al. which significantly improved on the accuracy of object recognition over the previous exhaustive search methods. This allows us to use advanced methods such as Convolutional Neural Network, a.k.a. deep learning object identification, to identify interesting objects contained in an image. With this, interesting parts of the image can be preserved in the process of cropping producing effective thumbnails.

The code for selective search is available for matlab only while the deep learning algorithm, Caffe, is available with a Python wrapper. We will illustrate how Python is best suited to putting together the result of cutting edge research in order to solve complex data processing problems.

PyCassa: Setting up and using Apache Cassandra with Python (in Windows)

Nov 22 - 10:35 a.m.

Tobi Bosede

PyCassa is a python wrapper for Cassandra, the open source non-relational database. Cassandra has gained a lot of momentum recently because of big data problems and its ability to scale well. Rather than adding more memory to a machine and having to change the database schema, you just add more servers which forms a cluster of "nodes." In this talk I'll go over setting up Cassandra in windows and connecting to it with PyCassa.

Python as a Query Language for Distributed Key-Value Stores

Nov 23 - 2:25 p.m.

Stephen Pimentel

Applications often have back-end data stores that developers need to query. In this situation, you may first think of SQL, whether using a lightweight tool like SQLite or a full-fledged ORM like SQLAlchemy. However, for many data storage tasks, SQL is more than we really need. This talk presents an alternative approach that directly employs some of Python’s most powerful language features. Using a distributed key-value store, we can make our data persistent with an interface similar to a Python dictionary. Python then gives us a number of tools "out of the box" that we can use to form queries: * generators for memory-efficient data retrieval; * itertools to filter and group data; * comprehensions to assemble the query results. Taken together, these features give us a really powerful query capability, and most of it is straight Python. By the end of this talk, you'll have an understanding of some sophisticated but accessible Python features that are immediately useful for manipulating data.

Python for Fun, Profit and Retirement Planning

Nov 23 - 11:45 a.m.

Ritesh Bansal

Americans have over $10 Trillion in 401(K) and IRAs. These assets are mostly left in the default allocation of expensive funds as chosen by employers or brokerages. As a result account holders pay tens of billions of dollars in unnecessary fees for inferior returns. We will explore a simple framework for building retirement portfolios using readily available python packages and publicly available data. We will first use constraint programming to explore the universe of available portfolios. We will then build a simple portfolio rebalancer.

Python for Personal and Population Genome Interpretation

Nov 22 - 1:30 p.m.

Manuel Rivas

Over the past five years the drop in DNA sequencing costs has quickly transformed our ability to understand the genome. In this tutorial, we will introduce you to pyPLINK/SEQ, a toolset for working with human genetic variation data, in different ways, as: a library for accessing genomic information, connecting to reference databases, and applying statistical models; and as a library for interpreting your own genomes. We will explore the speaker's genome and give a thorough presentation of pyPLINK/SEQ functionalities. All contents of the talk will be supplemented with code available on GitHub.

Python in Business Intelligence: What's Missing?

Nov 22 - 4:20 p.m.

Matthew Rocklin, Stefan Urbanek

---

Rapid Exploration and Visualization of Large Datasets

Nov 22 - 11:10 a.m.

Karan Dodia

Visualizations are windows into datasets: they can help generate hypotheses, aid combinatory play to discover trends, and cement insight by providing structure and context.

This talk will touch on the current state of the Python visualization ecosystem, offer some thoughts on iteratively building visualizations, then launch into a data-driven exploration of visualization techniques grounded in the NYC taxicab dataset.

Recalling with precision

Nov 22 - 1:30 p.m.

Julia Evans

(or: how to only forget some things about your ML models instead of literally everything)

You're writing a classifier. So you trained 10 decision trees in October, with several sets of training data, different maximum depths, different scalings, and different features. Some of the experiments went better than others! Now it's November, and you want to go back to the project and start using one of these models. But which one?!

At Stripe, we train models to automatically detect and block fraudulent transactions in real time. We build a lot of models, and we need a way to keep track of all kinds of information about them. I'll talk about a simple tool we built to:

keep track of a few evaluation metrics for each model (precision vs recall, ROC curve)
remember which features, training data, and parameters we used
choose which threshold to use

This functions as a lightweight lab notebook for ML experiments, and been incredibly useful for us (as mere humans). Having a consistent way to look at the results of our experiments means we can compare models on equal footing. No more notes, no more forgetting, no more hand-crafted artisanal visualizations. [1]

[1] You're still allowed to make hand-crafted artisanal visualizations if you want.

Simple Machine Learning with SKLL 1.0

Nov 23 - 9:55 a.m.

Daniel Blanchard

As the popularity of machine learning techniques spreads to new areas of industry and science, the number of potential machine learning users is growing rapidly. While the fantastic scikit-learn library is widely used in the Python community for tackling such tasks, there are two significant hurdles in place for people working on new machine learning problems:

Scikit-learn requires writing a fair amount of boilerplate code to run even simple experiments.
Obtaining good performance typically requires tuning various model parameters, which can be particularly challenging for beginners.

SciKit-Learn Laboratory (SKLL) is an open source Python package, originally developed by the NLP & Speech group at the Educational Testing Service (ETS), that addresses these issues by providing the ability to run scikit-learn experiments with tuned models without writing any code beyond what generates the features. This talk will provide an overview of performing common machine learning tasks with SKLL, and highlight some of the new features that are present as of the 1.0 release.

Straight from the horse's mouth: Working with XBRL tagged 10-Ks

Nov 23 - 9 a.m.

Neal Snow

eXtensible Business Reporting Language (XBRL) is now required to be used by all publicly traded companies filing annual (10-K) and quarterly (10-Q) financial reports with the Securities and Exchange Commission (SEC). Popular finance websites such as Google and Yahoo Finance and academic databases, such as Compustat, aggregate financial terms to create the financial statements displayed. Obtaining XBRL tagged financial statements allows investors to get machine readable financial data directly from the company and analyze it at the granular level. In this tutorial, I show how to obtain XBRL financial statements quickly and easily from the SEC website, parse them using BeautifulSoup and the lxml engine, and analyze and visualize them using pandas and matplotlib and run regressions using the statsmodels library. Time permitting, I will also discuss the limitations of XBRL and how the opensource community can help.

Submit topic to admin@pydata.org

BOF

Open BOF. Submit topic to admin@pydata.org.

TBD

----------

----

The Polyglot Beaker Notebook

Nov 23 - 3 p.m.

Matt Greenwood, Scott Draves

The Beaker Notebook is a new open source tool for collaborative data science. Like IPython, Beaker uses a notebook-based metaphor for idea flow. However, Beaker was designed to be polyglot from the ground up. That is, a single notebook may contain cells from multiple different languages that communicate with one another through a unique feature called autotranslation. You can set a variable in a Python cell and then read that variable in a subsequent R cell, and everything just works – magically. Beaker comes with built-in support for Python, R, Groovy, Julia, and Javascript. In addition, Beaker also supports multiple kinds of cells for text, like HTML, LaTeX, Markdown, and our own visualization library that allows for the plotting of large data sets. This talk will motivate the design, review the architecture, and include a live demo of Beaker in action.

Time series analysis using Gaussian Processes in Python and the search for Earth 2.0

Nov 23 - 9 a.m.

Dan Foreman-Mackey

Thousands of planets outside the solar system have been discovered using time series data from NASA's Kepler mission but not a single one is a true Earth twin. I'm working to discover Earth 2.0 using an open dataset from NASA and custom-built, high-performance tools in Python. I will sketch the problem and introduce the resulting Python module (called George; http://dfm.io/george) for doing time series analysis with Gaussian Processes (GPs). The core algorithm (developed in collaboration with applied mathematicians at NYU) allows this code to compute GPs on general large datasets that were previously intractable. This method isn't just applicable in astronomy so I'll demonstrate how this package can be incorporated into the standard scientific Python stack and compare it to other GP implementations.

Title Coming Soon

Nov 23 - 3 p.m.

James Powell

Coming Soon

Translating SQL to pandas. And back.

Nov 22 - 9 a.m.

Greg Reda

SQL is still the bread-and-butter of the data world, and data analysts/scientists/engineers need to have some familiarity with it as the world runs on relational databases.

When first learning pandas (and coming from a database background), I found myself wanting to be able to compare equivalent pandas and SQL statements side-by-side, knowing that it would allow me to pick up the library quickly, but most importantly, apply it to my workflow.

This tutorial will provide an introduction to both syntaxes, allowing those inexperienced with either SQL or pandas to learn a bit of both, while also bridging the gap between the two, so that practitioners of one can learn the other from their perspective. Additionally, I'll discuss the tradeoffs between each and why one might be better suited for some tasks than the other.

Using Cloud Foundry for Data Driven Python apps

Nov 23 - 11:45 a.m.

Ian Huston

As a data scientist I frequently need to create web apps to provide interactive functionality, deliver data APIs or simply publish results. It is now easier than ever to deploy your data driven web app by using cloud based application platforms to do the heavy lifting. Cloud Foundry is an open source public and private cloud platform that supports Python based applications (along with other languages) and enables simple app deployment, scaling and connectivity to data services like PostgreSQL, Redis and Cassandra. To make use of the full PyData stack I have created a Heroku-style buildpack which uses conda for package management. This means for example that you can get a private IPython Notebook server up and running in seconds. In this talk I want to show you how to

deploy your first app using Cloud Foundry,
connect to databases and other data services,
use PyData packages with a Heroku-style buildpack,
find public and private Cloud Foundry installation options.

Using Data Science to Transform OpenTable Into Your Local Dining Expert

Nov 22 - 3:45 p.m.

Sudeep Das

I will talk about how we are using data science to help transform OpenTable into a local dining expert who knows you very well, and can help you and others find the best dining experience wherever we travel! This entails a whole slew of tools from natural language processing, recommendation system engineering, sentiment analysis, and predictions based on internal and external signals that have to work in synch to make that magical experience happen. I will touch upon how the rich Python ecosystem of tools like Pandas, scikit-learn, nltk, gensim and matplotlib has helped in almost all stages of this venture.

Using Python to design a parametric catastrophic (CAT) earthquake bond and predict its credit spread.

Nov 22 - 11:10 a.m.

Luis Miguel Sanchez

Attendees will learn how to use data to generate opportunities for their organizations, and rank some type of risks in scales similar to the ones used by rating agencies, opening profitable risk transfer opportunities.

This presentation documents a simplified analytical approach developed by the author to illustrate key elements in the design of a parametric catastrophic bond, a type of Insurance Linked Security (ILS). The model was developed in order to 1) help potential clients of an investment bank understand advantages and disadvantages of insurance linked securities vs. traditional insurance, 2) help government decision makers draft policies to accommodate the product in their risk management efforts, 3) expand potential market of buyers of ILS by sharing analytical work to Rating Agencies and CDO managers in a reproducible way, 4) Help win structuring mandates for investment banks.

Although the original model was developed using a combination of C/C++, Visual Basic, ActiveX, and an Excel front end, this presentation will show a modern approach that will use:

IPython Notebook to collect and explore a dataset that includes 100 years of earthquake data from the USGS, around the geographical location covered by the bond.
Pandas for time series analysis of earthquake historical data.
Monte Carlo Error Propagation to generate simulated earthquakes and calculate exceedance loss probabilities on life on bond.
Sci-kit learn, and/or other tools to select best parameters to fit target credit rating, bond parameters, and predict spread over LIBOR.

Welcome

Nov 22 - 9 a.m.

Travis Oliphant

--------