PyData 2015 | Dallas | April 24

Presentation Abstracts

(click on the Title to view presentation details)

Data Science for Computational Journalism

Apr 26 - 3:20 p.m.

Chengkai Li

Politicians make claims of "facts" all the time. Oftentimes there are false and misleading claims on important topics, due to careless mistakes and even deliberate manipulation of information. Journalists and reporters spend good amount of time to check veracity of factual claims important to the public. We have been working on a computational journalism project for several years. Part of the project's goal is to help journalists check such claims. In this talk, I will provide an overview of the project and introduce specifically how we use Python in fact-checking claims made in presidential debates.

A Game Engine For Engineers

Apr 25 - 2:20 p.m.

Oliver Nagy

"Azrael is a game engine for engineers. Unlike traditional engines it emphasises accurate physics, runs in the Cloud, and offers a language agnostic API. Its main purpose is to make it easy for engineers to build, study, and control complex physical systems, for instance how to auto pilot a space ship; or a fleet thereof; in formation; through an Asteroid belt... I will show a live demo to illustrate the concept. It uses a simple control algorithm to manoeuvre an object to a pre defined position in space. Once there it will maintain that position despite random collisions with other objects."

A Python Module for Data Analytics with Interval Computing

Apr 26 - 11:40 a.m.

Chenyi Hu

Interval computing can play an important role in data analysis. In this talk, the speaker will introduce interval computation and its applications in data analysis. A Python module in development for data analysis with interval method will be introduced too.

A Thorough Machine Learning Pipeline via Scikit-Learn

Apr 24 - 9 a.m.

Bugra Akyildiz

Scikit-Learn is one of the most popular machine learning library written in Python, it has quite active community and extensive coverage for a number of machine learning algorithms. It has feature extraction, feature and model selection algorithms, and validation methods to build a modern machine learning pipeline. It also provides more advanced structures to make the machine learning pipeline and flow even easier such as feature union, pipelines, grid parameter search and randomized parameter search. This tutorial introduces common recipes to build a modern machine learning pipeline for different input domains and show how one might construct the components using advanced features of Scikit-learn. Specifically, I will try to go over the following steps in Scikit-Learn: - Introduce various feature extraction methods for image and text - Explain how one might use various feature selection algorithms to capture information rich features and ignoring the irrelevant or redundant ones - Show various approaches and methods to do parameter optimization within Scikit-Learn - Explain and compare different validation score and metrics to evaluate the model accuracy - Introduce how one could do model selection - Show how one could deploy the model into production Then I will introduce more advanced features and methods: - Introduce pipeline structures and parameter optimization within the grid search - Randomized Search to make the parameter search more intelligibly and efficiently - Feature Unions to make the feature more diverse and rich

Assimilation - From C to Python

Apr 25 - 3:20 p.m.

Paul Joireman

We all learned to program in a particular way, either you started out using Basic, Pascal, C, Fortran anyone. If you're younger maybe Java was your first language or maybe you came to programming through the web using Javascript, PHP, Perl or Ruby. Python allows you to work with multiple programming paradigms (procedural, object-oriented, functional). When you first started using Python you were able to find features that looked familiar, loops, conditionals, classes, ... and worked how you expected them to work. This is great for adoption, but there is a hidden cost, it doesn't force you to change anything. You can go on writing code as you always have and it will work. But there could be a better way of doing things, one that is more efficient and easier to understand and explain. This is especially true in the realm of data processing.

In my talk I'm going to present some case studies in some simple algorithms that I had to implement in Python in the course of my data analysis work. I'll show my initial naive implementations and how a I slowly optimized these reducing both the amount of code necessary and the execution time. I'll share some the things I learned, the habits I had to break and assumptions I needed to discard, in the hopes that these will be instructive and beneficial to others just starting out with Python and data analysis. I'll integrate a look at some of the functionality provided by the NumPy, SciPy and pandas packages and how I use these to simplify and clarify my day to day work.

Beyond t-tests: how to conclude an online experiment using python

Apr 25 - 3:20 p.m.

Volodymyr Kazantsev

"A/B testing and control-group testing are very well-known techniques to learn about the market and consumer preferences. In reality, however, lots of companies make incorrect conclusions about their experiments, due to the lack of statistical knowledge. Moreover, there is surprisingly little material available about types of statistical tests that are appropriate in online setting. In this talk, I will try to establish a conceptual framework that we will use to analyse different types of experiments, starting with a simple “conversion” testing using null-hypothesis tests and move to more advanced topics, finishing with actual %% uplift measurement of heavily skewed data, such as payments data and long-term customer lifetime value (CLV), using Bayesian Credible Intervals. Agenda (Draft): 1. types of tests in online setting: product A/B, exclusion groups for marketing activities 2. goals of a test: yes/no vs. %% Uplift. Why do we care about the measuring the uplift? 3. brief overview of relevant inferential statistics (and how to do all that in python): central limit theorem, confidence intervals, t-test 4. We will apply those technique to real-life like data in ipython notebook 5. I will introduce and apply two alternative techniques that are actually used in production to reject null-hypothesis when dealing with heavily-skewed data 6. How to measure the uplift of CLV"

Blaze in the Real World

Apr 25 - 11:40 a.m.

Phillip Cloud

Blaze is a library for harnessing the power of big data technologies. We show motivating use cases illustrating why you might want to use blaze, including a comparison of out-of-core pandas to other backends designed to scale both horizontally and vertically. Time permitting, we'll show how easy it is for users of blaze to scratch their own itch by hooking an existing API into blaze via a small set of multiply dispatched functions.

Bokeh Tutorial

Apr 24 - 1 p.m.

Bryan Van De Ven

Coming Soon

Briefly : A Python DSL to Scale Complex Mapreduce Pipelines

Apr 24 - 10:30 a.m.

Chou-han Yang

"Briefly, a open-source project designed to tackle the challenge of simultaneously handling the flow of Hadoop and non-Hadoop tasks. In short, Briefly is a Python-based, meta-programming job-flow control engine for big data processing pipelines. We called it Briefly because it provides us with a way to describe complex data processing flows in a very concise way. At BloomReach, we have hundreds of Hadoop clusters running with different applications at any given time. From parsing HTML pages and creating indexes to aggregating page visits, we all rely on Hadoop for our day to day work. The job sounds simple, but the challenge is to handle complex operational issues without compromising code quality, as well as the ability to control a group of Hadoop clusters to maximize efficiency."

Building Python Data Applications with Blaze and Bokeh

Apr 24 - 1 p.m.

Andy R. Terrel, Christine Doig

"We use the Blaze and Bokeh libraries to interactively query and visualize large datasets through Python. Blaze provides a consistent query experience on data ranging from a small local CSV files to a large remote Impala or Spark clusters. It automates data migration and brings the power of other database systems into the hands of the armchair analyst. Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. It provides elegant, concise construction of novel graphics in the style of D3.js, but also delivers this capability with high-performance interactivity over large or streaming datasets."

Building machine learning applications in Python

Apr 24 - 3:15 p.m.

Rajat Arya, Yucheng Low

In this hands-on tutorial, we walk you through the steps to build and deploy a sentiment classifier in Python. The task is to learn to classify reviews from the Yelp! reviews dataset as either funny or not funny. The tasks include doing some basic feature engineering, building and evaluating machine learning models, and deployment. The software is GraphLab Create . Bring your laptop! Please check back prior to the tutorial for setup instructions.

CSVKit

Chris Groskopf

Hands on training on using the open source CSVkit library to clean messy formats and prepare data for analysis. (Chris is the main author and maintainer of the project.)

Calling Distributed Python from Inside SQL — A technical look at implementing a scalable SQL/ Python hybrid Platform

Apr 25 - 10 a.m.

Paul Ingram

"We set out to build a fully scalable distributed SQL platform but quickly realized that use cases at that scale were much more complex that simply joining data and easy to digest select statements. To support our data science team and our users, we implemented a distributed python layer that allows our programmers to quickly construct distributed functions that enhance SQL’s native capabilities, including implementing statistical and machine learning algorithms for use across massively parallel data sets. Our talk will cover the motivation behind implementing the python layer as well as the technical challenges we ran into while extending the language and enabling it to run in a truly parallelized fashion. Abstract : The SQL language allows many data scientists to quickly manipulate raw data into data sets useful for analysis. However, SQL has a number of restrictions that make simple tasks such as efficiently parsing rows of data difficult to do without writing custom procedures. Couple that with the increasingly varied data types and data sets, and SQL is quickly eschewed as the language of choice for complex, big data manipulation. Enter Python — a language perfectly suited to the task. Marrying python and SQL has allowed our data science teams to pick the language that best fits their specific needs and construct functions that allow us to implement repeatable processes at scale and make complex calls accessible to clients and team members who are only familiar with SQL. We will introduce 3 key paradigms for parallelizing python and building the SQL integration and discuss the technical thought process behind each : 1. Implementing a master-worker communication paradigm for Python 2. Developing a custom distributed data structure 3. Distributed package access At the conclusion of the talk, we will walk through a basic use case and demonstrate how the interplay between SQL and python is made real time. "

Can PYTHON hear PULSARS?

David Saroff

Can pulsars in the Andromeda galaxy be heard from Greenbank, West Virginia? What if the external ear is a dish 300 feet across? What if the inner ear is SiGe cooled to -260 C? What if the auditory cortex is PYTHON on 40 XEON's? What if radio waves are what is being listened for? What if a pulsar is what supernova explosions of stars leave behind, spinning?

Deploying scikit-learn Models in Production

Apr 25 - 2:20 p.m.

Rajat Arya

Machine Learning should be everywhere. Applications today have the opportunity to leverage all the data being collected about users' interactions and behavior. Unfortunately machine learning at scale is mostly absent from production systems. Training models using scikit-learn is useful, but it is difficult to take this code to production. Why is it so painful to deploy models in a scalable way? What are the options and what challenges exist today? After exploring the current options, I will present Dato Predictive Services, which we developed to address these challenges. Dato Predictive Services enables deploying and managing scikit-learn models into an elastic, scalable, fault-tolerant, low-latency cluster of machines, in AWS & YARN. With Dato Predictive Services, in one command, you can take arbitrary Python and deploy it as a REST service.

This will be a hands-on talk, walking through code and with multiple demonstrations. Bring your laptop to follow along!

Hashtag Buddies

Apr 26 - 2:20 p.m.

Prabhu Saiprabhu "Sai"

We have been seeing increasing use of hashtags in social media. How can businesses leverage the power of hashtags? By opting to use specific hashtags, customers may be expressing alignments with concepts and what does knowing more about such alignment would do to businesses? In this discussion I will walk through the use of topic modeling based on clusters of conversations (primarily tweets) and identify associations between users. I will then show how such association can be augmented with additional Natural Language Processing (NLP) capabilities towards solving the primary business goal of how to find a set of potential lookalikes of premium customers for a given business. This presentation will use a combination of Python and Java components on Hadoop platform.

H₂O Machine Learning

Apr 25 - 10:50 a.m.

Cliff Click

H2O – Now with a Python interface! It’s open-source Machine Learning, in-memory big-data clustered computing – Math At Scale.H2O has the Worlds Fastest Logistic Regression (by a lot!), Distributed Deep Learning, world’s first (and fastest) distributed Gradient Boosted Method (GBM), plus Random Forest, PCA, Naive Bayes, Cox Proportional Hazards, KMeans++, and much more. And now we can do the same data-munging in python as we've been able to do in R for the past year, built on the same REST/JSON based API. Build test/train splits on tera-scale datasets in seconds, impute the mean, throw out outliers, run group-bys and joins and do all your basic data munging from your standard laptop python / ipython / ide, and yet work on giant datasets.

Integration With the Vernacular (the NumPy Approach)

Apr 25 - 10:50 a.m.

James Powell

The `NumPy` model of computation in Python has proven to be one of the most successful ways to integrate high-performance computational code into an application. This talk offers a foundational conceptualization for this approach and discusses its strengths and limitations.

Introducing the World of Commercial Real Estate Data to App Developers

Apr 25 - 4:10 p.m.

Jason Vertrees

Much of the $15.2T commercial real estate (CRE) world is closed and clandestine. This has held the industry back from adopting technological progress, creating inefficiencies across the entire ecosystem. With the power of the internet at its side, RealMassive is beginning to break open this industry. One area in which we now shed light is commercial real estate availability data, or CRE supply side data. Application developers can now use our freely open RESTful API to augment their applications or services with CRE data and analytics. In this talk, we explore the API and provide a few examples of the power of being able to access and analyze CRE data.

Keynote Title Coming Soon

Apr 25 - 1:20 p.m.

Dana Bauer

----

Machine Learning

----------

Coming Soon

Odo:shape-shifting data—a handy tool to guide you from CSV->HDFS and beyond

Apr 26 - 10:50 a.m.

Ben Zaitlen

"Data are always messy and ill-formatted. We spend seemingly unnecessary amounts of hours writing software to convert between common formats, databases, and newer filesystems. Typically, we spend just enough mental energy to get the job done -- hopefully, giving us more time in the next stage of the data pipeline. This results in non-performant, non-reusable, non-extensible code. In this talk we present Odo, a new open-source software package which simplifies and eases common data migration tasks. Odo can seamlessly migrate between CSVs, JSON, Dataframes, and Databases, just as easily as it can migrate between NumPy Arrays, HDF5, HDFS, and S3 -- and everything in between and much more. When choosing a storage format we have to balance several features: size, performance (read/write), chunk-ability, shareability, multi-tenancy, computational target, etc. Odo lets us explore and evaluate various target data containers without much cost. Where possible, Odo takes advantage of performant and feature rich bulk loaders. With a lower cost to play and faster data conversion speeds, a once unfun and boring task can possibly engage us and lead to happier computing down the road. We will cover different real-world use cases and scenarios and compare these with the “common” answers repeated amongst us data mungers"

Open Micro-Hackathon

Apr 24 - 3:15 p.m.

----------

Topic to be announced at tutorials.

Panel - Social Media Analysis in Python

Apr 26 - 1:20 p.m.

Jeb Stone, Jigar Mistry, Rohan Patil

Coming Soon

Python Data Analytics Workshop - NumPy, pandas, matplotlib, and SciPy

Apr 24 - 9 a.m.

Vivian Zhang

Coming Soon

Python for the News

Apr 25 - 11:40 a.m.

Daniel Lathrop

Presentation on the ways The Dallas Morning News is using Python in newsgathering and presentation, including our major effort to train reporters to code in Python (20% of the staff so far).

Python in Scientific Research

Apr 25 - 10 a.m.

Andy R. Terrel

Whether you modelling an earthquake, hurricane, or medical device, Python is there. The language has become so ubiquitous in scientific research that it is the go to tool. In this presentation, I present different modalities that we see Python being used and how it continues to slither its way into new business. Whether you are running on a large cluster or teaching some new students how to get their next big idea done, Python delivers as a language that can be used by all, scale as needed, and not get in your way while it does it.

Python is for the Curious

Apr 26 - 9 a.m.

Jon Riehl

"What is it you are curious about? If you are more curious about your data than your tools, then Python is for you. If you are more curious about your tools than your data, then Python is for you. If you are curious, then Python is for you."

Reproducible Multi-language Data Science with Conda

Apr 26 - 3:20 p.m.

Christine Doig

Reproducibility is one of the main principles of the scientific method to ensure that our analysis and results are reproducible by anyone. As Data Science projects grow in variety (applications, libraries, standalone analysis...) and complexity (DBs, computing engines, multiple programming languages, backwards incompatibilities...) we need solutions to handle reproducibility in every case. In this talk, we'll explore how Conda, a cross-platform package manager written in Python can make our lives simpler and our Data Science projects easily shareable and reproducible.

SFrame: A Scalable, Out-of-Core Dataframe for Machine Learning

Apr 26 - 10 a.m.

Yucheng Low

Machine learning is hard. Machine learning at scale is even harder. Scaling up machine learning requires not just advances in algorithm implementation, but also more scalable data structures. In this talk, we discuss the SFrame, one of the core data-structures of a new scalable Python machine learning platform called GraphLab Create. The SFrame was designed to enable the manipulation of tables with billions of rows and thousands of columns on a single machine while maintaining a high degree of performance. It shares many of Pandas and Numpy's capabilities, making it easy for Pandas users to get up to speed quickly. In this talk, I will demonstrate the capabilities of the SFrame and its companion datastructure, the SGraph, and discuss some of their architecture and design considerations.

State of the Py, 2015

Apr 25 - 9 a.m.

Peter Wang

Coming Soon

TBD

----------

Coming Soon

There's No Place Like Home: Analyzing HPD Police Beat Data with Python and ArcGIS

Apr 26 - 11:40 a.m.

Paige Bailey

What are the safest times to walk or bike around your neighborhood? When Google Maps returns a pedestrian or public transportation route, is it returning the safest path? How safe is the area you live in, really? I'll show you how I answered these questions and more for my neighborhood -- and how you can do the same for yours! -- using police beat crime data, ArcGIS, and Python's webscraping capabilities.

To the Future with Jupyter

Apr 26 - 2:20 p.m.

Kyle Kelley

IPython has given us novel ways of interacting with our code, data, documentation, and reporting. It has enabled collaboration over common open source formats and APIs. Jupyter is the next step in this journey, extending support across languages for the notebook, multi user systems, and exposing more of the building blocks for advanced tooling upon these systems. We'll talk about the division of concerns between IPython and Jupyter (hint: same community/devs), what projects are available for consumption, what's at the bleeding edge, and directions we can head.

Transitioning an idea from Academia to Commercialization: A Startup Story

Apr 25 - 4:10 p.m.

Meltem Ballan

Meltem will share her story with Terastructure from the inception of the idea to the realization of a commercially viable product in partnership with the University of North Carolina. If you have dreamt, thought of, or have an idea that you want to take to market, you’ll not want to miss this presentation.

Using Python to Fight Cyber Crime

Apr 26 - 10:50 a.m.

Kyle Maxwell

Coming Soon

Waiting For Speaker Confirmation

----------

---

Where Python Meets Real-Time Data

Apr 26 - 10 a.m.

Denis Akhiyarov

"This talk is about application of Python Scientific Stack for interpolation, plotting, modeling, symbolic computations, curve fitting, optimization and statistical analysis of real-time data from operations around the world. The author is going to describe the workflow: prototyping, development, analysis, testing, embedding into existing applications, troubleshooting and debugging of the models. Experience related to going from novice to intermediate level are going to be described with issues such as: installation and deployment, using IDE, understanding language features, selecting third-party libraries, getting help and support, etc. Some comparison to existing languages used in the company is going to be given."

Presentation Abstracts

Data Science for Computational Journalism

Apr 26 - 3:20 p.m.

A Game Engine For Engineers

Apr 25 - 2:20 p.m.

A Python Module for Data Analytics with Interval Computing

Apr 26 - 11:40 a.m.

A Thorough Machine Learning Pipeline via Scikit-Learn

Apr 24 - 9 a.m.

Assimilation - From C to Python

Apr 25 - 3:20 p.m.

Beyond t-tests: how to conclude an online experiment using python

Apr 25 - 3:20 p.m.

Blaze in the Real World

Apr 25 - 11:40 a.m.

Bokeh Tutorial

Apr 24 - 1 p.m.

Briefly : A Python DSL to Scale Complex Mapreduce Pipelines

Apr 24 - 10:30 a.m.

Building Python Data Applications with Blaze and Bokeh

Apr 24 - 1 p.m.

Andy R. Terrel, Christine Doig

Building machine learning applications in Python

Apr 24 - 3:15 p.m.

Rajat Arya, Yucheng Low

CSVKit

Calling Distributed Python from Inside SQL — A technical look at implementing a scalable SQL/ Python hybrid Platform

Apr 25 - 10 a.m.

Can PYTHON hear PULSARS?

Deploying scikit-learn Models in Production

Apr 25 - 2:20 p.m.

Hashtag Buddies

Apr 26 - 2:20 p.m.

H₂O Machine Learning

Apr 25 - 10:50 a.m.

Integration With the Vernacular (the NumPy Approach)

Apr 25 - 10:50 a.m.

Introducing the World of Commercial Real Estate Data to App Developers

Apr 25 - 4:10 p.m.

Keynote Title Coming Soon

Apr 25 - 1:20 p.m.

Machine Learning

Odo:shape-shifting data—a handy tool to guide you from CSV->HDFS and beyond

Apr 26 - 10:50 a.m.

Open Micro-Hackathon

Apr 24 - 3:15 p.m.

Panel - Social Media Analysis in Python

Apr 26 - 1:20 p.m.

Jeb Stone, Jigar Mistry, Rohan Patil

Python Data Analytics Workshop - NumPy, pandas, matplotlib, and SciPy

Apr 24 - 9 a.m.

Python for the News

Apr 25 - 11:40 a.m.

Python in Scientific Research

Apr 25 - 10 a.m.

Python is for the Curious

Apr 26 - 9 a.m.

Reproducible Multi-language Data Science with Conda

Apr 26 - 3:20 p.m.

SFrame: A Scalable, Out-of-Core Dataframe for Machine Learning

Apr 26 - 10 a.m.

State of the Py, 2015

Apr 25 - 9 a.m.

TBD

There's No Place Like Home: Analyzing HPD Police Beat Data with Python and ArcGIS

Apr 26 - 11:40 a.m.

To the Future with Jupyter

Apr 26 - 2:20 p.m.

Transitioning an idea from Academia to Commercialization: A Startup Story

Apr 25 - 4:10 p.m.

Using Python to Fight Cyber Crime

Apr 26 - 10:50 a.m.

Waiting For Speaker Confirmation

Where Python Meets Real-Time Data

Apr 26 - 10 a.m.

Sponsors

FOUNDING SPONSOR

DIAMOND

PLATINUM

GOLD