PyData NYC | Presentation: Querying 1.6 billion reddit comments with python

Wednesday 2:45 p.m.–3:25 p.m.

Querying 1.6 billion reddit comments with python

Daniel Rodriguez

Audience level:: Intermediate

Description

New tools such as ibis and blaze have given python users the ability to write python expression that get translated to natural expression in multiple backends (spark, impala and more). Attendees will learn how these and other tools allow python to target bigger datasets specially impala. Talk will go through a big data pipeline, moving, converting and querying 1.6 comments from reddit.

Abstract

Since Google started to publish papers about their infrastructure starting with the Map Reduce and Google File System paper, that in time became Hadoop, the amount of tools (from big tech companies like Facebook, Yahoo and Cloudera) to gather and query this increasing amount of data has increased. These tools are often build on languages that run on top of the JVM such as Java (MapReduce, Hive) and Scala (Spark) and in some cases C++ (Impala). There is going to be a discussion on some of this new tools and how they make queries faster, the main focus is going to be Impala and the columnar file format Parquet.

These the Big Data technologies that have more and different requirements than the small/medium data tools data scientists like such as R (dyplr) or Python (pandas). While the medium data tools run a single node the big data technologies run in a cluster of nodes and require a knowledge of DevOps data scientists usually don't have. New tools are coming up to help data scientists fill those missing requirements easier and also allowing them to target the big data technologies within python.

While deploying clusters and install computing frameworks is now a new need and there have been solutions such as STAR Cluster and Spark includes scripts to deploy a spark cluster on EC2 these tools ofter fell short to the requirements data scientists have such as installing packages and having easy access to the cluster. There a some new tools that provide some of the same solutions but also try to give as much freedom as possible to data scientists and at the same time making the deployment of these tools faster using new Configuration Management tools like Salt. We will talk about Anaconda Cluster, a proprietary tool from Continuum Analytics that offer a free 4 node version, and a small alternative called DataScienceBox.

Once a cluster is running its needed to target the big data frameworks from within Python having easy to write expressions that in some cases are transformed to queries to each framework and then are sent to these frameworks to let them do the heavy lifting of the data processing. Spark has always treated Python as a first class language so PySpark has been available for a while, what about the other tools like Impala. New projects have come out recently, they take a different approach than spark and have a write-once target multiple backends expression systems that will be familiar to regular pandas or R users. We will talk about Blaze from Continuum Analytics and Ibis from Cloudera and authored by the author of pandas.

After the presentation and overview of the tools there is going to be a small demo in a running cluster using Blaze and Ibis (to target Impala) to query around 1.65 billion comments from Reddit that were recently made available to the public.

Wednesday 2:45 p.m.–3:25 p.m.

Querying 1.6 billion reddit comments with python

Daniel Rodriguez

Description

Abstract

Sponsors

Become a sponsor.