Thursday 10:00 AM–10:45 AM in Track 1 - McKinley

Scalable Data Science in Python and R on Apache Spark

Felix Cheung

Audience level:
Intermediate

Description

In the world of Data Science, Python and R are very popular. Apache Spark is a highly scalable data platform. How could a Data Scientist integrate Spark into their existing Data Science toolset? How does Python work with Spark? How could one leverage the rich 10000+ packages on CRAN for R?

Abstract

In the world of Data Science, Python and R are very popular. Apache Spark is a highly scalable data platform. How could a Data Scientist integrate Spark into their existing Data Science toolset? How does Python work with Spark? How could one leverage the rich 10000+ packages on CRAN for R?

We will start with PySpark, beginning with a quick walkthrough of data preparation practices and an introduction to Spark MLLib Pipeline Model. We will also discuss how to integrate native Python packages with Spark.

Compare to PySpark, SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages.

Presentation: https://www.slideshare.net/felixcss/scalable-data-science-in-python-and-r-on-apache-spark

Subscribe to Receive PyData Updates

Subscribe

Tickets

Get Now