PyData Carolinas 2016 | Presentation: Scalable Data Science with Spark and R

Wednesday 10:45 AM–12:15 PM in Room 1

Scalable Data Science with Spark and R

Zeydy Ortiz, Rob Montalvo

Audience level:: Novice

Description

Processing large datasets in R have been limited by the amount of memory in the local system. To overcome the native R limitation, several cluster computing alternatives have recently emerged including Apache Spark. In this session, we will discuss the architecture of Spark and introduce the SparkR library. We will work through examples of the API and discuss additional resources to learn more.

Abstract

In this tutorial, we will focus on SparkR. The outline of the tutorial is as follows: - Introduction to cluster computing with Spark - Getting started with SparkR - Deep dive into SparkR DataFrame API - Additional resources

In preparation for this tutorial please install.packages("SparkR") in your system.