Saturday 11:00 AM–11:45 AM in Room #220/219 (2nd Floor)

Scaling up to Big Data - Devops for Data Science

Marck Vaisman

Audience level:
Intermediate

Description

Scaling up R/Python from a single machine to a cluster environment can be tricky. While there are many tools available that make the launching of a cluster relatively easy, they are not focused or optimized to the specific use case of analytics but mostly on operations. Come and learn about devops tips and tricks to optimize your transition into the big data world as a data scientist.

Abstract

The migration of running R or Python locally on a single machine to a cluster environment can be tricky. While there are many tools and resources available that make the launching of a cluster relatively easy, they are not focused or optimized to the specific use case of analytics using R and/or Python, but mostly on operations.

Imagine this scenario: you are a data scientist at a small organization. There is no devops support and you need to start setting up your environment for big data processing and analysis. You start a cluster in the cloud (on your favorite provider), you log on, and you want to run an R script that you’ve developed in your laptop. Sounds easy, right? Well have you considered the following?

  • You need to have R and all necessary additional packages installed on every node on the cluster
  • There are several ways to run R on hadoop
  • How do you set file permissions?
  • If you want to use RStudio as your interface/front end, you need to configure it, especially if you want it to talk to the underlying cluster
  • If there are multiple people that are going to be using the cluster simultaneously, you need to configure that
  • What if you want to use Anaconda Python?
  • How can you make an ipython notebook talk to Spark?

Come and learn about devops tips and tricks to optimize your transition into the big data world as a data scientist. This is a how-to session intended to raise awareness of some of the typical technical issues that can cause headaches. This session is not intended to be a sysadmin session, but hopefully give you an additional understanding of concepts you need to know, including tools such as Ansible for automating your setup.