Apache YARN is the resource manager native to Hadoop. Its JVM-only nature, complicated security model, and myriads of options have historically made it difficult to deploy non-java applications on it. In this talk we present Skein, a tool & library written to ease Python deployment on YARN, as well as some of the issues (and solutions!) we encountered while developing and testing this tool.
Apache YARN is the resource manager native to Hadoop clusters. It is responsible for scheduling applications on the cluster (deciding where and when an application gets the resources it requested) and provisioning these resources in a secure and robust way.
Historically, deploying applications on YARN has been complicated for several reasons:
YARN is a JVM-only framework, and requires a non-trivial amount of code to get things working (for example, Spark's YARN support is ~6000 lines of Scala).
Applications that deploy on YARN need to distribute their resources with them. For Java applications this is straightforward - just bundle everything into a JAR and you're done. With Python things aren't so easy.
YARN's security model is Kerberos, which can be tricky to support properly. This, coupled with the myriad of configuration options YARN supports can make testing applications work on all clusters difficult.
In this talk we present Skein, a tool and library to simplify deployment on YARN. Building on ideas from Docker and Kubernetes, Skein uses a declarative interface for defining and deploying applications, and provides Python access to this previously difficult deployment environment.
Attendees of this talk will learn:
The basics of Apache YARN, and some of the things that make using it so complicated.
How Skein tries to rectify these issues to provide a simpler deployment experience.
How to use Skein to package and deploy Python applications on Apache YARN (using Dask's YARN deployment as an example).
Strategies for testing applications against Hadoop using existing CI tools.