Occasionally Python-focused data shops need to use JVM languages for performance reasons. Generally this necessitates throwing away whole repositories of Python code and starting over or resorting to interop architectures (e.g., Apache thrift) which increase system complexity. We provide a technique (and a new library, spylon) for interoperating easily with the JVM from Python.
This talk will:
Look in detail at the Python Spark bindings (PySpark) to show how Python-JVM interop used by Apache Spark actually works.
Demonstrate a package that we have designed (spylon) which generalizes the above operations (Scala class reflection, type conversions etc.).
Explain drawbacks and pitfalls of our approach.
The examples will be focused on PySpark and Scala but the concepts generalize to use with any other architecture where Python support is not yet fully available (e.g., Cassandra, ElasticSearch, etc.) or any internal JVM Projects.
The talk will not require any knowledge of Scala or Java from the audience.