Saturday October 30 2:00 PM – Saturday October 30 2:30 PM in Talks II

High Performance Python With Numba, Dask, and Rapids For the Absolute Beginner

Gus Cavanaugh

Prior knowledge:
No previous knowledge expected

Summary

Data Scientists often have large datasets and powerful hardware at their disposal. However, the excitement of fast computation in Python slows against a steep learning curve. This talk will build your confidence and intuition around high performance computing with Python. We step through a complete example while also covering the core concepts so you can generalize to your own work.

Description

  • An example data science pipeline with numpy and pandas
    • Common heuristics for when to accelerate your code
    • Quick survey of common approaches
    • An example data processing pipeline with numpy
  • How to accelerate on a single machine with Numba
    • Brief introduction to Numba
    • Quick comparison to cython
    • Accelerating our example pipeline with numba
  • How to distribute on a cluster with Numba and Dask
  • Brief introduction to Dask
  • Quick comparison to PySpark, Ray
  • Accelerating our example pipeline with numba and dask
  • How to accelerate and distribute with Numba, Dask, and Rapids
    • Brief introduction to Rapids & GPUs
    • Quick comparison to other GPU computing methods
    • Accelerating our example pipeline with numba, dask, and rapids
  • Conclusion
    • Review of performance gains
    • Summary of when to apply each to your project
    • Where to find hardware and example costs for various pipelines and data volumes