Sunday 11:00–11:45 in Tower Suite 3

Understanding of distributed processing in Python

Chie Hayashida

Audience level:
Novice

Description

There are several ways and libraries for distributed processing using Python. Each of these is developed under different concepts and has different features. we need to understand these features for efficient usage of computer resources. In this session, I will discuss how to be good to choose and use these distributed processing tools.

Abstract

At this time, Python is widely used in Data Engineering and Data Science. When we work on Data Engineering and Data science, We often encounter situations that require complex calculations for many data. If you have the proper knowledge of distributed processing, you may be able to significantly reduce your script's execution time.

The audience of this session can expect practical guidance of distributed processing.

The talk will be divided into three parts:

Introduction to Distributed Processing

I will give a brief overview and the basics of distributed processing for those who have never used distributed processing.

Features of Distributed Processing Libraries

At first, I will introduce multiprocessing and multithreading those are Python built-in modules and explain differences between concurrent and parallel. After that, I will introduce joblib, celery, dask, and PySpark (and GNU parallel) those are third-party available libraries. Not only introduce its feature, but I will also explain the reason why the library performs such operation is explained from the viewpoint of the structure and concept of the library.

Case Study of Distributed Processing by Situation

Now that we understand the basics of python distributed processing packages, I will discuss which distributed processing library we should select and use in various situations in data engineering and machine learning.

Key takeaways

By the end of this talk, audiences will have learned about the following things:

Subscribe to Receive PyData Updates

Subscribe