There are several ways and libraries for distributed processing using Python. Each of these is developed under different concepts and has different features. we need to understand these features for efficient usage of computer resources. In this session, I will discuss how to be good to choose and use these distributed processing tools.
At this time, Python is widely used in Data Engineering and Data Science. When we work on Data Engineering and Data science, We often encounter situations that require complex calculations for many data. If you have the proper knowledge of distributed processing, you may be able to significantly reduce your script's execution time.
The audience of this session can expect practical guidance of distributed processing.
The talk will be divided into three parts:
I will give a brief overview and the basics of distributed processing for those who have never used distributed processing.
At first, I will introduce multiprocessing and multithreading those are Python built-in modules and explain differences between concurrent and parallel. After that, I will introduce joblib, celery, dask, and PySpark (and GNU parallel) those are third-party available libraries. Not only introduce its feature, but I will also explain the reason why the library performs such operation is explained from the viewpoint of the structure and concept of the library.
Now that we understand the basics of python distributed processing packages, I will discuss which distributed processing library we should select and use in various situations in data engineering and machine learning.
By the end of this talk, audiences will have learned about the following things: