Day 2 201 Meeting Room 11:15 - 12:00

Talk/演講: Multitasking in a Single Threaded Engine - Scaling Machine Learning Without Changing Workflow

Python is widely used in Data Science. However, due to python’s single threaded nature, data scientists often encounter difficulties that would require techniques in multithreading or asynchronous design patterns. This talk will introduce a multitasking process using Python asyncio, and the structure for asynchronous task abstraction which could contribute to the design of an intuitive library.

Introduction:

Consider a general situation where a linear workflow involves tasks A, B, C, where C depends on the completion of A and B, but A, B are independent tasks. In this case, we would want to execute A and B simultaneously, and automatically trigger C when both A and B are done. This situation happens quite often in any general programming settings, and an abundant design pattern is available for dealing with this kind of situation. However, when it happens in machine learning, like in order to train a model (task C), we need to first process training data (task A) and process forecast data (task B), we are faced with 2 particular challenges:
1. Data Scientists are not so familiar with multithreading design, concurrent programming, ...etc.
2. Usually data scientists work in a single threaded environment (python, jupyter), rendering the possibility of multitasking even harder.
So we propose a general design to overcome the issue, a design that applies the abstraction of tasks which handles the blocking dependencies and asynchronously working.

Decanter AI Core SDK

Decanter AI Core SDK is a tool with an intuitive interface for users who want to take advantage of the Mobagel’s AutoML API. This tool, while easy to use, handles the complicated dependencies between asynchronous tasks under the hood, allowing its users to maximize their computation power. Meaning that a task will be blocked only if its prerequisite tasks haven't finished. Moreover, the task object is designed in a way to get the results easily without knowing the process which handles the dependencies and the existence of API. This design and structure can also be applied to other scenarios where efficient handling of large amounts of asynchronous tasks and their dependencies are critical.

Basic Outline of the Talk

General workflow of multiple tasks with dependency relationship and its shortcomings. — [3-5min]
* The traditional approach to dealing tasks in Python’s Machine Learning environment.
* The problems that can be improve
How task abstraction makes workflow simple. — [5-7min]
* Brief Introduction of Decanter AI Core SDK
* The Input and return type of instructions
* How task is structured and handles API request and response.
How Multitasking works in Python? — [10-15min]
* The purpose of Event Loop
* How Python Coroutines and Tasks are related to our own task object
* Handling the dependencies of tasks
Conclusion and Q&A session. — [2min]

Speaker/講者: Hsiao-Shan Chen

Hi, I'm Hsiao-Shan Chen. I'm a Computer Science graduate from National Tsing Hua University and studied deep learning and computer vision as undergraduate research. I had worked in Mobagel as Software Engineering Intern, helped to build the SDK of their product, which helps handling the execution of multitasks in an efficient way.

Subscribe to Receive PyData Updates

Subscribe