PYTHON CONCURRENCY
Harnessing Multi-Core Power with Asyncio in Python
Boost your Python application’s performance by efficiently utilizing multiple CPU cores with asyncio
Introduction
In this article, I will show you how to execute Python asyncio code on a multi-core CPU to unlock the full performance of concurrent tasks.
What is our problem?
asyncio uses only one core.
In previous articles, I covered the mechanics of using Python asyncio in detail. With this knowledge, you can learn that asyncio allows IO-bound tasks to execute at high speed by manually switching task execution to bypass the GIL contention process during multi-threaded task switching.
Theoretically, the execution time of IO-bound tasks depends on the time from initiation to the response of an IO operation and is not dependent on your CPU performance. Thus, we can concurrently initiate tens of thousands of IO tasks and complete them quickly.
But recently, I was writing a program that needed to crawl tens of thousands of web pages simultaneously and found that although my asyncio program was much more efficient than programs that use iterative crawling of web pages, it still made me wait for a long time. Should I be using the full performance of my computer? So I opened Task Manager and checked:
I found that since the beginning, my code was running on only one CPU core, and several other cores were idle. In addition to launching IO operations to grab network data, a task has to unpack and format the data after it returns. Although this part of the operation does not consume much CPU performance, after more tasks, these CPU-bound operations will severely impact the overall performance.
I wanted to make my asyncio concurrent tasks execute in parallel on multiple cores. Would that squeeze the performance out of my computer?
The underlying principles of asyncio
To solve this puzzle, we must start with the underlying asyncio implementation, the event loop.
As shown in the figure, asyncio’s performance improvement for programs starts with IO-intensive tasks. IO-intensive tasks include HTTP requests, reading and writing files, accessing databases, etc. The most important feature of these tasks is that the CPU does not block and spends a lot of time computing while waiting for external data to be returned, which is very different from another class of synchronous tasks that require the CPU to be occupied all the time to compute a specific result.
When we generate a batch of asyncio tasks, the code will first put these tasks into a queue. At this point, there is a thread called event loop that grabs one task at a time from the queue and executes it. When the task reaches the await statement and waits (usually for the return of a request), the event loop grabs another task from the queue and executes it. Until the previously waiting task gets data through a callback, the event loop returns to the previous waiting task and finishes executing the rest of the code.
Since the event loop thread executes on only one core, the event loop blocks when the “rest of the code” happens to take up CPU time. When the number of tasks in this category is large, each small blocking segment adds up and slows down the program as a whole.
What is my solution?
From this, we know that asyncio programs slow down because our Python code executes the event loop on only one core, and the processing of IO data causes the program to slow down. Is there a way to start an event loop on each CPU core to execute it?
As we all know, starting with Python 3.7, all asyncio code is recommended to be executed using the method asyncio.run
, which is a high-level abstraction that calls the event loop to execute the code as an alternative to the following code:
try:
loop = asyncio.get_event_loop()
loop.run_until_complete(task())
finally:
loop.close()
As you can see from the code, each time we call asyncio.run
, we get (if it already exists) or create a new event loop. Could we achieve our goal of executing asyncio tasks on multiple cores simultaneously if we could call the asyncio.run
method on each core separately?
The previous article used a real-life example to explain using asyncio’s loop.run_in_executor
method to parallelize the execution of code in a process pool while also getting the results of each child process from the main process. If you haven’t read the previous article, you can check it out here:
Thus, our solution emerges: distribute many concurrent tasks to multiple sub-processes using multi-core execution via the loop.run_in_executor
method, and then call asyncio.run
on each sub-process to start the respective event loop and execute the concurrent code. The following diagram shows The entire flow:
Where the green part represents the sub-processes we started. The yellow part represents the concurrent tasks we started.
Preparation before starting
Simulating the task implementation
Before we can solve the problem, we need to prepare before we start. In this example, we can’t write actual code to crawl the web content because it would be very annoying for the target website, so we will simulate our real task with code: