Advanced Python: Concurrency And Parallelism

Explaining Why, When And How To Use Threads, Async And Multiple Processes In Python

Concurrency and parallelism features have completely changed the landscape of software applications. It’s a common trend in larger organisations to design a concurrent and parallel enterprise-level application. Therefore, it’s important to understand how concurrency and parallelism works. The sole aim of this article is to provide a clear and succinct guide on how and when to use concurrency and parallelism in Python applications.

This article will explain the concepts of concurrency and parallelism using practical examples

Despite the fact that this article is focused on the Python programming language, it can be used to understand the general concepts of concurrency and parallelism

This is an advanced level topic for Python developers and I recommend it to everyone who is/or intends in using the Python programming language.

If you want to understand the Python programming language from the beginner to an advanced level then I highly recommend the article below:

Everything About Python — Beginner To Advanced

Everything You Need To Know In One Article

medium.com

I will be explaining the concepts using CPython implementation of Python.

Article Aim

This article will provide an overview of the following concepts:

Brief introduction of CPU, Process and Threads to help us understand how concurrency and parallelism work
Providing an overview of what concurrency and parallelism are
Explaining when to take advantage of concurrency and parallelism
A quick introduction on how the GIL works and why it is needed
Designing and implementing applications with concurrency and parallelism in mind and explaining how to use multithreading, asyncio, and multiprocessing in Python

Designing and implementing distributed multi-process and multi-threaded applications can be a challenging task. The key is to learn the theory as much as possible and then code and review as many applications as we can. Recently, I have been tasked to design and architect a distributed multi-process and concurrent real-time data science application. Therefore, the moral of the story is that if I can do it then so can you.

This article should clear most of the complicated concepts.

Brief Introduction Of CPU, Process, And Thread

Let’s start by understanding the three main components of our machines. If we open the Task Manager on a Windows computer and click on the Performance tab, we will see the CPU utilisation along with the number of cores, processes, and threads that are currently running on the computer:

Screenshot of task manager on windows computer

CPU

We can think of a CPU as the brain of a computer. A CPU has its cycles that are essentially the time that the CPU takes for the execution of one processor operation. One tick of a cycle is 1 bytecode instruction.

Cores And Processes

A CPU can have multiple cores. Each core can run multiple processes. A process is essentially the applications/the instructions of the programs we run on our computer. Each process has its own memory space. If we click on the Processes tab, we will see all of the processes that are currently running on our computer such as iexplorer.exe, chrome.exe, PyCharm, Python.exe, and so on. Each Python process has its own Python interpreter, memory space, and the GIL (which I will explain shortly).

Threads

Furthermore, each process can run multiple threads within its context. A thread is a set of coded computer instructions. Each thread has its own memory space. It also has access to the process memory and context information.

When we run a process, such as Python.exe, it executes the code within its Main thread. The main thread can start up multiple threads. Subsequently, a process can start up multiple subprocesses.

That’s all we need to know for now. Let’s use this knowledge to understand a real-life scenario.

A Cooking Example

This example will help us understand the concurrency and parallelism concepts better. Consider this scenario.

It’s a Sunday morning and I am expecting 10 of my friends to visit my house for lunch. I have around 3 hours to cook a meal for all of my 10 friends.

So I decide to write down the sequence of the cooking steps along with the time each of these steps would take me:

Now, some of the steps are more time consuming than others.

I start by washing vegetables which would take me around 5 minutes.

2. Then I move on to cutting the vegetables which consume around 13 minutes of my time.

3. The cooking process requires boiling the vegetables which can take around 30 minutes.

4. Lastly, I serve the cooked food on a plate that takes around 2 minutes.

In total, it would consume approximately 50 minutes to cook the food for one of my friends. Now, repeat the process 10 times and it will take me 500 minutes. That’s around 8 hours of cooking.

I am sure, I can speed it up!

But hold on, I only have 3 hours to do an 8 hours task . How do I do 8 hours task within 3 hours? Can I parallelise it?

The first task for me is to understand where the bottleneck is. Let’s draw the sequence of steps on a paper to visualise how and where the bottleneck is:

The step to cook the food takes around 30 minutes. The biggest bottleneck for me is that I only have one cooker at home where I can cook/boil the vegetables. This is a perfect scenario to help me illustrate how we can utilise concurrency and parallelism to speed up the performance. I am unsure where I read a similar example and it stayed in my head so I will base my article on it.

Back To The Software World

I will use the scenario that I have explained above to explain what concurrency and parallelism are and how they can help.

At the moment, I am the only chef who is preparing the food.

We can consider the cooking process to be a Python application.
Each of the cooking steps that I have outlined above is essentially the functions that we can run in our Python application.
The chef i.e. myself is a Python process and all of these steps are executed within the Main thread.

Ok, back to the core task of parallelism and concurrency!

Speeding Up The Cooking Process

There are multiple ways to speed the cooking process up, such as:

I can chunk/divide the entire workload into two sets of tasks. As an instance, the first set can contain the food preparation steps for 5 of my friends whilst the second set can contain food preparation steps for the remaining 5 friends. Then I can work on the first set while another chef (process) can take the second task and cook the food in parallel with me.
I could also buy extra cookers and cook the vegetables simultaneously whilst I wash and cut the vegetables for the next plate in the meantime.

In short, there are multiple ways to speed up the process. Software is no different. We have to choose the right solution to the problem.

Let’s understand how concurrency and parallelism fit in there.

What Are Concurrency And Parallelism?

Let’s understand these two concepts.

Parallelism

Understanding parallelism is the simplest. Parallel execution means executing multiple sets of instructions at the same time in a completely different process. Within each process, we can choose to execute the steps synchronously or asynchronously.

With regards to the lunch scenario above, if one of my friends helps to cook lunch for 5 of my friends in his home whilst I cook lunch for the other five friends at my home then it is as if there are two processes running in parallel. That’s known as parallelism in the software world.

For the sake of simplicity, we can conclude that running the processes of cooking in parallel will help us prepare lunch for 10 of my friends in approximately 4 hours. We’ve just cut the time by half. Not Bad, right!

Therefore, parallelism simply means chunking the work in multiple sets and running multiple processes at the same time (map). Finally we can aggregate the results together (reduce). Just like I am not sharing my kitchen and cooker with my friend, the processes do not have to share the memory.

One very important information to note is the reasons for having a multi-core machine. Whenever we create a new Python process, the process runs the code in its own Python interpreter in a single core. For us to run multiple processes, we need to have a multi-core machine.

Parallel processing involves running a process in each of the cores of a machine.

As a result, parallelism in Python is all about creating multiple processes to run the CPU bound operations in parallel.

Creating a new process is an expensive task. Hence we have to be very careful and ensure that we are attempting to execute a long-running CPU bound (and not IO bound) operation.

Concurrency

Concurrency is slightly different. Concurrency is also used to increase the overall performance of an application but it is used to solve a different problem. I will explain the concept now.

With regards to the lunch scenario above, let’s consider that all of my friends are unavailable to help me cook the lunch. I can instead go for the second option and buy multiple cookers. Each cooker can be seen as a thread.

I can now wash and cut the vegetables serially like before. However, when it comes to the step of cooking/heating the vegetables, I can heat each plate's worth of vegetables on multiple cookers concurrently.

I would have to periodically check the state of the food in each of the cookers and serve the food as soon as it is cooked. As a result, we have managed to run the step of cooking (step 5) concurrently as shown below. This is how concurrency works in Python applications (well sort of). We can potentially cook a 60 minutes (2 x 30 minutes) cooking task within 35 minutes due to the overlap as shown below.

Image shows running the set of cook concurrently with an overlap

Concurrency is great for IO-bound operations such as for reading a file or communication over a network.

Let’s Review It Technically

Let’s consider that there are a number of tasks to execute in a single process. These tasks could involve reading a file or calling an external web service. These are essentially network or I/O calls. We can speed up the time our application takes in completing multiple I/O calls by creating multiple threads.

As an instance, we can start a task in a thread. While the first thread is running and waiting for the data from an external resource, we can start another task in another thread.

Assessing Threads

Threads are slightly complicated objects. Let’s assess them closely.

The image above shows a simplistic view of how threads run.

In the image above, we have created three threads. Initially, we started a thread. While the thread was running, we started another thread. While the second thread was running, the first thread completed and we started the third thread. Finally, all of the threads completed their execution.

We could have executed all of the threads in a sequence but running them concurrently saved some time (wherever the threads have overlapped).

The key point to take from this section is that an application can be concurrent and not parallel. An application can be parallel and not concurrent. An application can be neither concurrent or parallel and an application can be both concurrent and parallel.

When Should We Use Concurrency And/Or Parallelism?

This brings us to the core of the section.

We understood in the beginning that CPU bound operations require CPU cycles. CPU operations can include performing mathematical operations, text manipulation, image processing, and so on. They are heavy on the CPU. We can see the CPU utilisation increasing in the Task Manager when we run them.

At a high level, if your Python application is performing CPU bound operations such as number crunching or text manipulation then go for parallelism. Concurrency will not yield many benefits in those scenarios.

At times, we have to communicate and get data from an external source. This source could be a database that is running on another server, a web service, or even a file that resides in a shared file system or even a distributed task queue. These are examples of IO-bound operations.

The key to note is that any action that requires communicating with the hardware (such as sockets) involves communicating with the kernel of your machine. And the operations performed by the kernel are slower than the operations of a CPU.

As a result, most of the IO-bound operations remain in the I/O waiting state. For instance, if we want to query an external web service, the function will stay in the waiting state until it gets the results back from the web service. There is also an overhead of context switching. Context switching is about saving the state of the current task and switching the context to the other task.

Most of the I/O bound operations require communicating with the external resources such as hardware or network as an instance. When we run the I/O bound operations, we would see a bump in the Network and Disk utilisation in the Task manager.

If your Python application is performing IO-bound operations such as querying a web service or reading large files then go for concurrency options. If we run two CPU bound operations as two threads then they will run sequentially and we will not yield any benefits in Python. They might actually end up taking longer due to the lock acquisition and context switching.

We can use the image below to decide whether we need a process or a thread/asyncio:

Matter of fact, for IO-bound operations, running multiple processes might actually turn out to be slower than if we were to run them sequentially due to the fact that there is an overhead of creating processes.

Each Python process gets its own Python interpreter and this has a large memory footprint. Concurrency is better for situations where resource efficiency is required.

The locking mechanism in Python is heavily used during threads execution. Before I explain how to run Python code concurrently and in parallel, it is worth mentioning what global interpreter lock (GIL) is.

Python Global Interpreter Lock (GIL)

The GIL is one of the most important concepts to understand for the advanced Python users.

To understand GIL, it’s important to remember that the python memory manager is not thread-safe. As a result, multi-threads can update the same object in memory. This can end up corrupting the state of objects in an application.

CPython is built on C code and the interpreter's internal structures along with the C code structures are not thread-safe.

The key to note is that the data must be protected by using locking mechanisms so that multiple threads cannot corrupt the memory. To resolve this issue, Python interpreter uses a lock, known as Global Interpreter Lock (GIL).

The GIL is a lock acquired by the Python Interpreter

Each Python process has an interpreter. Subsequently, each process has one GIL. The CPU cycles tick fast and at times it seems as if the threads are running concurrently when they are internally running serially. Hence, the interpreter executes the code of only one thread at a time.

The interpreter coordinates the threads so that they all get the portion of CPU cycles. This is achieved by the interpreter using the GIL.

In the image above, we have a process that has two threads. Each thread has access to the shared memory space. Each thread has its own local memory space.

The GIL is used by the interpreter to acquire a lock on a thread. Consequently, only one thread is executed at a given time even if we run the application on a multi-core machine.

Therefore, GIL ensures that only one thread can run the interpreter at a given instance of time.

All other threads stay in the waiting state until the GIL is released. This is known as the time-slicing mechanism. Each thread gets around 5ms of execution time and the interpreter releases the lock to the thread so that it can run.

Due to the fact that each process has a GIL to control the execution of threads, we often look to use multi processes over multi-threads to speed up the CPU bound operations.

This is the reason why the GIL is not so popular amongst the Python users.

Therefore, the GIL prevents the CPU-bound threads from executing in parallel. This does not mean that we cannot use threads in Python. We should use threads when we are performing multiple I/O bound operations. This includes network calls, loading data from disk, UI based operations, communicating with external sources.

As a result, GIL will be acquired by a thread that requires getting the data from an external source. Subsequently, the lock is released and it is acquired by another thread that needs to execute an IO-bound code. When the GIL is acquired back by the first thread, it would have received the data.

As a result, multi-threads improve the performance of the Python application for IO-bound operations even though the interpreter is only executing the code for one thread at a time.

The image above illustrates that the GIL is only acquired by one thread at a time. All of the other threads remain in the waiting state until the GIL is acquired. Acquiring/releasing the GIL ends up slowing the application down.

Concurrent applications are usually less expensive than parallel applications because creating new processes are more expensive than creating new threads.

It’s important to lock the global/shared objects in your Python code if you choose to go for the concurrency approach.

Why Do We Need To Lock Objects When Only One Thread Runs At A Time?

We are probably wondering why we need to lock the objects if GIL ensures that only one thread can execute at a time.

The interpreter will take care of the internal Python objects by using GIL but we need to take care of the objects we have created ourselves. The threads share global memory space and hence they can overwrite the same memory.

Although the threads do get executed once at a time and thus the interpreter and its internal objects are taken care of by the GIL, but the global custom objects which are created by us in the application are all shared by all of the threads within a process. These global objects are required to be locked by us to ensure we don’t face any unexpected results.

The image below shows that each thread has its local memory space and access to the shared memory space. The shared memory space is global to the threads and we can create our custom objects in there. These objects are required to be locked by ourselves.

The image shows that each thread as access to its local and global memory space.

As an instance, we can use the threading.Lock, threading.RLock or threading.Semaphore objects to lock the objects. To understand the reasons in-depth, consider the snippet of the code below:

i = 0

for j in range(200):
 if i == 2:
   raise ValueError('i is 2')
 else:
   i = i + 1

In the code snippet above, the value of i is set to 0 at the start. Within each iteration, we are checking whether the value of an object i is 2. If it is 2 then we raise an exception, otherwise, we increment the value of i by 1.

The object i is in the global memory space therefore it is accessible to all of the threads.

Let’s consider that there are two threads executing the code. The GIL will ensure that only thread can execute the instructions (line of code) at a time and even with the GIL, we will end up getting memory corruption issues as explained below.

The image above illustrates how the two threads can corrupt the state of the memory.

The two recommended solutions would be to:

Create a local variable i so that each thread has its own object in its local memory space, either by passing the variable as an argument to the thread or declaring it within the function.
Use the locking mechanism in Python to lock the variable. Python offers a number of locking objects. If multiple threads are sharing an object then we need to use the threading.Lock, threading.RLock, threading.Semaphore, etc objects to ensure that the objects are not corrupted as shown below:

lock = mulithreading.Lock()

def func():
  lock.acquire()
  if i == 2:
   ...
  lock.release()

This brings us to the last sections of the article.

Let’s understand how we can create multiple threads, coroutines, and processes in Python.

How To Multithread In Python?

I am going to demonstrate how we can execute IO operations by creating multiple threads. I will start by presenting a real-life example that works in a non-concurrent/synchronous manner.

Synchronous Approach

The method below calls the yahoo finance library for 5 company symbols. It gets the information for each of the companies sequentially.

The code then prints out the time it took to get the information for the 5 companies:

import time
started = time.time()

def print_company_info(company):
    ticker = yf.Ticker(company)
    info = ticker.get_info()
    print(info)



companies = ['ABT', 'ABBV', 'ABMD', 'ATVI', 'ADBE']
for company in companies:
    print_company_info(company)
    
    
elapsed = time.time()

print('Time taken: :', elapsed-started)

#Time taken: : 67.12

The function took 67 seconds.

This is an I/O bound operation that is performed once for each company. The function is in the waiting state most of the time as it is waiting for the result from the external yahoo web service.

Therefore it is an ideal candidate for concurrent application.

Multithreaded Approach

Python offers a built-in threading library which lets us execute instructions of our code concurrently. There are multiple ways to create a multi-threaded application.

I will show one of the simplest options by using the ThreadPoolExecuter class. The ThreadPoolExecuter class is in the concurrent.futures module.

It allows us to use a pool of threads and execute method calls. We can create a ThreadPoolExecuter by passing in the optional maximum number of worker threads which are needed to be executed asynchronously.

From version 3.8, if a value for the maximum workers parameter is not given then the library internally sets the number of threads to min(32, CPU count + 4). This is based on the assumption that the threads are mostly used for I/O bound operations and as a result, at least 5 workers are preserved for I/O bound tasks.

The value for the number of threads should be selected carefully by assessing the resources and how long each of the operations takes. The creation of threads is an expensive operation, it consumes extra memory space and there is an overhead of context switching.

The code snippet shows how we can use the ThreadPoolExecuter:

import time
import concurrent.futures

started = time.time()

companies = [’ABT’, 'ABBV’, 'ABMD’, 'ATVI’, 'ADBE’]

size = 5
with concurrent.futures.ThreadPoolExecutor(size) as thp:
    thp.map(print_company_info, companies)
elapsed = time.time()

print(’Time taken: :’, elapsed - started)
#Time taken: : 6.766068696975708

We have used the concurrent.futures module and created a thread pool executor with the maximum number of 5 threads in the pool.

We then executed the map() method on the executor. The map method takes in the function we want to execute along with an iterable. The function is then run once for each of the elements of the iterable.

The function completed in 6.77 seconds. This is approximately 10x faster than its synchronous counterpart.

If you want to know more about the threading library then please read the Python official documentation here.

How To AsyncIO In Python?

Python offers the asyncio library which lets us perform actions while a task is in an I/O wait state. The library enables Python coders to write concurrent code using the async/await syntax whilst having full control over the execution of the concurrent coroutines.

The foundation of Async IO sits on the event loop object. We can think of the event loop as a stack of functions. The event loop is essentially the core of every asyncio operation. It maintains the functions that are going to run and ensures that the functions are not blocking each other.

We can declare a function with the async keyword:

import asyncio
async def main():
 ....

This is the preferred way of writing asyncio applications.

The function is declared with async keyword, implying that it will return the result in the future. These functions are called Future objects. An await keyword is called to wait for the function to complete. This makes the object awaitable.

Whenever we define a function with the async keyword, a coroutine is created. Python coroutines are awaitable from other coroutines.

Therefore, we can create a coroutine by declaring a function with the async keyword and then await on it. The keyword await lets the event loop handle the control and blocks the code indicating that it is waiting for the result.

The code snippet below will execute the code to query the Yahoo finance library in a concurrent manner by utilising the asyncio library.

import asyncio
import time

def print_company_info(company):
    ticker = yf.Ticker(company)
    info = ticker.get_info()
    print(info)

companies = ['ABT', 'ABBV', 'ABMD', 'ATVI', 'ADBE']
tasks = []
event_loop = asyncio.get_event_loop()

for company in companies:
    #Not calling the functions here, we are storing an awaitable functions in an array
    tasks.append((print_company_info(company)))

started = time.time()
event_loop.run_until_complete(asyncio.wait(tasks))
elapsed = time.time()

#Time taken: : 6.34

In the code snippet above:

We started by importing the asyncio library
Then we created our function print_company_info() that takes in the company symbol as a parameter.
Then we got the current event loop. The OS will create a new event loop if an event loop is not set.
Then we looped over the collection of companies. Within each iteration, we created a task. Tasks are used to schedule coroutines concurrently.
Finally, we executed the run_until_complete() method to run all of the tasks until they have been completed. We passed in asyncio.wait() as a parameter. It will return when all of the tasks have completed.

The CPU sent multiple requests, one for each company, and then awaited on the result. The function took only 6.34 seconds to complete.

If you want the return values then you can also use asyncio.gather(*tasks) We can also declare each method with @asyncio.coroutine decorator.

The library and its functionality is documented in detail here

Asyncio Over Threads

Few points to note:

Asyncio is the preferred option over threads for long-running IO operations. Asyncio is a better option if we want to share resources without spending time in making our code thread-safe.
Threads can be error prone. If your IO-bound operations are fast then choose multi-threading.
Asyncio is cooperative multitasking and is suitable for slow IO-bound operations.

How To MultiProcess In Python?

Finally, I wanted to explain how parallelism works in Python.

Most of the data cleaning and model training tasks in data science are CPU bound in nature. We can improve the performance of CPU bound operations in a multi-core machine by utilising the multiprocessing features of Python.

Python, its CPython implementation in particular, does not use multiple CPUs by default.

In theory, we can utilise all cores of our machine and complete a long-running CPU bound operation within ¹/ₙ of the time it takes on a single-core machine, where n is the number of cores in our machine. In practice, we have to take the time it takes to create a subprocess instance, the memory it consumes amongst other operations into account but generally, it is faster to run multiple long-running CPU bound operations in parallel than to run the operations serially.

Python has a powerful multiprocessing library. It allows us to spawn multiple processes. The processes in Python are not bound by the GIL.

Additionally, the processes do not share the memory context and thus it is not easier to corrupt the memory objects.

Using the multiprocessing module, we can also share data within processes. However, I would recommend that we should avoid it, if we can, to prevent communication overhead.

There are two main objects: Pool and Process.

A pool is a wrapper around a process/thread. It’s named a Pool as it pools a number of workers to share a large set of work and is capable of aggregating the result. The pool is great for splitting work amongst multiple processes and then aggregating the results back (map/reduce pattern).

Each subprocess has its own private memory, Python interpreter, GIL and a single main thread.

The pool.map() returns a collection where each element of the collection is the return value of the function which we wanted to execute in a subprocess. By default, the number of items is divided into independent units of work so that in total, we have as many chunks of data as the number of physical CPUs.

The Process object can execute all of the tasks. Therefore if we have a relatively fewer number of tasks then we can opt to create a Process object to handle them. However, if we need to execute a large number of tasks and we are worried about memory starvation then we should use the Pool object. The Pool object ensures that only the chosen number of processes are running at a given time.

How to share data across processes

We can share the data by passing it as an argument of the function to each of the processes but it would result in doubling the memory footprint.

The processes consume more memory than threads and it is an expensive operation to create a process. Each process has its own memory context. As a result, we can have an unlimited number of child processes within a process. There are multiple options to share the data across processes:

If we want to send data between processes, we can utilise the multiprocessing.Queue objects. The queue objects pickle the objects and then unpickle them when they are received in a process. We can also use the multiprocessing.Manager().Queue() objects and pass them as arguments to a multiprocessing.Process(args=queue) object for intercommunication between the processes.

It’s important to only use this approach if we have a long running CPU bound operation and the data can be pickled.

2. The Pipe can also be used for uni and bi-directional communication between processes.

3. We can use a Queue for a publish/subscriber scenario.

4. A manager can also be used to share objects between processes. 5. We can also use multiprocessing.Array and multiprocessing.Value to share blocks of memory between processes.

We can also parallelise I/O bound tasks using the multiprocessing.dummy module; from multiprocessing.dummy import Pool.

Let’s write the Python code to illustrate how multiprocessing works in Python. The snippet of code below shows a function named increment().

It increments a variable by one, 1 million times. We are going to execute the function 100 times in a serial manner and assess the long running CPU bound operation.

Serial Approach

import time


def increment(input):
    for i in range(1000000):
        input = input + 1

if __name__ == "__main__":
    inputs = [1] * 100

    
    started = time.time()
    for i in inputs:
        increment(i)

    # Time taken: : 35.019266843795776
    elapsed = time.time()

    print('Time taken Sequential:', elapsed - started)

# Time taken: : 35.019266843795776

The function took 35.01 seconds to complete. This number-crunching function is a CPU bound operation.

We can fork it by creating multiple child processes to speed up the performance.

Multi-Processing Approach

There are multiple options to run an application in parallel mode. We can use the multiprocessing library whereby we can use the Process() or Pool() object as an instance or we can use the concurrent.futures module.

In the snippet of code below, I have created a Pool object with 8 worker processes to be created. The function increment() can be offloaded to the worker processes. As a result, only 8 processes will be created to execute the increment() function. When they finish execution, the batch of next 8 tasks will be on boarded onto the 8 processes and so on.

import time
from multiprocessing import Pool


def increment(input):
    for i in range(1000000):
        input = input + 1

if __name__ == "__main__":
    inputs = [1] * 100
    pool = Pool(8)
    started = time.time()
    pool.map(increment, inputs)
    

    elapsed = time.time()
    print('Time taken MultiProcess :', elapsed - started)

    pool.close()

# Time Taken 8 processes = 7.562885284423828

The snippet shows how we can execute the map() method. The map() method executed the increment function 100 times (length of inputs is 100) and each time, it passed the appropriate element of the inputs collection to the method.

Tip: We can view the Task Manager’s Processes tab to view how many of the Processes are running at a given time. There will be 9 processes running for the snippet above (8 subprocess and 1 main python.exe process).

The code completed within 7.65 seconds which is nearly 5 times faster than its serial counterpart.

We can also use the ProcessPoolExecuter class which uses a pool of processes to execute calls asynchronously. It uses the multiprocessing library under the hood. It’s important to call the join() method to ensure that the processes have completed execution.

Tip: We can also make the processes concurrent (non-blocking) by executing the pool.apply_async() method.

To understand the package in detail, please read this official documentation

Tips

Do consider the fact that debugging a parallel and concurrent system is a hard task. During debugging, I recommend setting the number of workers in a pool to 1. It would help in accomplishing a controlled environment as it will eliminate the noise due to parallel execution in your application.
Also, it’s important to remember not to max out CPU and RAM utilisation as the process objects are resource-heavy objects.
Try to sort the jobs in order so that the slowest jobs are sent first.
Always aim to keep the solution design simple. If you do not need asynchronous mechanisms then don’t unnecessarily add them to your solution design. Always lookout to make your code thread-safe and avoid locks if you can.
Ask yourself where each of the systems can fail and have a recovering strategy for each of the components.
Attempt to minimise data sharing between processes and threads.
If you find an application challenging to design then find similar applications on the web (GitHub as an instance) and understand how they are designed and architected.
There is also the option to run the operations across multiple machines however distributed systems increase the complexity by 10-fold.
Ensure you are using the appropriate data structures.

Summary

Concurrency and parallelism features have completely changed the landscape of software applications. The sole aim of this article was to provide a clear and succinct guide on how and when to use concurrency and parallelism in Python applications.

We understood that an application can be concurrent and not parallel. An application can be parallel and not concurrent. An application can be neither concurrent or parallel and an application can be both concurrent and parallel.

We also understood how GIL works and why it is needed. Additionally, the article explained that for I/O bound operations, use concurrency and for CPU bound operations, use parallelism.

Thumbs up for reading from Farhad Malik— FinTechExplained

Finally, the article demonstrated how the multiprocessing, concurrent.futures, and asyncio can be used to implement concurrency and parallelism in Python code.