Boost Your Code’s Performance with Concurrent.futures
A Guide to Speeding up Your Applications, Web Scraping, and Web Crawling in Python
Concurrency is a powerful technique for making software faster and more efficient. In Python, the concurrent.futures module provides a high-level interface for working with concurrency. It is especially useful for tasks that involve parallel processing, such as web scraping, web crawling, and building applications that need to make many requests to external resources. In this article, we’ll take a closer look at concurrent.futures and explore how it can be used to speed up your code.
What is concurrent.futures?
The concurrent.futures module is part of the Python standard library and was introduced in Python 3.2. It provides a high-level interface for working with concurrency and allows developers to write concurrent code that is both simpler and more efficient.
At its core, concurrent.futures provides two classes: ThreadPoolExecutor and ProcessPoolExecutor. These classes allow you to execute a function asynchronously in a separate thread or process. The function can be any callable object, including functions, methods, and lambda functions.
Using ThreadPoolExecutor
ThreadPoolExecutor is a class that allows you to execute functions in a separate thread. To use ThreadPoolExecutor, you first create an instance of the class and then use its submit method to submit a function for execution. The submit method returns a Future object, which represents the result of the function call.
Here is an example of how to use ThreadPoolExecutor:
import concurrent.futures
import time
def my_function():
print("Starting my_function")
time.sleep(2)
print("Finishing my_function")
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(my_function)
print(future.result())In this example, we define a simple function that sleeps for 2 seconds and then prints a message. We then create a ThreadPoolExecutor instance and use its submit method to submit the function for execution. Finally, we print the result of the function call.
Using ProcessPoolExecutor
ProcessPoolExecutor is a class that allows you to execute functions in a separate process. To use ProcessPoolExecutor, you first create an instance of the class and then use its submit method to submit a function for execution. The submit method returns a Future object, which represents the result of the function call.
Here is an example of how to use ProcessPoolExecutor:
import concurrent.futures
import time
def my_function():
print("Starting my_function")
time.sleep(2)
print("Finishing my_function")
with concurrent.futures.ProcessPoolExecutor() as executor:
future = executor.submit(my_function)
print(future.result())In this example, we define the same function as in the previous example, but this time we create a ProcessPoolExecutor instance and use its submit method to submit the function for execution. Again, we print the result of the function call.
When to use ThreadPoolExecutor vs. ProcessPoolExecutor
ThreadPoolExecutor and ProcessPoolExecutor are both useful for concurrency, but they have different use cases.
ThreadPoolExecutor is best used when the tasks you need to execute are I/O-bound, such as network requests or file operations. This is because threads are lightweight and can be created and destroyed quickly. Additionally, threads share memory, which makes it easier to share data between threads.
ProcessPoolExecutor is best used when the tasks you need to execute are CPU-bound, such as heavy computations or image processing. This is because processes are heavier than threads and require more resources to create and destroy. However, processes have their own memory space, which makes it easier to isolate data between processes.
Real-world examples of concurrent.futures
Now that we’ve explored the basics of concurrent.futures, let’s take a look at some real-world examples of how it can be used to speed up your code.
Web scraping
Web scraping involves extracting data from websites using automated processes. It can be a time-consuming task, especially if you need to extract data from multiple websites or pages. Concurrent.futures can help speed up web scraping by allowing you to execute multiple requests in parallel.
Here’s an example of how to use ThreadPoolExecutor to scrape multiple websites:
import concurrent.futures
import requests
def scrape_website(url):
response = requests.get(url)
# Do something with the response, such as extract data
return response.content
urls = ['https://example.com', 'https://google.com', 'https://amazon.com']
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(scrape_website, urls)
for result in results:
# Process the resultsIn this example, we define a function called scrape_website that takes a URL and returns the content of the web page. We then create a list of URLs and use ThreadPoolExecutor to execute the scrape_website function in parallel for each URL. The map method returns an iterator that allows us to iterate over the results as they become available.
Web crawling
Web crawling involves traversing a website to extract data from multiple pages. Like web scraping, it can be a time-consuming task. Concurrent.futures can help speed up web crawling by allowing you to execute multiple requests in parallel.
Here’s an example of how to use ThreadPoolExecutor to crawl a website:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
def crawl_website(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
links = soup.find_all('a')
for link in links:
# Do something with the link, such as extract data or follow it
crawl_website(link['href'])
urls = ['https://example.com']
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(crawl_website, urls)
for result in results:
# Process the resultsIn this example, we define a function called crawl_website that takes a URL and extracts all the links from the web page. We then use ThreadPoolExecutor to execute the crawl_website function in parallel for each link we find. This allows us to crawl the entire website in parallel.
Building applications
Building applications that need to make many requests to external resources, such as APIs, can also benefit from concurrent.futures. By executing requests in parallel, you can speed up your application and improve its performance.
Here’s an example of how to use ThreadPoolExecutor to make multiple requests in parallel:
import concurrent.futures
import requests
def make_request(url):
response = requests.get(url)
# Do something with the response, such as extract data
return response.content
urls = ['https://api.example.com/users', 'https://api.example.com/products', 'https://api.example.com/orders']
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(make_request, urls)
for result in results:
# Process the resultsIn this example, we define a function called make_request that takes a URL and makes a request to the API. We then create a list of URLs and use ThreadPoolExecutor to execute the make_request function in parallel for each URL.
Conclusion
Concurrency is a powerful technique for making software faster and more efficient. The concurrent.futures module provides a high-level interface for working with concurrency in Python. It is especially useful for tasks that involve parallel processing, such as web scraping, web crawling, and building applications that need to make many requests to external resources. By using ThreadPoolExecutor and ProcessPoolExecutor, you can speed up your code and improve its performance.
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Join our Discord community and follow us on Twitter, LinkedIn and YouTube.
Learn how to build awareness and adoption for your startup with Circuit.






