Summary

The website content outlines three key techniques to significantly accelerate web scraping using Beautiful Soup by reusing sessions, utilizing the lxml parser, and installing the cchardet library for faster encoding detection.

Abstract

The webpage titled "How to make webscraping with Beautiful Soup 5X faster" provides valuable insights into improving the efficiency of web scraping tasks. It emphasizes that when scraping data over thousands of iterations, optimizations can lead to substantial time savings. The author suggests using sessions to maintain connections across requests, which reduces latency caused by repeated handshakes. The article also advocates for switching from the default HTML parser to lxml for its superior speed due to its C-based implementation. Furthermore, it recommends the installation of the cchardet library to expedite the detection of document encoding, thereby enhancing overall performance. These techniques are particularly beneficial when dealing with servers in distant geographical locations, where internet latency can significantly impact the duration of scraping operations.

Opinions

The author believes that every second saved in web scraping loops can accumulate to significant time savings, especially in data science projects.
Reusing sessions is presented as a critical method to avoid unnecessary connection delays, implying that this is an often-overlooked optimization.
The preference for lxml over other parsers like html.parser or html5lib is based on its performance advantage, which stems from its C-based foundation.
The article suggests that the combination of lxml and cchardet can lead to a more efficient encoding detection process, hinting at the inefficiencies of default encoding detection methods.
The author points out that the benefits of these optimizations are more pronounced when there is high internet latency due to geographical distances from the server.

BEAUTIFUL SOUP

How to make webscraping with Beautiful Soup 5X faster

Use these 3 simple techniques to speed up your webscraping using beautiful soup

Image courtesy : Photo by Stanislav Remnev on Unsplash

Why is this useful and important ?

Most of the times when you scrape a site for pulling public data for your datascience projects, you end up doing it over a loop ( sometimes it means doing it over a few thousand times) and every second that you save in your loop adds up significantly in the overall time taken.

Note : The difference is most observable if the server is located in a geographically different country ( .. or in other words — you are far away from the server and the internet latency is a big factor in time taken)

1. Re-use sessions ( and keep it alive ) instead of creating a new request per page .

By default, if you donot use a session, a new connection request is made to the server every time you call the request object and that causes significant hand-shake delays. Instead, create a one-time session and keep that alive till you finish your loop.

Here is the difference in the code —

Traditional code

import requests
url_ = < some_url_ you _give > 
response_object = requests.get(url_)
soup = BeautifulSoup(response_object.text, ‘html.parser’)

Code using session object

import requests
url_ = < some_url_ you _give >
session_object = requests.Session()
page_obj = session_object.get(url_)
soup = BeautifulSoup(response_obj.text, ‘html.parser’ )

If you have to query a url in the same domain again, you can reuse the session_object directly. This will NOT create a new connection sequence between your script and the server — and save time.

2 . Use lxml as the underlying parser instead of the default HTML parser.

lxml is faster than html.parser or html5lib parser.

This is because lxml parser that you will invoke in beautiful soup is natively written in C ( uses the libxml2 C library ) , hwere as the html.parser is written in python.

Note : lxml and should be installed prior to using. Depending on the environment, you might use one of these methods to install lxml if it is not already present.

$ apt-get install python-lxml    
( linux )

$ easy_install lxml   
( using easy-install )

$ pip install lxml    
( if you are using the traditional pip installer )

Traditional code

soup = BeautifulSoup(response_object.text, ‘html.parser’)

Code using lxml

soup = BeautifulSoup(response_object.text, ‘lxml’)

3. Install the cchardet library

Detecting the encoding used in the web document takes a chunk of time ( expecially if it is a large document). Along with lxml , the cchardet library makes the detection of the encoding used in the document much faster.

pip install cchardet

import cchardet

More References :

Improving performance — from the Beautiful Soup official documentation.

‘https://beautiful-soup-4.readthedocs.io/en/latest/#improving-performance’

Beautiful Soup Documentation - Beautiful Soup 4.4.0 documentation

Beautiful Soup 4 is published through PyPi, so if you can't install it with the system packager, you can install it…

beautiful-soup-4.readthedocs.io