BEAUTIFUL SOUP
How to make webscraping with Beautiful Soup 5X faster
Use these 3 simple techniques to speed up your webscraping using beautiful soup

Why is this useful and important ?
Most of the times when you scrape a site for pulling public data for your datascience projects, you end up doing it over a loop ( sometimes it means doing it over a few thousand times) and every second that you save in your loop adds up significantly in the overall time taken.
Note : The difference is most observable if the server is located in a geographically different country ( .. or in other words — you are far away from the server and the internet latency is a big factor in time taken)
1. Re-use sessions ( and keep it alive ) instead of creating a new request per page .
By default, if you donot use a session, a new connection request is made to the server every time you call the request object and that causes significant hand-shake delays. Instead, create a one-time session and keep that alive till you finish your loop.
Here is the difference in the code —
Traditional code
import requests
url_ = < some_url_ you _give >
response_object = requests.get(url_)
soup = BeautifulSoup(response_object.text, ‘html.parser’)Code using session object
import requests
url_ = < some_url_ you _give >
session_object = requests.Session()
page_obj = session_object.get(url_)
soup = BeautifulSoup(response_obj.text, ‘html.parser’ )If you have to query a url in the same domain again, you can reuse the session_object directly. This will NOT create a new connection sequence between your script and the server — and save time.
2 . Use lxml as the underlying parser instead of the default HTML parser.
lxml is faster than html.parser or html5lib parser.
This is because lxml parser that you will invoke in beautiful soup is natively written in C ( uses the libxml2 C library ) , hwere as the html.parser is written in python.
Note : lxml and should be installed prior to using. Depending on the environment, you might use one of these methods to install lxml if it is not already present.
$ apt-get install python-lxml
( linux )$ easy_install lxml
( using easy-install )$ pip install lxml
( if you are using the traditional pip installer )Traditional code
soup = BeautifulSoup(response_object.text, ‘html.parser’)Code using lxml
soup = BeautifulSoup(response_object.text, ‘lxml’)3. Install the cchardet library
Detecting the encoding used in the web document takes a chunk of time ( expecially if it is a large document). Along with lxml , the cchardet library makes the detection of the encoding used in the document much faster.
pip install cchardet import cchardet More References :
Improving performance — from the Beautiful Soup official documentation.
‘https://beautiful-soup-4.readthedocs.io/en/latest/#improving-performance’
