Scrape Web Articles With Python
Use common Web Scraping techniques in Python to build a library of articles for you to read offline. Using BeautifulSoup, urllib3, and pdfkit.

Lately, I have been interested in creating a Second Brain as a complement to my current goal to build an automated life. I intend to create a repository of the content that I consume, add my notes to it, and make it searchable. This will allow me to surface content as I need it and create a repository of information that I have previously vetted. As anyone who has looked into building an AI, the first thing to do is that you need data. I don’t have enough time to read all the articles I come across. So, I need to create a repository of articles that I can consume anywhere on my Kindle. I’ll be doing this with python and some common web scraping techniques.
The first step is to web scrape articles as I see them needed. After some liberal usage of HTML Inspect in my Chrome browser, I identified two sites with a predictable article format, CNN.com and Vox.com.
Setting up your Python Environment
First, we need to get the libraries that Python will need to construct the code. I’m using Python 3.7 and the following libraries:
- Beautiful Soup — For Web Scraping
- urllib3 — To Call Webservices
- certifi — To facilitate calling HTTPS sites with urlib3
- pdfkit — To Convert HTML to PDFs
Here are the pip install commands for the libraries.
pip install beautifulsoup4
pip install urllib3
pip install pdfkit
pip install certifiFirst, we need to create a PoolManager with urllib3. I’m also using the certifi library here so that I can parse HTTPS websites.







