Web Scraping In Python Day 1 — Beautiful Soup

In simple terms, web scraping refers to extracting data from websites automatically. It’s like going to a website, and copying down the data — just that we use Python (or whatever language) to make this automatic.
What is Beautiful Soup?
Beautiful Soup is a Python library that is commonly used for web scraping. It provides an easy-to-use API that allows you to parse HTML and XML documents and extract the information you need. Beautiful Soup is compatible with Python 2.x and 3.x, making it a versatile tool for web scraping.
Installing Beautiful Soup
Before we start using Beautiful Soup, we need to install it. You can install Beautiful Soup using pip. Remember to run the following command in command prompt or terminal.
pip install beautifulsoup4
Basic Web Scraping
Let’s start with a basic example of web scraping using Beautiful Soup. Suppose we want to extract the title of a web page. We can do this using the following code:
from bs4 import BeautifulSoup
import requests
url = "https://www.example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.title.string
print(title)
- we first import the BeautifulSoup library and the requests library
- we then specify the URL of the web page we want to scrape
- we use the requests library to send a GET request to the web page and obtain its HTML content.
- we then pass the HTML content to the BeautifulSoup constructor and specify the parser we want to use (I usually just use
html.parser
). - finally, we extract the title of the web page using the
soup.title.string
method.
Extracting Data from HTML
Beautiful Soup provides a number of methods that can be used to extract data from HTML documents. Here are some examples:
Searching for Tags by Name
soup.find_all('a')
This method finds all the <a>
tags in the HTML document, put them in a list, and return it.
Searching for Tags by Class
soup.find_all(class_='header')
This method finds all the tags that have a class attribute with the value 'header’
. Similarly, it puts them into a list and returns the list.
Searching for Tags by ID
soup.find(id='main')
This method finds the tag that has an ID attribute with the value 'main'
.
Extracting Text from Tags
soup.find('h1').text
This method extracts the text inside the first <h1>
tag in the HTML document.
Some Real Examples
Let’s use https://www.scrapethissite.com/pages/simple/

First things first, let’s open the inspect panel in our web browser so we are able to check the HTML of the web page.
Extracting all h1 elements
# getting all h1 tags
from bs4 import BeautifulSoup
import requests
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# we simply scrape every single h1 element
h1s = soup.find_all('h1')
print(h1s)
# [<h1>
# Countries of the World: A Simple Example
# <small>250 items</small>
# </h1>]
Extracting all elements with the class ‘country’

Next, let’s say we want to extract all elements with class="country"
. For those of you who are not too familiar with CSS, .col-md-4.country
means that this element has 2 classes — col-md-4
and country
. We want to use Python to extract all elements with the country
class.
from bs4 import BeautifulSoup
import requests
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
countries = soup.find_all(class_='country')
print(countries)
# [<div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-ad"></i>
# Andorra
# </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
# <strong>Population:</strong> <span class="country-population">84000</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
# </div>
# </div>, <div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-ae"></i>
# United Arab Emirates
# </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
# <strong>Population:</strong> <span class="country-population">4975593</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
# </div>
# </div>, <div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-af"></i>
# Afghanistan
# </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Kabul</span><br/>
# <strong>Population:</strong> <span class="country-population">29121286</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">647500.0</span><br/>
# </div>
# </div>
#
# ...]
Any element with class="country"
will thus be extracted.
Extracting country names
Yep so we are able to extract the countries. But what we have currently is extremely messy! Let’s say we are only interested in the country names.
from bs4 import BeautifulSoup
import requests
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
countries = soup.find_all(class_='country')
print(type(countries[0]))
# <class 'bs4.element.Tag'>
When we use the find_all
method, we get a list of bs4.element.Tag
objects. We can do further work on these objects in order to extract the names of the countries.

In each country tag, we can see that the name of the country is inside a h3 tag with class="country-name"
. We can thus use country-name
instead of country
.
from bs4 import BeautifulSoup
import requests
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
countries = soup.find_all(class_='country-name')
print(countries)
# [<h3 class="country-name">
# <i class="flag-icon flag-icon-ad"></i>
# Andorra
# </h3>, <h3 class="country-name">
# <i class="flag-icon flag-icon-ae"></i>
# United Arab Emirates
# </h3>, <h3 class="country-name">
# <i class="flag-icon flag-icon-af"></i>
# Afghanistan
# </h3>
# ... ]
Ok we got something less messy, but do we not just want the country name? instead of some h3 thing? Yes — we need to get the .text
attribute of each country object.
from bs4 import BeautifulSoup
import requests
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
countries = soup.find_all(class_='country-name')
print([country.text.strip() for country in countries])
# ['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda', 'Anguilla', ... ]
Conclusion
Hopefully this was clear and helpful!
Some Final words
If this story provided value and you wish to show a little support, you could:
- Clap 50 times for this story (this really, really helps me out)
- Sign up for a Medium membership using my link ($5/month to read unlimited Medium stories)
My Home Office Setup: https://zlliu.co/workspace
My Free Ebooks: https://zlliu.co/books
Level Up Coding
Thanks for being a part of our community! Before you go:
- 👏 Clap for the story and follow the author 👉
- 📰 View more content in the Level Up Coding publication
- 💰 Free coding interview course ⇒ View Course
- 🔔 Follow us: Twitter | LinkedIn | Newsletter
🚀👉 Join the Level Up talent collective and find an amazing job