Web Scraping In Python Day 1 — Beautiful Soup

In simple terms, web scraping refers to extracting data from websites automatically. It’s like going to a website, and copying down the data — just that we use Python (or whatever language) to make this automatic.

What is Beautiful Soup?

Beautiful Soup is a Python library that is commonly used for web scraping. It provides an easy-to-use API that allows you to parse HTML and XML documents and extract the information you need. Beautiful Soup is compatible with Python 2.x and 3.x, making it a versatile tool for web scraping.

Installing Beautiful Soup

Before we start using Beautiful Soup, we need to install it. You can install Beautiful Soup using pip. Remember to run the following command in command prompt or terminal.

pip install beautifulsoup4

Basic Web Scraping

Let’s start with a basic example of web scraping using Beautiful Soup. Suppose we want to extract the title of a web page. We can do this using the following code:

from bs4 import BeautifulSoup
import requests

url = "https://www.example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.title.string

print(title)

we first import the BeautifulSoup library and the requests library
we then specify the URL of the web page we want to scrape
we use the requests library to send a GET request to the web page and obtain its HTML content.
we then pass the HTML content to the BeautifulSoup constructor and specify the parser we want to use (I usually just use html.parser).
finally, we extract the title of the web page using the soup.title.string method.

Extracting Data from HTML

Beautiful Soup provides a number of methods that can be used to extract data from HTML documents. Here are some examples:

Searching for Tags by Name

soup.find_all('a')

This method finds all the <a> tags in the HTML document, put them in a list, and return it.

Searching for Tags by Class

soup.find_all(class_='header')

This method finds all the tags that have a class attribute with the value 'header’. Similarly, it puts them into a list and returns the list.

Searching for Tags by ID

soup.find(id='main')

This method finds the tag that has an ID attribute with the value 'main'.

Extracting Text from Tags

soup.find('h1').text

This method extracts the text inside the first <h1> tag in the HTML document.

Some Real Examples

Let’s use https://www.scrapethissite.com/pages/simple/

First things first, let’s open the inspect panel in our web browser so we are able to check the HTML of the web page.

Extracting all h1 elements

# getting all h1 tags

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# we simply scrape every single h1 element 
h1s = soup.find_all('h1')

print(h1s)

# [<h1>
#      Countries of the World: A Simple Example
#      <small>250 items</small>
# </h1>]

Extracting all elements with the class ‘country’

Next, let’s say we want to extract all elements with class="country". For those of you who are not too familiar with CSS, .col-md-4.country means that this element has 2 classes — col-md-4 and country. We want to use Python to extract all elements with the country class.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country')

print(countries)

# [<div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-ad"></i>
#                             Andorra
#                         </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
# <strong>Population:</strong> <span class="country-population">84000</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
# </div>
# </div>, <div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-ae"></i>
#                             United Arab Emirates
#                         </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
# <strong>Population:</strong> <span class="country-population">4975593</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
# </div>
# </div>, <div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-af"></i>
#                             Afghanistan
#                         </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Kabul</span><br/>
# <strong>Population:</strong> <span class="country-population">29121286</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">647500.0</span><br/>
# </div>
# </div>
# 
# ...]

Any element with class="country" will thus be extracted.

Extracting country names

Yep so we are able to extract the countries. But what we have currently is extremely messy! Let’s say we are only interested in the country names.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country')

print(type(countries[0]))

# <class 'bs4.element.Tag'>

When we use the find_all method, we get a list of bs4.element.Tag objects. We can do further work on these objects in order to extract the names of the countries.

In each country tag, we can see that the name of the country is inside a h3 tag with class="country-name". We can thus use country-name instead of country.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country-name')
print(countries)

# [<h3 class="country-name">
# <i class="flag-icon flag-icon-ad"></i>
#                             Andorra
#                         </h3>, <h3 class="country-name">
# <i class="flag-icon flag-icon-ae"></i>
#                             United Arab Emirates
#                         </h3>, <h3 class="country-name">
# <i class="flag-icon flag-icon-af"></i>
#                             Afghanistan
#                         </h3>
# ... ]

Ok we got something less messy, but do we not just want the country name? instead of some h3 thing? Yes — we need to get the .text attribute of each country object.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country-name')
print([country.text.strip() for country in countries])

# ['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda', 'Anguilla', ... ]

Conclusion

Hopefully this was clear and helpful!

Some Final words

If this story provided value and you wish to show a little support, you could:

Clap 50 times for this story (this really, really helps me out)
Sign up for a Medium membership using my link ($5/month to read unlimited Medium stories)

My Home Office Setup: https://zlliu.co/workspace

My Free Ebooks: https://zlliu.co/books

Get an email whenever Liu Zuo Lin publishes.

Get an email whenever Liu Zuo Lin publishes. By signing up, you will create a Medium account if you don't already have…

zlliu.medium.com

Level Up Coding

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the Level Up Coding publication
💰 Free coding interview course ⇒ View Course
🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job