avatarLiu Zuo Lin

Summary

The provided web content is a comprehensive guide on web scraping in Python using the Beautiful Soup library, including installation, basic usage, and real-world examples for extracting specific data from HTML documents.

Abstract

The article titled "Web Scraping In Python Day 1 — Beautiful Soup" is an introductory tutorial on web scraping techniques using Python. It explains the concept of web scraping as the automated extraction of data from websites and introduces Beautiful Soup, a Python library designed to facilitate the parsing and data extraction from HTML and XML documents. The tutorial covers the installation process of Beautiful Soup, demonstrates basic scraping methods such as retrieving the title of a webpage, and provides examples of more advanced data extraction, including searching for HTML tags by name, class, or ID, and extracting text from tags. The author also presents practical examples using a sample website, guiding readers through the process of scraping data such as country names and their associated information. The article concludes with encouragement for reader support and promotes additional resources and content from the author and the "Level Up Coding" publication.

Opinions

  • The author believes that Beautiful Soup is a versatile and easy-to-use tool for web scraping, compatible with both Python 2.x and 3.x.
  • The tutorial is designed to be helpful and clear, aiming to assist readers in understanding and implementing web scraping techniques.
  • The author values reader engagement and support, suggesting that readers clap for the story, follow the author, and consider a Medium membership.
  • The author provides additional resources for readers interested in further learning, such as free ebooks and a coding interview course.
  • The article promotes the Level Up Coding publication and its associated social media channels, indicating a community-driven approach to coding education.

Web Scraping In Python Day 1 — Beautiful Soup

In simple terms, web scraping refers to extracting data from websites automatically. It’s like going to a website, and copying down the data — just that we use Python (or whatever language) to make this automatic.

What is Beautiful Soup?

Beautiful Soup is a Python library that is commonly used for web scraping. It provides an easy-to-use API that allows you to parse HTML and XML documents and extract the information you need. Beautiful Soup is compatible with Python 2.x and 3.x, making it a versatile tool for web scraping.

Installing Beautiful Soup

Before we start using Beautiful Soup, we need to install it. You can install Beautiful Soup using pip. Remember to run the following command in command prompt or terminal.

pip install beautifulsoup4

Basic Web Scraping

Let’s start with a basic example of web scraping using Beautiful Soup. Suppose we want to extract the title of a web page. We can do this using the following code:

from bs4 import BeautifulSoup
import requests

url = "https://www.example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.title.string

print(title)
  • we first import the BeautifulSoup library and the requests library
  • we then specify the URL of the web page we want to scrape
  • we use the requests library to send a GET request to the web page and obtain its HTML content.
  • we then pass the HTML content to the BeautifulSoup constructor and specify the parser we want to use (I usually just use html.parser).
  • finally, we extract the title of the web page using the soup.title.string method.

Extracting Data from HTML

Beautiful Soup provides a number of methods that can be used to extract data from HTML documents. Here are some examples:

Searching for Tags by Name

soup.find_all('a')

This method finds all the <a> tags in the HTML document, put them in a list, and return it.

Searching for Tags by Class

soup.find_all(class_='header')

This method finds all the tags that have a class attribute with the value 'header’. Similarly, it puts them into a list and returns the list.

Searching for Tags by ID

soup.find(id='main')

This method finds the tag that has an ID attribute with the value 'main'.

Extracting Text from Tags

soup.find('h1').text

This method extracts the text inside the first <h1> tag in the HTML document.

Some Real Examples

Let’s use https://www.scrapethissite.com/pages/simple/

First things first, let’s open the inspect panel in our web browser so we are able to check the HTML of the web page.

Extracting all h1 elements

# getting all h1 tags

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# we simply scrape every single h1 element 
h1s = soup.find_all('h1')

print(h1s)

# [<h1>
#      Countries of the World: A Simple Example
#      <small>250 items</small>
# </h1>]

Extracting all elements with the class ‘country’

Next, let’s say we want to extract all elements with class="country". For those of you who are not too familiar with CSS, .col-md-4.country means that this element has 2 classes — col-md-4 and country. We want to use Python to extract all elements with the country class.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country')

print(countries)

# [<div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-ad"></i>
#                             Andorra
#                         </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
# <strong>Population:</strong> <span class="country-population">84000</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
# </div>
# </div>, <div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-ae"></i>
#                             United Arab Emirates
#                         </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
# <strong>Population:</strong> <span class="country-population">4975593</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
# </div>
# </div>, <div class="col-md-4 country">
# <h3 class="country-name">
# <i class="flag-icon flag-icon-af"></i>
#                             Afghanistan
#                         </h3>
# <div class="country-info">
# <strong>Capital:</strong> <span class="country-capital">Kabul</span><br/>
# <strong>Population:</strong> <span class="country-population">29121286</span><br/>
# <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">647500.0</span><br/>
# </div>
# </div>
# 
# ...]

Any element with class="country" will thus be extracted.

Extracting country names

Yep so we are able to extract the countries. But what we have currently is extremely messy! Let’s say we are only interested in the country names.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country')

print(type(countries[0]))

# <class 'bs4.element.Tag'>

When we use the find_all method, we get a list of bs4.element.Tag objects. We can do further work on these objects in order to extract the names of the countries.

In each country tag, we can see that the name of the country is inside a h3 tag with class="country-name". We can thus use country-name instead of country.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country-name')
print(countries)

# [<h3 class="country-name">
# <i class="flag-icon flag-icon-ad"></i>
#                             Andorra
#                         </h3>, <h3 class="country-name">
# <i class="flag-icon flag-icon-ae"></i>
#                             United Arab Emirates
#                         </h3>, <h3 class="country-name">
# <i class="flag-icon flag-icon-af"></i>
#                             Afghanistan
#                         </h3>
# ... ]

Ok we got something less messy, but do we not just want the country name? instead of some h3 thing? Yes — we need to get the .text attribute of each country object.

from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

countries = soup.find_all(class_='country-name')
print([country.text.strip() for country in countries])

# ['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda', 'Anguilla', ... ]

Conclusion

Hopefully this was clear and helpful!

Some Final words

If this story provided value and you wish to show a little support, you could:

  1. Clap 50 times for this story (this really, really helps me out)
  2. Sign up for a Medium membership using my link ($5/month to read unlimited Medium stories)

My Home Office Setup: https://zlliu.co/workspace

My Free Ebooks: https://zlliu.co/books

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

Python Programming
Python
Web Scraping
Recommended from ReadMedium