Beautifulsoup — All that you need to know to get any data from any website using this Python library.
The goal here is to understand how you can use the library Beatifulsoup to fetch, retrieve any data from any website that you want.
I will explain from the beginning, the concept and how you should look to the data, also, some tips to some problems that you can find during scraping, as some tips to bypass some problems.
If you prefer to watch, I have a video about this tutorial, just go to the end of this post and access the link for Youtube.
The “action” that we want to do here has a name, it is “web scraping”, that is, load the web page that you want, search for any data that you want from there using a tool that can automatically fetch and parse this data for you.
These are the steps that we will present:
1 — Prepare your environment. 2 — Understanding how to use it. 3 — Using Beautifulsoup. 4 — Save the content and show it on your page.
Why use this?
For this question, there are many answers all are right, but you can, for example, use this to fetch and parse the data that you want to show on your website.
For example, on my website (*www.profession-programmer.com/), I wanted that every time I publish a new post on Medium or on my Youtube channel, fetch the data, parse in a format that I want to be presented. In this way, my website is always up-to-date without I need to add this manually.
This is just one example, I see people fetching data from Stocks, people fetch news from different websites that they want, and more.
And, it’s not always that we have API to give the data that we want, so this is definitely the tool that you need.
1 — Prepare your environment.
In order to start, we do need to have our base environment, this is what want to have setup:
1- Configured Flask application, can be local we don’t need to deploy this.
2 — Have the library Beatifulsoup installed.
1 — Create app project.
In order to speed up this process, as we need to have a basic setup I will provide the initial project state, you have to checkout to your machine, install the dependencies and we can continue.
Link: https://github.com/felipeflorencio/BeautifulsoupSample/releases/tag/Start
Now that we have the project, let’s start the project and see if everything is running, first open the README.md
file and export the environment variables that you need in order to start the project.
If you are not familiar with this process I highly recommend read this post https://itnext.io/beginning-with-flask-project-the-5-most-important-information-to-know-before-starting-f075e0fb0aec. In this post we cover all the basic for you to know, not deep but enough to understand anything that we will talk here regarding Flask. Or if you prefer I also have the video version about this post, you can find here: https://www.youtube.com/watch?v=DaOoYRsJu5I&list=PLz3OOILu_dPk1LfYv1YEBW_fid728Pcfm
Go to your project folder, and you should have this:
Let’s start our virtual environment typing:
source bin/activate
Now, export the variables that we need before start:
export FLASK_APP='app.py'; export APP_SETTINGS='config.DevelopmentConfig'
If you are not familiar with the ‘config’ file you can learn and understand in deep here in this post: https://itnext.io/how-and-why-have-a-properly-configuration-handling-file-using-flask-1fd925c88f4c
Let’s run and see if everything is running fine:
python app.py
If everything is good, you can access http://localhost:5000/
, you should see this:
2 — Installing Beautifulsoup.
We have everything ready to start. To install our library, go to the project folder, that has your virtual environment and type:
pip install beautifulsoup4
After successfully installed let’s start!
2 — Understanding how to use it.
This library provides you an abstraction over the HTML code, basically what this library and many others do is, load the HTML and provide some helper methods that allow you to navigate through the source code like you access a function or methods from code, they basically create a ‘tree’ that you can navigate inside and access the HTML as you would do in your code.
The page for this project is here: https://www.crummy.com/software/BeautifulSoup/
Ok, for this tutorial I will use Medium, and the reason for this is because this website has a “complex” setup. And how you try to fetch the data will impact your capability to parse the data and use it.
Let’s create our first object that will parse the data from Medium:
1 — Create a folder named “scraping”;
2 — Create a file name 'medium_scraping.py'
.
*As we want this to be a module
in python create also the file __init__.py
inside this folder.
The scraping part will be handled by 2 main parts, one is requesting the data from the website, and the other is to get the data that this request generated, pass it to Beautifulsoup library to parse so you can access it.
There are 2 main libraries to do this, I will show first the requests
library, only because many tutorials use it, but we will use another one, that actually will be two together, Request and urlopen.
Let’s start:
1 — Install requests
using: pip install requests
2 — Import;
3 — Load the content from the URL using ‘requests’ library;
4 — Add this content to Beautifulsoup;
This will be our initial setup:
import requests
from bs4 import BeautifulSoup
# I will use my medium profile page, it's open so you can try too
MEDIUM_URL = "<https://medium.com/@ProfessionProgrammer>"
def scraping_medium():
page = requests.get(MEDIUM_URL)
parsed_data = BeautifulSoup(page.content, 'html.parser')
print(parsed_data)
return parsed_data
Good, now we can use this, just to try and see if we are receiving the data, get back to our app.py
file, import our module and at 'get_load_parser()' let's use the method scraping_medium()
to check if it is working fine.
This will be your code:
import os
from flask import Flask, render_template, jsonify, Response
from extensions import database, commands
from scraping.medium_scraping import scraping_medium
app = Flask(__name__, static_url_path='')
app.config.from_object(os.environ['APP_SETTINGS'])
database.init_app(app)
commands.init_app(app)
@app.route("/")
def index():
return render_template('index.html')
@app.route("/load")
def get_load_parser():
scraping_medium()
return jsonify({"success": "data parsed"}, 200)
if __name__ == "__main__":
app.run()
Let’s dig into what we have for now:
1 — We import the function that we want to use from scraping.medium_scraping import scraping_medium
;
2 — We update our method ‘get_load_parser()’ to call our scraping method.
For now, we don’t do anything with this data, but, as we are printing, you will be able to see that we print the whole HTML that we just requested.
Ok, we can load our page, but now, how we know what we are looking for? Let’s go to this address using the Chrome browser because this browser has a nice tool to inspect the code.
Access the link: https://medium.com/@ProfessionProgram
mer
There is two way to access the inspection in your browser:
1 — Right-click on the page and select the option ‘inspect’ 2 — Or, if you are using Mac, shortcut: ‘COMMAND + SHIFT + C’
This is what should look like:
Let’s see what we can see here, we have our page, some parts have this “blue” color, this is because we are using the inspect tool, and, when you select the option to select the element in the page, he will highlight it for you.
The right side is the tool to inspect, it’s there that you can get the exact part of the code that you want.
When we load the page, we don’t want to get everything right and remember that Beautifulsoup will generate a ‘tree hierarchy’ for us.
I tried to add a gif, but, still not much quality and it’s heavy, but you can get it, when you select the inspect “arrow” where you point you can see the structure in the code.
This is important, you remember that Beatifulsoup will create a tree structure right? Then if you find wherein the code is the structure that you want it’s easy to check.
Not going into much detail about HTML, we have, div
and class
these are important items in the hierarchy of an HTML source code for us.
Ok, this is how we have the tool inspecting, nice, indented, you can easily navigate through the code, select exactly the element that you want.
And this is the source code that we downloaded using request and the Beautifulsoup tool.
How do we come to an easy step to find what we want?
3 — Using Beautifulsoup.
As I mentioned before, Beautifulsoup works by parsing your data, HTML data, to a formatted output that you can navigate like a “tree”.
According to Beautifulsoup description on their website:
“It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.”
Beautifulsoup will transform all the HTML tags into variables and searchable items, that, if the structure becomes items inside items, you will be able to loop into like an array or even get values like accessing a dictionary.
For example this simples code:
*Extracted from Beautifulsoup website.
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="<http://example.com/elsie>" class="sister" id="link1">Elsie</a>,
<a href="<http://example.com/lacie>" class="sister" id="link2">Lacie</a> and
<a href="<http://example.com/tillie>" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
Parsing using Beautifulsoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="<http://example.com/elsie>" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="<http://example.com/elsie>" id="link1">Elsie</a>,
# <a class="sister" href="<http://example.com/lacie>" id="link2">Lacie</a>,
# <a class="sister" href="<http://example.com/tillie>" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="<http://example.com/tillie>" id="link3">Tillie</a>
for link in soup.find_all('a'):
print(link.get('href'))
# <http://example.com/elsie>
# <http://example.com/lacie>
# <http://example.com/tillie>
Definitely, this is self-explanatory following what we just mentioned before.
Some “basic” aspects like, “title”, “p” or “a”, can see in the source code that I provided. And look at how they use the tool “find_all”, we are literally asking for the string “a”.
Only one last piece of information before getting back to our code and parsing, we are parsing the data, and Beautifulsoup does this using different parses, this is the one that we use in our example:
soup = BeautifulSoup(html_doc, 'html.parser')
We will use the “html.parser
", but there are others, that I will not go into, but just to let you know that there's more.
Now let’s extract our first data from our website, we know that we will have a tree, that we can look into our code as tags, and that we can use “find” to search for some specific key.
Using my Medium sample, I understood that all my posts are inside a “div
" tag inside my HTML.
Let’s break our hierarchy:
1 — We have a “section
" that has a key "class
" with the value "r s t u v w x y
" ;
2 — We have an empty “<div>
" ;
3 — After we have one div
with a class
key that has the value fx y
;
4 — Inside we have many others div
tags, but, if you take a look only the one that as the class
value as gg gh gi gj gk y c
those are my posts!
Back to our code, we downloaded the source code, right? Then let’s start from the beginning, we know that everything that we want actually is inside a div
tag that has a class
key with the value fx y
.
Remember from the short sample above that we can “find” using any key or value in our code? And if that is valid Beautifulsoup will return a “tree” or “hierarchy” that allows us to loop? Let’s do it.
Back to medium_scraping.py
:
def scraping_medium():
page = requests.get(MEDIUM_URL)
parsed_data = BeautifulSoup(page.content, 'html.parser')
main_div = parsed_data.find('section', class_="r s t u v w x y")
print(main_div)
Ok, easy right? We will get our “parsed data” and “find” the first section
that has the class_
value "r s t u v w x y".
If we save and run the code we will see the print, that should return another HTML code, but only the section
item.
What’s wrong? We know that this is part of the code hierarchy, I don’t think that the source code has changed between my request, and this value is a CSS
that you use to style your code, they can change this as they change the style, but not between requests!
Let’s now understand some problems that will help you in the future when scrapping, and I will show here some tips for when you don’t find what you want, so you can debug.
First, let’s get the “pure” output that we got before, without trying to find any code, and use any plain text editor to search for that key.
Without too much complexity, you can just print the output as I did, copy it, put it in a plain editor, and use the search functionality.
But it can’t find any, how can be as we saw when using the inspector tool.
It’s related to the library and how we are using it to make the request, actually when you make a request you need to send some information, like, who is requesting, this is always in the request that we make using our browser, but, you are not one.
These are the “header” information, we need to try to be more like a browser, and to make this happen we will change the library that we use, the reason for this is that we need to set the header, and there’s a better library for this.
We need to make 2 basic changes in order to see this working:
1 — Drop “requests” and import from urllib.request
two tools, "Request
" and "urlopen
"
2 — Change how we fetch the data and add a default header.
In our medium_scraping.py
:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
# I will use my medium profile page, it's open so you can try too
MEDIUM_URL = "<https://medium.com/@ProfessionProgrammer>"
HEADER = {'User-Agent': 'Mozilla/5.0', "accept-language": "en-US"}
def scraping_medium():
# Get the URL content
request = Request(MEDIUM_URL, headers=HEADER)
response = urlopen(request, timeout=10)
# Get the content from the response, and set as default encoder 'utf-8'
page_content = response.read().decode('utf-8')
# Parse the data using Beautifulsoup
parsed_data = BeautifulSoup(page_content, 'html.parser')
main_div = parsed_data.find('section', class_="r s t u v w x y")
print(parsed_data)
Let’s call again our request method “load” and see if we have any results:
There we are, you see that we already see more data!
Another important reason to make this change is that if you do not provide a “valid” header, they will understand that you are scraping, that you are a “robot”, they don’t want this, and suddenly the requests will start returning 403 forbidden.
So, make this change!
Getting back to our code, let’s deep into again to our code, and follow that “tree” that we described how to find the elements that we really want.
The next step would be to find the <div>
that has all my "posts" elements, as we want to extract this data, the first simple thing to do is the same as before, get the output and search for the next "key" that you want to look for.
Searching for this “class” value fx y
:
And again, we can’t find another key, here come some tips for situations like this.
1 — Try to have a better and formatted output, for instance, I just print that code that is the data that we want, this is what I do: 1.1 — Copy the output content. 1.2 — Past in a tool/website that can format you HTML, I use: https://htmlformatter.com/ . 1.3 — Copy this formatted output, past it in any editor that you use, and first, search for that key that you want, and, if it is not there, try to go into the HTML and see if you find the content/structure that you are looking for.
After making this, this is what I see, I’m using VSCode, and I collapsed the content, as you see, indeed we don’t see that hierarchy that we want, but, this doesn’t mean that we don’t have what we want, it’s just that we have a different output from when navigate using the browser comparing to when we download.
Ok, now you just go into the HTML and try to find some keyword that you know that it’s there, and see the HTML keys that belong to what we want.
For this case, I noticed that all the posts are inside the “class” value “fd y
", let's change our code to look for this:
#Replace the previous find method for this one
main_div = parsed_data.find('div', class_="fd y")
Awesome, now we can try and see if has any output:
And yes, nice, we have, but, now what I should look is for the div
that has each post that I have, I would give you another tip.
I can’t copy the output now, because it’s huge, the terminal doesn’t let me navigate through, maybe it’s different for you.
But you will find yourself in this situation, a good tip is, to save the output to a file, will be easier for you, and for this reason, I will let you with this class snippet to save the output to a file:
from urllib.request import Request, urlopen
def save_to_site(url, file_name):
header = {'User-Agent': 'Mozilla/5.0'}
# Get the page content
# Parse the content and save in a HTML as it's different from inspecting
req = Request(url, headers=header)
response = urlopen(req, timeout=10)
file = open(f"{file_name}.html", "w")
page_content = response.read().decode('utf-8')
file.write(page_content)
file.close()
You can create a file name for example save_content_to_file
and user like this:
# here inside you class
from save_content_to_file import save_to_file
save_to_file(MEDIUM_URL, "medium_page")
Done this way you always can get the output!
Now that I can look into the code, I know about the structure, I was able to identify that all the tags that are like this <div class="r s y">
actually are the ones that have the content that I want, like, the name of the publication, the link to the publication.
Let’s go into the code and see how we can navigate through this:
def scraping_medium():
# Get the URL content
request = Request(MEDIUM_URL, headers=HEADER)
response = urlopen(request, timeout=10)
# Get the content from the response, and set as default encoder 'utf-8'
page_content = response.read().decode('utf-8')
# Parse the data using Beautifulsoup
parsed_data = BeautifulSoup(page_content, 'html.parser')
# This will get the main part of the code that has our posts
main_div = parsed_data.find('div', class_="fd y")
# This will get only the 'div' tha has my posts
posts_div = main_div.find_all('div', class_="r s y")
print(posts_div)
Nice, we just introduced one more method, that is find_all
that the main difference is that will generate an Array with all the div tags, split like "groups" that we can loop into!
This is the output for just one item from this group:
We found, now let’s use the help that Beautifulsoup provides to us, and loop to access what we want:
def scraping_medium():
# Get the URL content
request = Request(MEDIUM_URL, headers=HEADER)
response = urlopen(request, timeout=10)
# Get the content from the response, and set as default encoder 'utf-8'
page_content = response.read().decode('utf-8')
# Parse the data using Beautifulsoup
parsed_data = BeautifulSoup(page_content, 'html.parser')
# This will get the main part of the code that has our posts
main_div = parsed_data.find('div', class_="fd y")
# This will get only the 'div' tha has my posts
posts_div = main_div.find_all('div', class_="r s y")
for item in posts_div:
title = item.h1.text
post_link = item.a['href']
print(title)
print(post_link)
print("")
Let’s understand what we did, and how we get this data, the code below is the code for the div
that I will look for, this is what "posts_div" variable has as content.
1 — I get the title by accessing the “item” object, accessing the first “h1”, as you can see in the image, and using another variable “text” to get the content.
Import to remember, Beautifulsoup creates a tree and in this, the first item that has the tag “h1” will be the returned one, as I see that has only one this will be the one.
The text variable it’s a variable from Beautifulsoup for retrieving the “content” from the key that you just specified.
2- For the link, we follow the same rule, as the first “a” tag is the one that I want I know that when I ask for the content this will be the one, but, for this, Beautifulsoup returns as a dictionary, then, I access this key from the dictionary to return the content.
If we run this, this is the output:
Amazing, we finally have the content that we want!
Tips!
There’s more than one way of doing this “search” and parse, here are some tips that you can look for.
1 — You remember that is a tree, so if the tag “a” that you are looking for it’s just the third one you can access like this “item.a.a.a['href']
" it's nest but it does the job.
2 — We could have saved one find, actually only the “find_all” could be enough as it’s where we have the content.
Avoid crash for your code!
Besides Beautifulsoup make the hard work, they don’t assure you that this key will be there, and should not, as each code is different, so, it’s up to you to make sure this will be there.
For this, I will add this inside try except
function to avoid a crash if something changes.
def scraping_medium():
# Get the URL content
request = Request(MEDIUM_URL, headers=HEADER)
response = urlopen(request, timeout=10)
# Get the content from the response, and set as default encoder 'utf-8'
page_content = response.read().decode('utf-8')
# Parse the data using Beautifulsoup
parsed_data = BeautifulSoup(page_content, 'html.parser')
# This will get the main part of the code that has our posts
main_div = parsed_data.find('div', class_="fd y")
# This will get only the 'div' tha has my posts
posts_div = main_div.find_all('div', class_="r s y")
for item in posts_div:
try:
title = item.h1.text
post_link = item.a['href']
# This is only intended for debug not for production
print(title)
print(post_link)
print("")
except:
# If you are using this in production, you should have a log layer that you log when this happens
print("No tag found")
Done! We have our code that can scrap our Medium web page, and get only the content that we want!
4 — Save the content and show it on your page.
Now that we can save the data, let’s fetch the data and show on our page when we load, let’s update our parser to persist the data into our database, create our model, and update the app
.py to fetch the data.
Will be some steps, first let's create the model that will persist in our database, create a folder with the name “model” and create 2 files, __init__.py
and medium_model.py
.
Medium model.
from extensions.database import db
from sqlalchemy_serializer import SerializerMixin
# Define a base model for other database tables to inherit
class MediumModel(db.Model, SerializerMixin):
db = db
id = db.Column(db.Integer, primary_key=True)
date_created = db.Column(db.DateTime, default=db.func.current_timestamp())
title = db.Column(db.Text, nullable=False)
post_link = db.Column(db.Text, nullable=False)
def __repr__(self):
return f"Post title: {self.title}"
We have just one more thing that we need to add here, and it’s the SerializerMixin
, this library helps us a lot transform from a model to a format that we can easily access.
To install just do this pip command: pip install SQLAlchemy-serializer
Nice, let’s get back to our scraping class, and should be like this now:
Scraping tool
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from extensions.database import db
from model.medium_model import MediumModel
# I will use my medium profile page, it's open so you can try too
MEDIUM_URL = "<https://medium.com/@ProfessionProgrammer>"
HEADER = {'User-Agent': 'Mozilla/5.0', "accept-language": "en-US"}
def scraping_medium():
# Get the URL content
request = Request(MEDIUM_URL, headers=HEADER)
response = urlopen(request, timeout=10)
# Get the content from the response, and set as default encoder 'utf-8'
page_content = response.read().decode('utf-8')
# Parse the data using Beautifulsoup
parsed_data = BeautifulSoup(page_content, 'html.parser')
# This will get the main part of the code that has our posts
main_div = parsed_data.find('div', class_="fd y")
# This will get only the 'div' tha has my posts
posts_div = main_div.find_all('div', class_="r s y")
for item in posts_div:
try:
title = item.h1.text
post_link = item.a['href']
# We create our model
post = MediumModel(title=title, post_link=post_link)
# We add to our db session
db.session.add(post)
# This is only intended for debug not for production
print(title)
print(post_link)
print("")
except:
# If you are using this in production, you should have a log layer that you log when this happens
print("No tag found")
# after all the loop is done we just persist into our database
db.session.commit()
def fetch_medium_posts():
postList = MediumModel.query.order_by(MediumModel.date_created).all()
serialized_list = list(map(lambda post : post.to_dict(), postList))
return serialized_list
1 — When we scrap, we create our models and we add them to the database session, when we finish our loop, we go just commit to our database.
2 — We also added one method to fetch the data that we just saved, we serialize and we fetch by the data that was created.
Time to update our app
.py to fetch our data:
App
import os
from flask import Flask, render_template, jsonify, Response
from extensions import database, commands
from scraping.medium_scraping import scraping_medium, fetch_medium_posts
app = Flask(__name__)
app.config.from_object(os.environ['APP_SETTINGS'])
database.init_app(app)
commands.init_app(app)
@app.route("/")
def index():
medium_posts = fetch_medium_posts()
return render_template('index.html', posts=medium_posts)
@app.route("/load")
def get_load_parser():
scraping_medium()
return jsonify({"success": "data parsed"}, 200)
if __name__ == "__main__":
app.run()
Done, here it’s just a small change, first, we import the new function that we want to use to fetch the data that was saved, and we call this function and send it to when our page is loaded as a parameter.
Let’s update now our main page to load and show this data:
Index
<!doctype html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="<https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css>" integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
</head>
<body>
<div class="container-fluid text-center">
</br>
<h3>Beautifulsoup tutorial</h3>
</div>
<div class="container-fluid">
<div class="row row-cols-2">
<!-- Here will come our content -->
{% for post in posts %}
<div class="col offset-md-3">
<h3><a href="{{ post.post_link }}" target="_blank">{{ post.title }}</a> <i class="fas fa-angle-right"></i></h3>
</div>
{% endfor %}
</div>
</div>
</body>
</html>
As you see, we just loop again, for each item, and create a link using the link that we have and the name of the post to be the text for the link.
You can’t see anything yet, let’s do it.
Let’s add the content to our database!
If you download our project, you will notice that we have a commands
.py file, this helps us to create our database, drop, using Flask CLI commands, if you don't know much about it, please, go to this tutorial where you can learn more about: https://itnext.io/use-flask-cli-to-create-commands-for-your-postgresql-on-heroku-in-6-simple-steps-e8166c024c8d
This post teaches the entire process, but you can go just to the part that we create and use our custom commands.
Now that you know how to use it, create your database and run again the endpoint: http://localhost:5000/load/
And refreshing our index page!
Here we are!!!
We have our custom tool, that we parse data from any website, here we choose Medium, save it into a database, and load it.
What next?!
Now that you know the basics of how to use the library, you know some tips to help you when you need to fetch the data, the next step is to improve, and read the documentation, as you have understood how this tool works you can understand better their documentation.
Please read and you will find most of the answers to your questions! Link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
This is the link to the final project if you want to download it: Link: https://github.com/felipeflorencio/BeautifulsoupSample/releases/tag/Final
Conclusion
I hope you enjoyed reading this. If you’d like to support me as a writer, consider signing up to become a Medium member. It’s just $5 a month and you get unlimited access to Medium also if you liked, consider sharing with others that also want to learn more about this.
Also, I started a new channel on YouTube, I want to teach in videos what I write about, so please, subscribe if you liked this tutorial.
Youtube channel:
Linkedin:
Instagram: @thedevproject