Web Scraping Cheat Sheet (2022), Python for Web Scraping
The complete guide to web scraping: Beautiful Soup, Selenium, Scrapy, XPath, and more!

Web Scraping is the process of extracting data from a website. Although you only need the basics of Python to start learning web scraping, this might sometimes get complicated because web scraping goes beyond Python. It also involves learning the basics of HTML, XPath, and tons of new methods in Beautiful Soup, Selenium, or Scrapy.
This is why I decided to create this cheat sheet — to walk you through all the stuff you have to learn to successfully scrape a website with Python. This goes from the basic Python stuff you need to know before learning web scraping to the most complete web scraping framework, Scrapy.
Below, you’ll find the topics covered. I’m leaving the basic Python stuff at the end of the article since most of you are probably familiar with it. In case you’re an absolute beginner start with that section and then follow the order of the list below to easily learn web scraping.
Table of Contents
1. HTML for Web Scraping
- HTML Element Syntax
- HTML Code Example
- HTML Tree Structure
2. Beautiful Soup Cheat Sheet
- Installing and importing the libraries
- Creating the “soup”
- Finding elements: find() vs find_all()
- Getting the values: text vs get_text()
3. XPath (Necessary for Selenium and Scrapy)
- XPath Syntax
- XPath Functions and Operators
- XPath Special Characters
4. Selenium Cheat Sheet
- Installing and importing the libraries
- Creating the "driver"
- Finding elements: find_element_by() vs find_elements_by()
- Getting the values: text
- Waits: Implicit Waits vs Explicit Waits
- Options: Headless mode, block ads, and more
5. Scrapy Cheat Sheet
- Setting up Scrapy
- Creating a Project and Spider
- The Template
- Finding elements and getting the text value
- Return data extracted
- Run the spider and export data to CSV or JSON
6. Python Basics for Web Scraping
7. Web Scraping Cheat Sheet [PDF]Important Note: A few months ago Selenium 4 was released. There are only a few changes between Selenium 3.x versions and Selenium 4. You can check out those changes in my web scraping cheat sheet in PDF format.
HTML for Web Scraping
A typical step in web scraping is inspecting an element within a website in order to obtain the HTML code behind it. This is why you should learn HTML basics before learning any Web Scraping library.
In this section, we will see the basic HTML stuff required for web scraping.
HTML Element Syntax
An HTML element usually uses tags and attributes for its syntax. The tags define how your web browser must format and display the content. Most HTML elements are written with an opening tag and closing tag, with content in between. The attributes define the additional properties of the element.
Let’s imagine we’re about to scrape a website that contains the transcript of movies, so you inspect a movie’s title. The HTML element behind the title will look like the picture below.

Let’s see each element. First, the tag name is set to h1. This will give the word Titanic (1997) the biggest heading size of a page. Some other common tags are a, p and div. Then we have the attribute name set to class, which is one of the most common attributes. The last element is the content Titanic (1997). This is the only element you see on the page before inspecting.
HTML Code Example
So far we’ve seen just a single element HTML element, but the HTML code behind a website has hundreds of elements like the image above. Fortunately, when scraping a website we will only analyze the elements that contain the data we want to get.
To easily understand HTML code, let’s check the following small code I wrote as an example.

This code represents a web that has the title, plot, and transcript of the movie Titanic. Each element has a tag name and most of them have an attribute. The HTML code is structured with “nodes”. There are element nodes, attribute nodes, and text nodes. This might be hard to identify in plain code, so we’ll see them much better in a tree structure.
HTML Tree Structure
The image below is the tree structure of the HTML code example we’ve seen before.

In the tree above, each rectangle represents a node. The gray rectangles represent element nodes, the yellow rectangles represent attribute nodes and the green rectangles represent text nodes. Also, the tree shows hierarchical relationships between nodes (parent, child, and sibling nodes). Let’s identify the relationships between nodes.
- The “root node” is the top node. In this example,
<article>is the root. - Every node has exactly one “parent”, except the root. The
<h1>node’s parent is the<article>node. - An element node can have zero, one, or several “children,” but attributes and text nodes have no children.
<p>has two child nodes, but no child elements. - “Siblings” are nodes with the same parent (e.g.,
h1,panddiv) - A node’s children and its children’s children are called its “descendants”. Similarly, a node’s parent and its parent’s parent are called its “ancestors”.
When doing web scraping, locating children and parent nodes is sometimes vital when you can’t find a particular element but only its parent or child. We will see this in detail in the XPath section.
Beautiful Soup
Beautiful Soup is the easiest web scraping tool in Python. Although it has some limitations (e.g., it doesn’t scrape Javascript-driven websites), it should be the starting point for beginners.
Installing the libraries
To start working with Beautiful Soup we need 3 libraries. We use “beautifulsoup4” to scrape the website, “requests” to send requests to the website and “lxml” for parsing XML and HTML. Open up a terminal and run the following commands.
pip install beautifulsoup4
pip install requests
pip install lxmlImporting the libraries
After installing the necessary libraries, import BeautifulSoup and requests before scraping the website.
from bs4 import BeautifulSoup
import requestsCreating the “soup”
In Beautiful Soup we use the “soup” object to find elements in a website. To create this object do the following.
# 1.Fetch the pages (write the website you wish to scrape within parentheses)
result = requests.get("www.google.com")# 2.Get the page content
content = result.text# 3. Create the soup
soup = BeautifulSoup(content, "lxml")What we’ve done so far is what we always do regardless of the website you wish to scrape.
Finding elements: find() vs find_all()
There are two ways to get elements with Beautiful Soup:find() and find_all(). We use find() to get the first element that matches a specific tag name, class name, and id, while find_all() will get all the elements that matched and put them inside a list.
Both find() and find_all() have a similar syntax. Let’s have a look.

The find() and find_all() usually takes 2 arguments, but we can omit any of them if necessary. Also, we should use “class_” whenever we want to locate an element by its class name. The “_” is only to make this argument different from Python’s class keyword. Another common attribute used to locate elements is id because it represents unique elements.
Let’s look at some examples of how to locate elements with Beautiful Soup. We’ll be using the HTML code we’ve seen before.

Let’s locale the article element and title.
# Get the article element
element1 = soup.find(‘article’, class_=”main_article”)# Get the title element
element2 = soup.find(‘h1’)Let’s imagine there are multiple h2 elements. We can get all of them with find_all()
# Get all h2 elements
elements = soup.find_all(“h2”)We could also use find_all() in the examples we used for find() but we would’ve obtained a list with a single element.
Getting the values: text vs get_text()
Most of the time we want to get the text inside an element. There are 2 options to get the text in Beautiful Soup: text and get_text(). The first is a property while the second is a function. Both return the text of a tag as a string but with get_text() we can also add various keyword arguments to change how it behaves (e.g.,separator, strip, types)
Let’s look at some examples using the “element2” we got before.
data = element2.text
data = element2.get_text(strip=True, separator=' ')In this particular example, text and get_text() will return the same text “Titanic (1997).” However, if we’re scraping “dirty” data, the strip and separator arguments will come in handy. The first will get rid of leading and trailing spaces of the text, while the second adds a blank space as a separator (this will replace a newline ‘\n’ for example)
We can also get a specific attribute of an element — like thehref attribute within an a tag (the href will help us get the link of the element)
# Get the "a" tag
element = soup.find('a')# Get the attribute value
data = element.get('href')Below you can find a guide that will help to scrape your first website with Beautiful Soup.
XPath (Necessary for Selenium and Scrapy)
Before learning Selenium or Scrapy, we have to learn how to build an XPath. XPath is a query language for selecting nodes from an XML document. This will help us locate an element when the HTML code isn’t simple.
XPath Syntax
An XPath usually contains a tag name, attribute name, and attribute value. Let’s have a look at the XPath syntax.

The // and @ are special characters that we’ll see later. Now let’s check some examples to locate some elements of the HTML code we’ve been using so far.

Let’s build the XPath of the article element, title, and transcript.
# Article element XPath
//article[@class="main-article"]# Title element XPath
//h1# Transcript element XPath
//div[@class="full-script"]XPath Functions and Operators
Sometimes the HTML elements are complicated to locate with a simple XPath. This is when we need to use XPath functions. One of the most useful functions is contains. The contains function has the following syntax.

Let’s look at some examples with the HTML code we’ve used above.
# Article element XPath
//article[contains(@class, "main")]# Transcript element XPath
//div[contains(@class, "script")]As you can see, we don’t need to write the whole value, but only a part of it. This is extremely useful when working with long values or attributes that have multiple value names.
On the other hand, XPath can also use and and or logical operators. Both have the same syntax.

To see an example, let’s imagine we have an extra p element with attribute class=”plot2" in our HTML code.
# Locate elements that has either "plot" or "plot2" values
//p[(@class="plot") or (@class="plot2")]XPath Special Characters
Building XPath might be a bit trickier in the beginning because there are many characters that we don’t know their meaning. This is why I made the table below that contains the most common special characters.

Selenium
Selenium is most powerful than Beautiful Soup because it allows you to scrape JavaScript-driven pages.
I recommend you solve a project to memorize all the Selenium methods listed in this guide. Below there’s my step-by-step tutorial on how to solve a Selenium project from scratch.
Tutorial: Python Selenium for Beginners — A Complete Web Scraping Project
Installing the libraries and Chromedriver
To start working with Selenium we need to install the selenium library and download chromedriver.
To download Chromedriver, go to this link. In the “Current Releases” section click on the Chromedriver version that corresponds to your Chrome browser (to check the version, click the 3 dot button on the upper right corner, click on “Help”, then click on “About Google Chrome”).
After you download the file, unzip it and remember where it’s located.
To install Selenium, open up a terminal and run the following commands.
pip install seleniumImporting the libraries
After installing the necessary libraries, import webdriver before scraping the website.
from selenium import webdriverCreating the “driver”
In Selenium we use the “driver” object to find elements in a website. To create this object, do the following.
# 1. Define the website you wish to scrape and path where Chromedriver was downloaded
web = "www.google.com"
path = "introduce chromedriver path"# 2. Create the driver
driver = webdriver.Chrome(path)Once the driver is created we can open the website with .get(). Remember always to close the website after you scrape the content.
# 1. Open the website
driver.get(web)# 2. Close the website
driver.quit()What we’ve done so far is what we always do regardless of the website you wish to scrape.
Finding elements: find_element_by() vs find_elements_by()
There are two ways to get elements with Selenium: find_element_by() and find_elements_by(). We use the first to get the first element that matches a specific tag name, class name, id, and XPath, while the latter will get all the elements that matched and put them inside a list.
Let’s have a look at the syntax of finding elements with Selenium.

You can replace the attribute_name with any attribute. Below you can find the most common attributes used.
# Finding a single element
driver.find_element_by_id(‘id_value’)# Finding multiple elements (returns a list)
driver.find_elements_by_class_name('value')
driver.find_elements_by_css_selector('value')
driver.find_elements_by_tag_name('value')
driver.find_elements_by_name('value')
driver.find_elements_by_xpath('value')In the case of XPaths, there’s a special syntax.

Let’s look at some examples. We’ll be using the HTML code we’ve seen before.

Let’s locate the article element, title, and transcript.
# Get the article element
element1 = driver.find_element_by_class_name('main-article')# Get the title element
element2 = driver.find_element_by_tag_name('h1')# Get the transcript
element3=driver.find_element_by_xpath('//div[@class="full-script"]')Getting the values: text
Most of the time we want to get the text inside an element. In Selenium we can use .text to get the text we want.
Let’s look at some examples using the “element2” we got before.
data = element2.textIn this particular example, text will return the text “Titanic (1997).” The .text does a good job formating the text we scrape, but if necessary use strip and theseparator functions as additional operations after getting the data.
Waits: Implicit Waits vs Explicit Waits (Handling ElementNotVisibleException)
One of the problems of scraping Javascript-driven websites is that the data is loaded dynamically so it can take some seconds to display all the data correctly. As a result, an element might not be located in the DOM (Document Object Model) when scraping the website, so we’ll get an “ElementNotVisibleException.” This is why we have to make the driver wait until the data we wish to scrape is loaded completely.
There are 2 types of waits: implicit & explicit waits. An implicit wait is used to tell the web driver to wait for a certain amount of time when trying to locate an element. In Python, you can import the time library and then make an implicit wait with time.sleep() and specify the seconds to wait within parentheses. For example, if you want to make the driver stop for 2 seconds, write this.

On the other hand, an explicit wait makes the web driver wait for a specific condition (Expected Conditions) to occur before proceeding further with the execution. First, you need to import a couple of libraries besides webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECdriver = webdriver.Chrome()
driver.get("www.google.com")<explicit wait syntax>driver.quit()Explicit waits have the following syntax.

Let’s look at an example. We’ll locate the transcript element we’ve seen before, but now with expected conditions.








