Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6322

Abstract

’s chile nodes of the current node.</li><li><b>Siblings</b> are nodes that share the same parent node.</li></ul><p id="52af">The Basic Syntax For an Xpath Expression is —</p><figure id="f524"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Zyttym-dNoK9pICsD_WnuA.png"><figcaption></figcaption></figure><p id="754e">There are different types of functions & operators that combine help writing expressions for selecting elements on the web. Let’s see some of them one by one.</p><p id="6893"><code>//</code> : Select any Descendant Node that matches</p><p id="f76f"><code>/</code> : Selects from the root, useful for writing absolute path.</p><p id="51d6"><code>nodename</code> : Select a particular node ex: <div> select all the divs.</div></p><p id="e6ac"><code>.</code> : Select the element from the current node</p><p id="c1b8"><code>..</code> : Selects the element from the current node parent.</p><p id="c617"><code>@</code> : Select the attribute from the element.</p><p id="b1cc"><code></code> : Match the expression with any node.</p><p id="0136"><code>@</code> : Matches Any Attribute Node</p><p id="6563"><b>Advance Expression</b></p><ul><li><code>contains(A,B)</code> : Search for a string <code>A</code> inside the element <code>B</code> . Suppose you want to select a tag with some fixed attribute Like type, name, etc then It can be used.</li></ul><figure id="969c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Yd0LR-44ovaVPXETxoZIBQ.png"><figcaption></figcaption></figure><ul><li><code>not</code> : negate some part of the query. It can be used in conditions where you want to select a tag from a set of tags negating an attribute or tag.</li></ul><figure id="d75c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gdNfpHmSf72XEQlfg1JdGA.png"><figcaption></figcaption></figure><ul><li><code>starts-with</code> : Search for an element that starts with a string <code>A</code></li></ul><figure id="03f3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*atBzaGfeKvU7fhBmM9A1nQ.png"><figcaption></figcaption></figure><ul><li><code>ends-with</code> : Seach for an element that ends with a string <code>B</code></li><li><code>OR</code> : Select an element that satisfies either condition 1 or 2.</li></ul><figure id="d09e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qRiUhZGBp8m2dwnJU0gY7A.png"><figcaption></figcaption></figure><ul><li><code>and</code> : Select an element that satisfies both the conditions.</li></ul><figure id="7ded"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Bg5DBj1oaSAQxTAF9nrjEg.png"><figcaption></figcaption></figure><ul><li><code>text()</code> : locate element based on the text of a web element. it is a built-in function of the selenium web driver.</li></ul><figure id="4be0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*pXDLquMp1splQptiYEzzDQ.png"><figcaption></figcaption></figure><ul><li><code>following</code> : It will select all the elements of the current node following a particular tag.</li></ul><figure id="16d5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BA3RjuKsqmVBgbQODNXG3A.png"><figcaption></figcaption></figure><p id="39a3">Above Xpath matches two following input tags (password, submit) of the current node (username).</p><ul><li><code>Child</code> : Selects all the children elements of the current node.</li></ul><figure id="d69c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*S5r6U2kbCF5SUu_i95mQrQ.png"><figcaption></figcaption></figure><ul><li><code>preceding</code> : It selects all the nodes that come before the current node.</li></ul><div id="7d2e"><pre><span class="hljs-regexp">//</span>[@name=<span class="hljs-string">'submit'</span>]<span class="hljs-regexp">//</span>preceding::input</pre></div><p id="96dc">Above Xpath will select all the input tags that come before the input tag that has an attribute name with value submit.</p><ul><li><code>following-siblings</code> : it will select all the siblings of the same level for the currently selected node. You can use it to select cards, buttons, etc.</li></ul><figure id="ca09"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*puMnWp1bJjCpff7XHYfBgA.png"><figcaption></figcaption></figure><ul><li><code>parent</code> : Selects all the parents of the current node. You can choose a particular parent by specifying the index inside square brackets.</li></ul><div id="1575"><pre><span class="hljs-regexp">//</span>[@id=<span class="hljs-string">'data'</span>]<span class="hljs-regexp">//</span>parent::div</pre></div><p id="bf89">Above Xpath will select all the parent divs of an element that has an id of data.</p><ul><li><code>descendant</code> : It is similar to a child selector but the difference is that it selects all the HTML elements that are either child, grandchild, or great-grandchild, and so on. while child selector only selects elements that are a direct child of the currently selected node.</li></ul><figure id="501c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gYM--ChAdrypLcNoEKcMuQ.png"><figcaption></figcaption></figure><ul><li><code>Ancestor</code> : It selects all the ancestor parent, grandparent, great grandparent, and so on of the current node.</li></ul><div id="d799"><pre><span class="hljs-regexp">//</span>*[@id=<span class="hljs-string">'info'</span>]<span class="hljs-regexp">//</span>ancestor::div</pre></div><p id="8f9c">All of these functions are just a part of Xpath functions. There are many that you can find out from MDN Web Docs <a href="https://developer.mozilla.org/en-US/docs/Web/XPath/Functions">Functions In Xpath</a>.</p><h1 id="2b98">Testing Xpath</h1><p id="3f84">Sometimes Xpath can become very complicated and hard to write, so it is a better idea to test all your Xpath in the browser itself before using them inside the scraping script.</p><p id="9e79">Most Browsers we use today provide a way to test your Xpath expressions. To do this open a web page, right-click and select inspect.</p><figure id="bdd5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Xr1ybXfB6g9Mp90VFQWBNg.gif"><figcaption></figcaption></figure><p id="c708">Once the Source Code is opened press <code>ctrl+f</code> to open the expres

Options

sion test filed. Write any expression you want and then simply press enter.</p><h1 id="b81a">Xpath With Python</h1><p id="6818">Let’s use our learning of python combined with scrapy to scrape all the book records from this free scraping site <a href="https://books.toscrape.com/">https://books.toscrape.com/</a></p><p id="6a71">If you don’t know anything about scrapy I would suggest you go through my <a href="https://levelup.gitconnected.com/web-scraping-2-0-6600abca37de">previous blog</a> on web scraping using scrapy. Create a project structure and a spider for web scraping.</p><p id="1a6e">Let’s Come to the main part of the web scraping code is selecting elements using Xpath.</p><figure id="f819"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3F_iV3RqEEslW3CHxnuKwA.png"><figcaption></figcaption></figure><p id="9181">The Title of the Books Is inside an anchor tag that is a child of heading 3 tags. The Xpath for this will be</p><div id="5388"><pre><span class="hljs-regexp">//</span>h3<span class="hljs-regexp">/a/</span>text()</pre></div><p id="b1c8">Next For Price,</p><figure id="857d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TyD27zhuSl_f_UXUT0eKLw.png"><figcaption></figcaption></figure><p id="a097">The price is inside a paragraph tag that has a class of price color. To access this element we have multiple options. Like we can access it directly threw the class name <code>//*[@class='price_color']</code> or with tag and class name <code>//p[@class='price_color']</code> or with parent div <code>//div[@class='product_price']//child::p[1]</code> ,etc. you can use any method you want. For simplicity, I will use to access it using the class name.</p><p id="ef73">Last For Links,</p><figure id="b4b5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tgJ9yHlU8kAo0TZbdr7eGg.png"><figcaption></figcaption></figure><p id="e679">To extract the link we will use <code>@href()</code> at the end of our expression.</p><div id="4595"><pre>Xpath : <span class="hljs-regexp">//</span>h3<span class="hljs-regexp">/a/</span>@href</pre></div><p id="624b">Now, That We have all the data let’s combine them and scrape each and every page of this website using Xpath.</p> <figure id="28b2"> <div> <div>

            <iframe class="gist-iframe" src="/gist/Abhayparashar31/09825717fe9feb279f5a757be42a58cb.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="eb9b">References</h2><p id="a71b">[1] <a href="https://www.guru99.com/xpath-selenium.html">https://www.guru99.com/xpath-selenium.html</a>

[2] <a href="https://developer.mozilla.org/en-US/docs/Web/XPath">https://developer.mozilla.org/en-US/docs/Web/XPath</a></p><h1 id="803b">Conclusion</h1><p id="287c">In this blog, we have learned about

What is Xpath,
Features of Xpath,
How to Find Xpath on Browsers,
Types of Xpath,
Different Xpath Functions,
Testing Xpath On the Web.</p><p id="64dc">I tried my best to include most of the concepts and functions that you will ever require for web scraping using Xpath. Next, try to get familiar with the libraries that use Xpath for web scraping like <a href="https://readmedium.com/web-scraping-using-selenium-python-6c511258ab50">selenium</a> and <a href="https://levelup.gitconnected.com/web-scraping-2-0-6600abca37de">scrapy</a>.</p><p id="ee80">Like <a href="https://www.brainyquote.com/authors/oscar-de-la-hoya-quotes"><b>Oscar De La Hoya</b></a><b> </b>Says “There is always space for improvement” if anything you want to add to the article then I am always open to your suggestions and response.</p><p id="87b8"><i>*All images used in the article are by the author or referenced otherwise.</i></p><h1 id="f715">Recommended Readings</h1><div id="bc2c" class="link-block"> <a href="https://readmedium.com/master-web-scraping-completly-from-zero-to-hero-38051423256b"> <div> <div> <h2>Master Web Scraping Completely From Zero To Hero 🕸</h2> <div><h3>Using Beautiful Soup and Requests Library with One Project</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*l_LbZMF4ZD6vvNtAjeM_Rw.png)"></div> </div> </div> </a> </div><div id="8982" class="link-block"> <a href="https://levelup.gitconnected.com/web-scraping-2-0-6600abca37de"> <div> <div> <h2>Web Scraping 2.0</h2> <div><h3>Over The Top Web Scraping Using Scrapy</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*uNXt06GLrpBJXi-g)"></div> </div> </div> </a> </div><div id="b885" class="link-block"> <a href="https://readmedium.com/web-scraping-using-selenium-python-6c511258ab50"> <div> <div> <h2>Web Scraping Using Selenium Python</h2> <div><h3>Detailed Tutorial With One Project</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*aFYBwY5VYJy4C0MFeXCW4w.jpeg)"></div> </div> </div> </a> </div><p id="7df5">Thanks For Reading Till Here, If You Like My Content and Want To Support Me The Best Way is —</p><ol><li>Follow Me On <a href="http://abhayparashar31.medium.com/"><i>Medium</i></a>.</li><li>Connect With Me On <a href="https://www.linkedin.com/in/abhay-parashar-328488185/"><i>LinkedIn</i></a>.</li><li>Become a Medium Member Using <a href="https://abhayparashar31.medium.com/membership"><i>My Referral Link</i></a>. a small part of your membership fee will go to me.</li><li>Subscribe To <a href="https://abhayparashar31.medium.com/subscribe"><i>My Email List</i></a> To Never Miss An Article From Me.</li></ol></article></body>

Master The Art of Writing Xpath For Web Scraping

A Gentle Introduction To The RegEX of Web Scraping

Photo by NeONBRAND on Unsplash Created Using Canva

The key part of web Scraping is to describe the computer how it should look for an element on the web. Xpath is a way to write a pattern that can be matched to a document structure for scraping data. It specifies the parts of a document in a tree structure manner where the parent node is written before the child node inside a pattern.

XPath stands for XML path language, is a tool for locating elements in XML documents. Thus, HTML is an implementation of XML it can be used to locate elements in HTML documents too.

Features

Target Element Perfectly.
Has a built-in browser tool for extraction.
best choice to have when there is no suitable id or class.
Capability to scrape multiple pages at the same time.
More powerful than CSS selectors

Downsides

Hard To Understand.
Not at all beginner-friendly.

There are two major libraries of python that use Xpath on a big scale for web scraping — selenium and scrapy.

Selenium is an automation & testing library that can be used for web scraping as well. One of the biggest advantages of selenium is, it can scrape dynamically generated data from the web very easily.

Scrapy is a complete python framework for web scraping. It contains multiple tools for large-scale web scraping. Xpath is a major selector in scrapy.

We will consider both while learning about Xpath expressions.

> Xpath Browser Essential
 - Finding Xpath
 - Testing Xpath

> Xpath
 - Types of Xpath
 - Xpath Basic Functions
 - Xpath Advance Functions

> Python Web Scraping Project Using Xpath

Finding Xpath

Most of the modern browsers (like chrome, firefox) provide a very useful feature using which you can copy the Xpath of an element with a few mouse clicks.

To get the Xpath, Right-click on the element you want to get the XPath for then click inspect. Once the source code will appear click on copy > copy Xpath.

You will see two options for Xpath that represents the two types of Xpath.

Types of Xpath

1. Absolute XPath (Full Xpath): It uses the complete path from the root to our element. It starts with a single /For Example —

/html/body/div/div[2]/div[1]/div[1]/span[2]/small

2. Relative Xpath: It is a direct reference to the element you want to extract. It starts with //For Example —

//*[@class='author']

Relative Xpath is always chosen on top of Absolute Xpath because they are not the complete path from the root element. Also, if in near future a new element is added or removed then Absolute Xpath becomes invalid and stops working. So Relative Xpath is preferable.

Xpath follows a syntax using which each expression is created.

Xpath Syntax

Before Looking At the syntax you should understand the node structure of HTML and terminology Xpath uses —

HTML is divided into four different nodes mainly — root node, element node, attribute node, and text nodes that contain the value.
Root Node is the top node inside a node tree. Every node has a parent except for the root node. a root node can have n number of children.
Every element inside the document except the root is considered an Element Node. Each element node has one parent.
Attribute node contains information about all the attributes used by an element node.
Text nodes Contain the text value for the element node. These values are visible to the user.
Ancestor nodes are the parent or parent’s parent nodes of the current node.
Descendants nodes are child or child’s chile nodes of the current node.
Siblings are nodes that share the same parent node.

The Basic Syntax For an Xpath Expression is —

There are different types of functions & operators that combine help writing expressions for selecting elements on the web. Let’s see some of them one by one.

// : Select any Descendant Node that matches

/ : Selects from the root, useful for writing absolute path.

nodename : Select a particular node ex:

select all the divs.

. : Select the element from the current node

.. : Selects the element from the current node parent.

@ : Select the attribute from the element.

* : Match the expression with any node.

@* : Matches Any Attribute Node

Advance Expression

contains(A,B) : Search for a string A inside the element B . Suppose you want to select a tag with some fixed attribute Like type, name, etc then It can be used.

not : negate some part of the query. It can be used in conditions where you want to select a tag from a set of tags negating an attribute or tag.

starts-with : Search for an element that starts with a string A

ends-with : Seach for an element that ends with a string B
OR : Select an element that satisfies either condition 1 or 2.

and : Select an element that satisfies both the conditions.

text() : locate element based on the text of a web element. it is a built-in function of the selenium web driver.

following : It will select all the elements of the current node following a particular tag.

Above Xpath matches two following input tags (password, submit) of the current node (username).

Child : Selects all the children elements of the current node.

preceding : It selects all the nodes that come before the current node.

//*[@name='submit']//preceding::input

Above Xpath will select all the input tags that come before the input tag that has an attribute name with value submit.

following-siblings : it will select all the siblings of the same level for the currently selected node. You can use it to select cards, buttons, etc.

parent : Selects all the parents of the current node. You can choose a particular parent by specifying the index inside square brackets.

//*[@id='data']//parent::div

Above Xpath will select all the parent divs of an element that has an id of data.

descendant : It is similar to a child selector but the difference is that it selects all the HTML elements that are either child, grandchild, or great-grandchild, and so on. while child selector only selects elements that are a direct child of the currently selected node.

Ancestor : It selects all the ancestor parent, grandparent, great grandparent, and so on of the current node.

//*[@id='info']//ancestor::div

All of these functions are just a part of Xpath functions. There are many that you can find out from MDN Web Docs Functions In Xpath.

Testing Xpath

Sometimes Xpath can become very complicated and hard to write, so it is a better idea to test all your Xpath in the browser itself before using them inside the scraping script.

Most Browsers we use today provide a way to test your Xpath expressions. To do this open a web page, right-click and select inspect.

Once the Source Code is opened press ctrl+f to open the expression test filed. Write any expression you want and then simply press enter.

Xpath With Python

Let’s use our learning of python combined with scrapy to scrape all the book records from this free scraping site https://books.toscrape.com/

If you don’t know anything about scrapy I would suggest you go through my previous blog on web scraping using scrapy. Create a project structure and a spider for web scraping.

Let’s Come to the main part of the web scraping code is selecting elements using Xpath.

The Title of the Books Is inside an anchor tag that is a child of heading 3 tags. The Xpath for this will be

//h3/a/text()

Next For Price,

The price is inside a paragraph tag that has a class of price color. To access this element we have multiple options. Like we can access it directly threw the class name //*[@class='price_color'] or with tag and class name //p[@class='price_color'] or with parent div //div[@class='product_price']//child::p[1] ,etc. you can use any method you want. For simplicity, I will use to access it using the class name.

Last For Links,

To extract the link we will use @href() at the end of our expression.

Xpath : //h3/a/@href

Now, That We have all the data let’s combine them and scrape each and every page of this website using Xpath.

References

[1] https://www.guru99.com/xpath-selenium.html [2] https://developer.mozilla.org/en-US/docs/Web/XPath

Conclusion

In this blog, we have learned about - What is Xpath, - Features of Xpath, - How to Find Xpath on Browsers, - Types of Xpath, - Different Xpath Functions, - Testing Xpath On the Web.

I tried my best to include most of the concepts and functions that you will ever require for web scraping using Xpath. Next, try to get familiar with the libraries that use Xpath for web scraping like selenium and scrapy.

Like Oscar De La Hoya Says “There is always space for improvement” if anything you want to add to the article then I am always open to your suggestions and response.

*All images used in the article are by the author or referenced otherwise.

Master The Art of Writing Xpath For Web Scraping

A Gentle Introduction To The RegEX of Web Scraping

Features

Downsides

Finding Xpath

Types of Xpath

Xpath Syntax

Testing Xpath

Xpath With Python

References

Conclusion

Recommended Readings

Master Web Scraping Completely From Zero To Hero 🕸

Using Beautiful Soup and Requests Library with One Project

Web Scraping 2.0

Over The Top Web Scraping Using Scrapy

Web Scraping Using Selenium Python

Detailed Tutorial With One Project