avatarL Javier Tovar

Summary

This context provides a guide on how to scrape data from a website with pagination using JavaScript and Playwright.

Abstract

The context begins by explaining the concept of pagination and its importance in web development for structuring content. It then delves into the different types of pagination, focusing on traditional pagination, which divides content into arbitrary groups with links to move forward and backward. The guide then uses the YTravel website as an example, demonstrating how to retrieve blog titles and links from the site using Playwright. The process involves manually browsing the website, locating the pagination element, and inspecting it with the browser. The guide then provides a step-by-step process for coding the scraper, including locating the elements to extract and executing the code. The final output is a JSON object containing the scraped data. The guide concludes by emphasizing the importance of reviewing the HTML structure and content loading of the target website before scraping.

Bullet points

  • Pagination is a technique used in web development to structure content, making user navigation more user-friendly.
  • Traditional pagination divides content into arbitrary groups with links to move forward and backward.
  • The guide uses the YTravel website as an example to demonstrate how to scrape data using Playwright.
  • The process involves manually browsing the website, locating the pagination element, and inspecting it with the browser.
  • The guide provides a step-by-step process for coding the scraper, including locating the elements to extract and executing the code.
  • The final output is a JSON object containing the scraped data.
  • It is important to review the HTML structure and content loading of the target website before scraping.

Scrape Data From a Website With Pagination Using JavaScript & Playwright

Explanation with traditional pagination

Photo by Amari James on Unsplash

Paginated content is everywhere. For example, if you go to an e-commerce site, not all products are on the same page, they are most likely spread over several pages.

Pagination is a technique widely used in web development to structure the content, grouping it by a fixed amount of space or number of elements. This is to make the user’s navigation more user-friendly.

There are many ways to paginate the content of a website that works perfectly for the user. The main ways are the traditional way with next and previous buttons, infinite scroll, and the button to load more.

However, web scraping is more difficult on some websites than others, depending on how they are structured.

Let’s see an example of web scraping websites with pagination. We are going to develop our script with Playwright, a new technology for testing modern web applications.

Traditional pagination

Traditional pagination divides the contents into arbitrary groups of 10, 25, 100, or any other number of results. At the end of the listing, it includes links to move forward and backward page by page. The user can either use these links or use the forward and back buttons on the web browser itself.

Most websites, such as newspapers, online stores, search engines, and forums, use the traditional pagination system.

Scraping YTravel

Let’s assume that for this exercise, we need to retrieve the blogs published on the website (title and link). So, we will have to go through the pagination to get all the information.

We are going to work with this category of posts Travel Tips — y Travel Blog

The first steps we have to do are:

1 — Manually browse the website and identify what type of pagination is being used to get an idea of how we are going to approach the exercise.

2 — Locate the pagination element and inspect it with the browser. To interact with the pagination, we need to locate the Next button element and the total pages.

Step 2

3 — We locate the elements we want to extract from each page. In this case, we need the blog title and link.

Step 3

4 — It is time to code

If you still have doubts about web scraping or how it works, I recommend you take a look at my article where I explain more about it.

In the following script, we get the title and link of each blog found on each page and return it as a JSON. For pagination, we have developed two simple ways to do this.

First, we need to know the total number of pages to click and navigate through each of them or, if necessary, scrape only a certain number of pages based on the total.

Below is the code of the exercise. I have briefly commented on what each code block does. To execute the code, you only need to install Playwright and node fileName.js.

If what we need is only to scrape all the pages we find, in this second form, we do not need to know the total number of pages, so we only loop until we no longer find the “Next” button element visible on the page, which would mean that there are no more pages.

Final Output

Output JSON object

Conclusion

We have seen the most known way websites use to display their content through pagination.

In this case, it was not so difficult to navigate through the pagination, but we may come across other types of navigation with which we will have to do more to achieve it. It is important to review the HTML structure and how the content loading is working.

I hope this exercise has helped you get a better idea of how to scrape your target website.

Happy scraping!

Read more about Web Scraping:

Remember, you must take into account the Terms of Service and the Privacy Policies of the websites before scraping. So be responsible for that.

Want to Connect with the Author? Love connecting with friends all around the world on Twitter.

Further Reading

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

JavaScript
Nodejs
Scraping
Web Development
Programming
Recommended from ReadMedium