Scrape Data From a Website With Pagination Using JavaScript & Playwright
Explanation with traditional pagination
Paginated content is everywhere. For example, if you go to an e-commerce site, not all products are on the same page, they are most likely spread over several pages.
Pagination is a technique widely used in web development to structure the content, grouping it by a fixed amount of space or number of elements. This is to make the user’s navigation more user-friendly.
There are many ways to paginate the content of a website that works perfectly for the user. The main ways are the traditional way with next and previous buttons, infinite scroll, and the button to load more.
However, web scraping is more difficult on some websites than others, depending on how they are structured.
Let’s see an example of web scraping websites with pagination. We are going to develop our script with Playwright, a new technology for testing modern web applications.
Traditional pagination
Traditional pagination divides the contents into arbitrary groups of 10, 25, 100, or any other number of results. At the end of the listing, it includes links to move forward and backward page by page. The user can either use these links or use the forward and back buttons on the web browser itself.
Most websites, such as newspapers, online stores, search engines, and forums, use the traditional pagination system.
Scraping YTravel
Let’s assume that for this exercise, we need to retrieve the blogs published on the website (title and link). So, we will have to go through the pagination to get all the information.
We are going to work with this category of posts Travel Tips — y Travel Blog
The first steps we have to do are:
1 — Manually browse the website and identify what type of pagination is being used to get an idea of how we are going to approach the exercise.
2 — Locate the pagination element and inspect it with the browser. To interact with the pagination, we need to locate the Next button element and the total pages.
3 — We locate the elements we want to extract from each page. In this case, we need the blog title and link.
4 — It is time to code
If you still have doubts about web scraping or how it works, I recommend you take a look at my article where I explain more about it.
In the following script, we get the title and link of each blog found on each page and return it as a JSON. For pagination, we have developed two simple ways to do this.