Summary

The article discusses a Python-based method using Selenium to scrape content from web pages with infinite scrolling, demonstrated through scraping Reddit for posts related to COVID-19.

Abstract

The article provides a detailed guide on using Python and Selenium to scrape data from websites that implement infinite scrolling, a technique where content is dynamically loaded as the user scrolls down. The author illustrates the challenges faced when trying to extract complete data sets from such pages, using Reddit's search results for "COVID-19" as a case study. The traditional approach of using APIs is deemed inefficient for this task due to the need to filter posts from various subreddits. The article explains the concept of infinite scrolling and how Selenium initially only accesses the visible part of the web page, missing out on content that is loaded dynamically. To overcome this, the author presents a Python script that simulates scrolling to load all content before scraping. The script uses JavaScript execution within Selenium to scroll in increments and wait for new content to load, ensuring that the entire dataset is captured. The effectiveness of this method is demonstrated by comparing the number of URLs extracted with and without auto-scrolling, showing a significant increase from 22 to 246 URLs. The article concludes by emphasizing the versatility of the auto-scrolling code across different websites and encourages readers to engage with the author's work on GitHub.

Opinions

The author favors manual web scraping over using APIs for complex data extraction tasks involving keyword searches across multiple subreddits.
Infinite scrolling is acknowledged as a user-friendly web design feature that complicates web scraping.
The author emphasizes the importance of choosing an appropriate pause time in the auto-scrolling script to allow for content loading, suggesting that a too-short pause can lead to incomplete data collection.
The article suggests that the presented auto-scrolling method is robust and has been successfully applied to various websites beyond Reddit, including LinkedIn and Sephora.
The author seeks validation and support from readers, requesting them to clap for the story if they find it useful, which indicates a desire for community engagement and recognition.

Using Python and Selenium to Scrape Infinite Scroll Web Pages

Web scraping can be an important tool for data collection. While big social media, such as Twitter and Reddit, supports APIs to quickly extract data using existing python packages, you may sometimes encounter tasks that are difficult to solve using APIs. For instance, the Reddit API allows you to extract posts and comments from subreddits (online communities in Reddit), but it is hard to get posts and comments by keyword search (you will see more clearly what I mean in the next section). Moreover, not every web page has API for web scraping. In these cases, manual web scraping becomes the optimum choice. However, nowadays many web pages implement a web-design technique: infinite scrolling. Infinite scroll web pages automatically expand the content when users scroll down to the bottom of the page, to replace the traditional pagination. While it is very convenient for the users, it adds difficulty to the web scrapping. In this story, I will show the python code I developed to auto-scrolling web pages, and demonstrate how to use it to scrape URLs in Reddit as an example.

Selenium for infinite scroll web pages: What Is The Problem?

Let’s say that I want to extract the posts and comments about COVID-19 on Reddit for sentiment analysis. I then go to Reddit.com and search “COVID-19”, the resulting page is as follow:

Search Results for COVID-19 on Reddit.com (Before Scrolling)

The texts highlighted in blue boxes are the subreddits. Notice that they are all different. Therefore, if I want to get all these posts through Reddit API, I would have to first get the posts from each subreddit, and write extra code to filter the posts that are related to COVID-19. This is a very complicated process, and thus in this case, manual scraping is favored.

The icon and numbers highlighted in red boxes are the scroll bar and the screen height and scroll height. The screen height represents the entire height of the screen, and the scroll height represents the entire height of the web page. The scroll bar tells where my current screen is located with respect to the entire web page, and the length of the scroll bar indicates how large the screen height is with respect to the scroll height. In this case, the screen height is 864 and the scroll height is 3615. So, the scroll bar is relatively long.

However, after I scroll down to the very bottom of the web page, the scroll bar shrinks, because the screen height is unchanged, but the scroll height now becomes 27452:

Search Results for COVID-19 on Reddit.com (After Scrolling)

This is infinite scrolling: at the initial stage, only a small number of posts are on the page, and new posts will show up after I scroll down. Unfortunately, Selenium always opens the web pages in their initial forms. Therefore, the HTML we extract from this web page is incomplete, and we are unable to get the posts that show up after scrolling down.

Solution? Simulate Scrolling!

So how can we extract the complete HTML and get all the posts from this Reddit web page? Well, we ask Selenium to scroll it! The following code shows how to implement the auto-scrolling feature in Selenium:

Selenium can execute Javascript in the console of the web page. Therefore, I first obtain the screen height (the total height of my screen, which is less than the scroll height). Then, I ask Selenium to scroll one screen height at a time. After running the loop for a couple of times, Selenium will scroll to a position that is low enough to trigger the expansion of the page, and the new posts will start to load. Usually, the web page takes some time to update the new scroll height and load new content, so I add a pause time to allow the web page has enough time to complete the update. If the pause time is too short, the new scroll height will not be updated and the loop will break sooner. The updated scroll height is then recorded, and the loop will break when we ask Selenium to scroll to a position that is larger than the scroll height (which means that we reach the end).

Now, Let’s Demonstrate!

Firstly, we need to inspect the page and check where to find the URLs:

The URLs we want are just in the first a-tag under a div with class “y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE”. Looks simple! let’s try scraping the URLs without the auto-scrolling code:

The resulting number of URLs is 22.

Now, let’s implement the auto-scrolling code and see how many URLs we can get:

The resulting number of URLs increase to 246!

That’s it!

I have tested my auto-scrolling code on many different websites, including Linkedin.com, Sephora.com (not for myself though), and etc. It performs quite well unless I choose a pause time that is too small (as I mentioned above). If you are interested to see what I used my code for on Sephora.com, it’s right here:

https://github.com/KuanWeiBeCool

If you find this story useful, please give it a clap! It will help me a lot!