Using Python and Selenium to Scrape Infinite Scroll Web Pages
Web scraping can be an important tool for data collection. While big social media, such as Twitter and Reddit, supports APIs to quickly extract data using existing python packages, you may sometimes encounter tasks that are difficult to solve using APIs. For instance, the Reddit API allows you to extract posts and comments from subreddits (online communities in Reddit), but it is hard to get posts and comments by keyword search (you will see more clearly what I mean in the next section). Moreover, not every web page has API for web scraping. In these cases, manual web scraping becomes the optimum choice. However, nowadays many web pages implement a web-design technique: infinite scrolling. Infinite scroll web pages automatically expand the content when users scroll down to the bottom of the page, to replace the traditional pagination. While it is very convenient for the users, it adds difficulty to the web scrapping. In this story, I will show the python code I developed to auto-scrolling web pages, and demonstrate how to use it to scrape URLs in Reddit as an example.
Selenium for infinite scroll web pages: What Is The Problem?
Let’s say that I want to extract the posts and comments about COVID-19 on Reddit for sentiment analysis. I then go to Reddit.com and search “COVID-19”, the resulting page is as follow:

The texts highlighted in blue boxes are the subreddits. Notice that they are all different. Therefore, if I want to get all these posts through Reddit API, I would have to first get the posts from each subreddit, and write extra code to filter the posts that are related to COVID-19. This is a very complicated process, and thus in this case, manual scraping is favored.
The icon and numbers highlighted in red boxes are the scroll bar and the screen height and scroll height. The screen height represents the entire height of the screen, and the scroll height represents the entire height of the web page. The scroll bar tells where my current screen is located with respect to the entire web page, and the length of the scroll bar indicates how large the screen height is with respect to the scroll height. In this case, the screen height is 864 and the scroll height is 3615. So, the scroll bar is relatively long.
However, after I scroll down to the very bottom of the web page, the scroll bar shrinks, because the screen height is unchanged, but the scroll height now becomes 27452:

This is infinite scrolling: at the initial stage, only a small number of posts are on the page, and new posts will show up after I scroll down. Unfortunately, Selenium always opens the web pages in their initial forms. Therefore, the HTML we extract from this web page is incomplete, and we are unable to get the posts that show up after scrolling down.
Solution? Simulate Scrolling!
So how can we extract the complete HTML and get all the posts from this Reddit web page? Well, we ask Selenium to scroll it! The following code shows how to implement the auto-scrolling feature in Selenium:



