The web content provides a tutorial on using Selenium with Python for web scraping, focusing on dynamic content extraction and interaction with web elements.
Abstract
The article is part of a series on web scraping with Python, specifically addressing the extraction of dynamic content from websites that require user interactions, such as clicking buttons or entering data. It introduces Selenium as a tool for automating browser actions, demonstrates how to set up and use Selenium with Python, and explains how to locate and interact with web elements using various methods like XPath and CSS selectors. The tutorial also covers the importance of waiting for elements to load and suggests combining Selenium with BeautifulSoup for more powerful scraping capabilities. The author, Esteban Thilliez, provides code examples and links to further resources, encouraging readers to follow his Medium account for more Python-related content.
Opinions
The author believes that web scraping static content is limited and that handling dynamic content is essential for comprehensive data extraction.
Selenium is presented as a versatile tool for web automation, capable of simulating user actions in a web browser.
The author emphasizes the ease of integrating Selenium with Python and the convenience of using Selenium with common web browsers without specifying their paths.
The article suggests that waiting mechanisms are crucial in web scraping to ensure elements are fully loaded before interacting with them.
Combining Selenium with BeautifulSoup is recommended for efficient web scraping, leveraging the strengths of both tools.
The author acknowledges that there are more complex aspects of Selenium not covered in the article, implying they may be less commonly needed.
The author invites engagement by asking readers to clap, comment, and follow for more content, indicating a desire to build a community around their work.
A cost-effective AI service, ZAI.chat, is recommended as an alternative to ChatGPT Plus, suggesting the author's endorsement of the service for similar performance at a lower price.
There is also a GitHub repo associated with this series if you want to find code examples: Web Scraping Series
In the last story, we’ve seen how you can easily scrap static content. But it’s a limited approach because sometimes, you will have to scrap websites requiring interactions, such as clicking on buttons, keyboard entries, etc…
The content you can get from these actions is called dynamic content, and it’s usually content generated by JavaScript or PHP scripts.
Selenium
Selenium is a bundle of several tools used for web automation projects. It has a Python implementation. Let’s install it now:
pip install selenium
As Selenium simulates user actions, it works directly through the browser. So, you need to download a web driver you can use with your browser. For example, if you’re on Chrome, you can download it here: https://chromedriver.chromium.org/downloads (choose the version corresponding to your Chrome version).
If you’re not on Chrome, just download the web driver corresponding to your web browser.
Launching the Driver
Now, you can open a Python project, and we’ll start to configure the driver. Let’s start with importing the web driver, and configuring it:
As I use Brave, I need to specify its location. If you use Chrome, Firefox, or any common web browser, you don’t need to do this. Also, as my web driver is on the same path as my script, I don’t especially need to specify its location. So, with Chrome, I could have just done this:
Now, I can launch the driver with an URL using driver.get(url) .
Find Elements
Selenium provides two methods to find elements: either webdriver.find_element for a single element or webdriver.find_elements for a list of all the elements.
The two parameters we can use with these methods are by and value :
by specifies the method used to find the element. It can be either By.ID , By.CLASS_NAME , By.XPATH , By.CSS_SELECTOR , etc…
value specifies the value used by by .
For example:
I won’t explain the XPath or CSS syntaxes, you can find the XPath syntax here and the CSS syntax here.
Interacting with the Elements
We have several ways to interact with elements. Perhaps we want to retrieve their content, their attributes, or execute actions with them.
To retrieve their content, as seen above, you can use element.text . To retrieve attributes, you use instead element.get_attribute(attribute) .
Then, you can also execute actions such as clicking on a button, or on a link, sending keys to a search bar, etc…
Waiting
Sometimes, you will need to wait before scraping content or executing actions. For example, if you click on a button, perhaps you have to wait 3 seconds before anything happens.
You have two main ways to wait using Selenium:
You can wait a predefined time.
You can wait until an element is present on the page.
Selenium with BeautifulSoup
A powerful way to web scrape is to combine Selenium with BeautifulSoup. You can do it easily as you can extract the page’s source code of the web driver with an attribute.
Then, you just have to initialize a soup and do it as we’ve done in the previous story:
Final Note
Now, you know most of the things you can do with Selenium. There are still other things, but they’re a bit complex, and not so useful, so I won’t talk about them.