avatarLynn Kwong

Summary

The provided web content outlines how to effectively use Selenium with Scrapy and Zyte Smart Proxy for web scraping, including JavaScript-heavy pages and geo-sensitive data.

Abstract

The article details a comprehensive approach to web scraping by integrating Scrapy, Selenium, and Zyte Smart Proxy. It begins by explaining the individual strengths of Scrapy for managing complex scraping tasks and Selenium for handling JavaScript-rendered content. The author then guides readers through setting up a virtual environment, installing necessary libraries, and creating a simple Scrapy project. The tutorial demonstrates the limitations of using Scrapy alone for JavaScript-rendered pages and then shows how Selenium can be used to overcome these limitations by rendering JavaScript before scraping. The article further explains how to combine Scrapy and Selenium to optimize the scraping process, including the creation of a custom downloader middleware for Selenium to improve efficiency. Additionally, the author discusses the integration of Zyte Smart Proxy to handle proxy management for scraping geo-sensitive data and avoiding IP bans. The article concludes with a discussion on how to use these tools together effectively and provides links to related resources for further learning.

Opinions

  • The author positions Scrapy and Selenium as complementary tools, emphasizing that Scrapy excels in complex scraping logic while Selenium is essential for JavaScript-rendered content.
  • The use of webdriver-manager is recommended for simplifying the management of Selenium's webdriver binaries.
  • The author suggests that using Selenium's driver.find_element() function directly can be cumbersome for web scraping and instead recommends integrating Selenium with Scrapy.
  • Creating a custom downloader middleware for Selenium is presented as a solution to avoid inefficiencies caused by duplicate requests to the target URL.
  • The author advocates for the use of Zyte Smart Proxy for production scraping, highlighting its ease of integration with both Scrapy and Selenium.
  • The article promotes the use of environment variables or secret management services to securely handle API keys, such as those for Zyte Smart Proxy.
  • The author encourages further exploration of related topics through provided links to additional resources and related posts.

How to Use Selenium with Scrapy and Zyte Smart Proxy

Learn a handy toolset for all types of scraping scenarios

Image by Mohamed_hassan on Pixabay

Scrapy and Selenium are two important tools in web scraping. Scrapy is a popular web scraping framework that is good at managing complex scraping logic for large projects with many spiders, while Selenium can render webpages created with JavaScript which is a great complement to Scrapy. By combining Scrapy and Selenium in our scraping projects, we would be able to scrape all kinds of web pages. And with the help of Zyte Smart Proxy, we can scrape geo-sensitive data and avoid being blocked.

In this post, we will introduce how to use Scrapy and Selenium separately and together, with and without proxies, using simple-to-follow examples.

Preparation

Let’s create a virtual environment so we can try out the latest versions of Python and the libraries:

conda create -n selenium python=3.12
conda activate selenium

pip install Scrapy selenium webdriver-manager scrapy-zyte-smartproxy zyte-smartproxy-selenium
  • Scrapy — A popular web scraping framework in Python.
  • selenium — A library used to automate web browser interaction from Python using the Selenium WebDriver. It is used to render JavaScript webpages in this post.
  • webdriver-manager — Selenium webdriver manager to simplify the management of binary drivers for different browsers. With this library, we don’t need to install and manage the webdriver binaries by ourselves.
  • scrapy-zyte-smartproxy — Provides easy use of Zyte Smart Proxy Manager (formerly Crawlera) with Scrapy.
  • zyte-smartproxy-selenium — A wrapper over Selenium Wire to provide Zyte Smart Proxy Manager-specific functionalities.

Use Scrapy directly

Let’s first create a simple Scrapy project, based on which more features will be added later:

scrapy startproject demo

scrapy genspider quotes http://quotes.toscrape.com/js

The Scrapy framework will create the project and spider for you and add some boilerplate code to it.

Let’s add the code of our custom spider to the file quotes.py:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/js"]

    def parse(self, response):
        title = response.xpath("//h1/a/text()").get()
        quote = response.xpath(
            '//div[@class="quote"]/span[@class="text"]/text()'
        ).get()

        print({"title": title, "quote": quote})

If you are new to Scrapy or want to refresh your memory about it, you can check this post for a quick introduction to Scrapy which has much more details.

Now we can run the spider with the following command:

scrapy crawl quotes -L WARNING

{'title': 'Quotes to Scrape', 'quote': None}

It shows that the title can be scraped but the quote not. This is because the quotes are rendered by some JavaScript code and cannot be scraped by Scrapy directly. To scrape the quotes, we need to use Selenium to render the JavaScript code as plain HTML first.

Use Selenium directly

Previously, we needed to manually install the webdriver binaries before we could use Selenium for web scraping, which is pretty cumbersome. Now, with the webdriver manager, this is all done automatically.

Selenium can be used for web scraping directly using the driver.find_element() function. You can create a new script file or simply run the following code in Python directly:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(
    options=options, service=ChromeService(ChromeDriverManager().install())
)
driver.get("http://quotes.toscrape.com/js")

title = driver.find_element(by=By.XPATH, value="//h1/a").text
quote = driver.find_element(
    by=By.XPATH, value='//div[@class="quote"]/span[@class="text"]'
).text

print({"title": title, "quote": quote})

When the above code is run, the quote can then be scraped successfully because it’s rendered by Selenium:

{'title': 'Quotes to Scrape', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}

Use Scrapy and Selenium together

Since Scrapy is much more flexible and advanced than Selenium for web scraping. It is useful to combine Scrapy and Selenium in practice, namely to use Selenium to return the plain HTML code of JavaScript-created webpages which can then be parsed by Scrapy:

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/js"]

    def parse(self, response):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")

        driver = webdriver.Chrome(
            options=options,
            service=ChromeService(ChromeDriverManager().install()),
        )
        driver.get(response.url)

        # Use Scrapy to parse the data.
        selector = Selector(text=driver.page_source)
        title = selector.xpath("//h1/a/text()").get()
        quote = selector.xpath(
            '//div[@class="quote"]/span[@class="text"]/text()'
        ).get()

        print({"title": title, "quote": quote})

This approach is simple to use. However, it has the shortcoming that two requests are made to the target URL, one with Scrapy and the other with Selenium, which is inefficient.

A better way is to create a downloader middleware for Selenium which can then avoid the request to the target URL with Scrapy. Instead, we will only request with Selenium and return an HtmlResponse which can then be parsed by Scrapy.

Remove the boilerplate code in middleware.py and add the following code to it:

from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class SeleniumMiddleware:

    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")

        self.driver = webdriver.Chrome(
            options=options,
            service=ChromeService(ChromeDriverManager().install()),
        )

    def process_request(self, request, spider):
        self.driver.get(request.url)
        body = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(
            request.url, body=body, encoding="utf-8",  request=request
        )

Also, we need to add the middleware in settings.py:

ROBOTSTXT_OBEY = False

DOWNLOADER_MIDDLEWARES = {
    'demo.middlewares.SeleniumMiddleware': 800,
}

Now all the requests would be made with Selenium. Hmm, this may also not be what we want because, in our scraping projects, not all websites need to be scraped with Selenium. For regular webpages that can be scraped without Selenium, using Selenium will slow down the scraping process.

To make it more controllable, let’s introduce a custom Spider property called use_selenium which decides if Selenium should be used for scraping or not:

from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class SeleniumMiddleware:

    def process_request(self, request, spider):
        if not getattr(spider, "use_selenium", None):
            return
    
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")

        self.driver = webdriver.Chrome(
            options=options,
            service=ChromeService(ChromeDriverManager().install()),
        )

        self.driver.get(request.url)
        body = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(
            request.url, body=body, encoding="utf-8", request=request
        )

Note that we moved the initialization code from __init__() to process_request() because we cannot access spider properties in __init__().

The use_selenium property can either be hardcoded in the Spider or passed on the command line:

# Default, the use_selenium property is not set.
scrapy crawl quotes -L WARNING
{'title': 'Quotes to Scrape', 'quote': None}

# Specify use_selenium to be True:
scrapy crawl quotes -L WARNING -a use_selenium=True
{'title': 'Quotes to Scrape', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}

Use Scrapy with Zyte Smart Proxy

In production, it would normally be required to use proxies for web scraping, either to scrape geo-sensitive data or simply to avoid being blocked. In this section, we will demonstrate how to use Zyte Smart Proxy which is very easy to integrate with Scrapy and Selenium. We just need to add the following lines in settings.py to enable Zyte Smart Proxy for our spiders.

# Enable Zyte smart proxy in settings.py.
DOWNLOADER_MIDDLEWARES = {
    "scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware": 610,
    "demo.middlewares.SeleniumMiddleware": 800,
}

ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = "YOUR-ZYTE-API-KEY"

For more details regarding using proxies with Scrapy, please check this post.

If you don’t want to expose the API key in the code or the log, you may want to check the following posts:

Use Scrapy, Selenium, and Zyte Smart Proxy together

To use Zyte smart proxy for Selenium in Scrapy, we need to update the SeleniumMiddleware created above. We need to add the Zyte API key for the driver before making the request, otherwise, the proxy will not be used for scraping, even though ZYTE_SMARTPROXY_ENABLED is set to be True in settings.py, which is only effective when making requests without Selenium.

Please check the comments for the code below regarding the changes.

from scrapy.http import HtmlResponse

# `webdriver` is imported from `zyte_smartproxy_selenium`` rather than `selenium`.
from zyte_smartproxy_selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class SeleniumMiddleware:

    def process_request(self, request, spider):
        if not getattr(spider, "use_selenium", None):
            return        

        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
    
        # Add proxy for the Selenium driver.
        spm_options = {
            "spm_apikey": "YOUR-ZYTE-API-KEY",
            "headers": {
                "X-Crawlera-No-Bancheck": "1",
                "X-Crawlera-Profile": "desktop",
                "X-Crawlera-Cookies": "disable",
            },
        }

        self.driver = webdriver.Chrome(
            options=options,
            spm_options=spm_options,
            service=ChromeService(ChromeDriverManager().install()),
        )

        self.driver.get(request.url)
        body = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(
            request.url, body=body, encoding="utf-8", request=request
        )

Now when we run the spider, the Zyte proxy will be used for scraping.

Scrapy and Selenium are both important tools in web scraping. Scrapy is good at managing complex scraping logic for large scraping projects with many spiders, while Selenium can render webpages created with JavaScript which is a fatal flaw of Scrapy. By combining Scrapy and Selenium in our scraping projects, we would be able to scrape all kinds of web pages. And with the help of Zyte Smart Proxy, we can scrape geo-sensitive data and avoid being blocked.

Related posts:

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Python
Scraping
Scrapy
Selenium
Proxy
Recommended from ReadMedium