How to Use Selenium with Scrapy and Zyte Smart Proxy

Learn a handy toolset for all types of scraping scenarios

Scrapy and Selenium are two important tools in web scraping. Scrapy is a popular web scraping framework that is good at managing complex scraping logic for large projects with many spiders, while Selenium can render webpages created with JavaScript which is a great complement to Scrapy. By combining Scrapy and Selenium in our scraping projects, we would be able to scrape all kinds of web pages. And with the help of Zyte Smart Proxy, we can scrape geo-sensitive data and avoid being blocked.

In this post, we will introduce how to use Scrapy and Selenium separately and together, with and without proxies, using simple-to-follow examples.

Preparation

Let’s create a virtual environment so we can try out the latest versions of Python and the libraries:

conda create -n selenium python=3.12
conda activate selenium

pip install Scrapy selenium webdriver-manager scrapy-zyte-smartproxy zyte-smartproxy-selenium

Scrapy — A popular web scraping framework in Python.
selenium — A library used to automate web browser interaction from Python using the Selenium WebDriver. It is used to render JavaScript webpages in this post.
webdriver-manager — Selenium webdriver manager to simplify the management of binary drivers for different browsers. With this library, we don’t need to install and manage the webdriver binaries by ourselves.
scrapy-zyte-smartproxy — Provides easy use of Zyte Smart Proxy Manager (formerly Crawlera) with Scrapy.
zyte-smartproxy-selenium — A wrapper over Selenium Wire to provide Zyte Smart Proxy Manager-specific functionalities.

Use Scrapy directly

Let’s first create a simple Scrapy project, based on which more features will be added later:

scrapy startproject demo

scrapy genspider quotes http://quotes.toscrape.com/js

The Scrapy framework will create the project and spider for you and add some boilerplate code to it.

Let’s add the code of our custom spider to the file quotes.py:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/js"]

    def parse(self, response):
        title = response.xpath("//h1/a/text()").get()
        quote = response.xpath(
            '//div[@class="quote"]/span[@class="text"]/text()'
        ).get()

        print({"title": title, "quote": quote})

If you are new to Scrapy or want to refresh your memory about it, you can check this post for a quick introduction to Scrapy which has much more details.

Now we can run the spider with the following command:

scrapy crawl quotes -L WARNING

{'title': 'Quotes to Scrape', 'quote': None}

It shows that the title can be scraped but the quote not. This is because the quotes are rendered by some JavaScript code and cannot be scraped by Scrapy directly. To scrape the quotes, we need to use Selenium to render the JavaScript code as plain HTML first.

Use Selenium directly

Previously, we needed to manually install the webdriver binaries before we could use Selenium for web scraping, which is pretty cumbersome. Now, with the webdriver manager, this is all done automatically.

Selenium can be used for web scraping directly using the driver.find_element() function. You can create a new script file or simply run the following code in Python directly:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(
    options=options, service=ChromeService(ChromeDriverManager().install())
)
driver.get("http://quotes.toscrape.com/js")

title = driver.find_element(by=By.XPATH, value="//h1/a").text
quote = driver.find_element(
    by=By.XPATH, value='//div[@class="quote"]/span[@class="text"]'
).text

print({"title": title, "quote": quote})

When the above code is run, the quote can then be scraped successfully because it’s rendered by Selenium:

{'title': 'Quotes to Scrape', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}

Use Scrapy and Selenium together

Since Scrapy is much more flexible and advanced than Selenium for web scraping. It is useful to combine Scrapy and Selenium in practice, namely to use Selenium to return the plain HTML code of JavaScript-created webpages which can then be parsed by Scrapy:

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/js"]

    def parse(self, response):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")

        driver = webdriver.Chrome(
            options=options,
            service=ChromeService(ChromeDriverManager().install()),
        )
        driver.get(response.url)

        # Use Scrapy to parse the data.
        selector = Selector(text=driver.page_source)
        title = selector.xpath("//h1/a/text()").get()
        quote = selector.xpath(
            '//div[@class="quote"]/span[@class="text"]/text()'
        ).get()

        print({"title": title, "quote": quote})

This approach is simple to use. However, it has the shortcoming that two requests are made to the target URL, one with Scrapy and the other with Selenium, which is inefficient.

A better way is to create a downloader middleware for Selenium which can then avoid the request to the target URL with Scrapy. Instead, we will only request with Selenium and return an HtmlResponse which can then be parsed by Scrapy.

Remove the boilerplate code in middleware.py and add the following code to it:

from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class SeleniumMiddleware:

    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")

        self.driver = webdriver.Chrome(
            options=options,
            service=ChromeService(ChromeDriverManager().install()),
        )

    def process_request(self, request, spider):
        self.driver.get(request.url)
        body = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(
            request.url, body=body, encoding="utf-8",  request=request
        )

Also, we need to add the middleware in settings.py:

ROBOTSTXT_OBEY = False

DOWNLOADER_MIDDLEWARES = {
    'demo.middlewares.SeleniumMiddleware': 800,
}

Now all the requests would be made with Selenium. Hmm, this may also not be what we want because, in our scraping projects, not all websites need to be scraped with Selenium. For regular webpages that can be scraped without Selenium, using Selenium will slow down the scraping process.

To make it more controllable, let’s introduce a custom Spider property called use_selenium which decides if Selenium should be used for scraping or not:

from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class SeleniumMiddleware:

    def process_request(self, request, spider):
        if not getattr(spider, "use_selenium", None):
            return
    
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")

        self.driver = webdriver.Chrome(
            options=options,
            service=ChromeService(ChromeDriverManager().install()),
        )

        self.driver.get(request.url)
        body = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(
            request.url, body=body, encoding="utf-8", request=request
        )

Note that we moved the initialization code from __init__() to process_request() because we cannot access spider properties in __init__().

The use_selenium property can either be hardcoded in the Spider or passed on the command line:

# Default, the use_selenium property is not set.
scrapy crawl quotes -L WARNING
{'title': 'Quotes to Scrape', 'quote': None}

# Specify use_selenium to be True:
scrapy crawl quotes -L WARNING -a use_selenium=True
{'title': 'Quotes to Scrape', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}

Use Scrapy with Zyte Smart Proxy

In production, it would normally be required to use proxies for web scraping, either to scrape geo-sensitive data or simply to avoid being blocked. In this section, we will demonstrate how to use Zyte Smart Proxy which is very easy to integrate with Scrapy and Selenium. We just need to add the following lines in settings.py to enable Zyte Smart Proxy for our spiders.

# Enable Zyte smart proxy in settings.py.
DOWNLOADER_MIDDLEWARES = {
    "scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware": 610,
    "demo.middlewares.SeleniumMiddleware": 800,
}

ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = "YOUR-ZYTE-API-KEY"

For more details regarding using proxies with Scrapy, please check this post.

If you don’t want to expose the API key in the code or the log, you may want to check the following posts:

Use Scrapy, Selenium, and Zyte Smart Proxy together

To use Zyte smart proxy for Selenium in Scrapy, we need to update the SeleniumMiddleware created above. We need to add the Zyte API key for the driver before making the request, otherwise, the proxy will not be used for scraping, even though ZYTE_SMARTPROXY_ENABLED is set to be True in settings.py, which is only effective when making requests without Selenium.

Please check the comments for the code below regarding the changes.

from scrapy.http import HtmlResponse

# `webdriver` is imported from `zyte_smartproxy_selenium`` rather than `selenium`.
from zyte_smartproxy_selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager


class SeleniumMiddleware:

    def process_request(self, request, spider):
        if not getattr(spider, "use_selenium", None):
            return        

        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
    
        # Add proxy for the Selenium driver.
        spm_options = {
            "spm_apikey": "YOUR-ZYTE-API-KEY",
            "headers": {
                "X-Crawlera-No-Bancheck": "1",
                "X-Crawlera-Profile": "desktop",
                "X-Crawlera-Cookies": "disable",
            },
        }

        self.driver = webdriver.Chrome(
            options=options,
            spm_options=spm_options,
            service=ChromeService(ChromeDriverManager().install()),
        )

        self.driver.get(request.url)
        body = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(
            request.url, body=body, encoding="utf-8", request=request
        )

Now when we run the spider, the Zyte proxy will be used for scraping.

Scrapy and Selenium are both important tools in web scraping. Scrapy is good at managing complex scraping logic for large scraping projects with many spiders, while Selenium can render webpages created with JavaScript which is a fatal flaw of Scrapy. By combining Scrapy and Selenium in our scraping projects, we would be able to scrape all kinds of web pages. And with the help of Zyte Smart Proxy, we can scrape geo-sensitive data and avoid being blocked.

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture
More content at PlainEnglish.io