$2500 Per Month Web Scraping Gig
Harnessing the Power of Undetected ChromeDriver in Python for Web Automation
I generate a steady monthly income of $2500 by efficiently handling two crucial tasks: pulling bankruptcy data and extracting UCC data for Florida and North Carolina. This process is streamlined, taking me only 10 minutes each day to compile and send the requested customer data. To optimize efficiency, I have a pre-existing script that pulls the entire month’s worth of UCC data, allowing me to extract the required 20 data points per day and organize them into individual daily files. Conversely, the bankruptcy data requires daily execution to ensure up-to-date information.
Web automation has become increasingly popular in various industries, ranging from data scraping and testing to bot-driven tasks. Python, with its vast array of libraries, offers a robust ecosystem for web automation. When it comes to automating web browsers, one of the most widely used tools is ChromeDriver. In this article, we will explore the concept of an undetected ChromeDriver in Python and how it can enhance your web automation capabilities.
Understanding ChromeDriver: ChromeDriver is an essential component in the Selenium ecosystem, which allows developers to control and interact with the Chrome browser programmatically. It acts as a bridge between the browser and the Python code, enabling automated tasks like form filling, navigation, and data extraction. By utilizing ChromeDriver, developers can harness the power of Chrome’s capabilities within their automation scripts.
The Need for Undetected ChromeDriver: Websites employ various measures to detect and prevent automated browser activities, commonly referred to as bot detection mechanisms. These mechanisms are designed to protect against malicious activities and maintain the integrity of their platforms. However, they can also hinder legitimate automation tasks.
To overcome these challenges, developers have created modified versions of ChromeDriver known as “undetected ChromeDrivers.” These drivers are specifically designed to emulate human-like browsing behavior and avoid detection by bot detection systems. By using an undetected ChromeDriver, developers can enhance the success rate of their automation tasks and avoid potential roadblocks.
Key Features of Undetected ChromeDriver
- Emulating Human Behavior: Undetected ChromeDrivers mimic human-like browser activities, including mouse movements, scrolling, and random delays between actions. By replicating these behaviors, they can bypass many bot detection systems that rely on identifying patterns associated with automated activities.
- User-Agent Rotation: An important aspect of avoiding detection is to rotate the user agent string, which is a characteristic identifier of the browser being used. Undetected ChromeDrivers provide mechanisms to randomize the user agent string for each browsing session, making it harder for websites to recognize automated activities.
- Proxy Support: Many websites implement IP-based blocking or rate limiting to restrict automated access. Undetected ChromeDrivers often offer proxy support, allowing developers to route their requests through a pool of IP addresses. This enables the automation script to appear as if it’s coming from different sources, reducing the chances of being detected and blocked.
Implementing Undetected ChromeDriver in Python: To utilize an undetected ChromeDriver in Python, you can leverage third-party libraries that provide the necessary functionality. One such library is “undetected-chromedriver,” which is built on top of Selenium and offers a straightforward way to integrate an undetected ChromeDriver into your automation workflow.
The following steps outline a basic implementation:
- Install the required libraries: Use pip to install the “selenium” and “undetected-chromedriver” packages.
- Import the necessary modules: Import the required modules from the installed libraries in your Python script.
- Configure the undetected ChromeDriver: Set up the undetected ChromeDriver instance with desired options, such as user agent rotation and proxy settings.
- Automate web tasks: Utilize the ChromeDriver instance to interact with web elements, navigate through pages, and extract data as needed.
- Handle exceptions and errors: Implement appropriate error handling mechanisms to deal with common issues like element not found or page load failures.
Here is the actually code that I use for my scraper:
from selenium.webdriver.common.by import By
import pandas as pd
import openpyxl
import undetected_chromedriver as uc
from time import sleep
# I removed some place and state names to save space
places = ['Alaska', 'Alabama Middle', 'Alabama Northern']
states_list = ['Alabama', 'Alaska', 'Arizona', 'Arkansas']
states = {item: next((state for state in states_list if state in item), '') for item in places}
records = []
options = uc.ChromeOptions()
options.add_argument('-headless')
driver = uc.Chrome(options=options)
for p, place in enumerate(places):
try:
place = place.lower().replace(' ', '-')
# I change the name of the bankruptcy site to protect my intellectual property
url = f'https://www.thisbankruptcy.com/browse-filings/{place}-bankruptcy-cases-filed-in-2023?page=1'
print(url)
driver.get(url)
sleep(5)
for i in range(1, 51):
try:
title = driver.find_element(By.XPATH,
f'//*[@id="__next"]/div/div/div[2]/div[2]/div/table/tbody/tr[{i}]/td[1]').text
location = driver.find_element(By.XPATH,
f'//*[@id="__next"]/div/div/div[2]/div[2]/div/table/tbody/tr[{i}]/td[2]').text
case = driver.find_element(By.XPATH,
f'//*[@id="__next"]/div/div/div[2]/div[2]/div/table/tbody/tr[{i}]/td[3]/a').text
chapter = driver.find_element(By.XPATH,
f'//*[@id="__next"]/div/div/div[2]/div[2]/div/table/tbody/tr[{i}]/td[4]').text
date = driver.find_element(By.XPATH,
f'//*[@id="__next"]/div/div/div[2]/div[2]/div/table/tbody/tr[{i}]/td[5]').text
records.append([title, location, case, chapter, date])
except:
break
except Exception as e:
print(f"An error occurred while processing '{place}': {str(e)}")
driver.quit()
df = pd.DataFrame(records, columns=['Title', 'Location', 'Case', 'Chapter', 'Date'])
df['State'] = df['Location'].apply(lambda x: states.get(x))
df.to_excel('bankruptcies_061423.xlsx', index=False)
print(df.tail())Conclusion
Undetected ChromeDriver in Python opens up new possibilities for web automation by mimicking human browsing behavior and avoiding detection by bot detection systems. By utilizing the features offered by undetected ChromeDrivers, developers can enhance the success rate of their automation tasks while maintaining a high level of stealth. It’s important to note that while undetected ChromeDrivers can increase the chances of successful automation, they should be used responsibly and in compliance with the terms of service of the websites being accessed.
More content at PlainEnglish.io.
Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.
