Web Scraping with Selenium and Python - updated guide for 2025
Selenium is one of the top headless browsers used for scraping and testing automations, and it enables scrapers with powerful tools to get their tasks done.
Unlike static scrapers that can only retrieve pre-rendered HTML, Selenium WebDriver fully engages with the rendered Document Object Model (DOM), enabling data extraction from JavaScript-heavy websites.
Browser automation through Selenium is also crucial for handling complex scraping scenarios where JavaScript renders essential content, like e-commerce product listings or social media feeds.
This article will act as your handbook through scraping the web using Selenium in Python.
Without further ado, let’s dive right in!
How Selenium Works with WebDriver
Selenium facilitates browser automation by integrating with WebDriver, a standardized API and protocol for controlling web browsers.
Each major browser has its own driver (like ChromeDriver for Chrome or GeckoDriver for Firefox) that translates Selenium commands into actions within the browser.
This architecture allows Selenium to send instructions to WebDriver, which communicates directly with the browser to perform tasks such as:
- clicking buttons,
- scrolling,
- or waiting for JavaScript content to load.
By using browser-specific drivers, Selenium enables cross-browser and cross-platform automation, ensuring scripts behave consistently across different environments.
Browser vendors typically maintain their drivers, ensuring they are optimized for each browser’s unique features and updates. For cases where vendor-supported drivers are not available, the Selenium project provides its own driver support to maintain functionality.
This modular design not only supports transparent interaction across multiple browsers but also provides developers with flexibility for testing and scraping applications that demand high compatibility and responsiveness across platforms.
Environment Setup for Selenium Web Scraping
You can use Selenium with various languages like NodeJS - Selenium with Ruby is also a good option. For this article, we’ll go exclusively with Python.
To get started with Selenium in Python, the first step is installing the Selenium package. Use pip to install it:
pip install selenium
Selenium supports various languages, including Python, Java, C#, and JavaScript. While Python is popular for web scraping, the installation process is similar across languages, involving adding Selenium’s libraries to your project. For this article, we’ll be using Python for everything.
Selenium requires a browser driver to communicate with your browser. For Chrome, use ChromeDriver, and for Firefox, use GeckoDriver. You can download the appropriate driver version for your browser from ChromeDriver orGeckoDriver. After downloading, ensure the driver’s path is correctly linked in your script.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# Provide the path to your ChromeDriver
service = Service('/path/to/chromedriver')
driver = webdriver.Chrome(service=service)
An important part of using Selenium is headless browsing. Headless browsing enables Selenium to run the browser in the background without displaying the graphical interface, making it faster and more resource-efficient. This is especially useful for large-scale scraping tasks where performance is key. Here’s how to set it up:
from selenium.webdriver.chrome.options import Options
# Configure headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(service=Service('/path/to/chromedriver'), options=chrome_options)
Headless mode speeds up the scraping process by reducing the overhead associated with rendering the browser UI, which is particularly beneficial when scraping at scale.
To optimize scraping performance, it’s crucial to configure timeouts to avoid waiting unnecessarily for page elements. Selenium lets you use implicit waits (global delay) and explicit waits (targeted delays). Implicit waits tell Selenium to wait for a set time before throwing an exception if elements are not found, while explicit wait gives more precise control by waiting for specific conditions to be met (e.g., element presence).
Handling Dynamic Content with Selenium
Today, JavaScript is the backbone of the internet, and most websites built with JavaScript frameworks like React, Angular, or Vue dynamically render content after the initial page load. Traditional static scraping methods struggle here because the HTML is often empty or incomplete until the JavaScript has fully executed.
Selenium, however, can render the entire DOM by interacting with the browser, making it ideal for scraping JavaScript-heavy websites. Selenium’s WebDriverWait is essential when scraping JavaScript-rendered pages, as it allows you to wait for specific elements to appear before attempting to extract data.
By combining this with conditions like presence_of_elements_located, Selenium ensures you scrape only once the data is fully loaded. Here’s an example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Set up Selenium
driver_path = '/path/to/chromedriver'
driver = webdriver.Chrome()
driver.get('https://www.scrapingcourse.com/ecommerce/')
try:
# Wait for multiple elements to be present
WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, 'product-name'))
)
# Extract text from all elements with the specified class
elements = driver.find_elements(By.CLASS_NAME, 'product-name')
# Loop through each element and print the text
for element in elements:
print("Product name:", element.text)
except Exception as e:
print("Error occurred:", str(e))
finally:
# Ensure the browser is closed
driver.quit()
In this example, the script waits until the JavaScript has rendered a product name before extracting the text, ensuring you scrape the right data.
Handling Lazy Loading and Infinite Scroll
Many modern websites, particularly those with long product lists or social media feeds, use lazy loading or infinite scroll to improve performance. Instead of loading all the content at once, new content is loaded dynamically as the user scrolls.
Selenium can handle these types of interactions by simulating scrolling. To ensure all data is loaded, you can repeatedly scroll to the bottom of the page, triggering additional content loads, until no new content appears. Here’s how to do that:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# Set up Selenium
driver_path = '/path/to/chromedriver' # Replace with your actual chromedriver path
options = webdriver.ChromeOptions()
options.add_argument('--headless') # Run in headless mode to speed up the scraping
options.add_argument('--disable-gpu') # Disable GPU acceleration
driver = webdriver.Chrome(executable_path=driver_path, options=options)
# Define function to scroll to the bottom of the page
def scroll_to_bottom(driver):
""" Scroll to the bottom of the page to trigger lazy loading. """
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Adjust sleep to wait for new content to load
try:
# Open the e-commerce page
driver.get('https://www.scrapingcourse.com/infinite-scrolling')
# Initialize product list
product_names = set()
# Scroll multiple times to load more content dynamically
scroll_pause_time = 2 # Seconds to wait for new products to load after scrolling
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to the bottom of the page
scroll_to_bottom(driver)
# Wait for new elements to be loaded
try:
WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, 'product-name'))
)
except TimeoutException:
print("Timeout waiting for products to load.")
break
# Extract product names after each scroll
products = driver.find_elements(By.CLASS_NAME, 'product-name')
for product in products:
product_names.add(product.text) # Use a set to avoid duplicates
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
# If the scroll height hasn't changed, exit the loop (end of page)
break
last_height = new_height
time.sleep(scroll_pause_time)
# Print all collected product names
for name in product_names:
print("Product name:", name)
except Exception as e:
print("Error occurred:", str(e))
finally:
# Ensure the browser is closed properly
driver.quit()
This script scrolls to the bottom of the page, waits for more content to load, and repeats the process until no new content is detected, which is essential for pages that rely on infinite scroll or lazy loading. However, you might want to add a timer to avoid scraping infinitely.
Scrape.do makes handling dynamic content, lazy loading, and infinite scrolling significantly easier by offering a fully managed scraping API that automatically deals with JavaScript rendering and dynamic data loading.
Instead of manually using Selenium to wait for elements, simulate scrolling, and manage JavaScript-heavy websites, Scrape.do renders the full DOM, handling all dynamic content in the background.
This eliminates the need for complex scrolling scripts or WebDriver waits, allowing you to scrape data from JavaScript frameworks like React or Angular effortlessly. Additionally, Scrape.do optimizes the scraping process by managing timeouts, avoiding infinite loops, and ensuring efficient data extraction from any dynamic site.
Dealing with CAPTCHA, Anti-bot Measures, and Throttling
Web scrapping can sometimes be a furious dance in which you try to get data from a website while the website attempts to stop you from doing that using anti-bot mechanisms like CAPTCHAs, rate-limiting, and IP blocking. While these are formidable defences, there are a couple of ways you can use to get past them.
Handling CAPTCHA
CAPTCHAs are one of the most common anti-bot mechanisms, used to ensure the interaction is coming from a human. CAPTCHA types can range from simple image selection to more complex versions like reCAPTCHA and Cloudflare’s JavaScript challenges.
Common Anti-Bot Mechanisms include:
- Simple CAPTCHAs: These might ask the user to enter text from an image.
- reCAPTCHA v2/v3: Google’s reCAPTCHA that often includes clicking images or is invisible to the user but detects bot-like behavior.
- Cloudflare Challenges: These include JavaScript-based challenges that test whether requests are coming from a bot.
CAPTCHAs can be solved manually or automatically by integrating with third-party CAPTCHA-solving services such as 2Captcha, AntiCaptcha, or DeathByCaptcha. These services allow you to send the CAPTCHA to their API, where human solvers or AI resolve it and return the solution.
Here’s an example of using 2Captcha to bypass CAPTCHAs:
import requests
API_KEY = 'your-2captcha-api-key'
captcha_image_url = 'https://example.com/captcha-image-url'
# Step 1: Send CAPTCHA image to 2Captcha
captcha_data = {
'key': API_KEY,
'method': 'base64',
'body': captcha_image_url,
'json': 1
}
response = requests.post('http://2captcha.com/in.php', data=captcha_data)
captcha_id = response.json().get('request')
# Step 2: Poll for the solution
solution_url = f'http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}&json=1'
while True:
result = requests.get(solution_url).json()
if result.get('status') == 1:
print(f"Captcha Solved: {result['request']}")
break
else:
time.sleep(5) # Wait and retry until solved
This approach is often used to bypass simple image CAPTCHAs. For reCAPTCHA and Cloudflare challenges, more advanced techniques are required. reCAPTCHA(V2) requires solving a visual challenge and you can pass the g-recaptcha-response token from 2Captcha to solve this. Cloudflare, on the other hand, uses JavaScript challenges or CAPTCHA forms to detect bots, and these challenges may require a special tool like CloudScraper to bypass. To do that, you first have to install cloudscraper:
pip install cloudscraper
Then, you can run the following script:
import cloudscraper
# Using CloudScraper to bypass Cloudflare challenges
scraper = cloudscraper.create_scraper()
response = scraper.get('https://example.com')
print(response.text)
Randomization and Throttling to Avoid IP Blocking
To avoid IP blocking and throttling, it’s essential to simulate human behavior by rotating user agents, adding random delays, and interacting with the page in a non-uniform way (such as random mouse movements or keystrokes). Here’s how to do that:
- Rotating User Agents: Changing the user-agent string for each request can help prevent detection by mimicking different browsers. Selenium’s webdriver.ChromeOptions() allows you to set a custom user-agent.
from selenium import webdriver
import random
# List of user-agents
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0'
]
# Set up Selenium with a random user-agent
options = webdriver.ChromeOptions()
user_agent = random.choice(user_agents)
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome(options=options)
driver.get('https://example.com')
- Adding Random Delays: Adding random delays between actions mimics human behavior and reduces the chance of being flagged as a bot.
import time
import random
# Random delay between actions
time.sleep(random.uniform(2, 5)) # Wait between 2 and 5 seconds
- Simulating Human Interaction (Mouse Movements, Keystrokes): Interacting with the page using mouse movements and keystrokes can further reduce bot detection. You can use Selenium’s actions to simulate user interaction.
from selenium.webdriver.common.action_chains import ActionChains
# Move the mouse to a random position
actions = ActionChains(driver)
element = driver.find_element(By.ID, 'some-element')
actions.move_to_element(element).perform()
# Simulate typing in a text box
text_box = driver.find_element(By.NAME, 'search')
text_box.send_keys('Selenium Web Scraping')
- IP Rotation and Proxies: Using proxies or rotating IPs is another method to avoid detection. Services like ScraperAPI or Bright Data allow you to rotate IPs for each request.
from selenium import webdriver
# Set up proxy in Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=http://myproxy:1234')
driver = webdriver.Chrome(options=options)
driver.get('https://example.com')
Scrape.do simplifies overcoming these challenges by offering a fully managed web scraping API that handles anti-bot measures, CAPTCHAs, and throttling without needing manual intervention. With Scrape.do, you don’t have to worry about integrating CAPTCHA-solving services, rotating user agents, or managing proxies. The platform automatically takes care of these, ensuring uninterrupted data extraction while simulating human-like behavior to avoid detection.
Our built-in support for IP rotation and CAPTCHA bypass means you can scrape even the most secure websites without getting blocked, significantly reducing the complexity and overhead of managing your scraping infrastructure.
Advanced DOM Manipulation and Interaction
When it comes to web scraping, sometimes you need to interact with forms, submit search queries, or click buttons to retrieve specific data. Selenium’s ability to interact with a webpage like a real user makes it perfect for scraping such dynamic content.
To extract data after interacting with forms, you can simulate filling out form fields, selecting dropdowns, and clicking buttons. Let’s break this down with some examples.
Submitting a Search Query
Suppose you want to scrape search results from a website where you need to submit a search term first. Here’s how you’d do that:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
# Set up Selenium WebDriver
driver_path = '/path/to/chromedriver'
driver = webdriver.Chrome(executable_path=driver_path)
# Open the search page
driver.get('https://example.com/search')
# Find the search box and input a search term
search_box = driver.find_element(By.NAME, 'q') # Assuming 'q' is the name of the search box
search_box.send_keys('Selenium Web Scraping')
# Submit the search form by simulating an Enter key press
search_box.send_keys(Keys.RETURN)
# Wait for results to load
time.sleep(2)
# Scrape search results
results = driver.find_elements(By.CLASS_NAME, 'result')
for result in results:
print(result.text)
# Close the browser
driver.quit()
In this example, You first locate the search box using its HTML attributes (e.g., name=“q”). The send_keys() method sends input to the form, simulating typing, and you can then submit the form by simulating an Enter key press using Keys.RETURN.
In some cases, you need to explicitly click the “Submit” button after filling out the form. Here’s how to handle that:
from selenium.webdriver.common.by import By
# Locate the submit button by its ID and click it
submit_button = driver.find_element(By.ID, 'submit-button-id')
submit_button.click()
Interacting with Dropdowns and Radio Buttons
When forms contain dropdown menus or radio buttons, Selenium can also select these elements.
from selenium.webdriver.support.ui import Select
# Locate the dropdown menu and select an option
dropdown = Select(driver.find_element(By.ID, 'dropdown-id'))
dropdown.select_by_value('option_value') # You can select by value, index, or visible text
The Select class in Selenium is used to interact with dropdown menus. You can select an option by its value, visible text, or index.
Advanced Interaction: Dealing with Dynamic Forms
Some forms dynamically load additional content (e.g., dependent dropdowns or CAPTCHA after submission). In such cases, it’s essential to use explicit waits to ensure the new elements are loaded before interacting with them.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Wait for the dynamically loaded element (e.g., search results)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-results'))
)
Data Extraction and Cleaning
Data extraction is one of the most critical tasks in web scraping, and Selenium provides various ways to locate and extract elements from a webpage. You can extract text, links, images, and attributes using different Selenium selectors like By.ID, By.CLASS_NAME, By.XPATH, and others.
Text content is often the most important data when scraping web pages. You can extract the text of an HTML element using the text attribute in Selenium.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
# Set up Selenium WebDriver with Service
driver_path = '/path/to/chromedriver'
service = Service()
driver = webdriver.Chrome()
# Open the target page
driver.get('https://books.toscrape.com/')
# Extract text using different selectors
element_by_id = driver.find_element(By.ID, 'element-id').text
element_by_class = driver.find_element(By.CLASS_NAME, 'element-class').text
element_by_xpath = driver.find_element(By.XPATH, '//div[@class="element-class"]').text
# Print the extracted text
print("Text by ID:", element_by_id)
print("Text by Class:", element_by_class)
print("Text by XPath:", element_by_xpath)
# Close the browser
driver.quit()
You can also extract links from anchor (<a>) elements by retrieving their href attributes.
# Extract all links on the page
links = driver.find_elements(By.TAG_NAME, 'a')
for link in links:
href = link.get_attribute('href')
print("Link:", href)
Similar to extracting links, you can extract image URLs from <img> tags by retrieving their src attributes.
# Extract all images on the page
images = driver.find_elements(By.TAG_NAME, 'img')
for image in images:
src = image.get_attribute('src')
print("Image URL:", src)
Cleaning Extracted Data
Once data has been extracted, it often needs cleaning to make it useful. You may encounter whitespace, non-printable characters, or incomplete data. Python offers several methods to clean the data after extraction.
- Removing Whitespace and Line Breaks: HTML often includes unnecessary whitespace or new lines that you don’t need in your data.
raw_text = driver.find_element(By.ID, 'element-id').text
clean_text = raw_text.strip() # Removes leading and trailing whitespace
clean_text = ' '.join(clean_text.split()) # Removes extra spaces and line breaks
print("Cleaned text:", clean_text)
- Removing Non-Printable Characters: Sometimes text extracted from web pages may include hidden characters that need to be removed.
import re
from selenium.webdriver.common.by import By
# Extract text from the element
try:
raw_text = driver.find_element(By.CLASS_NAME, 'element-class').text
if raw_text:
# Clean the text by removing non-printable characters
clean_text = re.sub(r'[^\x20-\x7E]', '', raw_text)
print("Cleaned text:", clean_text)
else:
print("No text found in the element.")
except Exception as e:
print(f"An error occurred: {e}")
Extracting and Cleaning Multiple Elements: When extracting data from a list of elements (e.g., product listings, blog posts), it’s important to structure the data efficiently. Here’s an example of extracting and cleaning a list of items from a page:
# Extract and clean product names and prices from a product listing page
products = driver.find_elements(By.CLASS_NAME, 'product')
for product in products:
name = product.find_element(By.CLASS_NAME, 'product-name').text.strip()
price = product.find_element(By.CLASS_NAME, 'product-price').text.strip()
print(f"Product: {name}, Price: {price}")
Using XPath for Advanced Selection
XPath is a powerful language for selecting elements based on complex conditions, such as finding elements based on their text content, attributes, or hierarchical relationships.
# Extract elements that contain specific text using XPath
elements = driver.find_elements(By.XPATH, "//div[contains(text(), 'Special Offer')]")
for element in elements:
print("Offer:", element.text)
Extracting Data from HTML Tables
HTML tables are common for displaying structured data, such as product listings, financial data, or rankings. Selenium allows you to navigate the DOM to extract table rows and cells easily. Let’s assume you have an HTML table with products and prices, and you want to extract this information.
<table id="product-table">
<tr>
<th>Product</th>
<th>Price</th>
</tr>
<tr>
<td>Product A</td>
<td>$20</td>
</tr>
<tr>
<td>Product B</td>
<td>$30</td>
</tr>
</table>
To extract the information from here, here’s what you’ll need to do:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Set up Selenium WebDriver
driver_path = '/path/to/chromedriver'
service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service)
# Open the page with the table
driver.get('https://example.com/products')
try:
# Wait for the table to be present
table = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'product-table'))
)
# Extract table rows (excluding the header)
rows = table.find_elements(By.TAG_NAME, 'tr')[1:] # Skipping the header row
# Iterate over rows and extract cells
for row in rows:
cells = row.find_elements(By.TAG_NAME, 'td')
if len(cells) >= 2: # Ensure there are at least two cells (product name and price)
product_name = cells[0].text.strip()
product_price = cells[1].text.strip()
print(f"Product: {product_name}, Price: {product_price}")
else:
print("Row does not have the expected number of cells.")
finally:
# Close the browser
driver.quit()
Optimizing Scraping Performance and Resource Management
When scraping at scale, optimizing performance and managing resources effectively are crucial for successful and efficient data extraction. Different techniques can achieve this, such as parallel execution using Selenium Grid and using resource management strategies to avoid common pitfalls like memory leaks or excessive resource consumption.
Parallel Execution with Selenium Grid
Selenium Grid allows you to run Selenium tests or scraping tasks on multiple machines or browsers in parallel, significantly improving performance by distributing the workload. Selenium Grid operates with two key components:
- Hub: The central point that receives requests and distributes them to nodes.
- Nodes: Machines (or browsers) where the actual browser sessions are executed.
By using Selenium Grid, you can run tests or scraping tasks on multiple machines or different browsers simultaneously, increasing efficiency and scalability.
To start with Grid, you must first download the server from here. Next, run the following code to start the hub:
java -jar selenium-server-standalone-4.24x.0.jar -role hub
This will start the Selenium Grid Hub on http://localhost:4444/grid. On each machine that will act as a node, start the Selenium Node:
java -jar selenium-server-standalone-4.24.0.jar -role node -hub http://localhost:4444/grid/register
This will register the node to the hub, and it will be ready to receive tasks.
Once the Grid is set up, you can configure Selenium to send browser sessions to the hub, which will distribute them to available nodes.
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.options import Options
# Set up the WebDriver to connect to the Selenium Grid hub
grid_url = "http://localhost:4444/wd/hub"
# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument('--headless') # Optional: run headless if you don't need a GUI
chrome_options.add_argument('--disable-gpu') # Disable GPU acceleration for headless mode
# Define the desired capabilities for Chrome browser
capabilities = DesiredCapabilities.CHROME.copy()
capabilities.update(chrome_options.to_capabilities()) # Merge Chrome options with capabilities
# Initialize the WebDriver to connect to the hub and distribute tasks
try:
driver = webdriver.Remote(command_executor=grid_url, desired_capabilities=capabilities)
# Perform scraping tasks
driver.get('https://example.com')
print(driver.title)
finally:
# Close the browser session
driver.quit()
Resource Management
When scraping at scale, resource management becomes crucial to avoid excessive memory usage, prevent browser crashes, and ensure efficient handling of large-scale operations. One of the primary causes of resource leaks is failing to properly close browser sessions. Each browser session consumes memory and CPU resources, so it’s essential to manage them efficiently.
To fix this, ensure you use quit() instead of close(). driver.close() closes the current window only while driver.quit() closes all windows and quits the WebDriver session, releasing all resources.
For large-scale scraping tasks, handling and storing the data efficiently is critical to avoid performance bottlenecks. Writing data to disk or a database in real-time instead of storing it in memory helps manage memory usage better.
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Path to chromedriver
driver_path = '/path/to/chromedriver'
# Open the CSV file in write mode
with open('data.csv', mode='w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Product Name', 'Price']) # Write header
# Set up Selenium WebDriver
service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service)
try:
# Loop through 100 pages
for i in range(100):
driver.get(f'https://example.com/products/page/{i}')
# Wait for product elements to be loaded (Optional but recommended)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'product-name'))
)
# Extract data
products = driver.find_elements(By.CLASS_NAME, 'product-name')
prices = driver.find_elements(By.CLASS_NAME, 'product-price')
# Zip products and prices and write to CSV
for product, price in zip(products, prices):
writer.writerow([product.text.strip(), price.text.strip()])
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Close the browser
driver.quit()
Best Practices for Selenium Web Scraping
Handling Edge Cases
Websites frequently update their structures, change class names, or introduce new anti-bot mechanisms, which can cause your scraper to break. Handling such edge cases involves building resilient and adaptive scraping workflows. Here are a few tips to help you achieve this:
- Rather than relying on hardcoded IDs or class names which can lead to brittle scrapers, use more flexible and dynamic strategies, such as XPath and CSS selectors, which target elements based on their content or relationships to other elements.
# Extract a product by its label text, instead of relying on specific classes
product_name = driver.find_element(By.XPATH, "//div[contains(@class, 'product')]//h2").text
- Before extracting data, validate the presence of key elements. If an element isn’t found, handle it gracefully rather than allowing the script to crash.
if not driver.find_elements(By.CLASS_NAME, 'product-name'):
print("Error: Product elements not found on page.")
else:
# Proceed with scraping if elements are present
products = driver.find_elements(By.CLASS_NAME, 'product-name')
- If a page structure changes (e.g., due to A/B testing or minor updates), implement fallbacks. For instance, try different selectors if the primary one fails.
try:
product = driver.find_element(By.CLASS_NAME, 'product-name')
except NoSuchElementException:
print("Primary selector failed, trying fallback")
product = driver.find_element(By.XPATH, "//h2[contains(text(), 'Product')]")
Legal Considerations
Scraping can raise legal issues, especially if the website’s terms of service prohibit data extraction or if sensitive information is involved. It’s important to comply with the following legal guidelines to avoid liability.
- Before scraping a website, always review its terms of service. Many websites explicitly prohibit scraping or impose restrictions on how data can be used. Respecting these guidelines is essential to avoid legal issues.
- Ensure you are not scraping personally identifiable information (PII) or sensitive data that could violate privacy regulations, such as the GDPR in Europe or CCPA in California.
- Excessive scraping can cause significant server load, potentially resulting in denial of service (DoS) attacks. Respect the website’s traffic limits by implementing polite scraping techniques (e.g., setting request limits, adding delays).
Error Handling
A well-designed scraper must be capable of handling errors gracefully. Pages may fail to load, elements may not be found, or you may face network interruptions, so implementing robust error handling and retry mechanisms will improve scraper reliability.
If an element is missing, catching NoSuchElementException prevents your script from crashing. Provide fallback mechanisms or retries when an element is not found.
from selenium.common.exceptions import NoSuchElementException
try:
element = driver.find_element(By.ID, 'non-existent-element')
except NoSuchElementException:
print("Element not found. Skipping this step.")
Pages may also fail to load due to temporary network issues, server errors, or JavaScript not rendering correctly. Implement a retry mechanism to attempt loading the page multiple times before giving up.
import time
from selenium.common.exceptions import TimeoutException
def load_page_with_retries(url, retries=3):
for i in range(retries):
try:
driver.get(url)
return True # Exit if successful
except TimeoutException:
print(f"Page load failed. Retry {i + 1} of {retries}")
time.sleep(2) # Wait before retrying
return False # Return False if all retries fail
# Use the function to load a page
if not load_page_with_retries('https://example.com'):
print("Page failed to load after multiple retries")
Finally, if scraping fails due to unforeseen issues, ensure your script shuts down properly and cleans up resources like open browser instances.
try:
# Perform scraping tasks
driver.get('https://example.com')
# ... additional scraping logic ...
except Exception as e:
print(f"Error occurred: {str(e)}")
finally:
driver.quit() # Ensure the browser is closed
Advanced Selenium Usage
When dealing with modern web applications, advanced Selenium techniques are essential for handling complex scenarios like pagination and dynamic content loading.
Multi-Page Scraping
def scrape_paginated_content(self, base_url: str, max_pages: Optional[int] = None) -> List[Dict]:
"""
Scrape content across multiple pages with error handling
Args:
base_url (str): Starting URL for scraping
max_pages (int, optional): Maximum number of pages to scrape
"""
all_data = []
current_page = 1
try:
while True:
# Check page limit
if max_pages and current_page > max_pages:
self.logger.info(f"Reached maximum page limit: {max_pages}")
break
# Wait for content to load
self.wait_for_content()
# Extract current page data
page_data = self.extract_page_data()
all_data.extend(page_data)
# Find next button with better error handling
try:
next_button = WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".pagination .next:not(.disabled)"))
)
# Check if next button is actually clickable
if not next_button.is_displayed() or not next_button.is_enabled():
self.logger.info("Reached last page - next button not clickable")
break
# Scroll to button and click
self.driver.execute_script("arguments[0].scrollIntoView(true);", next_button)
next_button.click()
# Wait for new content
time.sleep(2) # Allow for any animations
current_page += 1
self.logger.info(f"Navigated to page {current_page}")
except TimeoutException:
self.logger.info("No next button found - reached last page")
break
except Exception as e:
self.logger.error(f"Error during pagination: {str(e)}")
return all_data
Resource Management
Implement proper resource management to prevent memory leaks and ensure clean cleanup:
class ResourceManager:
def __init__(self):
self.temp_files = []
self.active_drivers = []
def register_driver(self, driver: webdriver.Chrome):
"""Register a WebDriver instance for cleanup"""
self.active_drivers.append(driver)
def cleanup(self):
"""Clean up all resources"""
# Close all active WebDriver instances
for driver in self.active_drivers:
try:
driver.quit()
except Exception as e:
logging.error(f"Error closing driver: {e}")
# Clear temporary files
for temp_file in self.temp_files:
try:
os.remove(temp_file)
except Exception as e:
logging.error(f"Error removing temp file: {e}")
Code Optimization and Scaling
For large-scale scraping operations, implement distributed processing and proper resource management:
from concurrent.futures import ThreadPoolExecutor, as_completed
from queue import Queue
from threading import Lock
class DistributedScraper:
def __init__(self, max_workers: int = 5):
self.max_workers = max_workers
self.data_queue = Queue()
self.lock = Lock()
self.resource_manager = ResourceManager()
def process_url(self, url: str) -> Dict:
"""Process a single URL with proper resource management"""
driver = None
try:
driver = self.create_driver()
self.resource_manager.register_driver(driver)
# Scrape data
data = self.scrape_url(driver, url)
# Store in queue
self.data_queue.put(data)
return data
finally:
if driver:
driver.quit()
def scrape_multiple_urls(self, urls: List[str]) -> List[Dict]:
"""Scrape multiple URLs concurrently"""
results = []
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
future_to_url = {
executor.submit(self.process_url, url): url
for url in urls
}
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
results.append(data)
except Exception as e:
self.logger.error(f"Error processing {url}: {e}")
return results
def save_results(self, results: List[Dict], output_file: str):
"""Save results with proper error handling"""
try:
with open(output_file, 'w') as f:
json.dump(results, f)
except Exception as e:
self.logger.error(f"Error saving results: {e}")
# Implement backup saving mechanism
Integration with Scrapy
For complex scraping projects, combine Selenium with Scrapy:
from scrapy import Spider
from scrapy.http import Request
from selenium.webdriver.remote.webdriver import WebDriver
class SeleniumSpider(Spider):
name = 'selenium_spider'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.driver: Optional[WebDriver] = None
def start_requests(self):
"""Initialize Selenium and start requests"""
self.driver = self.create_driver()
for url in self.start_urls:
yield Request(url, self.parse)
def parse(self, response):
"""Parse with Selenium for dynamic content"""
self.driver.get(response.url)
# Wait for dynamic content
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".content"))
)
# Extract data
for item in self.extract_items():
yield item
def closed(self, reason):
"""Clean up resources when spider closes"""
if self.driver:
self.driver.quit()
Conclusion
Selenium is a powerful tool for scraping dynamic, JavaScript-heavy websites, as we’ve seen in this guide. However, as the complexity of web scraping increases, so do the challenges, from anti-bot defenses to handling dynamic content efficiently.
This is where Scrape.do makes the difference. By using Scrape.do’s fully managed scraping API, you can:
- Bypass JavaScript-heavy obstacles with ease.
- Automate IP rotation and CAPTCHA handling effortlessly.
- Optimize your scraping operations with enhanced performance and scalability.
No need to spend hours fine-tuning Selenium scripts. Scrape.do’s robust toolkit ensures you focus on insights and growth, not on overcoming technical scraping barriers.
Start your web scraping journey today with 1,000 free credits.