Categories: Web scraping, Tutorials

How to Scrape JavaScript-Rendered Web Pages with Python

17 mins read Created Date: August 21, 2024   Updated Date: August 21, 2024
Have you ever been captivated by a website’s stunning visuals and interactions? The technology behind these engaging online experiences is often JavaScript. While JavaScript enables developers to create dynamic, interactive websites that delight users, it also presents significant challenges for data extraction. Many modern websites rely heavily on JavaScript to load dynamic content, making it difficult for developers to access the underlying data efficiently.

Have you ever been captivated by a website’s stunning visuals and interactions? The technology behind these engaging online experiences is often JavaScript. While JavaScript enables developers to create dynamic, interactive websites that delight users, it also presents significant challenges for data extraction. Many modern websites rely heavily on JavaScript to load dynamic content, making it difficult for developers to access the underlying data efficiently.

To effectively harvest data from such dynamic websites, you need specialized tools and techniques. In this article, we’ll focus on advanced techniques for scraping JavaScript-rendered web pages using Python. We’ll explore a combination of powerful tools and libraries, including Selenium, BeautifulSoup, and Requests, to effectively extract data from dynamic web pages.

But before we dive into the technical details, it’s worth noting that while the methods described in this article are powerful, they can be complex to implement and maintain. For those seeking a more streamlined solution, Scrape.do offers a brilliant alternative. With Scrape.do’s rendering infrastructure and playWithBrowser parameters, you can obtain scrape-ready HTML in seconds without relying on a browser, simplifying the process significantly.

Without wasting any more time, let’s dive right into it.

Setting Up the Development Environment(Prerequisites)

To get started with scraping JavaScript-rendered web pages, we need to set up our development environment with the necessary tools and libraries. This article assumes you have at least an intermediate level of Python experience. Here’s what you’ll need:

  • Python (latest stable version)
  • Selenium
  • BeautifulSoup
  • Requests
  • ChromeDriver or GeckoDriver

If you have all these, it is time for the next step, which is creating a virtual environment for Python. You can do that with the following:

python -m venv scraping_env

Next, you have to activate the environment. You can do this by navigating to the Scripts or bin folder and typing “activate”.

scraping_env/bin/activate

# On Windows:
scraping_env\Scripts\activate

Finally, you have to install all the necessary libraries needed.

pip install selenium beautifulsoup4 requests

Using Selenium for JavaScript Rendering

Selenium is a powerful tool for automating web browsers, and it’s ideal for dealing with the complexities of dynamic web page scraping. When you use Python Selenium web scraping techniques, you can execute JavaScript and capture the fully rendered HTML page, allowing you to extract data that would otherwise be inaccessible.

Since we’ve installed Selenium, you’ll need to download the appropriate WebDriver for your browser.

For Chrome, download ChromeDriver from the Chrome For Testing site and ensure it’s in your system PATH.

Here’s a quick tutorial to help you do that.

Once you’ve done that, you’re ready to start. If, for instance, you want to launch a browser, load Scrape.do’s web page, and print out the title, here’s the basic Selenium route for that.

from selenium import webdriver

# Specify the path to the ChromeDriver
driver_path = '/path/to/chromedriver'  # Adjust the path accordingly
driver = webdriver.Chrome()

# Open a web page
driver.get('https://scrape.do/')

# Print the page title
print(driver.title)

# Close the browser
driver.quit()

Handling Dynamic Content

One of the biggest challenges when scraping JavaScript-rendered web pages is ensuring that the dynamic content has fully loaded before attempting to extract it.

Selenium’s WebDriverWait provides a robust solution for synchronizing your script with the page’s loading process. It allows you to specify a maximum amount of time to wait for a condition to be met. If the condition is met before the timeout, the script proceeds; otherwise, a TimeoutException is raised.

If you want to extract multiple dynamic elements from a website, here’s how you can do that:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Set up Selenium
driver_path = '/path/to/chromedriver'
driver = webdriver.Chrome(executable_path=driver_path)
driver.get('https://scrape.do/')

try:
    # Wait for multiple elements to be present
    WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, 'dynamic_class'))
    )
    # Extract text from all elements with the specified class
    elements = driver.find_elements(By.CLASS_NAME, 'dynamic_class')
    for element in elements:
        print("Element text:", element.text)
except:
    print("Elements not found within the time frame")
finally:
    driver.quit()

Parsing Extracted Content with BeautifulSoup

Once you’ve successfully extracted the HTML content using Selenium, the next step is to parse it into a structured format for data extraction, and that’s where BeautifulSoup shines. BeautifulSoup is a Python library designed to parse HTML and XML documents. It works by creating a parse tree that allows you to navigate and search through the document’s structure.

Here’s a basic setup for passing the HTML content to BeautifulSoup for parsing after rendering the page with Selenium:

from selenium import webdriver
from bs4 import BeautifulSoup

# Set up Selenium
driver_path = '/path/to/chromedriver'
driver = webdriver.Chrome(executable_path=driver_path)
driver.get('https://example.com')

# Get the page source after JavaScript has rendered
page_source = driver.page_source

# Parse the page source with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')

# Close the Selenium browser
driver.quit()

That’s just a basic setup, but you can easily tweak it to target specific elements and data that you want. Let’s take a couple of examples:

  • Extracting a Single Element by Tag
# Extract the title of the page
title = soup.title.text
print("Page Title:", title)

# Extract the first <h1> tag
h1 = soup.find('h1').text
print("First H1 Tag:", h1)
  • Extracting Multiple Elements by Class Name
# Extract all elements with a specific class name
elements_by_class = soup.find_all(class_='element_class')
for element in elements_by_class:
    print("Element with class 'element_class':", element.text)
  • Handling Complex Queries with CSS Selectors
# Use CSS selectors to find elements
selected_elements = soup.select('div.outer_class > span.inner_class')
for element in selected_elements:
    print("Selected Element Text:", element.text)
  • Combining multiple search criteria
# Find elements that match multiple criteria
results = soup.find_all('div', class_='result', limit=5)
for result in results:
    title = result.find('h2', class_='title')
    description = result.find('p', class_='description')
    if title and description:
        print(f"Title: {title.text.strip()}")
        print(f"Description: {description.text.strip()}")
        print("---")

As we’ve just seen, Selenium BeautifulSoup scraping lets you easily parse the HTML content rendered by Selenium and extract the necessary data. This combination allows you to handle both static and dynamic content effectively. The examples provided demonstrate various ways to locate and extract specific elements and data, making your web scraping tasks more efficient and powerful.

However, instead of going through this arduous process, you can simply use Scrape.do now for free and get the same data in a matter of seconds.

Advanced Techniques with Selenium

Selenium is not just for navigating web pages and extracting static content; it can also interact with web elements to handle more complex scenarios. This includes clicking buttons, scrolling, filling out forms, and more. Let’s look at some advanced techniques and example codes for these interactions.

Clicking is one of the most common actions in web automation. To simulate a click on a button or link, you can use the click() method:

link = driver.find_element(By.LINK_TEXT, 'Sign Up')
button = driver.find_element(By.ID, "submit-button").click()
link.click()
button.click

Handling Alerts and Popups

For handling alerts and popups, use the switch_to.alert interface:

# Switch to the alert
alert = driver.switch_to.alert
alert_text = alert.text
alert.accept()  # Or alert.dismiss() to cancel

Scrolling

Since Selenium does not have a native scrolling API, here’s an alternative solution to use for scrolling:

# Scroll to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Scroll to a specific element
element = driver.find_element(By.ID, "target-element")
driver.execute_script("arguments[0].scrollIntoView();", element)

Handling Drag and Drop

Drag and drop is a common interaction in modern web interfaces. Selenium provides the ActionChains class to perform such complex user interactions. Here’s how to perform a drag and drop action:

from selenium.webdriver import ActionChains

source = driver.find_element(By.ID, "draggable")
target = driver.find_element(By.ID, "droppable")

actions = ActionChains(driver)
actions.drag_and_drop(source, target).perform()

If the drag and drop doesn’t work as expected, you can try a more manual approach:

# Alternative manual drag and drop
actions.click_and_hold(source).move_to_element(target).release().perform()

This approach breaks down the drag and drop into its component actions: click and hold the source, move to the target, and release.

These advanced techniques demonstrate the power and flexibility of Selenium for web automation and scraping. While they do not do much alone, by combining these methods you can create sophisticated scripts that can navigate and interact with complex web applications, handle dynamic content, and extract data from challenging sources.

Using Headless Browsers with Selenium

A headless browser is a web browser without a graphical user interface (GUI), and it’s a favorite for web scrappers. Python headless browser scraping with Selenium offers several advantages, including faster execution, lower resource consumption, and the ability to run tests in environments without a display.

Headless mode can be enabled by adding specific options to the browser configuration. To run Selenium in headless mode with Chrome, you just need to follow these steps:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Enable headless mode
chrome_options.add_argument("--disable-gpu")  # Disable GPU acceleration (Windows specific)

# Set up Selenium with Chrome in headless mode
driver_path = '/path/to/chromedriver'  # Adjust the path accordingly
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://example.com')

For Firefox, you’ll need to follow the same steps too:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# Set up Firefox options
firefox_options = Options()
firefox_options.headless = True  # Enable headless mode

# Set up Selenium with Firefox in headless mode
driver_path = '/path/to/geckodriver'  # Adjust the path accordingly
driver = webdriver.Firefox(options=firefox_options)
driver.get('https://example.com')

Error Handling and Debugging

Error handling and debugging are indispensable elements in the successful execution of any web scraping project. These practices safeguard the script’s reliability, optimize its performance, and guarantee the accuracy of extracted data.

To pinpoint and resolve problems within the scraping script, you need comprehensive debugging techniques, including logging, print statements, and debugging tools, are essential for

Let’s explore common issues and effective strategies to handle them.

Element not found Error

One of the most common challenges encountered in web scraping is the ‘Element Not Found’ error. This occurs when the script attempts to locate an element on a webpage that either doesn’t exist, hasn’t loaded yet, or has been dynamically changed.

To mitigate this issue, it’s essential to implement robust error handling mechanisms, such as using try-except blocks to catch NoSuchElementException, and employing explicit waits to ensure elements are present before attempting to interact with them.

from selenium.common.exceptions import NoSuchElementException

try:
    element = driver.find_element(By.ID, "my-element")
except NoSuchElementException:
    print("Element not found. The page structure might have changed.")
    # You might want to log this, take a screenshot, or implement a fallback strategy

Network errors

Network errors, including connection timeouts, DNS resolution failures, or server-side issues, can frequently disrupt the execution of web scraping processes. These errors can manifest as unexpected interruptions, incomplete data extraction, or script failures.

To fix this,, it’s crucial to implement robust error handling mechanisms, such as retry logic with exponential backoff, to gracefully handle temporary network disruptions and avoid overwhelming the target website. Additionally, incorporating proper logging practices can provide valuable insights into the root causes of network-related issues, aiding in troubleshooting and improving the script’s resilience.

StaleElementReferenceException

StaleElementReferenceException is a common challenge in web scraping when using Selenium. This error occurs when the web page’s Document Object Model (DOM) undergoes changes, rendering previously located elements inaccessible. Such modifications can happen due to various reasons, including dynamic content updates, AJAX calls, or page refreshes.

Here’s how to fix this problem:

from selenium.common.exceptions import StaleElementReferenceException
def click_with_retry(element, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            element.click()
            return
        except StaleElementReferenceException:
            if attempt == max_attempts - 1:
                raise
            print(f"Stale element, retrying... (Attempt {attempt + 1})")

Techniques for debugging and logging

Taking Screenshots to Document Errors

Capturing screenshots of the webpage at the moment an error occurs can be invaluable for debugging and understanding the root cause of the issue. Selenium provides a straightforward method to achieve this.

from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from PIL import Image

def take_screenshot(driver, filename):
    driver.save_screenshot(filename)
    img = Image.open(filename)
    # Resize or modify image if needed
    img.save(filename, 'PNG')

# ... rest of your code

try:
    # Your scraping logic
    element = driver.find_element(By.ID, "product-title")
    # ...
except Exception as e:
    # Handle the exception
    take_screenshot(driver, "error_screenshot.png")
    raise e

Saving Page Source for Analysis

Preserving the HTML source of a webpage when an error occurs is invaluable for post-mortem analysis. By capturing the DOM structure at the exact moment of failure, developers can meticulously examine the HTML content, identify missing elements, unexpected changes, or other anomalies that contributed to the error.

try:
    # Your selenium code
    pass
except Exception as e:
    with open('error_page_source.html', 'w') as f:
        f.write(driver.page_source)
    print(f"Error occurred: {e}")

Logging For Comprehensive Monitoring

Implementing detailed logging is essential when performing a JavaScript-rendered web scraping operation for understanding the behavior of your web scraping script and diagnosing issues effectively. Python’s built-in logging module provides a flexible and powerful way to record information about your scraping process.

import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename='scraper.log')

def scrape_page(url):
    logging.info(f"Starting to scrape {url}")
    try:
        driver.get(url)
        # More scraping code
        logging.info("Successfully scraped the page")
    except Exception as e:
        logging.error(f"Error while scraping {url}: {str(e)}")

Combining Selenium and Requests for Hybrid Scraping

Many websites offer a mix of static and dynamic content. In such cases, leveraging both Selenium and Requests can optimize the scraping process. Selenium handles dynamic content and browser interactions, while Requests efficiently fetches static content.

Let’s create a hybrid scraper that uses both Requests and Selenium. We’ll demonstrate this with a hypothetical e-commerce site where the product list is static, but product details are loaded dynamically.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import logging
import time

# Set up logging
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename='hybrid_scraper.log')

# Function to scrape static product list using requests
def scrape_product_list(url):
    logging.info(f"Scraping product list from {url}")
    try:
        response = requests.get(url)
        response.raise_for_status()
        logging.info("Successfully fetched product list")
        return response.text
    except requests.RequestException as e:
        logging.error(f"Error fetching product list from {url}: {str(e)}")
        return None

# Function to scrape dynamic product details using Selenium
def scrape_product_details(url):
    logging.info(f"Scraping product details from {url}")
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(options=chrome_options)

    try:
        driver.get(url)
        # Wait for the dynamic element to be present
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-detail'))
        )
        page_source = driver.page_source
        logging.info("Successfully fetched product details")
        return page_source
    except Exception as e:
        logging.error(f"Error fetching product details from {url}: {str(e)}")
        return None
    finally:
        driver.quit()
        logging.info("Browser closed")

# Function to combine static and dynamic scraping
def hybrid_scrape(static_url, dynamic_base_url):
    static_content = scrape_product_list(static_url)
    if not static_content:
        return

    soup_static = BeautifulSoup(static_content, 'html.parser')
    product_links = soup_static.find_all('a', {'class': 'product-link'})

    for link in product_links:
        product_url = dynamic_base_url + link['href']
        dynamic_content = scrape_product_details(product_url)

        if dynamic_content:
            soup_dynamic = BeautifulSoup(dynamic_content, 'html.parser')
            product_name = soup_dynamic.find('h1', {'class': 'product-name'}).text
            product_price = soup_dynamic.find('span', {'class': 'product-price'}).text
            product_description = soup_dynamic.find('div', {'class': 'product-description'}).text

            print(f"Product Name: {product_name}")
            print(f"Product Price: {product_price}")
            print(f"Product Description: {product_description}")
            print("-" * 50)

        time.sleep(1)  # Be polite and avoid hammering the server with requests

# Example usage of the hybrid_scrape function
static_url = 'https://example.com/products'  # URL of the product list page
dynamic_base_url = 'https://example.com'     # Base URL to construct full product URLs
hybrid_scrape(static_url, dynamic_base_url)

This hybrid scraper efficiently combines the strengths of requests for static content and Selenium for dynamic content. It demonstrates how to handle both types of content on an e-commerce site, providing a solution for web scraping scenarios where content is loaded dynamically via JavaScript.

Real-World Example

SO far, we’ve covered both basic and advanced techniques for scrapping data from JavaScript rendered pages. Now, let’s look at an example that’ll tie together many of the concepts we’ve discussed, including handling dynamic content, error management, and combining Selenium with Requests.

Our scenario: We’ll scrape an e-commerce site that sells electronics. The main page loads statically, but product details and user reviews are loaded dynamically. We’ll gather product information and user reviews, handling pagination and potential errors.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import csv
import logging
import random

class ElectronicsStoreScraper:
    def __init__(self):
        # Initialize a requests session for handling HTTP requests
        self.session = requests.Session()
        # Set up the Selenium WebDriver
        self.driver = self.setup_selenium()
        # Configure logging
        self.setup_logging()

    def setup_selenium(self):
        """
        Set up the Selenium WebDriver with Chrome in headless mode.
        """
        chrome_options = Options()
        chrome_options.add_argument("--headless")  # Run Chrome in headless mode (no GUI)
        chrome_options.add_argument("--window-size=1920,1080")  # Set a standard window size
        service = Service('path/to/chromedriver')  # Path to ChromeDriver
        return webdriver.Chrome(service=service, options=chrome_options)

    def setup_logging(self):
        """
        Configure logging to save scraper activities and errors to a file.
        """
        logging.basicConfig(filename='scraper.log', level=logging.INFO,
                            format='%(asctime)s - %(levelname)s - %(message)s')

    def get_product_links(self, url):
        """
        Fetch product links from the main product listing page using requests.

        Args:
            url (str): The URL of the product listing page.

        Returns:
            list: A list of product URLs.
        """
        try:
            # Set a user agent to mimic a browser
            headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
            }
            # Send a GET request to the URL
            response = self.session.get(url, headers=headers)
            response.raise_for_status()  # Raise an exception for bad status codes

            # Parse the HTML content
            soup = BeautifulSoup(response.content, 'html.parser')
            # Find all product link elements
            product_links = soup.select('a.product-item-link')
            # Extract and return the href attributes
            return [link['href'] for link in product_links]
        except requests.RequestException as e:
            # Log any request-related errors
            logging.error(f"Error fetching product links: {str(e)}")
            return []

    def get_product_details(self, url):
        """
        Fetch detailed information for a single product using Selenium.

        Args:
            url (str): The URL of the product page.

        Returns:
            dict: A dictionary containing product details, or None if an error occurs.
        """
        try:
            # Navigate to the product page
            self.driver.get(url)
            # Wait for the main product information to load
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, "product-info-main"))
            )

            # Extract product details
            name = self.driver.find_element(By.CLASS_NAME, "page-title").text
            price = self.driver.find_element(By.CLASS_NAME, "price").text
            description = self.driver.find_element(By.CLASS_NAME, "product-info-description").text

            # Get product reviews
            reviews = self.get_product_reviews()

            return {
                "name": name,
                "price": price,
                "description": description,
                "reviews": reviews
            }
        except (TimeoutException, NoSuchElementException) as e:
            # Log any errors in finding elements or timeouts
            logging.error(f"Error getting product details for {url}: {str(e)}")
            return None

    def get_product_reviews(self):
        """
        Fetch and extract user reviews for the current product page.

        Returns:
            list: A list of dictionaries containing review text and ratings.
        """
        reviews = []
        try:
            # Find and click the reviews tab
            review_button = WebDriverWait(self.driver, 10).until(
                EC.element_to_be_clickable((By.ID, "tab-label-reviews"))
            )
            review_button.click()

            while True:
                # Wait for the review section to load
                WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located((By.CLASS_NAME, "review-items"))
                )
                # Find all review elements
                review_elements = self.driver.find_elements(By.CLASS_NAME, "review-item")

                # Extract information from each review
                for review in review_elements:
                    review_text = review.find_element(By.CLASS_NAME, "review-content").text
                    rating = review.find_element(By.CLASS_NAME, "rating-result").get_attribute("title")
                    reviews.append({"text": review_text, "rating": rating})

                try:
                    # Check for and click the 'Next' button if available
                    next_button = self.driver.find_element(By.CSS_SELECTOR, ".pages-item-next")
                    if "disabled" in next_button.get_attribute("class"):
                        break  # No more pages
                    next_button.click()
                    time.sleep(2)  # Wait for the next page to load
                except NoSuchElementException:
                    break  # No 'Next' button found, assuming it's the last page

        except (TimeoutException, NoSuchElementException) as e:
            # Log any errors in finding elements or timeouts
            logging.warning(f"Error getting reviews: {str(e)}")

        return reviews

    def scrape_products(self, base_url, max_products=50):
        """
        Main scraping function to extract information for multiple products.

        Args:
            base_url (str): The base URL of the product listing.
            max_products (int): Maximum number of products to scrape.

        Returns:
            list: A list of dictionaries containing product information.
        """
        all_products = []
        page = 1
        while len(all_products) < max_products:
            # Construct the URL for the current page
            url = f"{base_url}?p={page}"
            # Get product links from the current page
            product_links = self.get_product_links(url)
            if not product_links:
                break  # No more products found

            for link in product_links:
                if len(all_products) >= max_products:
                    break
                # Get details for each product
                product = self.get_product_details(link)
                if product:
                    all_products.append(product)
                    logging.info(f"Scraped product: {product['name']}")
                # Random delay between requests to avoid overloading the server
                time.sleep(random.uniform(1, 3))

            page += 1  # Move to the next page

        return all_products

    def save_to_csv(self, products, filename='products.csv'):
        """
        Save the scraped product information to a CSV file.

        Args:
            products (list): List of product dictionaries.
            filename (str): Name of the output CSV file.
        """
        with open(filename, 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=["name", "price", "description", "reviews"])
            writer.writeheader()
            for product in products:
                # Convert the list of review dictionaries to a string
                reviews_str = '; '.join([f"{r['rating']} - {r['text']}" for r in product['reviews']])
                writer.writerow({
                    "name": product['name'],
                    "price": product['price'],
                    "description": product['description'],
                    "reviews": reviews_str
                })

    def close(self):
        """
        Close the Selenium WebDriver to free up resources.
        """
        self.driver.quit()

if __name__ == "__main__":
    scraper = ElectronicsStoreScraper()
    try:
        base_url = "https://example.com/electronics"
        # Start the scraping process
        products = scraper.scrape_products(base_url, max_products=50)
        # Save the scraped data to a CSV file
        scraper.save_to_csv(products)
        print(f"Scraped {len(products)} products successfully.")
    except Exception as e:
        # Log any unexpected errors
        logging.error(f"An unexpected error occurred: {str(e)}")
    finally:
        # Ensure the WebDriver is closed even if an error occurs
        scraper.close()

This scraper is designed to be robust and handle various scenarios you might encounter in a real-world scraping task. It’s important to note that you should always respect the website’s robots.txt file and terms of service when scraping. Additionally, you may need to adjust selectors and wait times based on the specific website you’re scraping.

To use this scraper:

  1. Replace ‘path/to/chromedriver’ with the actual path to your ChromeDriver.
  2. Replace “https://example.com/electronics" with the actual URL of the e-commerce site you’re scraping.
  3. Adjust the CSS selectors and class names to match the structure of the website you’re scraping.

Conclusion

Scraping JavaScript-rendered web pages requires a combination of tools and techniques. While Selenium and BeautifulSoup provide powerful capabilities, they also come with complexity in setup and maintenance. In the long run, it’s better to use a less complex and more comprehensive tool like Scrape.do

Scrape.do’s rendering infrastructure and playWithBrowser parameters allow you to generate scrape-ready HTML in seconds, all without needing your browser. This not only saves time and resources but also lets you experiment with web scraping. The best part is, that you can get started now for free!