A Quick Introduction to Selenium in Ruby
Selenium with Ruby is great for web scraping and testing because it can simulate real user interactions to load dynamic content.
Ruby’s concise and expressive syntax simplifies writing scripts, while its rich ecosystem of libraries adds flexibility to the automation process. In this article, we’ll tell you all you need to know about Selenium in Ruby, covering both setup and advanced use cases.
First, let’s look at some prerequisites:
Prerequisites and Setup
Before diving into browser automation with Selenium in Ruby, it’s important to ensure that your system meets the necessary requirements and is set up correctly to handle web scraping or testing tasks.
System Requirements
To begin, you’ll need a compatible version of Ruby installed on your system. Selenium WebDriver is compatible with the latest stable versions of Ruby, such as Ruby 3.3.5, so it’s recommended to have this or a similar version installed to avoid compatibility issues.
Once Ruby is set up, the next step is installing the necessary dependencies. For a complete browser automation setup in Ruby, you’ll need a few key libraries. The first is the `selenium-webdriver` gem, which provides the core functionality for automating browser interactions. If you’re scraping data from web pages, you might also find the `nokogiri` gem helpful for parsing and extracting HTML content.
For those looking to run browser automation without launching a visible browser window (i.e., headless mode), the <code>headless</code> gem is an optional but useful tool.
You can easily do this by running the command:
gem install selenium-webdriver
gem install nokogiri
gem install webdrivers # Handles driver management automatically
gem install headless
Browser Driver Setup
To successfully run Selenium automation in Ruby, you’ll need to install the appropriate browser drivers. These drivers act as a bridge between Selenium and the browser, allowing Selenium to interact with the browser’s functionality. It’s essential to ensure that the driver versions are compatible with your browser’s version to avoid conflicts.
Here’s how to install the appropriate WebDriver for your browser:
ChromeDriver (for Chrome):
To automate Chrome, you’ll need to download the ChromeDriver that matches the version you installed. You can find the compatible version of ChromeDriver from the official ChromeDriver site.
To check your Chrome version, open Chrome and go to “chrome://settings/help” or click on the three-dot menu. Selenium requires GeckoDriver, which can be downloaded from the official GeckoDriver releases page. Make sure the version of GeckoDriver you download is compatible with your Firefox version (Help > About Google Chrome). Once you’ve downloaded the appropriate version, place it in a folder and add that folder to your system’s PATH environment variable. You can verify the installation by running `chromedriver –version` in your terminal.
GeckoDriver (for Firefox):
For Firefox, Selenium requires GeckoDriver, which can be downloaded from the official GeckoDriver releases page.
Make sure the version of GeckoDriver you download is compatible with your version of Firefox. You can check your Firefox version by navigating to “Help” > “About Firefox” in the browser](). Like ChromeDriver, you must ensure that GeckoDriver is added to your PATH environment variable. After downloading, you can test the installation with `geckodriver –version`.
EdgeDriver (for Edge):
To automate tasks in Microsoft Edge, you’ll need EdgeDriver, which you can download from the Microsoft Edge Driver site.
The EdgeDriver version should correspond to the version of Edge installed on your machine, and you can find it by navigating to “Help and Feedback” > “About Microsoft Edge.”()After downloading, make sure to add the EdgeDriver executable to your system PATH, and verify it with `msedgedriver –version` in the terminal.
Safari (SafariDriver):
For macOS users, Safari comes with built-in support for SafariDriver, which is included by default in Safari 10 or later.
There’s no need to download a separate driver, but you must enable ‘Allow Remote Automation’ in Safari’s developer options. To do this, go to “Safari” > “Preferences” > “Advanced” and check the box next to “Show Develop menu in menu bar.” Then, in the “Develop” menu, select “Allow Remote Automation.”
Once you’re done with that, it’s time to set up your project.
Basic Project Structure
First, create a new directory for your scraping project:
mkdir ruby_scraping_project
cd ruby_scraping_project
bundle init
Next, you need to add the following to your Gemfile:
source 'https://rubygems.org'
gem 'selenium-webdriver',
gem 'webdrivers',
gem 'nokogiri'
gem 'base64'
gem 'logger'
gem 'ostruct'
Finally, run `bundle install` to install dependencies.
Setting Up Selenium for Ruby
Once you’ve installed the necessary dependencies and drivers, getting started with Selenium in Ruby is a breeze.
Below, we’ll walk through the basic setup for using Selenium WebDriver in Ruby. We’ll begin with a basic example, then explore configuring various browser drivers, handling WebDriver paths, and using custom browser options.
Basic Setup
Here’s a basic example demonstrating how to set up Selenium in Ruby and automate a simple browser task like navigating to a webpage and printing its title:
require 'selenium-webdriver'
# Initialize the Chrome browser
driver = Selenium::WebDriver.for :chrome
# Navigate to a webpage
driver.get 'https://www.example.com'
# Output the title of the page
puts "Page Title: #{driver.title}"
# Close the browser
driver.quit
This script opens Chrome, navigates to a webpage, retrieves the page title, and then closes the browser. To execute the script, save it in your project directory (for example, as `selenium_example.rb`), and then run the file in your terminal:
ruby selenium_example.rb
Driver Configuration
When using Selenium to automate browser tasks, it’s essential to configure the correct browser driver that acts as an interface between Selenium and the browser. This driver enables Selenium to control the browser and perform actions like navigating, clicking, and extracting data.
Selenium supports multiple browsers (Chrome, Firefox, Edge, etc.), each requiring a corresponding WebDriver that we’ve shown how to download. Now, let’s look at setting up different drivers.
Chrome Driver Setup
require 'selenium-webdriver'
# Configure Chrome with specific options
options = Selenium::WebDriver::Chrome::Options.new
# Example: Open Chrome in incognito mode and headless mode
options.add_argument('--incognito')
options.add_argument('--headless')
# Initialize the browser with custom options
driver = Selenium::WebDriver.for :chrome, options: options
driver.get 'https://www.scrapingcourse.com/ecommerce/'
puts "Page Title: #{driver.title}"
driver.quit
Firefox Driver Setup
require 'selenium-webdriver'
# Configure Firefox options
options = Selenium::WebDriver::Firefox::Options.new
# Example: Run Firefox in headless mode
options.add_argument('--headless') # Shortcut for --headless
# Initialize Firefox
driver = Selenium::WebDriver.for :firefox, options: options
driver.get 'https://www.example.com'
puts "Page Title: #{driver.title}"
driver.quit
Edge Driver Setup
If you’re working with `Microsoft Edge`, the setup is similar:
require 'selenium-webdriver'
# Initialize Edge browser
driver = Selenium::WebDriver.for :edge
driver.get 'https://www.example.com'
puts "Page Title: #{driver.title}"
driver.quit
Handling WebDriver Paths
If the WebDriver executable is not in your system’s PATH, you must manually specify its location so that Selenium can access it. For instance, to set the path for ChromeDriver in Ruby, you would use the following line of code:
Selenium::WebDriver::Chrome.driver_path = '/path/to/your/chromedriver'
This ensures that Selenium can locate the ChromeDriver executable if it’s not globally accessible via the system’s PATH. This is especially useful when you have multiple versions of a WebDriver installed or when running scripts on machines where the WebDriver isn’t set up in the default environment.
Custom Browser Configurations
To optimize your scraping tasks, you can set various custom browser options:
Headless Mode
Running the browser without a graphical interface, is useful for faster scraping.
options.add_argument('--headless')
Incognito Mode
Avoids storing browsing history and cookies.
options.add_argument('--incognito')
User-Agent String
Customizing the browser’s user-agent string helps reduce detection when scraping:
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')
Window Size
Sometimes adjusting the window size can affect how content loads:
options.add_argument('window-size=1200x600')
These configurations allow flexibility when interacting with different sites and help in avoiding detection while scraping.
Navigating and Interacting with Web Pages
Selenium enables seamless navigation and interaction with web pages, allowing you to retrieve dynamic content, interact with forms, and handle complex website behaviors. Below, we’ll walk through basic commands, advanced interactions, and techniques for handling dynamic content.
Navigating to a URL
To navigate to a webpage, you can use the, .`get` method. This command instructs the browser to load the specified URL.
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :chrome
driver.get 'https://www.scrapingcourse.com'
puts "Current URL: #{driver.current_url}"
driver.quit
Using Selectors to Find Elements
To interact with elements on the webpage, you need to locate them using selectors. Selenium provides two primary methods for finding elements:
- find_element: Finds the first matching element based on the provided selector.
- find_elements: Returns an array of all matching elements based on the provided selector.
Selenium supports several types of selectors, such as by ID, class name, tag name, name attribute, XPath, and CSS selectors. Let’s look at these selectors in action:
Find a single element(e.g., by CSS selector):
element = driver.find_element(css: 'h1')
puts element.text
Find multiple elements:
elements = driver.find_elements(css: 'p')
elements.each { |el| puts el.text }
You can also use XPath for more complex selections:
element = driver.find_element(xpath: '//div[@id="main"]')
puts element.text
Practical Example: Retrieving Dynamic Data
Web scraping often involves dealing with dynamic content that changes based on user interactions, JavaScript rendering, or AJAX requests. Here’s an example of how to retrieve dynamic data from a website.
require 'selenium-webdriver'
# Start a new Selenium session
options = Selenium::WebDriver::Chrome::Options.new
driver = Selenium::WebDriver.for :chrome, options: options
# Navigate to the e-commerce products page
driver.get 'https://www.scrapingcourse.com/ecommerce/'
# Wait for the product names to load
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { driver.find_elements(css: 'h2.product-name').any? }
# Retrieve product names
product_names = driver.find_elements(css: 'h2.product-name')
# Print each product name
product_names.each { |product| puts product.text }
# Quit the driver
driver.quit
This example:
- Automates the process of extracting product names from an e-commerce website.
- It uses the Selenium library to control a Chrome web browser.
- Navigates to the desired page and waits for content to load.
- Finds specific elements representing product names.
- Stores the names in a list, and finally prints the extracted names.
Advanced Element Interactions
While basic actions like navigating pages or finding elements are essential, advanced interactions—such as filling out forms, clicking buttons, hovering over elements, scrolling, and executing JavaScript—expand Selenium’s capabilities significantly.
Filling Out Forms
Filling out forms is a common interaction for automated testing or data input tasks. With Selenium, you can locate form fields, input values, and submit the form with ease.
# Find and fill a form input
search_box = driver.find_element(name: 'q')
search_box.send_keys('Selenium Ruby')
# Submit the form
search_box.submit
Clicking Buttons
Clicking buttons is a straightforward task in Selenium. Buttons can be located using various selectors (e.g., ID, class, name), and you can click any clickable element like buttons or links using the `click` method.
# Find and click a button
submit_button = driver.find_element(id: 'submit')
submit_button.click
Hovering Over Elements
Some websites use hover effects to display additional content, such as dropdowns or tooltips. Selenium’s `ActionBuilder` allows you to simulate mouse movements like hovering.
element = driver.find_element(css: '.menu-item')
driver.action.move_to(element).perform
Scrolling
Sometimes, elements may be off-screen or hidden in scrollable sections. Selenium allows you to scroll to elements or specific parts of the page using JavaScript or actions.
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
Executing JavaScript
Sometimes, interacting with a page via Selenium’s built-in methods isn’t enough. For more complex scenarios, such as triggering custom JavaScript functions or manipulating elements directly, Selenium allows you to execute JavaScript.
driver.execute_script('document.querySelector(".load-more").click();')
Handling Dynamic Content
When scraping dynamic websites, especially those that rely heavily on JavaScript for rendering content (such as single-page applications and AJAX-driven pages), you may encounter situations where the content you’re trying to extract hasn’t fully loaded by the time Selenium attempts to interact with it.
To handle this, Selenium provides two types of waits: implicit waits and explicit waits. These waits help ensure that your scripts interact with elements only when they are ready, which is crucial for avoiding errors and improving the reliability of your automation.
Explicit Waits:
Explicit waits provide more fine-grained control by allowing you to wait for a specific condition to be true before interacting with an element. You can wait for conditions like the visibility of an element, the presence of an element, or other custom conditions.
wait = Selenium::WebDriver::Wait.new(timeout: 10) # 10 seconds timeout
element = wait.until { driver.find_element(css: '.dynamic-element') }
puts element.text
This approach is useful for elements that load asynchronously, allowing Selenium to pause until the content appears.
Implicit Waits:
An implicit wait is a global setting that tells Selenium to wait for a specific amount of time whenever it tries to find an element. If the element is not immediately available, Selenium will keep trying until the timeout is reached. This approach is useful for general use cases where you want to handle minor delays in element loading.
driver.manage.timeouts.implicit_wait = 10 # seconds
element = driver.find_element(css: '.dynamic-element')
puts element.text
Implicit waits are set once and apply to all interactions.
Handling AJAX Content and JavaScript-heavy Websites
Many modern websites often load content asynchronously using **AJAX** (Asynchronous JavaScript and XML), where the content is loaded dynamically without a full page refresh. This can make scraping challenging because it causes the following issues:
- Elements not yet present: The elements you need may not be available when Selenium first tries to find them.
- Elements partially loaded: Some parts of the webpage may render while others are still being fetched via AJAX.
- Elements that change over time: Content might be updated dynamically based on user interactions or periodic AJAX requests.
When scraping websites that load content via AJAX, there are a couple of ways you can ensure you scrape the right data:
Wait for Elements:
Use explicit waits to ensure the AJAX-loaded elements are present before interacting with them.
wait = Selenium::WebDriver::Wait.new(timeout: 15)
element = wait.until { driver.find_element(css: '.ajax-loaded') }
Monitor Network Activity:
For some complex single-page applications (SPAs) or AJAX-heavy websites, waiting for a specific element may not be enough. In such cases, you can wait for network activity to idle or use JavaScript to ensure the page is done loading.
driver.execute_script('return jQuery.active == 0;')
Scroll or Click to Load More Content:
For infinite scrolling, you can repeatedly scroll to the bottom or click a “load more” button using JavaScript execution or scrolling methods.
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
Common Issues When Scraping JavaScript-Heavy Websites
Modal Windows and Popups
Modal windows and pop-ups, such as cookie consent banners or subscription dialogs, often interfere with scraping workflows.
These interruptions prevent the scraper from interacting with the underlying content. The best approach is to programmatically detect and close these pop-ups using Selenium’s element selectors or handle JavaScript alerts with Selenium’s built-in methods to ensure that these modals don’t block the scraping process.
popup_close_button = driver.find_element(:css, '.popup-close-button')
popup_close_button.click if popup_close_button.displayed?
Hidden or Lazy Loading Elements
Websites often employ lazy-loading techniques where elements, images, or content are loaded only when they are about to appear in the viewport.
This poses a problem for scrapers that attempt to capture the content before it becomes visible.
The solution is to use JavaScript execution to scroll to the relevant element and trigger its loading. By simulating scrolling or navigating to specific sections of the page, you can ensure that all necessary content is loaded before scraping.
element = driver.find_element(:css, '.lazy-loaded-element')
driver.execute_script("arguments[0].scrollIntoView();", element)
Slow Page Loading Times
Slow page loading times on JavaScript-heavy websites can also cause issues, especially if the scraper tries to interact with elements before the page has fully loaded. This can lead to timeout errors or incomplete data retrieval.
To avoid this, increase the timeout values for Selenium’s waits, or use explicit waits to ensure that key elements are fully loaded before attempting to interact with them.
Handling Edge Cases
Web scrapers face unique challenges when scraping complex websites, particularly those with dynamic behaviours like pagination, infinite scrolling, or CAPTCHA protections.
To combat these challenges, we’ll explore how to effectively scrape paginated or infinite scrolling content and discuss strategies for handling CAPTCHA-protected websites using Selenium and JavaScript execution.
Pagination and Infinite Scrolling
Websites that use pagination or infinite scrolling to display content incrementally require special handling to ensure that all the desired data is captured.
Both scenarios can be handled effectively with Selenium and Ruby.
Scraping Paginated Content
For paginated sites, the key is to locate the “Next” button or pagination links and repeatedly click them until the last page is reached. Here’s an example of how to scrape product names from multiple pages:
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :chrome
driver.get 'https://www.scrapingcourse.com/ecommerce/'
# Function to scrape product names from the current page
def scrape_product_names(driver)
product_names = driver.find_elements(css: 'h2.product-name')
product_names.each { |product| puts product.text }
end
# Initial scrape
scrape_product_names(driver)
# Pagination handling
while true
begin
# Locate the "Next" button and click it
next_button = driver.find_element(css: '.next')
next_button.click
# Wait for the new page to load
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { driver.find_elements(css: 'h2.product-name').any? }
# Scrape product names from the new page
scrape_product_names(driver)
rescue Selenium::WebDriver::Error::NoSuchElementError
# Break the loop if no "Next" button is found
puts "No more pages to scrape."
break
end
end
driver.quit
The pagination code begins by defining a method called `scrape_product_names`, which encapsulates the logic for scraping product names from the current page. An initial scrape is performed before entering the pagination loop.
Within the loop, the code locates the “Next” button using a CSS selector and attempts to click it to navigate to the next page. After clicking, the script waits until the product names are fully loaded on the new page before scraping again.
To handle errors, if the “Next” button is not found—indicating that there are no more pages to scrape—the code catches the `NoSuchElementError` and breaks the loop, thus ensuring a smooth and efficient scraping process.
Scraping with Infinite Scrolling
Infinite scrolling presents a different challenge, as more content is loaded dynamically when the user scrolls down the page.
Selenium can simulate this scrolling behavior using JavaScript execution to trigger the loading of new data. Here’s how:
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :chrome
driver.get 'https://www.scrapingcourse.com/ecommerce/'
# Wait for products to load
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { driver.find_elements(css: 'h2.product-name').any? }
# Infinite scrolling
previous_height = driver.execute_script("return document.body.scrollHeight")
loop do
# Scroll to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new products to load
sleep 2
# Get new height and compare with previous height
current_height = driver.execute_script("return document.body.scrollHeight")
break if current_height == previous_height # Exit loop if no new content
previous_height = current_height
end
# Scrape all product names
product_names = driver.find_elements(css: 'h2.product-name')
product_names.each { |product| puts product.text }
driver.quit
This infinite scrolling example begins by retrieving the initial height of the page to check for new content after scrolling. It enters a loop that scrolls to the bottom and waits briefly for new content to load. After each scroll, it compares the current height to the previous height; if they are the same, the loop exits, indicating no new content. Finally, the code scrapes the product names as before, capturing all available data effectively.
Handling CAPTCHAs
CAPTCHA protections are designed to prevent automated tools from interacting with a website and they can disrupt scraping workflows by blocking access or requiring manual input, making it a major challenge for scrapers.
Selenium can interact with basic elements on a webpage, but it cannot solve CAPTCHAs directly because these challenges are specifically designed to be difficult for bots. CAPTCHA types can vary from simple image-based verifications to more complex interactive challenges. When Selenium encounters a CAPTCHA, it usually results in a stalled process unless the CAPTCHA is solved manually or bypassed.
There are several ways to handle CAPTCHA-protected sites when scraping, though none are perfect. The most common approach is to use external CAPTCHA-solving services, such as Anti-Captcha, 2Captcha, or DeathByCaptcha, which solve CAPTCHAs programmatically. These services work by submitting the CAPTCHA to human solvers or using machine learning models, returning the solution that can be used in the scraping script.
Example of using an external CAPTCHA-solving service (pseudo-code):
# Pseudo-code for solving a CAPTCHA
if driver.find_elements(css: '.captcha').any?
captcha_solution = solve_captcha_using_service(api_key, driver.page_source)
driver.find_element(css: 'input[name="captcha"]').send_keys(captcha_solution)
driver.find_element(css: 'button[type="submit"]').click
end
In this pseudo-code, we check for the presence of a CAPTCHA element and call an external service to retrieve the solution, which we then input into the appropriate field.
While these methods work, they may not be as elegant or graceful as needed, so the best way to handle these challenges is using a web scraping API such as Scrape.do to avoid or automatically solve CAPTCHAs for you, so your bots are never blocked.
Error Handling
Selenium provides mechanisms for dealing with common exceptions that can arise when interacting with dynamic websites. Handling these errors gracefully ensures that your script can recover from issues like missing elements or timeouts without crashing.
The most common error exceptions are:
- NoSuchElementError: This occurs when Selenium is unable to locate an element on the page.
- TimeoutException: Raised when an element is not found within a specified wait time.
- StaleElementReferenceError: Occurs when the reference to an element becomes invalid due to page reloading or dynamic updates.
Here’s how to handle these exceptions in Selenium using Ruby:
require 'selenium-webdriver'
begin
driver = Selenium::WebDriver.for :chrome
driver.get 'https://www.scrapingcourse.com/ecommerce/'
# Attempt to find a non-existing element to demonstrate error handling
driver.find_element(css: '.non-existing-class')
rescue Selenium::WebDriver::Error::NoSuchElementError => e
puts "Element not found: #{e.message}"
rescue Selenium::WebDriver::Error::TimeoutError => e
puts "Request timed out: #{e.message}"
rescue Selenium::WebDriver::Error::StaleElementReferenceError
puts "The element reference is stale, try re-locating the element."
rescue StandardError => e
puts "An error occurred: #{e.message}"
ensure
driver.quit if driver
end
The code includes several important exception handling mechanisms.
- The `NoSuchElementError` is caught when an element cannot be found, allowing you to log a meaningful message to understand the issue better.
- The `TimeoutError` is triggered when a request exceeds the expected loading time, indicating that the page did not load as anticipated.
- Additionally, `StandardError` serves as a catch-all for any other exceptions that may arise during execution.
- Lastly, the `ensure` block guarantees that the driver is closed, regardless of whether an exception occurs, ensuring proper resource management and preventing memory leaks.
Best Practices for Scraping with Selenium
To ensure a successful scraping experience, follow these best practices:
1. Rotate User-Agent Strings
Many websites track user-agent strings to detect bots, so randomly changing your user-agent string can help disguise your scraper as a regular browser. You can set a custom User-Agent in your Selenium configuration:
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('user-agent=Your Custom User-Agent')
driver = Selenium::WebDriver.for :chrome, options: options
2. Use Headless Mode Selectively
Running Selenium in headless mode (without a visible browser window) can be faster and less resource-intensive. However, many websites can detect headless browsers, so use this mode only when necessary
3. Respect robots.txt
Always check the site’s `robots.txt` file to understand its scraping policies. Respecting these rules helps avoid legal issues and maintain a good relationship with the website.
4. Rate Limiting
Avoid overwhelming the server by implementing delays between requests. A simple `sleep` function can help:
sleep(rand(2..5)) # Waits for a random time between 2 and 5 seconds
5. Rotate Proxies
When scraping at scale, consider using rotating proxies to avoid IP blocking. This allows you to distribute requests across multiple IP addresses, reducing the risk of being detected and blocked.
Parsing and Storing Data
When using Selenium to scrape websites, especially those that load dynamic content with JavaScript, parsing and storing the extracted data is crucial.
Nokogiri, a powerful Ruby gem for parsing HTML and XML, can be used with Selenium to parse page content after the browser has rendered it efficiently. Once the data is extracted, it can be saved in various formats, such as CSV files or databases, for further analysis.
Let’s look at a practical example of extracting product names and prices from an e-commerce page, and then parsing it with Nokogiri:
require 'selenium-webdriver'
require 'nokogiri'
driver = Selenium::WebDriver.for :chrome
driver.get 'https://www.scrapingcourse.com/ecommerce/'
# Use Nokogiri to parse the page source
page_source = driver.page_source
document = Nokogiri::HTML(page_source)
# Extract product names and prices
products = document.css('.woocommerce-LoopProduct-link')
products.each do |product|
name = product.css('.product-name').text
price = product.css('.product-price').text
puts "Product: #{name}, Price: #{price}"
end
driver.quit
In this example, after retrieving the page source using Selenium, we parse it with Nokogiri. We then use CSS selectors to extract product names and prices, printing them to the console.
Saving Data
Once you’ve extracted the desired data, you’ll need to save it for further analysis or use. Common methods include saving data to a CSV file or a database. Here’s how to save the scraped product data to a CSV file:
require 'csv'
# Create or open a CSV file
CSV.open("products.csv", "wb") do |csv|
# Add headers to the CSV file
csv << ["Product Name", "Price"]
# Loop through products and save their names and prices
products.each do |product|
name = product.css('.product-name').text
price = product.css('.product-price').text
csv << [name, price] # Add each product's data as a new row
end
end
In this code snippet, we use the CSV library to create a new CSV file named `products.csv`. We write headers to the file, followed by looping through the extracted products and appending their names and prices as rows in the CSV.
For more complex or larger-scale scraping projects, storing data in a relational database such as SQLite, PostgreSQL, or MySQL can be a better solution. The ActiveRecord or Sequel gem in Ruby can be used to interact with a database, allowing you to store and retrieve the data efficiently.
Conclusion
Selenium in Ruby provides a powerful and flexible way to automate browser interactions, handle dynamic web content, and scrape data from modern JavaScript-heavy websites.
However, when scraping at scale with Ruby, you’ll need a bit more than Selenium to get the job done - to avoid being blocked.
A web scraping API such as Scrape.do can help you scrape dynamic websites without the need for hours of creating bots that might be obsolete in the next update of your target website or its firewall.
While you focus on your core business, Scrape.do handles:
- Automated proxy rotation with a pool of 100M+ IPs,
- Avoiding CAPTCHAs or integrating with solving services,
- Handling TLS fingerprinting,
- Handling header rotation and managing user agents,
- Monitoring and validating responses.
And you’re only charged for successful requests.
Start scraping today with 1000 free credits.