Category: Scraping basics

Web Scraping in NodeJS: Advanced Techniques and Best Practices

37 mins read Created Date: September 18, 2024   Updated Date: September 18, 2024

Web scraping has moved from being a niche skill to an essential tool in the modern developer’s arsenal. Node.js, with its asynchronous nature and rich ecosystem, provides an ideal platform for building efficient scraping solutions because it allows developers to efficiently extract and analyze data from websites.

This NodeJS web scraping tutorial will help you learn how to scrape data better. It will examine the intricacies of web scraping with Node.js, focusing on advanced techniques and best practices to overcome common challenges. We’ll explore powerful tools like Axios, Cheerio, and Puppeteer to handle various scraping scenarios, from simple HTML parsing to complex JavaScript-rendered content.

We’ll also shine a spotlight on Scrape.do, a powerful ally in the world of web scraping. Throughout this article, we’ll demonstrate how Scrape.do’s features can streamline your scraping workflows, help you bypass common obstacles, and scale your projects with ease. By the end of this article, you’ll be equipped to build efficient and scalable web scrapers using Node.js. Without further ado, let’s dive right in!

Setting Up the Environment

Before we dive into advanced techniques, let’s ensure our environment is properly configured. First, ensure you have NodeJS and NPM installed. If not, download and install them from the official NodeJS website.

Next, you have to install the following packages and libraries:

  • Axios: Enables efficient HTTP requests for fetching web content.
  • Cheerio: A powerful library for parsing and manipulating HTML content.
  • Puppeteer: A headless browser library for scraping JavaScript-heavy websites.
npm install axios cheerio puppeteer

Making HTTP Requests

Most Node.js web scraping projects involve fetching data from websites using HTTP requests, extracting the needed information, and storing it for later use. We’ll be using Axios for the initial step of fetching data because it’s a promise-based HTTP client for NodeJS that makes it easy to send HTTP requests to REST endpoints and handle their responses.

The two most popular HTTP requests people use with Axios are the GET and POST requests. A GET request is used to request data from a specified resource, while a POST request is used to send data to a server to create or update a resource. Here’s how to make these requests:

const axios = require('axios');

(async () => {
  // Example of a GET request
  try {
    const getResponse = await axios.get('https://jsonplaceholder.typicode.com/posts/1');
    console.log('GET response data:', getResponse.data);
  } catch (error) {
    console.error('Error making GET request:', error.message);
  }

  // Example of a POST request
  try {
    const postData = {
      title: 'foo',
      body: 'bar',
      userId: 1
    };
    const postResponse = await axios.post('https://jsonplaceholder.typicode.com/posts', postData);
    console.log('POST response data:', postResponse.data);
  } catch (error) {
    console.error('Error making POST request:', error.message);
  }
})();

While using a basic GET or POST request with Axios is great, you can take it to the next level with Scrape.do. Scrape.do helps you bypass anti-scraping measures by enhancing your requests to resemble legitimate user traffic, reducing the risk of getting blocked by websites. Here’s how to use Scrape.do to make a simple POST request.

// Import the axios library for making HTTP requests
var axios = require('axios');

// Your Scrape.do API token
var token = "YOUR_TOKEN";

// The target URL you want to scrape, URL-encoded for safety
var targetUrl = encodeURIComponent("https://httpbin.co/anything");

// Configuration object for the axios request
var config = {
    // The HTTP method for the request, in this case, POST
    'method': 'POST',

    // The Scrape.do API endpoint, including your token and the target URL
    'url': `https://api.scrape.do?token=${token}&url=${targetUrl}`,

    // Headers for the request, indicating that the request body is in JSON format
    'headers': {
        'Content-Type': 'application/json'
    },

    // The body of the POST request, converted to a JSON string
    data: JSON.stringify({
        "test-key": "test-value" // Example data to send in the POST request
    }),

    // Setting a timeout for the request (in milliseconds)
    timeout: 10000
};

// Function to handle the request with retries
async function makeRequestWithRetry(config, retries = 3) {
    try {
        // Make the HTTP request using axios with the specified configuration
        const response = await axios(config);
        // Handle the response data from the Scrape.do API
        console.log(response.data);
    } catch (error) {
        // Handle different types of errors
        if (error.response) {
            // Server responded with a status code outside the range of 2xx
            console.error('Response error:', error.response.status);
        } else if (error.request) {
            // No response received from server
            console.error('No response received:', error.request);
        } else {
            // Other errors
            console.error('Error:', error.message);
        }

        // Implement retry logic
        if (retries > 0) {
            console.log(`Retrying... (${retries} attempts left)`);
            await makeRequestWithRetry(config, retries - 1);
        } else {
            console.error('All retry attempts failed.');
        }
    }
}

// Call the function to make the request with retry logic
makeRequestWithRetry(config);

This example covers the basics of using Axios for web scraping, but Axios offers more advanced web scraping techniques, including setting custom headers, sending data in different formats, and handling various response types to accommodate complex scraping scenarios.

As you scrape more data, you’ll encounter various response types such as HTML, JSON, XML, or even binary data. Axios primarily handles JSON responses out of the box, but for other content types, you will need to implement specific parsers based on the content’s structure. For HTML response types, you can use a parser like Cheerio, and we’ll be talking more about that later. For XML, you’ll have to use libraries like xml2js to parse the responses.

There are a couple of errors you’d run into when making HTTP requests:

  • 4xx Client Errors: Indicates issues on the client side (e.g., 404 Not Found, 401 Unauthorized).
  • 5xx Server Errors: Indicates issues on the server side (e.g., 500 Internal Server Error, 503 Service Unavailable).

Because efficient error handling is crucial for reliable web scraping, Axios provides mechanisms to handle these HTTP errors:

  • Try-catch blocks: Enclosing Axios requests within try-catch blocks is essential for gracefully handling potential errors that may occur during the request
const axios = require('axios');

async function fetchData() {
  try {
    const response = await axios.get('https://api.example.com/data');
    console.log(response.data);
  } catch (error) {
    console.error('Error fetching data:', error);
  }
}

fetchData();
  • Checking Error Codes and Applying Retry Logic with Axios: When handling errors in Axios, the error object often includes a response property, which contains details about the HTTP response, including the status code. Checking error codes allows you to create more sophisticated and resilient web scraping scripts. You can then respond appropriately to different error scenarios, whether it’s retrying on temporary errors, handling authentication issues, or gracefully degrading functionality when resources are unavailable.
// Import the axios library for making HTTP requests
const axios = require('axios');

/**
 * Function to fetch data from a specified URL with retry logic
 * @param {string} url - The URL to fetch data from
 * @param {number} maxRetries - The maximum number of retry attempts (default: 3)
 * @param {number} retryDelay - The delay between retries in milliseconds (default: 1000)
 * @returns {Promise<any>} - The data fetched from the URL
 */
async function fetchDataWithRetry(url, maxRetries = 3, retryDelay = 1000) {
  let retries = 0; // Initialize retry counter

  // Loop until the maximum number of retries is reached
  while (retries < maxRetries) {
    try {
      // Attempt to fetch data from the URL
      const response = await axios.get(url);
      // Return the response data if the request is successful
      return response.data;
    } catch (error) {
      // Check if the error is due to a 503 Service Unavailable response
      if (error.response && error.response.status === 503) {
        console.error(`Retry attempt ${retries + 1} for ${url}`);
        retries++; // Increment the retry counter
        // Wait for the specified delay before retrying (exponential backoff)
        await new Promise(resolve => setTimeout(resolve, retryDelay * retries));
      } else {
        // Rethrow other errors
        throw error;
      }
    }
  }

  // Throw an error if the maximum number of retries is exceeded
  throw new Error('Maximum retries exceeded');
}

// Example usage of the fetchDataWithRetry function
fetchDataWithRetry('https://api.example.com/data')
  .then(data => console.log(data)) // Log the fetched data on success
  .catch(error => console.error('Error:', error)); // Log errors

Parsing HTML content

To parse HTML content efficiently, we can use the library Cheerio, which provides a simple and consistent API for manipulating HTML documents similar to jQuery. Cheerio web scraping is a popular technique, and it works by parsing HTML strings into a format that it can manipulate. You can then use CSS selectors to query and manipulate the elements in the document.

Let’s look at a basic HTML parsing operation using Cheerio and Axios in Node.js.

const cheerio = require('cheerio');
const axios = require('axios');

// Asynchronous function to fetch HTML data and parse it using Cheerio
async function fetchDataAndParse() {
  try {
    // Fetch HTML content from the specified URL
    const response = await axios.get('https://example.com');

    // Load the fetched HTML content into Cheerio
    const $ = cheerio.load(response.data);

    // Extract and log the title of the page
    const title = $('title').text();
    console.log('Page Title:', title);

    // Extract and log the main heading (h1)
    const heading = $('h1').text();
    console.log('Main Heading:', heading);

    // Extract and log the content of the first paragraph with the class 'intro'
    const introText = $('.intro').text();
    console.log('Introduction:', introText);

    // Extract and log all items in a list with the class 'item'
    const items = $('.items .item').map((i, el) => $(el).text()).get();
    console.log('List Items:', items);

    // Extract and log the text of all links (anchor tags)
    const links = $('a').map((i, el) => $(el).attr('href')).get();
    console.log('Links:', links);

  } catch (error) {
    // Handle errors in fetching or parsing HTML
    console.error('Error fetching or parsing HTML:', error);
  }
}

// Invoke the function to fetch and parse data
fetchDataAndParse();

In this example, the fetchDataAndParse function asynchronously requests HTML from a specified URL (https://example.com) using Axios. Once the HTML is retrieved, it is loaded into Cheerio, which provides a jQuery-like interface to navigate and manipulate the document. The code then extracts specific elements such as the page title, the main heading (h1), an introductory paragraph, a list of items, and all links on the page, logging these elements to the console.

Handling Nested Elements

When working with more complex HTML structures, it’s important to know how to handle nested elements, dynamic content, and edge cases effectively. Nested elements are common in HTML documents, especially in structured content like tables, lists, or nested divs. With Cherio, you can easily navigate through these nested structures easily using CSS selectors.

Consider the following HTML structure:

<div class="container">
  <div class="section">
    <h2>Section Title</h2>
    <ul class="items">
      <li class="item">
        <span class="name">Item 1</span>
        <span class="value">Value 1</span>
      </li>
      <li class="item">
        <span class="name">Item 2</span>
        <span class="value">Value 2</span>
      </li>
    </ul>
  </div>
</div>

To extract the name and value from each li.item:

const cheerio = require('cheerio');

// Assuming `html` contains the HTML string above
const $ = cheerio.load(html);

// Loop through each .item and extract name and value
$('.items .item').each((i, el) => {
  const name = $(el).find('.name').text();
  const value = $(el).find('.value').text();
  console.log(`Item: ${name}, Value: ${value}`);
});

CSS selectors are the go-to method for most web scraping tasks, but they may not always be efficient for navigating deeply nested content. XPath offers a more powerful and flexible alternative, especially for navigating complex page structures. It excels at locating elements buried deep within the HTML, dealing with dynamic content, or targeting elements based on specific conditions.

Although Cheerio does not natively support XPath, you can use the Xpath.js library in conjunction with Cheerio to enable XPath queries.

Handling JavaScript-Rendered Content

Modern websites heavily rely on JavaScript to generate content after the initial page load, and this content is often invisible to traditional scraping methods that only analyze the static HTML. To scrape this kind of content, you’ll need a headless browser like Puppeteer.

Headless browsers are web browsers without a graphical user interface (GUI). They operate in the background, allowing developers to interact with web pages programmatically. While they might seem less intuitive than traditional browsers, they are indispensable for scraping JavaScript-rendered content.

While Cheerio is excellent for static HTML, it’s very limited when dealing with dynamic content generated by JavaScript. For such cases, you have to render the page using Puppeteer and then use Cheerio to parse the resulting HTML.

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function fetchDynamicContent(url) {
  // Launch a headless browser
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Navigate to the page and wait for content to load
  await page.goto(url, { waitUntil: 'networkidle2' });

  // Get the page content after JavaScript has rendered
  const content = await page.content();

  // Load the rendered HTML into Cheerio
  const $ = cheerio.load(content);

  // Extract data as needed, e.g., a dynamically loaded list
  const dynamicItems = [];
  $('.dynamic-item').each((i, el) => {
    dynamicItems.push($(el).text());
  });

  console.log('Dynamic Items:', dynamicItems);

  // Close the browser
  await browser.close();
}

fetchDynamicContent('https://example.com');

Most websites today render content based on user interactions, and Puppeteer has functionalities to interact with web pages, including clicking buttons, filling forms, and waiting for elements

Many modern websites dynamically generate content based on user actions. Puppeteer handles this by allowing you to simulate user behavior, such as clicking, filling forms, and waiting for page elements to load, among others. Here is an example:

const puppeteer = require('puppeteer');

async function interactWithWebPage() {
  try {
    // Launch the browser with specified options
    const browser = await puppeteer.launch({
      headless: false, // Runs the browser in full mode (set to true for headless mode in production)
      defaultViewport: null, // Disables the default viewport size, allowing the browser to start maximized
      args: ['--start-maximized'] // Starts the browser maximized
    });

    // Create a new page (tab) in the browser
    const page = await browser.newPage();

    // Navigate to the login page and wait until there are no network connections for at least 500ms
    await page.goto('https://example.com/login', { waitUntil: 'networkidle0' });

    // Wait for the login form to be visible on the page
    await page.waitForSelector('#login-form');

    // Fill in the login form with the provided username and password
    await page.type('#username', 'your_username');
    await page.type('#password', 'your_password');

    // Click the login button and wait for the navigation to complete (ensures the page is fully loaded)
    await Promise.all([
      page.click('#login-button'),
      page.waitForNavigation({ waitUntil: 'networkidle0' })
    ]);

    console.log('Logged in successfully');

    // Wait for a specific dashboard element to be visible, indicating successful login
    await page.waitForSelector('.dashboard');

    // Click a button that opens a modal dialog
    await page.click('#open-modal-button');

    // Wait for the modal dialog to be visible before interacting with it
    await page.waitForSelector('.modal', { visible: true });

    // Fill in the form fields inside the modal dialog
    await page.type('.modal #name', 'John Doe');
    await page.type('.modal #email', '[email protected]');

    // Select an option ('US') from a dropdown menu in the modal
    await page.select('.modal #country', 'US');

    // Check a checkbox to agree to terms and conditions in the modal
    await page.click('.modal #terms-checkbox');

    // Submit the form in the modal and wait for a specific API response indicating the form was processed
    await Promise.all([
      page.click('.modal #submit-button'),
      page.waitForResponse(response => response.url().includes('/api/submit'))
    ]);

    console.log('Form submitted successfully');

    // Wait for a success message to appear and extract its text content
    const successMessage = await page.waitForSelector('.success-message');
    const messageText = await successMessage.evaluate(el => el.textContent);
    console.log('Success message:', messageText);

    // Perform an infinite scroll until the end of the page is reached
    let previousHeight;
    while (true) {
      previousHeight = await page.evaluate('document.body.scrollHeight'); // Get the current page height
      await page.evaluate('window.scrollTo(0, document.body.scrollHeight)'); // Scroll to the bottom of the page
      await page.waitForTimeout(1000); // Wait for 1 second to allow new content to load
      const newHeight = await page.evaluate('document.body.scrollHeight'); // Get the new page height
      if (newHeight === previousHeight) {
        break; // If the height didn't change, we reached the end of the scrollable content
      }
    }

    console.log('Reached end of infinite scroll');

    // Extract titles from elements with the class '.item-title' and store them in an array
    const titles = await page.evaluate(() =>
      Array.from(document.querySelectorAll('.item-title')).map(el => el.textContent)
    );

    console.log('Extracted titles:', titles);

    // Take a screenshot of the entire page and save it as 'screenshot.png'
    await page.screenshot({ path: 'screenshot.png', fullPage: true });

    // Close the browser to end the session
    await browser.close();

  } catch (error) {
    // Log any errors that occur during the execution of the script
    console.error('An error occurred:', error);
  }
}

// Call the function to start the interaction with the web page
interactWithWebPage();

The above example automates a sequence of typical e-commerce tasks using Puppeteer. It simulates user behavior by logging in, filling forms, interacting with modal dialogs, and performing actions like infinite scrolling and taking screenshots.

This example is perfect for JavaScript-rendered content scraping from single-page applications(SPAs). SPAs present unique challenges for web scraping due to their dynamic nature, but because Puppeteer is able to simulate a real browser with human interactions, you can use it to load the SPA, interact with it, and wait for dynamic content to load before scraping.

Most SPAs use client-side routing, where the page doesn’t reload when navigating between different “pages” of the app. Instead, the app dynamically changes content based on the route. With Puppeteer web scraping, you can bypass this issue by using the page.waitForNavigation() to handle route changes within the SPA.

const puppeteer = require('puppeteer');

async function scrapeRoute(page, route) {
  try {
    await Promise.all([
      page.click(`a[href="${route}"]`),
      page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 5000 }),
    ]);
    await page.waitForSelector('.content-for-route', { timeout: 5000 });
    return await page.evaluate(() => document.querySelector('.content-for-route').textContent);
  } catch (error) {
    console.error(`Error scraping route ${route}:`, error);
    return null;
  }
}

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  try {
    await page.goto('https://example-spa.com', { waitUntil: 'networkidle0', timeout: 10000 });

    const routes = ['/some-route', '/another-route', '/third-route'];
    for (const route of routes) {
      const content = await scrapeRoute(page, route);
      if (content) {
        console.log(`Content from ${route}:`, content);
      }
    }
  } catch (error) {
    console.error('An error occurred:', error);
  } finally {
    await browser.close();
  }
})();

Using Scrape.do is a simpler way of scraping data from SPAs and websites with dynamic content. We simplify this process by managing the complexities of headless browser setup and maintenance, allowing you to focus on data extraction without worrying about the underlying infrastructure.

For example, if there are web pages requiring full browser rendering and dynamic content loading, you can simply add the render=true parameter to your request. Our infrastructure will automatically handle the process, opening a headless browser to fetch the fully loaded page and provide the rendered content.

var axios = require('axios');
var token = "YOUR_TOKEN";
var targetUrl = encodeURIComponent("https://httpbin.co/anything");
var render = "true";
var config = {
    'method': 'GET',
    'url': `https://api.scrape.do?token=${token}&url=${targetUrl}&render=${render}`,
    'headers': {}
};
axios(config)
    .then(function (response) {
        console.log(response.data);
    })
    .catch(function (error) {
        console.log(error);
    });

Managing Data Extraction

Extracting data across multiple pages (multi-page extraction) is a common requirement in web scraping with NodeJS, especially when dealing with paginated content, search results, or listings. Here are a couple of strategies you could use to tackle this:

  • Before starting with the extraction, you need to identify how the target website handles pagination. Common pagination elements are Next and Previous buttons, Page numbers (e.g., 1, 2, 3, …), and Infinite scroll/Load More buttons.

  • When pages are numbered, you can directly iterate over the page numbers by modifying the URL or interacting with the pagination controls.

  • If a website uses the “Next” and “Previous” buttons, you can simulate clicking the “Next” button to move through the pages until you reach the last one.

  • For websites that use infinite scroll to load more content as the user scrolls down, you can simulate scrolling to the bottom of the page to load more data as we showed earlier in this article.

  • Some websites might combine different pagination techniques or use more complex systems. In such cases, you might need to combine the strategies mentioned above or employ custom logic to navigate through the pages.

     

After collecting data from multiple pages, you can combine it into a single dataset (e.g., an array of objects). Depending on your use case, you may want to save this data to a file or database.

const fs = require('fs');

// Assume allData is the array containing all extracted data
fs.writeFileSync('data.json', JSON.stringify(allData, null, 2), 'utf-8');
console.log('Data saved to data.json');

Remember, we’re dealing with dynamic websites that change frequently, leading to potential issues such as data inconsistencies, missing elements, and network errors. Below are strategies for handling each of these challenges:

  • Handling Data Inconsistencies: Data inconsistencies occur when the structure or format of the data changes between pages or over time. To fix this, you need to use CSS selectors or XPath expressions that can handle slight variations in the structure, validate the extracted data or provide default values if necessary, and Log instances where inconsistencies are detected.
  • Handling Missing Elements: Sometimes, elements you expect to find on a page may be missing, either because they haven’t loaded yet, or because the content is structured differently than anticipated. Before attempting to extract data, use conditional logic to check if the element exists using methods like page.$() or page.$eval() wrapped in a try-catch block. You can also use waitForSelector() with a timeout to wait for elements to appear or Design your script to continue running even if certain elements are missing.
try {
  await page.waitForSelector('.item-title', { timeout: 5000 });
  const title = await page.$eval('.item-title', el => el.textContent);
  console.log('Title:', title);
} catch (error) {
  console.warn('Title not found, skipping...');
}
  • Handling Network Errors: Since network errors are common when scraping large amounts of data or when dealing with unreliable internet connections or servers, you should Implement retry logic with exponential backoff to handle temporary network issues. This involves retrying the request after a delay, increasing the delay with each subsequent retry.

Dealing with Anti-Scraping Measures

Websites often implement anti-scraping measures to protect their content and server resources from automated bots. These measures can range from simple to sophisticated, making it challenging for scrapers to collect data. Some of the most common anti-scraping techniques are CAPTCHA, rate limiting, and bot detection.

  • CAPTCHA: CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a mechanism designed to differentiate between human users and bots. It usually presents challenges that are easy for humans to solve but difficult for automated systems. CAPTCHAs can block automated scripts from proceeding without solving the challenge, making it difficult for bots to access content.
  • Rate Limiting: Rate limiting controls the number of requests a user can make to a server within a specific time frame. It’s used to prevent abuse and ensure fair resource usage. When a website sets a rate limit, The server limits the number of requests from a single IP address or user over a certain period. Exceeding the limit results in denied access or delayed responses, often with a “429 Too Many Requests” status code. Rate limiting can slow down or halt scraping activities if the bot makes too many requests too quickly.
  • Bot Detection: As the name suggests, bot detection involves identifying and blocking automated bots based on their behavior, request patterns, or browser characteristics. Websites can detect and block requests from known bot User-Agent strings or suspicious patterns, and they often maintain lists of IP addresses associated with bots or data centers and block requests from these IPs. If a bot is detected, the website might block access, serve CAPTCHA challenges, or serve misleading content.

While these challenges may seem insurmountable, they can be overcome with careful planning and the use of advanced tools like headless browsers, IP rotation, and behavior emulation. The Puppeteer examples we showed above are very useful in bypassing anti-scrapping measures, so now, let’s talk about IP rotation and proxies.

IP Rotation and Proxies

IP rotation and the use of proxies are essential techniques in web scraping, especially when you want to avoid detection, rate limiting, or IP blocking. A proxy server acts as an intermediary between your device and the target website. It masks your real IP address with its own, making it appear as if the request originates from a different location.

Rotating proxies involves cycling through multiple proxy servers to constantly change your IP address and makes it difficult for websites to identify and block your scraping activities. Here’s how to rotate IPs with Puppeteer:

const puppeteer = require('puppeteer');

// List of proxy servers with authentication details
const proxies = [
  { host: 'proxy1.com', port: 8080, username: 'user1', password: 'pass1' },
  { host: 'proxy2.com', port: 8080, username: 'user2', password: 'pass2' },
  { host: 'proxy3.com', port: 8080, username: 'user3', password: 'pass3' }
];

// Function to randomly select a proxy from the list
async function getRandomProxy() {
  // Select a random proxy from the proxies array
  const proxy = proxies[Math.floor(Math.random() * proxies.length)];
  // Return the proxy in the format required by Puppeteer's --proxy-server argument
  return `--proxy-server=${proxy.host}:${proxy.port}`;
}

// Function to verify if the selected proxy is working
async function verifyProxy(page) {
  try {
    // Navigate to a simple test website to verify the proxy's functionality
    await page.goto('https://httpbin.org/ip', { waitUntil: 'networkidle0', timeout: 10000 });

    // Extract the IP address displayed on the test website to confirm connection via proxy
    const ip = await page.evaluate(() => document.body.textContent);
    console.log('Connected IP:', ip); // Log the IP address to confirm proxy is working
    return true; // Return true if the proxy works
  } catch (error) {
    // If there's an error (e.g., network error or connection timeout), log the error message
    console.error('Proxy verification failed:', error.message);
    return false; // Return false if the proxy fails to connect
  }
}

(async () => {
  // Select a random proxy for this browser session
  const proxyArg = await getRandomProxy();

  // Launch a new Puppeteer browser instance with the selected proxy
  const browser = await puppeteer.launch({
    headless: true, // Run the browser in headless mode (no GUI)
    args: [proxyArg] // Pass the proxy server argument to the browser
  });

  // Open a new page (tab) in the browser
  const page = await browser.newPage();

  // Find the selected proxy details to provide authentication
  const proxy = proxies.find(p => `--proxy-server=${p.host}:${p.port}` === proxyArg);

  // Authenticate with the proxy server using the provided username and password
  await page.authenticate({ username: proxy.username, password: proxy.password });

  // Verify if the selected proxy is working by connecting to a test website
  const isProxyWorking = await verifyProxy(page);
  if (!isProxyWorking) {
    console.log('Retrying with a different proxy...'); // Log a message if the proxy failed
    await browser.close(); // Close the browser session
    return; // Exit the script (or implement retry logic here)
  }

  // Navigate to the target page using the verified proxy
  await page.goto('https://example.com', { waitUntil: 'networkidle0' });

  // Extract specific data from the page using page.evaluate
  const extractedData = await page.evaluate(() => {
    // Select and return the text content of specific elements
    const title = document.querySelector('h1').textContent; // Get the content of the first <h1> element
    const description = document.querySelector('.description').textContent; // Get the content of the first element with class 'description'
    return {
      title, // Return the extracted title
      description // Return the extracted description
    };
  });

  // Log the extracted data to the console
  console.log('Extracted Data:', extractedData);

  // Close the browser session
  await browser.close();
})();

An easier way of rotating proxies is using Scrape.do. Scrape.do offers two types of proxies to help you bypass anti-bot measures that analyze your IP address to identify and block automated scraping attempts.

  • Datacenter Proxies: These are affordable options, but their IP addresses might be easily identified and blocked by advanced anti-bot solutions. We offer a massive pool of over 90,000+ rotating data center proxies for basic scraping needs.
  • Residential & Mobile Proxies: Residential and mobile proxies, with a pool exceeding 95,000,000+ IP addresses, can enhance scraping success, especially against sophisticated anti-bot measures. These proxies mimic real user traffic, making them less detectable by websites. To activate them, simply include the super=True parameter in your request.

Using our geo-targeting feature can also enhance your scraping success, because targeting the same country as the website you’re scraping increases your chances of bypassing restrictions and receiving accurate data.

Using puppeteer-extra

Puppeteer-Extra is a powerful extension for Puppeteer that allows you to add custom plugins to enhance its functionality. It’s particularly useful for overcoming anti-scraping measures because these plugins provide functionalities such as stealth mode, CAPTCHA solving, and more, making it easier to scrape websites that have implemented anti-bot protections.

To use Puppeteer-extra, you need to install it along with any plugins you want to use. Here’s how to install the stealth and reCAPTCHA plugins:

npm install puppeteer-extra puppeteer-extra-plugin-stealth puppeteer-extra-plugin-recaptcha

Now, let’s use Puppeteer-Extra and the plugins we installed to create a scraping solution capable of handling CAPTCHAs and potentially overcoming other anti-bot measures.

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha');

// Use the stealth plugin to make the browser more human-like
puppeteer.use(StealthPlugin());

// Use the reCAPTCHA plugin with your API key for solving captchas
puppeteer.use(
  RecaptchaPlugin({
    provider: {
      id: '2captcha', // You can also use 'anti-captcha' or other supported services
      token: 'YOUR_2CAPTCHA_API_KEY' // Replace with your API key
    },
    visualFeedback: true // Displays visual feedback when solving captchas
  })
);

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  // Navigate to a website that may trigger anti-bot protections
  await page.goto('https://example.com', { waitUntil: 'networkidle0' });

  // Automatically solve reCAPTCHAs if they appear
  const { captchas, solutions, solved, error } = await page.solveRecaptchas();

  if (solved) {
    console.log('CAPTCHA(s) solved:', solved);
  } else if (error) {
    console.error('Failed to solve CAPTCHA(s):', error);
  }

  // Perform further actions on the page, like data extraction
  const content = await page.evaluate(() => document.body.innerText);
  console.log(content);

  // Close the browser
  await browser.close();
})();

Optimizing Performance

It’s essential to balance speed and respect for the target server/website when scrapping data. NodeJS scraping performance optimization can be done by by managing concurrency (making multiple requests in parallel), implementing rate limiting (controlling the frequency of requests), and optimizing resource usage. This approach ensures that your scraper is efficient without overwhelming the server, which could lead to your IP being blocked. Let’s look at these in more detail:

Managing Concurrency

Concurrency refers to the number of tasks your scraper performs simultaneously. In web scraping, this usually means making multiple HTTP requests or browser interactions at the same time. With Puppeteer, you can handle concurrency using multiple browser pages or instances.

Implementing Rate Limiting

Rate limiting ensures that your scraper doesn’t overload the target server with too many requests in a short time. It can help you avoid triggering the anti-scraping defenses we mentioned earlier and getting your IP address blocked.

Managing concurrency and implementing rate limiting are critical strategies for optimizing the performance of your web scraping tasks while avoiding detection and bans, so it’s advised to combine concurrency and rate limiting to achieve a balance between efficiency and server load.

const puppeteer = require('puppeteer');

// Utility function to create a delay
// This function returns a promise that resolves after a specified number of milliseconds.
// It is used to pause the execution of the script for a given time.
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));

(async () => {
  // Launch a new instance of the Puppeteer browser in headless mode
  const browser = await puppeteer.launch({ headless: true });

  // Function to scrape data from a single page
  // This function takes a URL as input, navigates to the page, extracts data, and then closes the page.
  async function scrapePage(url) {
    const page = await browser.newPage(); // Open a new tab in the browser
    await page.goto(url, { waitUntil: 'networkidle2' }); // Navigate to the URL and wait for the network to be idle
    // Use page.evaluate() to execute code in the browser context and extract the page title
    const data = await page.evaluate(() => document.title);
    await page.close(); // Close the tab to free up resources
    return data; // Return the extracted data
  }

  // List of URLs to scrape
  const urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
    'https://example.com/page4',
    'https://example.com/page5'
  ];

  // Concurrency and rate limiting parameters
  const maxConcurrency = 2; // Maximum number of pages to scrape concurrently
  const delayBetweenRequests = 2000; // 2 seconds delay between batches of concurrent requests

  // Array to store the results of the scraping
  const results = [];

  // Loop through the list of URLs in chunks, processing up to 'maxConcurrency' URLs at a time
  for (let i = 0; i < urls.length; i += maxConcurrency) {
    const chunk = urls.slice(i, i + maxConcurrency); // Select a chunk of URLs to process concurrently
    // Use Promise.all to scrape the chunk of URLs in parallel
    const dataChunk = await Promise.all(chunk.map(async (url) => {
      const data = await scrapePage(url); // Scrape the data from the current URL
      console.log(`Scraped: ${url}`); // Log the URL that was scraped
      await delay(delayBetweenRequests); // Apply a delay before proceeding to the next batch
      return data; // Return the scraped data
    }));
    // Combine the data from the current chunk with the overall results
    results.push(...dataChunk);
  }

  // Log the final array of scraped data
  console.log('Scraped Data:', results);

  // Close the Puppeteer browser instance
  await browser.close();
})();

Optimizing Resource Usage

When scraping websites, it’s important to manage memory and CPU usage effectively, especially if you’re running multiple scraping tasks concurrently. Let’s look at some web scraping best practices for reducing resource consumption:

  • Limiting Headless Browser Instances: Puppeteer browser instances are resource-intensive, and running too many simultaneously can overload your system. To prevent this, limit the number of concurrent browser instances. You can achieve this by batching tasks, controlling concurrency levels, or using tools like promise-pool to manage asynchronous operations efficiently.
  • Disable Unnecessary Features: To optimize performance and reduce resource consumption, consider disabling unnecessary browser features you don’t need, like images, fonts, and some CSS. Puppeteer’s request interception capabilities allow you to block specific content types by intercepting network requests and selectively allowing or blocking them.
  • Optimize Scraping Logic: Suboptimal scraping logic can significantly increase CPU usage due to redundant DOM operations or excessive JavaScript execution. To optimize performance, refine your data extraction code and use precise selectors to minimize DOM queries.

Data Storage and Processing

Once you’ve successfully extracted data from your target websites, the next crucial step is to store it for further analysis and utilization. Two of the most popular database options for this task are MongoDB and PostgreSQL.

MongoDB: A Flexible Choice

MongoDB, a NoSQL database, excels at handling unstructured or semi-structured data, making it a natural fit for scraped information, often characterized by its irregular and dynamic nature. Its document-oriented structure allows for easy storage of nested data and rapid insertion of records. Here’s how you can scrape data using Puppeteer and store it in MongoDB:

const puppeteer = require('puppeteer');
const { MongoClient } = require('mongodb');

(async () => {
  // MongoDB connection details
  const url = 'mongodb://localhost:27017';
  const dbName = 'scrapedDataDB';
  const client = new MongoClient(url, { useNewUrlParser: true, useUnifiedTopology: true });

  try {
    // Connect to MongoDB
    await client.connect();
    console.log('Connected to MongoDB');
    const db = client.db(dbName);
    const collection = db.collection('scrapedData');

    // Launch Puppeteer
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://example.com', { waitUntil: 'networkidle2' });

    // Extract data from the page
    const data = await page.evaluate(() => {
      return {
        title: document.querySelector('h1').innerText,
        description: document.querySelector('.description').innerText
      };
    });

    // Insert the data into MongoDB
    const result = await collection.insertOne(data);
    console.log('Data inserted into MongoDB:', result.insertedId);

    // Close Puppeteer and MongoDB connection
    await browser.close();
  } catch (error) {
    console.error('Error:', error);
  } finally {
    await client.close();
  }
})();

PostgreSQL: For Structure and Scalability

PostgreSQL, a relational database, provides a structured approach to data storage. It’s ideal for data that exhibits clear relationships and requires complex queries. While it might involve more upfront schema design, it offers robust data integrity and scalability to balance things out.

const puppeteer = require('puppeteer');
const { Client } = require('pg');

(async () => {
  // PostgreSQL connection details
  const client = new Client({
    user: 'your_username',
    host: 'localhost',
    database: 'scrapedDataDB',
    password: 'your_password',
    port: 5432
  });

  try {
    // Connect to PostgreSQL
    await client.connect();
    console.log('Connected to PostgreSQL');

    // Launch Puppeteer
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://example.com', { waitUntil: 'networkidle2' });

    // Extract data from the page
    const data = await page.evaluate(() => {
      return {
        title: document.querySelector('h1').innerText,
        description: document.querySelector('.description').innerText
      };
    });

    // Insert the data into PostgreSQL
    const insertQuery = 'INSERT INTO scraped_data (title, description) VALUES ($1, $2) RETURNING id';
    const values = [data.title, data.description];
    const res = await client.query(insertQuery, values);
    console.log('Data inserted into PostgreSQL with ID:', res.rows[0].id);

    // Close Puppeteer and PostgreSQL connection
    await browser.close();
  } catch (error) {
    console.error('Error:', error);
  } finally {
    await client.end();
  }
})();

Data Cleaning and Exporting

Another crucial step in data processing for analysis is data cleaning. Data cleaning involves identifying and rectifying inconsistencies, missing values, and other errors within the raw data to ensure its accuracy and usability for analysis. Some common data cleaning techniques include:

  • Handling Missing Values: Missing data can compromise analysis accuracy. To address this, replace missing values with appropriate placeholders like ‘NULL’, ‘0’, or ‘Unknown’, or eliminate records with extensive missing information.
  • Standardizing Data Types: Data inconsistency, such as varying date formats or capitalization, hinders data analysis and aggregation. Standardizing data formats, like converting dates to a uniform style or applying consistent capitalization, is essential for accurate insights into scrapped data.
  • Removing Noise: Extraneous data, such as HTML tags, special characters, or irrelevant text, can create unnecessary noise and obscure valuable insights. To streamline analysis, eliminate these elements using regular expressions or specialized text processing tools.

After scraping and cleaning data, you often need to export it into various formats for further analysis, sharing, or integration with other systems. You can use NodeJS to export data to common formats such as CSV, JSON, and Excel.

  • CSV: CSV is a simple and widely supported format suitable for storing tabular data. It represents data as a series of comma-separated values, one row per record. To export data to CSV, you first need to install the fast-csv library.
npm install fs fast-csv

Next, you can use the following code to export the scraped and cleaned data.

const fs = require('fs');
const { format } = require('@fast-csv/format');

// Sample data to export
const data = [
  { id: 1, name: 'John Doe', email: '[email protected]' },
  { id: 2, name: 'Jane Smith', email: '[email protected]' },
  { id: 3, name: 'Jake Black', email: '[email protected]' }
];

// Create a writable stream to write data to CSV
const ws = fs.createWriteStream('output.csv');

// Use fast-csv to format and write data to the CSV file
format.writeToStream(ws, data, { headers: true })
  .on('finish', () => {
    console.log('Data successfully exported to output.csv');
  });
  • JSON: JSON (JavaScript Object Notation) is a lightweight data-interchange format that’s easy to read and write, making it popular for APIs and configuration files. To export data to JSON, you just have to do this:
const fs = require('fs');

// Sample data to export
const data = [
  { id: 1, name: 'John Doe', email: '[email protected]' },
  { id: 2, name: 'Jane Smith', email: '[email protected]' },
  { id: 3, name: 'Jake Black', email: '[email protected]' }
];

// Convert the data to a JSON string
const jsonData = JSON.stringify(data, null, 2);

// Write the JSON data to a file
fs.writeFile('output.json', jsonData, (err) => {
  if (err) throw err;
  console.log('Data successfully exported to output.json');
});
  • Excel: Excel files are commonly used in business environments. You can export data to Excel using the xlsx library, which creates Excel files in various formats, such as .xlsx.
const XLSX = require('xlsx');

// Sample data to export
const data = [
  { id: 1, name: 'John Doe', email: '[email protected]' },
  { id: 2, name: 'Jane Smith', email: '[email protected]' },
  { id: 3, name: 'Jake Black', email: '[email protected]' }
];

// Convert the data to a worksheet
const worksheet = XLSX.utils.json_to_sheet(data);

// Create a new workbook and add the worksheet to it
const workbook = XLSX.utils.book_new();
XLSX.utils.book_append_sheet(workbook, worksheet, 'Sheet1');

// Write the workbook to an Excel file
XLSX.writeFile(workbook, 'output.xlsx');

console.log('Data successfully exported to output.xlsx');

Web scraping, while a powerful tool, operates within a legal framework, so you need to understand the boundaries to avoid legal repercussions. This means having a solid knowledge of things like copyright laws, terms of service, and privacy regulations.

Copyright law safeguards original works, including digital content. Most website content, from text and images to databases, is typically copyrighted. While publicly accessible, copying and distributing this content without permission can lead to legal repercussions. While fair use might apply in certain cases, such as research or education, it’s not a guaranteed defense.

Terms of Service

Websites typically impose terms of service (ToS) that users must agree to before accessing their content. These agreements often explicitly prohibit automated access, such as web scraping. Violating a website’s ToS can lead to legal repercussions, including breach of contract lawsuits. Courts have upheld these claims, especially when scraping negatively impacts the website’s operations.

Privacy Laws

Scraping personal data, such as names, emails, or other identifying information, carries significant legal risks. Regulations like the GDPR and CCPA protect individual privacy and impose strict rules on data handling. To avoid legal issues, organizations must comply with these laws, obtain necessary consent, and respect individuals’ rights to data privacy.

Ethical Best Practices for Web Scraping

  • Respect Privacy: Avoid scraping personal information unless you have explicit consent or a compelling legal justification. Always adhere to data protection laws and consider anonymizing data when feasible.
  • Identify Yourself: Include a user agent string that identifies your bot, including your contact information. This allows site administrators to reach out to you if they have concerns.
  • Scrape Only Publicly Available Data: Focus on scraping publicly accessible data that does not require you to circumvent access controls or terms of service.
  • Avoid Unnecessary Server Load: To avoid overwhelming target websites and improve efficiency, implement rate limiting in your scraping scripts. Space out requests over time instead of sending them in bursts. Additionally, cache scraped data to reduce redundant requests and enhance performance.

If you find yourself facing legal challenges, such as a cease-and-desist letter, it’s crucial to handle the situation carefully. Ignoring a cease-and-desist letter can escalate the situation, potentially leading to a lawsuit or other legal action, so acknowledge receipt of the letter promptly, even if you do not immediately agree to their demands.

Next, A lawyer specializing in intellectual property, internet law, or data privacy. Not all cease-and-desist letters are legally enforceable, and some may overstate the legal grounds for the demand. Work with your lawyer to assess the validity of the claims. Consider whether you have violated any terms of service, copyright laws, or other legal obligations.

Depending on the situation, there may be various ways to resolve the issue, from negotiating with the website owner to ceasing the scraping activities. If the claims are valid and the risk is significant, consider ceasing the scraping activity and removing any collected data. You may be able to negotiate terms, such as obtaining a license to access the data legally. However, if the letter lacks a strong legal basis, your lawyer might advise responding with a counterargument or seeking a declaratory judgment.

Real-World Examples

Let’s assume we want to scrape data off an e-commerce site, targeting product names, prices, and ratings. The ideal step would be to use Cheerio for static content and Puppeteer for dynamic content, so we need to write a script that first attempts to scrape static content using Cheerio. If it detects that the content is dynamically loaded, it switches to Puppeteer.

Based on what we’ve learned so far, here’s how that would look.

const axios = require('axios');
const cheerio = require('cheerio');
const puppeteer = require('puppeteer');

// URL of the e-commerce site to scrape
const url = 'https://example-ecommerce.com/products';

// Function to scrape using Cheerio (for static content)
async function scrapeWithCheerio(url) {
  try {
    const { data } = await axios.get(url);
    const $ = cheerio.load(data);
    const products = [];

    $('.product-item').each((index, element) => {
      const name = $(element).find('.product-name').text().trim();
      const price = $(element).find('.product-price').text().trim();
      const rating = $(element).find('.product-rating').text().trim();

      products.push({ name, price, rating });
    });

    console.log('Products scraped using Cheerio:', products);
  } catch (error) {
    console.error('Error scraping with Cheerio:', error);
    console.log('Switching to Puppeteer for dynamic content...');
    await scrapeWithPuppeteer(url);
  }
}

// Function to scrape using Puppeteer (for dynamic content)
async function scrapeWithPuppeteer(url) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  const products = [];
  let previousHeight;

  while (true) {
    // Extract products from the current scroll position
    const newProducts = await page.evaluate(() => {
      return Array.from(document.querySelectorAll('.product-item')).map(product => ({
        name: product.querySelector('.product-name').textContent.trim(),
        price: product.querySelector('.product-price').textContent.trim(),
        rating: product.querySelector('.product-rating').textContent.trim(),
      }));
    });

    products.push(...newProducts);

    // Scroll down to load more products
    previousHeight = await page.evaluate('document.body.scrollHeight');
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
    await page.waitForTimeout(2000); // Wait for new content to load

    const newHeight = await page.evaluate('document.body.scrollHeight');
    if (newHeight === previousHeight) break; // Exit if no new content is loaded
  }

  console.log('Products scraped using Puppeteer:', products);

  await browser.close();
}

// Main function to start the scraping process
async function startScraping() {
  console.log('Starting scraping process...');
  await scrapeWithCheerio(url);
}

startScraping();

Challenges and Considerations

  • Switching Between Cheerio and Puppeteer: The script intelligently switches between static and dynamic scraping methods, ensuring robustness.
  • Performance: Since puppeteer is more resource-intensive than Cheerio. The script optimizes performance by starting with Cheerio and only using Puppeteer if necessary.
  • Dynamic Content Handling: The Puppeteer-based scraper includes logic for handling infinite scroll, ensuring that all content is scraped.

This example is not ready for immediate use because you have to edit it to suit your specific project, but it sets a solid foundation for extracting product information from e-commerce platforms. Remember to always prioritize ethical scraping by adhering to website terms of service and avoiding excessive server load.

Conclusion

So far, we’ve covered web scraping NodeJS using tools like Cheerio and Puppeteer. We delved into scraping both static and dynamic content, handling challenges such as pagination, infinite scroll, and anti-scraping measures like CAPTCHAs. Cheerio and Puppeteer provide elegant solutions for gathering data from the web, but these solutions can be complex and resource-intensive to implement effectively.

This is where Scrape.do offer significant advantages. Scrape.do simplifies the web scraping process by providing an API that handles many of the common challenges automatically, letting you focus on the data, not the hassle. Here’s how Scrape.do simplifies your scraping experience:

  • Built-in IP rotation: No need to worry about getting blocked! Scrape.do automatically rotate IP addresses, reducing the risk of detection.
  • Automatic CAPTCHA solving: Say goodbye to manual CAPTCHA deciphering. Scrape.do handles CAPTCHAs for you, saving you valuable time and effort.
  • Effortless JavaScript rendering: Access data from complex, JavaScript-heavy websites without the need to manage headless browsers directly.

Moreover, Scrape.do’s ability to handle complex websites and large-scale scraping tasks with minimal setup makes it an invaluable tool for developers. It eliminates the need for extensive custom scripting and infrastructure, allowing for quicker, more efficient data extraction.

So if you want to streamline your scraping operations, reduce the overhead of managing proxies and anti-bot measures, and gain reliable access to the data you need for analysis, all with just a few API calls, you should get started with Scrape.do for free now.