Categories: Web scraping, Tutorials

Advanced Web Scraping Techniques with Puppeteer and NodeJS

11 mins read Created Date: October 03, 2024   Updated Date: October 04, 2024

Web scraping allows you to gather data from websites efficiently, whether for extracting product information, monitoring page changes, or visualizing time-based data. In this guide, we leverage Puppeteer, a powerful headless browser automation tool, together with Node.js to manage asynchronous tasks and enhance web scraping capabilities. We’ll also integrate Scrape.do to optimize our scraping process with advanced features like proxy management, session handling, and bot detection avoidance.

NodeJS and Puppeteer

Puppeteer is an open-source Node.js library that provides a high-level API for controlling Chrome or Chromium over the DevTools Protocol. It enables developers to automate browsing tasks, such as clicking buttons, filling out forms, and capturing screenshots, while navigating websites as a real user would. By simulating real-user interactions, Puppeteer is particularly effective at scraping dynamic content generated by JavaScript, which traditional scraping tools may fail to capture.

Node.js, also called NodeJS, is a popular runtime environment that features an event-driven architecture essential for managing multiple web scraping tasks simultaneously. Together, Puppeteer and Node.js provide a robust platform for building scalable and efficient web scrapers capable of handling various challenges, including dynamic content loading, CAPTCHA challenges, and bot detection.

Environment Setup

Dependencies

To get started, you’ll need to install Node.js and Puppeteer. Make sure you are using stable versions for the web scraping tasks that follow.


npm install puppeteer@latest

Configuration

Set up a basic Puppeteer project with Node.js. First, create a package.json file by running the following command:


npm init -y

Then, configure your package.json to include the necessary dependencies:


{

  "name": "web-scraper",

  "version": "1.0.0",

  "description": "Advanced web scraping with Puppeteer, NodeJS, and Scrape.do",

  "main": "index.js",

  "dependencies": {

    "puppeteer": "^19.9.2",

    "axios": "^1.5.0",

  }

}

Efficient management of cookies and sessions is essential in web scraping to maintain continuity and avoid repetitive logins or verification checks. When handling form data that appears in a modal triggered by a button, it is crucial to ensure the session state is preserved throughout the scraping process. To achieve this, we will create cookies with timestamps, check for existing cookies, and reuse them as needed. This helps maintain the login state across multiple requests and reduces the likelihood of being detected and blocked by the target website.

Cookies store important information, such as login credentials, user preferences, and session IDs, which are vital for maintaining the login state and session data during scraping. Without proper session management, a scraper may trigger repeated login attempts or CAPTCHAs, increasing the risk of detection and blocking.

Session Management

To handle cookies and sessions effectively, we will demonstrate how to save cookies to a file and reload them in subsequent scraping sessions. This approach maintains the login state and minimizes the risk of being flagged as a bot. Additionally, we will use Scrape.do to manage proxies, solve blocking issues, and implement retry mechanisms that enable the scraper to handle CAPTCHAs, detect and bypass rate limits, and manage retries efficiently.

Here is an example of how to manage sessions and cookies to maintain login states across multiple scraping runs:


const puppeteer = require('puppeteer');

const axios = require('axios');

const fs = require('fs');

const path = require('path');

const os = require('os');

const SCRAPE_DO_TOKEN = '--your-token--'; // Replace with your Scrape.do API token

const targetUrl = encodeURIComponent('https://example.com/');

const blockedResourceTypes = [

    'beacon',

    'csp_report',

    'font',

    'image',

    'imageset',

    'media',

    'object',

    'texttrack',

    'stylesheet',

];

// Custom delay function

const delay = ms => new Promise(resolve => setTimeout(resolve, ms));

// JSON-like configuration for form fields

const formConfig = [

    { selector: 'input[name="name"]', value: 'John Doe', type: 'text' },

    { selector: 'input[name="email"]', value: '[email protected]', type: 'email' },

    { selector: 'textarea[name="message"]', value: 'Hello, this is a test!', type: 'textarea' },

];

(async () => {

  try {

    // Step 1: Use Scrape.do API to fetch page content

    const response = await axios.get(`https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=${targetUrl}`, {

      timeout: 30000 // Set timeout for the request

    });

    console.log('Scrape.do API response received.');

    // Step 2: Launch Puppeteer to process the fetched content

    const browser = await puppeteer.launch({

      headless: false,

      ignoreHTTPSErrors: true, // Ignore SSL errors for any certificates

    });

    const page = await browser.newPage();

    // Block unnecessary resources

    await page.setRequestInterception(true);

    page.on('request', request => {

      if (blockedResourceTypes.includes(request.resourceType())) {

        console.log(`Blocked type: ${request.resourceType()} url: ${request.url()}`);

        request.abort();

      } else {

        request.continue();

      }

    });

    // Load cookies from a file if available

    const cookiePath = 'cookies.json';

    let cookies = [];

    if (fs.existsSync(cookiePath)) {

      cookies = JSON.parse(fs.readFileSync(cookiePath, 'utf-8'));

      console.log('Loaded cookies:', cookies);

      await page.setCookie(...cookies);

    }

    // Step 3: Set the content obtained from Scrape.do in Puppeteer

    await page.setContent(response.data);

    console.log('Page content loaded into Puppeteer.');

    // Step 4: Click the button to open the modal

    await page.waitForSelector('#open-modal'); // Wait for the button to appear

    await page.click('#open-modal'); // Click the button to open the modal

    // Wait for the modal to become visible

    await page.waitForSelector('.modal-content', { visible: true });

    console.log('Modal appeared, checking if the modal has a form...');

    // Check if the modal contains a form

    const isFormPresent = await page.evaluate(() => {

        const modal = document.querySelector('.modal-content');

        return modal && modal.querySelector('form#contact-form') !== null; // Check if the form exists in the modal

    });

    if (!isFormPresent) {

        console.error('No form found in the modal.');

        await browser.close();

        return; // Exit the script if the form is not present

    }

    console.log('Form is present in the modal.');

    // Fill in the form fields dynamically using the configuration

    for (const field of formConfig) {

        console.log(`Filling field: ${field.selector} with value: ${field.value}`);

        await delay(1000); // Wait for 1 second before typing

        await page.type(field.selector, field.value);

    }

    // Wait for the file chooser, then trigger the file input click and upload a file

    const [fileChooser] = await Promise.all([

        page.waitForFileChooser(),

        page.click('input[type="file"]'), // This will trigger the file input

    ]);

    // Select the file to upload from ~/Desktop/test.jpg

    const filePath = path.join(os.homedir(), 'Desktop', 'test.jpg'); // Resolve the path to ~/Desktop/test.jpg

    await fileChooser.accept([filePath]);

    // Click the submit button

    await delay(1000);

    await page.click('button#submit-form');

    // Wait for the confirmation message to appear

    await page.waitForSelector('.form-submission-confirmation', { visible: true });

    // Extract the form data

    const formName = await page.$eval('form#contact-form', form => form.getAttribute('name'));

    const submissionUrl = page.url();

    // Extract the values of the fields dynamically from the configuration

    const extractedFormData = {};

    for (const field of formConfig) {

        extractedFormData[field.selector] = await page.$eval(field.selector, el => el.value);

    }

    // Log the data to the console

    console.log({

        formName,

        submissionUrl,

        extractedFormData,

    });

    // Step 5: Save cookies after the form submission

    cookies = await page.cookies();

    fs.writeFileSync(cookiePath, JSON.stringify(cookies));

    console.log('Cookies saved to file.');

    await browser.close();

  } catch (error) {

    console.error('Error during scraping:', error);

    // Retry logic using Scrape.do API retry feature

    try {

      const retryResponse = await axios.post('https://api.scrape.do/retry', { token: SCRAPE_DO_TOKEN });

      console.log('Retry triggered with Scrape.do', retryResponse.data);

    } catch (retryError) {

      console.error('Retry error:', retryError);

    }

  }

})();

Expected Output

When the script runs successfully, you will see the following output in your terminal:


Scrape.do API response received.

Page content loaded into Puppeteer.

Modal appeared, checking if the modal has a form...

Form is present in the modal.

Filling field: input[name="name"] with value: John Doe

Filling field: input[name="email"] with value: john.doe@example.com

Filling field: textarea[name="message"] with value: Hello, this is a test!

{

  formName: null,

  submissionUrl: 'about:blank',

  extractedFormData: {

    'input[name="name"]': 'John Doe',

    'input[name="email"]': '[email protected]',

    'textarea[name="message"]': 'Hello, this is a test!'

  }

}

Cookies saved to file.

Potential Challenges During the Process

When using Puppeteer and Node.js for complex web scraping tasks, some challenges may arise. However, with Scrape.do in API Mode or Proxy Mode, and by leveraging its additional features, it is possible to manage the entire process efficiently.

Increase the Timeout Value

Web scraping often requires handling dynamic content or slow network responses. Increase the timeout value in your Puppeteer and Scrape.do configuration to give more time for requests to complete.


const scrapeDoUrl = `https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=https://66d766b074bd9da7360c1769--incandescent-gnome-f2685a.netlify.app&keep_headers=true&timeout=30000`; // 30 seconds timeout

(async () => {

  const browser = await puppeteer.launch({ headless: false, ignoreHTTPSErrors: true });

  const page = await browser.newPage();

  try {

    // Increase Puppeteer's timeout

    await page.goto(scrapeDoUrl, { waitUntil: 'networkidle2', timeout: 60000 }); // 60 seconds timeout

    console.log('Page loaded successfully.');

  } catch (error) {

    console.error('Error during scraping:', error);

    // Retry logic using Scrape.do API retry feature

    try {

      const retryResponse = await axios.post('https://api.scrape.do/retry', { token: SCRAPE_DO_TOKEN });

      console.log('Retry response:', retryResponse.data);

    } catch (retryError) {

      console.error('Retry error:', retryError);

    }

  } finally {

    await browser.close();

  }

})();

Add Exponential Backoff for Retry Logic

Incorporate an exponential backoff strategy to handle retries more efficiently, allowing for progressively increased delays between attempts.


async function retryWithBackoff(fn, maxRetries = 5, delay = 1000) {

  for (let i = 0; i < maxRetries; i++) {

    try {

      return await fn();

    } catch (error) {

      console.error(`Attempt ${i + 1} failed:`, error.message);

      if (i === maxRetries - 1) throw error;

      await new Promise(res => setTimeout(res, delay * (i + 1))); // Exponential backoff

    }

  }

}

retryWithBackoff(() => page.goto(scrapeDoUrl, { waitUntil: 'networkidle2', timeout: 60000 }));

Use Scrape.do’s solve_blocking Parameter

Enable the solve_blocking feature in Scrape.do to solve potential blocking issues that could be causing timeouts.


const scrapeDoUrl = `https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=https://66d766b074bd9da7360c1769--incandescent-gnome-f2685a.netlify.app&keep_headers=true&timeout=30000&solve_blocking=true`;

Optimizing Performance Through Concurrency and Parallelism

Running multiple Puppeteer instances concurrently can significantly speed up the scraping process. However, this requires careful management of system resources and network connections. For example, you can launch multiple browser instances and control them using a queue system to prevent overwhelming the target website and reduce the risk of IP blocking.

Handle Network Errors in Puppeteer

Add specific error handling for network-related errors to handle potential failures during the scraping process.


page.on('requestfailed', request => {

  console.error(`Request failed: ${request.url()} - ${request.failure().errorText}`);

});

Verify Proxy and API Key

Ensure your Scrape.do API key is correctly set up and that the proxy server configuration is accurate:

  • Double-check that SCRAPE_DO_KEY is correctly initialized.
  • Verify that your Scrape.do API key has the correct permissions for the actions you are trying to perform.

const axios = require('axios');

const puppeteer = require('puppeteer');

const fs = require('fs');

const SCRAPE_DO_TOKEN = 'YOUR_SCRAPE_DO_TOKEN'; // Replace with your actual Scrape.do token

const targetUrl = "https://66d766b074bd9da7360c1769--incandescent-gnome-f2685a.netlify.app"; // Your target URL

// Set up the request configuration for both API and Proxy modes

const isApiMode = true; // Set this to false if you want to use Proxy mode instead

// Configuration changes based on the mode

let config;

if (isApiMode) {

    // API Mode: Use this configuration

    const encodedUrl = encodeURIComponent(targetUrl); // URL must be encoded in API mode

    config = {

        method: 'GET',

        url: `https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=${encodedUrl}&super=true&geoCode=us`, // Use super=True for Residential & Mobile Proxy, set geo targeting if needed

        headers: {}

    };

} else {

    // Proxy Mode: Use this configuration

    config = {

        method: 'GET',

        url: targetUrl,

        proxy: {

            protocol: 'http', // Use either 'http' or 'https'

            host: 'proxy.scrape.do',

            port: 8080,

            auth: {

                username: SCRAPE_DO_TOKEN,

                password: ''

            }

        }

    };

}

// Fetch the page content using the configured mode

axios(config)

    .then(async function (response) {

        console.log('Page content fetched successfully using Scrape.do.');

        //...

    })

    .catch(function (error) {

        console.error('Error fetching page content with Scrape.do:', error.message);

    });

CAPTCHA and Bot Detection

Many websites use CAPTCHA and bot detection mechanisms to prevent scraping. By integrating the Puppeteer-extra and puppeteer-extra-plugin-stealth libraries, you can disguise your headless browser as a regular user, reducing the risk of detection. Additionally, Scrape.do offers options like solve_blocking to handle more sophisticated blocking mechanisms automatically.

Dynamic Content Loading

Websites often load content dynamically through JavaScript, which can make it difficult for scrapers to capture the required data immediately. By leveraging Puppeteer’s waitForSelector, waitForFunction, or waitForTimeout methods, you can ensure that the necessary elements are fully loaded before attempting to extract data. For example, when scraping single-page applications (SPAs) that rely heavily on JavaScript, using page.waitForSelector(’.content-loaded’) ensures that content is available before proceeding.

Form Submission Timing

For web pages with dynamically loaded or asynchronously interacting elements, improper timing can cause submission errors. Use appropriate delays and wait functions, such as:


await page.waitForSelector('#open-modal'); // Wait for the button to appear

await page.click('#open-modal'); // Click the button to open the modal

await page.waitForSelector('.modal-content', { visible: true });

Handling File Uploads

To handle file uploads, use Puppeteer’s page.waitForFileChooser() and fileChooser.accept():


const [fileChooser] = await Promise.all([

    page.waitForFileChooser(),

    page.click('input[type="file"]'), // This will trigger the file input

]);

const filePath = path.join(os.homedir(), 'Desktop', 'test.jpg'); // Resolve the path to ~/Desktop/test.jpg

await fileChooser.accept([filePath]);

Integrating Puppeteer-extra and Puppeteer-extra-plugin-stealth

Integrating Puppeteer-extra and puppeteer-extra-plugin-stealth is crucial for bypassing common detection techniques employed by websites to identify and block scraping activities. These plugins modify the default behavior of Puppeteer, such as disabling WebGL, modifying the User-Agent string, and faking the presence of certain browser APIs that are often checked by anti-bot scripts.

To use these plugins effectively, you need to configure them properly for your specific use case. For instance, some websites check for unusual browser behaviors, such as missing fonts or disabled WebRTC. Puppeteer-extra-plugin-stealth automatically fixes these issues, making your scraping activities appear more human-like. Combining these plugins with other evasion techniques, like randomizing request intervals and using high-quality residential proxies, can significantly improve your success rate.


const puppeteer = require('puppeteer-extra');

const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

(async () => {

    const browser = await puppeteer.launch({ headless: false });

    const page = await browser.newPage();

    // Your scraping logic here...

})();

By integrating these plugins, your scraping scripts are less likely to be detected and blocked by anti-bot systems.

Conclusion

By using the Scrape.do API mode effectively, configuring timeouts, managing retries, and handling dynamic content, you can enhance your web scraping strategy and handle challenges such as network errors, CAPTCHAs, and bot detection. Proper configuration and strategic use of tools like Puppeteer-extra and Scrape.do features can significantly improve your scraping success rate.

  • References

Puppeteer Documentation

NodeJS Documentation

Puppeteer-extra

Puppeteer-extra-plugin-stealth