Advanced Web Scraping Techniques with Puppeteer and NodeJS
Web scraping allows you to gather data from websites efficiently, whether for extracting product information, monitoring page changes, or visualizing time-based data. In this guide, we leverage Puppeteer, a powerful headless browser automation tool, together with Node.js to manage asynchronous tasks and enhance web scraping capabilities. We’ll also integrate Scrape.do to optimize our scraping process with advanced features like proxy management, session handling, and bot detection avoidance.
NodeJS and Puppeteer
Puppeteer is an open-source Node.js library that provides a high-level API for controlling Chrome or Chromium over the DevTools Protocol. It enables developers to automate browsing tasks, such as clicking buttons, filling out forms, and capturing screenshots, while navigating websites as a real user would. By simulating real-user interactions, Puppeteer is particularly effective at scraping dynamic content generated by JavaScript, which traditional scraping tools may fail to capture.
Node.js, also called NodeJS, is a popular runtime environment that features an event-driven architecture essential for managing multiple web scraping tasks simultaneously. Together, Puppeteer and Node.js provide a robust platform for building scalable and efficient web scrapers capable of handling various challenges, including dynamic content loading, CAPTCHA challenges, and bot detection.
Environment Setup
Dependencies
To get started, you’ll need to install Node.js and Puppeteer. Make sure you are using stable versions for the web scraping tasks that follow.
npm install puppeteer@latest
Configuration
Set up a basic Puppeteer project with Node.js. First, create a package.json
file by running the following command:
npm init -y
Then, configure your package.json
to include the necessary dependencies:
{
"name": "web-scraper",
"version": "1.0.0",
"description": "Advanced web scraping with Puppeteer, NodeJS, and Scrape.do",
"main": "index.js",
"dependencies": {
"puppeteer": "^19.9.2",
"axios": "^1.5.0",
}
}
Advanced Cookie and Session Management
Efficient management of cookies and sessions is essential in web scraping to maintain continuity and avoid repetitive logins or verification checks. When handling form data that appears in a modal triggered by a button, it is crucial to ensure the session state is preserved throughout the scraping process. To achieve this, we will create cookies with timestamps, check for existing cookies, and reuse them as needed. This helps maintain the login state across multiple requests and reduces the likelihood of being detected and blocked by the target website.
Cookies store important information, such as login credentials, user preferences, and session IDs, which are vital for maintaining the login state and session data during scraping. Without proper session management, a scraper may trigger repeated login attempts or CAPTCHAs, increasing the risk of detection and blocking.
Session Management
To handle cookies and sessions effectively, we will demonstrate how to save cookies to a file and reload them in subsequent scraping sessions. This approach maintains the login state and minimizes the risk of being flagged as a bot. Additionally, we will use Scrape.do to manage proxies, solve blocking issues, and implement retry mechanisms that enable the scraper to handle CAPTCHAs, detect and bypass rate limits, and manage retries efficiently.
Here is an example of how to manage sessions and cookies to maintain login states across multiple scraping runs:
const puppeteer = require('puppeteer');
const axios = require('axios');
const fs = require('fs');
const path = require('path');
const os = require('os');
const SCRAPE_DO_TOKEN = '--your-token--'; // Replace with your Scrape.do API token
const targetUrl = encodeURIComponent('https://example.com/');
const blockedResourceTypes = [
'beacon',
'csp_report',
'font',
'image',
'imageset',
'media',
'object',
'texttrack',
'stylesheet',
];
// Custom delay function
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
// JSON-like configuration for form fields
const formConfig = [
{ selector: 'input[name="name"]', value: 'John Doe', type: 'text' },
{ selector: 'input[name="email"]', value: '[email protected]', type: 'email' },
{ selector: 'textarea[name="message"]', value: 'Hello, this is a test!', type: 'textarea' },
];
(async () => {
try {
// Step 1: Use Scrape.do API to fetch page content
const response = await axios.get(`https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=${targetUrl}`, {
timeout: 30000 // Set timeout for the request
});
console.log('Scrape.do API response received.');
// Step 2: Launch Puppeteer to process the fetched content
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true, // Ignore SSL errors for any certificates
});
const page = await browser.newPage();
// Block unnecessary resources
await page.setRequestInterception(true);
page.on('request', request => {
if (blockedResourceTypes.includes(request.resourceType())) {
console.log(`Blocked type: ${request.resourceType()} url: ${request.url()}`);
request.abort();
} else {
request.continue();
}
});
// Load cookies from a file if available
const cookiePath = 'cookies.json';
let cookies = [];
if (fs.existsSync(cookiePath)) {
cookies = JSON.parse(fs.readFileSync(cookiePath, 'utf-8'));
console.log('Loaded cookies:', cookies);
await page.setCookie(...cookies);
}
// Step 3: Set the content obtained from Scrape.do in Puppeteer
await page.setContent(response.data);
console.log('Page content loaded into Puppeteer.');
// Step 4: Click the button to open the modal
await page.waitForSelector('#open-modal'); // Wait for the button to appear
await page.click('#open-modal'); // Click the button to open the modal
// Wait for the modal to become visible
await page.waitForSelector('.modal-content', { visible: true });
console.log('Modal appeared, checking if the modal has a form...');
// Check if the modal contains a form
const isFormPresent = await page.evaluate(() => {
const modal = document.querySelector('.modal-content');
return modal && modal.querySelector('form#contact-form') !== null; // Check if the form exists in the modal
});
if (!isFormPresent) {
console.error('No form found in the modal.');
await browser.close();
return; // Exit the script if the form is not present
}
console.log('Form is present in the modal.');
// Fill in the form fields dynamically using the configuration
for (const field of formConfig) {
console.log(`Filling field: ${field.selector} with value: ${field.value}`);
await delay(1000); // Wait for 1 second before typing
await page.type(field.selector, field.value);
}
// Wait for the file chooser, then trigger the file input click and upload a file
const [fileChooser] = await Promise.all([
page.waitForFileChooser(),
page.click('input[type="file"]'), // This will trigger the file input
]);
// Select the file to upload from ~/Desktop/test.jpg
const filePath = path.join(os.homedir(), 'Desktop', 'test.jpg'); // Resolve the path to ~/Desktop/test.jpg
await fileChooser.accept([filePath]);
// Click the submit button
await delay(1000);
await page.click('button#submit-form');
// Wait for the confirmation message to appear
await page.waitForSelector('.form-submission-confirmation', { visible: true });
// Extract the form data
const formName = await page.$eval('form#contact-form', form => form.getAttribute('name'));
const submissionUrl = page.url();
// Extract the values of the fields dynamically from the configuration
const extractedFormData = {};
for (const field of formConfig) {
extractedFormData[field.selector] = await page.$eval(field.selector, el => el.value);
}
// Log the data to the console
console.log({
formName,
submissionUrl,
extractedFormData,
});
// Step 5: Save cookies after the form submission
cookies = await page.cookies();
fs.writeFileSync(cookiePath, JSON.stringify(cookies));
console.log('Cookies saved to file.');
await browser.close();
} catch (error) {
console.error('Error during scraping:', error);
// Retry logic using Scrape.do API retry feature
try {
const retryResponse = await axios.post('https://api.scrape.do/retry', { token: SCRAPE_DO_TOKEN });
console.log('Retry triggered with Scrape.do', retryResponse.data);
} catch (retryError) {
console.error('Retry error:', retryError);
}
}
})();
Expected Output
When the script runs successfully, you will see the following output in your terminal:
Scrape.do API response received.
Page content loaded into Puppeteer.
Modal appeared, checking if the modal has a form...
Form is present in the modal.
Filling field: input[name="name"] with value: John Doe
Filling field: input[name="email"] with value: john.doe@example.com
Filling field: textarea[name="message"] with value: Hello, this is a test!
{
formName: null,
submissionUrl: 'about:blank',
extractedFormData: {
'input[name="name"]': 'John Doe',
'input[name="email"]': '[email protected]',
'textarea[name="message"]': 'Hello, this is a test!'
}
}
Cookies saved to file.
Potential Challenges During the Process
When using Puppeteer and Node.js for complex web scraping tasks, some challenges may arise. However, with Scrape.do in API Mode or Proxy Mode, and by leveraging its additional features, it is possible to manage the entire process efficiently.
Increase the Timeout Value
Web scraping often requires handling dynamic content or slow network responses. Increase the timeout value in your Puppeteer and Scrape.do configuration to give more time for requests to complete.
const scrapeDoUrl = `https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=https://66d766b074bd9da7360c1769--incandescent-gnome-f2685a.netlify.app&keep_headers=true&timeout=30000`; // 30 seconds timeout
(async () => {
const browser = await puppeteer.launch({ headless: false, ignoreHTTPSErrors: true });
const page = await browser.newPage();
try {
// Increase Puppeteer's timeout
await page.goto(scrapeDoUrl, { waitUntil: 'networkidle2', timeout: 60000 }); // 60 seconds timeout
console.log('Page loaded successfully.');
} catch (error) {
console.error('Error during scraping:', error);
// Retry logic using Scrape.do API retry feature
try {
const retryResponse = await axios.post('https://api.scrape.do/retry', { token: SCRAPE_DO_TOKEN });
console.log('Retry response:', retryResponse.data);
} catch (retryError) {
console.error('Retry error:', retryError);
}
} finally {
await browser.close();
}
})();
Add Exponential Backoff for Retry Logic
Incorporate an exponential backoff strategy to handle retries more efficiently, allowing for progressively increased delays between attempts.
async function retryWithBackoff(fn, maxRetries = 5, delay = 1000) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
console.error(`Attempt ${i + 1} failed:`, error.message);
if (i === maxRetries - 1) throw error;
await new Promise(res => setTimeout(res, delay * (i + 1))); // Exponential backoff
}
}
}
retryWithBackoff(() => page.goto(scrapeDoUrl, { waitUntil: 'networkidle2', timeout: 60000 }));
Use Scrape.do’s solve_blocking Parameter
Enable the solve_blocking
feature in Scrape.do to solve potential blocking issues that could be causing timeouts.
const scrapeDoUrl = `https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=https://66d766b074bd9da7360c1769--incandescent-gnome-f2685a.netlify.app&keep_headers=true&timeout=30000&solve_blocking=true`;
Optimizing Performance Through Concurrency and Parallelism
Running multiple Puppeteer instances concurrently can significantly speed up the scraping process. However, this requires careful management of system resources and network connections. For example, you can launch multiple browser instances and control them using a queue system to prevent overwhelming the target website and reduce the risk of IP blocking.
Handle Network Errors in Puppeteer
Add specific error handling for network-related errors to handle potential failures during the scraping process.
page.on('requestfailed', request => {
console.error(`Request failed: ${request.url()} - ${request.failure().errorText}`);
});
Verify Proxy and API Key
Ensure your Scrape.do API key is correctly set up and that the proxy server configuration is accurate:
- Double-check that SCRAPE_DO_KEY is correctly initialized.
- Verify that your Scrape.do API key has the correct permissions for the actions you are trying to perform.
const axios = require('axios');
const puppeteer = require('puppeteer');
const fs = require('fs');
const SCRAPE_DO_TOKEN = 'YOUR_SCRAPE_DO_TOKEN'; // Replace with your actual Scrape.do token
const targetUrl = "https://66d766b074bd9da7360c1769--incandescent-gnome-f2685a.netlify.app"; // Your target URL
// Set up the request configuration for both API and Proxy modes
const isApiMode = true; // Set this to false if you want to use Proxy mode instead
// Configuration changes based on the mode
let config;
if (isApiMode) {
// API Mode: Use this configuration
const encodedUrl = encodeURIComponent(targetUrl); // URL must be encoded in API mode
config = {
method: 'GET',
url: `https://api.scrape.do?token=${SCRAPE_DO_TOKEN}&url=${encodedUrl}&super=true&geoCode=us`, // Use super=True for Residential & Mobile Proxy, set geo targeting if needed
headers: {}
};
} else {
// Proxy Mode: Use this configuration
config = {
method: 'GET',
url: targetUrl,
proxy: {
protocol: 'http', // Use either 'http' or 'https'
host: 'proxy.scrape.do',
port: 8080,
auth: {
username: SCRAPE_DO_TOKEN,
password: ''
}
}
};
}
// Fetch the page content using the configured mode
axios(config)
.then(async function (response) {
console.log('Page content fetched successfully using Scrape.do.');
//...
})
.catch(function (error) {
console.error('Error fetching page content with Scrape.do:', error.message);
});
CAPTCHA and Bot Detection
Many websites use CAPTCHA and bot detection mechanisms to prevent scraping. By integrating the Puppeteer-extra and puppeteer-extra-plugin-stealth libraries, you can disguise your headless browser as a regular user, reducing the risk of detection. Additionally, Scrape.do offers options like solve_blocking to handle more sophisticated blocking mechanisms automatically.
Dynamic Content Loading
Websites often load content dynamically through JavaScript, which can make it difficult for scrapers to capture the required data immediately. By leveraging Puppeteer’s waitForSelector, waitForFunction, or waitForTimeout methods, you can ensure that the necessary elements are fully loaded before attempting to extract data. For example, when scraping single-page applications (SPAs) that rely heavily on JavaScript, using page.waitForSelector(’.content-loaded’) ensures that content is available before proceeding.
Form Submission Timing
For web pages with dynamically loaded or asynchronously interacting elements, improper timing can cause submission errors. Use appropriate delays and wait functions, such as:
await page.waitForSelector('#open-modal'); // Wait for the button to appear
await page.click('#open-modal'); // Click the button to open the modal
await page.waitForSelector('.modal-content', { visible: true });
Handling File Uploads
To handle file uploads, use Puppeteer’s page.waitForFileChooser()
and fileChooser.accept()
:
const [fileChooser] = await Promise.all([
page.waitForFileChooser(),
page.click('input[type="file"]'), // This will trigger the file input
]);
const filePath = path.join(os.homedir(), 'Desktop', 'test.jpg'); // Resolve the path to ~/Desktop/test.jpg
await fileChooser.accept([filePath]);
Integrating Puppeteer-extra and Puppeteer-extra-plugin-stealth
Integrating Puppeteer-extra and puppeteer-extra-plugin-stealth is crucial for bypassing common detection techniques employed by websites to identify and block scraping activities. These plugins modify the default behavior of Puppeteer, such as disabling WebGL, modifying the User-Agent string, and faking the presence of certain browser APIs that are often checked by anti-bot scripts.
To use these plugins effectively, you need to configure them properly for your specific use case. For instance, some websites check for unusual browser behaviors, such as missing fonts or disabled WebRTC. Puppeteer-extra-plugin-stealth automatically fixes these issues, making your scraping activities appear more human-like. Combining these plugins with other evasion techniques, like randomizing request intervals and using high-quality residential proxies, can significantly improve your success rate.
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
// Your scraping logic here...
})();
By integrating these plugins, your scraping scripts are less likely to be detected and blocked by anti-bot systems.
Conclusion
By using the Scrape.do API mode effectively, configuring timeouts, managing retries, and handling dynamic content, you can enhance your web scraping strategy and handle challenges such as network errors, CAPTCHAs, and bot detection. Proper configuration and strategic use of tools like Puppeteer-extra and Scrape.do features can significantly improve your scraping success rate.
- References
Also You May Interest