How to Scrape Web Pages With Cheerio in Node.js
Cheerio is a fast and lightweight DOM (Document Object Model) manipulation library for Node.js, designed for server-side operations. It provides a jQuery-like API for parsing and manipulating HTML, making it an excellent choice for web scraping tasks. Built on htmlparser2, Cheerio offers robust APIs for extracting data and parsing HTML efficiently.
Key Features and Use Cases
- Traversing and parsing HTML documents
- Extracting specific elements from content
- Manipulating DOM elements on the server side
Important Distinctions
- Cheerio does not interpret the result as a web browser does.
- It does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript.
- These limitations make Cheerio much faster than other solutions.
Unlike browser-based tools like Puppeteer, Cheerio doesn’t render JavaScript but focuses on parsing static HTML, making it ideal for scenarios where speed and efficiency are critical.
Note: If your use case requires functionality like JavaScript execution, CSS rendering, or browser automation, consider alternatives such as Puppeteer, Playwright, or JSDom (DOM emulation project).
Setting Up the Development Environment
Dependencies
Before you begin web scraping with Cheerio, ensure you have Node.js installed on your system. Once Node.js is set up, you’ll use npm
(Node Package Manager) to install the required packages for your project.
npm init -y # Initialize a new Node.js project
npm install cheerio axios # Install Cheerio and Axios for making HTTP requests
In this guide, we’ll be using the following Node.js libraries:
- Axios: A promise-based HTTP client for making requests in both browser and Node.js environments.
- Cheerio: A fast, flexible, and lean implementation of core jQuery, designed specifically for server-side use.
Understanding Cheerio Basics
Loading HTML and Making Requests
To begin scraping with Cheerio, you first need to provide it with HTML markup to parse. This is done using the load
function. After loading the markup and initializing Cheerio, you can start manipulating and traversing the resulting data structure using Cheerio’s API.
Here’s a comprehensive example that fetches HTML from a website, extracts country data, processes it, and outputs the results:
const axios = require('axios');
const cheerio = require('cheerio');
const url = 'https://www.scrapethissite.com/pages/simple/';
axios.get(url)
.then(response => {
const html = response.data;
const $ = cheerio.load(html);
const countries = $('.country');
const countryData = [];
countries.each((index, element) => {
const country = $(element);
const name = country.find('.country-name').text().trim();
const capital = country.find('.country-capital').text().trim();
const populationStr = country.find('.country-population').text().trim();
const areaStr = country.find('.country-area').text().trim();
const population = parseInt(populationStr.replace(/,/g, ''), 10);
const area = parseFloat(areaStr.replace(/,/g, ''));
// Calculate population density (people per sq km)
const density = area > 0 ? (population / area).toFixed(2) : 'N/A';
countryData.push({ name, capital, population, area, density });
});
countryData.sort((a, b) => b.population - a.population);
console.log(`Total countries found: ${countryData.length}`);
console.log('Countries ordered by population (descending):');
countryData.forEach((country, index) => {
console.log(`${index + 1}. ${country.name} - Capital: ${country.capital}, Population: ${country.population.toLocaleString()}, Area: ${country.area} km², Density: ${country.density} people/km²`);
});
})
.catch(error => console.log('Error:', error.message));
Important Notes on Cheerio’s Behavior
- Automatic HTML Structure: Cheerio will automatically include
<html>
,<head>
, and<body>
elements in the rendered markup, similar to how browsers handle HTML (this only occurs if these elements aren’t already present in the parsed HTML). - Disabling Automatic HTML Structure: You can prevent Cheerio from adding these elements by passing
false
as a third argument to theload
function:
const $ = cheerio.load(html, null, false);
Understanding Cheerio’s Selector Function
Cheerio’s main function for selecting elements has the following structure:
$(selector, [context], [root])
Let’s break down each parameter:
- selector: This is used for targeting specific elements in the markup. It’s the starting point for traversing and manipulating the information. The selector can be:
- A string (e.g.,
div.classname
) - A DOM element
- An array of elements
- A Cheerio object
- A string (e.g.,
- context: (Optional) This defines the scope or where to begin looking for the target elements. It can take the same forms as the selector.
- root: (Optional) This is the markup string you want to traverse or manipulate.
Examples of Cheerio Selectors
Here are some common ways to use Cheerio’s selector function:
// Select all <a> elements
$('a')
// Select <div> elements with class 'content'
$('div.content')
// Select elements with id 'main-content'
$('#main-content')
// Select <li> elements that are direct children of <ul>
$('ul > li')
// Select the first <p> element
$('p:first')
// Select all <img> elements with a 'src' attribute
$('img[src]')
// Select elements using a function
$((i, elem) => elem.attribs.id === 'main')
These selectors allow you to precisely target the elements you want to extract or manipulate in your web scraping projects.
Advanced HTML Parsing with Cheerio
Navigating Through DOM Elements
Scraping a Table
const axios = require('axios');
const cheerio = require('cheerio');
const qs = require('querystring');
const baseUrl = 'https://www.scrapethissite.com/pages/forms/';
async function scrapeHockeyTeams(searchTerm = 'north', perPage = 25) {
try {
// First, submit the search form
const searchParams = qs.stringify({ q: searchTerm, per_page: perPage });
const response = await axios.get(`${baseUrl}?${searchParams}`);
let $ = cheerio.load(response.data);
// Check if we need to adjust the per_page value
const totalResults = $('.team').length;
if (totalResults >= perPage && perPage !== 100) {
console.log(`More than ${perPage} results found. Adjusting to 100 per page.`);
return scrapeHockeyTeams(searchTerm, 100);
}
// Extract table data
const tableData = [];
$('.team').each((index, element) => {
const team = $(element);
const teamData = {
name: team.find('.name').text().trim(),
year: team.find('.year').text().trim(),
wins: team.find('.wins').text().trim(),
losses: team.find('.losses').text().trim(),
otLosses: team.find('.ot-losses').text().trim(),
winPercentage: team.find('.pct').text().trim(),
goalsFor: team.find('.gf').text().trim(),
goalsAgainst: team.find('.ga').text().trim(),
plusMinus: team.find('.diff').text().trim()
};
tableData.push(teamData);
});
console.log(`Total teams found: ${tableData.length}`);
console.log(`First ${tableData.length} teams:`);
console.log(tableData.slice(0, 5));
return tableData;
} catch (error) {
console.error('Error:', error.message);
}
}
// Run the scraper
scrapeHockeyTeams();
Extracting Attributes (e.g., Image URLs)
const imageSrc = $('img').attr('src');
console.log('Image Source:', imageSrc);
Handling Pagination
const axios = require('axios');
const cheerio = require('cheerio');
const qs = require('querystring');
const baseUrl = 'https://www.scrapethissite.com/pages/forms/';
async function scrapeHockeyTeams(searchTerm = '', maxPages = 3, perPage = 100) {
let currentPage = 1;
let allTeams = [];
while (currentPage <= maxPages) {
try {
const params = qs.stringify({
q: searchTerm,
per_page: perPage,
page_num: currentPage
});
const url = `${baseUrl}?${params}`;
console.log(`Scraping page ${currentPage}: ${url}`);
const response = await axios.get(url);
const $ = cheerio.load(response.data);
$('.team').each((index, element) => {
const team = $(element);
const teamData = {
name: team.find('.name').text().trim(),
year: parseInt(team.find('.year').text().trim(), 10),
wins: parseInt(team.find('.wins').text().trim(), 10),
losses: parseInt(team.find('.losses').text().trim(), 10),
otLosses: team.find('.ot-losses').text().trim(),
winPercentage: parseFloat(team.find('.pct').text().trim()),
goalsFor: parseInt(team.find('.gf').text().trim(), 10),
goalsAgainst: parseInt(team.find('.ga').text().trim(), 10),
plusMinus: parseInt(team.find('.diff').text().trim(), 10)
};
allTeams.push(teamData);
});
console.log(`Page ${currentPage}: Found ${$('.team').length} teams`);
// Check if there's a next page
const nextPageLink = $('.pagination li:last-child a').attr('href');
if (!nextPageLink || currentPage >= maxPages) {
break;
}
currentPage++;
} catch (error) {
console.error(`Error on page ${currentPage}:`, error.message);
break;
}
}
// Sort teams by wins in descending order
allTeams.sort((a, b) => b.wins - a.wins);
console.log(`Total teams scraped: ${allTeams.length}`);
console.log('Top 10 teams by wins:');
console.log(JSON.stringify(allTeams.slice(0, 10), null, 2));
return allTeams;
}
// Run the scraper
scrapeHockeyTeams();
Error Handling
const scrapePage = () => {
axios.get(`${url}?page=${currentPage}`).then(response => {
// ... (scraping logic)
}).catch(error => {
console.log(error);
if (error.response && error.response.status === 404) {
console.log('Reached the last page');
}
});
};
Handling Dynamic Content
For content loaded dynamically via JavaScript, combine Cheerio with a headless browser like Puppeteer:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
async function scrapeOscarWinners() {
console.log('Launching browser...');
const browser = await puppeteer.launch({ headless: false }); // Set to false for debugging
const page = await browser.newPage();
console.log('Navigating to the page...');
await page.goto('https://www.scrapethissite.com/pages/ajax-javascript/', { waitUntil: 'networkidle2' });
console.log('Waiting for #oscars container...');
await page.waitForSelector('#oscars');
console.log('Getting years...');
const years = await page.evaluate(() => {
const container = document.querySelector('#oscars');
const yearLinks = Array.from(container.querySelectorAll('.year-link'));
console.log('Year links found:', yearLinks.length);
return yearLinks.map(el => el.textContent.trim()).sort((a, b) => b - a);
});
console.log('Years found:', years);
const allFilms = {};
for (const year of years) {
console.log(`\nScraping data for year ${year}`);
try {
console.log(`Clicking on year ${year} link...`);
await page.evaluate((yearToClick) => {
const yearLink = Array.from(document.querySelectorAll('.year-link'))
.find(el => el.textContent.trim() === yearToClick);
if (yearLink) {
yearLink.click();
} else {
throw new Error(`Year link for ${yearToClick} not found`);
}
}, year);
console.log('Waiting for table body to be populated...');
await page.waitForFunction(() => {
const tableBody = document.querySelector('#table-body');
return tableBody && tableBody.children.length > 0;
}, { timeout: 5000 });
console.log('Getting page content...');
const content = await page.content();
const $ = cheerio.load(content);
const films = [];
$('tr.film').each((index, element) => {
const film = {
title: $(element).find('.film-title').text().trim(),
nominations: parseInt($(element).find('.film-nominations').text().trim(), 10),
awards: parseInt($(element).find('.film-awards').text().trim(), 10),
bestPicture: $(element).find('.film-best-picture i').length > 0
};
films.push(film);
});
console.log(`Found ${films.length} films for year ${year}`);
allFilms[year] = films;
} catch (error) {
console.error(`Error scraping data for year ${year}:`, error);
}
}
console.log('Closing browser...');
await browser.close();
// Print a summary of the data
for (const [year, films] of Object.entries(allFilms)) {
console.log(`\nYear ${year}:`);
console.log(`Total films: ${films.length}`);
console.log(`Films with most awards:`);
const maxAwards = Math.max(...films.map(f => f.awards));
films.filter(f => f.awards === maxAwards).forEach(f => {
console.log(`- ${f.title} (${f.awards} awards, ${f.nominations} nominations)`);
});
}
return allFilms;
}
scrapeOscarWinners().then(data => {
console.log('Scraping completed.');
}).catch(error => {
console.error('An error occurred:', error);
});
Performance Optimization
Techniques for optimization include:
- Rate Limiting
- Caching
- Asynchronous Processing
- Efficient Data Storage
- Using proxy services like Scrape.do for IP rotation and handling dynamic content
Example using Scrape.do:
const scrapeDoUrl = 'http://[email protected]:8080?render=true';
axios.get(scrapeDoUrl)
.then(response => console.log(response.data))
.catch(err => console.error(err));
Saving Scraped Data to CSV Files
After successfully scraping data from a website, the next crucial step is to save this data in a format that’s easy to analyze and manipulate. CSV (Comma-Separated Values) is a popular choice due to its simplicity and compatibility with various data analysis tools and spreadsheet applications.
npm install json2csv
We’ll also use the built-in fs
(File System) module to write our CSV file to disk.
Importing Required Modules
At the top of your script, import the necessary modules:
const { Parser } = require('json2csv');
const fs = require('fs').promises;
Note that we’re using the promise-based version of the fs module for better asynchronous handling.
Creating a Function to Save Data as CSV
Let’s create an asynchronous function that accepts the scraped data, converts it to CSV, and saves it to a file:
async function saveToCSV(data, filename) {
try {
// Define the fields for the CSV file
const fields = ['year', 'title', 'nominations', 'awards', 'bestPicture'];
// Create a new parser instance with the defined fields
const parser = new Parser({ fields });
// Convert the data to CSV format
const csv = parser.parse(data);
// Write the CSV to a file
await fs.writeFile(filename, csv);
console.log(`Data successfully saved to ${filename}`);
} catch (error) {
console.error('Error saving data to CSV:', error);
}
}
Using the Save Function
After you’ve scraped your data, you can use this function to save it to a CSV file. Here’s an example of how you might use it in your scraping script:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const { Parser } = require('json2csv');
const fs = require('fs').promises;
async function saveToCSV(data, filename) {
try {
// Define the fields for the CSV file
const fields = ['year', 'title', 'nominations', 'awards', 'bestPicture'];
// Create a new parser instance with the defined fields
const parser = new Parser({ fields });
// Convert the data to CSV format
const csv = parser.parse(data);
// Write the CSV to a file
await fs.writeFile(filename, csv);
console.log(`Data successfully saved to ${filename}`);
} catch (error) {
console.error('Error saving data to CSV:', error);
}
}
async function scrapeOscarWinners() {
console.log('Launching browser...');
const browser = await puppeteer.launch({ headless: false }); // Set to false for debugging
const page = await browser.newPage();
console.log('Navigating to the page...');
await page.goto('https://www.scrapethissite.com/pages/ajax-javascript/', { waitUntil: 'networkidle2' });
console.log('Waiting for #oscars container...');
await page.waitForSelector('#oscars');
console.log('Getting years...');
const years = await page.evaluate(() => {
const container = document.querySelector('#oscars');
const yearLinks = Array.from(container.querySelectorAll('.year-link'));
console.log('Year links found:', yearLinks.length);
return yearLinks.map(el => el.textContent.trim()).sort((a, b) => b - a);
});
console.log('Years found:', years);
const allFilms = [];
for (const year of years) {
console.log(`\nScraping data for year ${year}`);
try {
console.log(`Clicking on year ${year} link...`);
await page.evaluate((yearToClick) => {
const yearLink = Array.from(document.querySelectorAll('.year-link'))
.find(el => el.textContent.trim() === yearToClick);
if (yearLink) {
yearLink.click();
} else {
throw new Error(`Year link for ${yearToClick} not found`);
}
}, year);
console.log('Waiting for table body to be populated...');
await page.waitForFunction(() => {
const tableBody = document.querySelector('#table-body');
return tableBody && tableBody.children.length > 0;
}, { timeout: 5000 });
console.log('Getting page content...');
const content = await page.content();
const $ = cheerio.load(content);
$('tr.film').each((index, element) => {
const film = {
year: year,
title: $(element).find('.film-title').text().trim(),
nominations: parseInt($(element).find('.film-nominations').text().trim(), 10),
awards: parseInt($(element).find('.film-awards').text().trim(), 10),
bestPicture: $(element).find('.film-best-picture i').length > 0
};
allFilms.push(film);
});
console.log(`Found ${$('tr.film').length} films for year ${year}`);
} catch (error) {
console.error(`Error scraping data for year ${year}:`, error);
}
}
console.log('Closing browser...');
await browser.close();
// Print a summary of the data
const filmsByYear = allFilms.reduce((acc, film) => {
if (!acc[film.year]) acc[film.year] = [];
acc[film.year].push(film);
return acc;
}, {});
for (const [year, films] of Object.entries(filmsByYear)) {
console.log(`\nYear ${year}:`);
console.log(`Total films: ${films.length}`);
console.log(`Films with most awards:`);
const maxAwards = Math.max(...films.map(f => f.awards));
films.filter(f => f.awards === maxAwards).forEach(f => {
console.log(`- ${f.title} (${f.awards} awards, ${f.nominations} nominations)`);
});
}
// Save data to CSV
await saveToCSV(allFilms, 'oscar_winners.csv');
return allFilms;
}
scrapeOscarWinners().then(data => {
console.log('Scraping completed.');
}).catch(error => {
console.error('An error occurred:', error);
});
In this example, we’re scraping a hypothetical table from a website, pushing each row’s data into an array, and then using our saveToCSV
function to save this data to a file named scraped_data.csv
.
A Bonus: Google Apps Script Usage
You can use Cheerio (cheeriogs) in Google Apps Script and get benefit from other Google services such as spreedsheets, forms, slides, etc. By integrating Cheerio (specifically, the CheerioGS library), you can enhance your scripts with robust HTML parsing capabilities. Here’s how to get started and make the most of Cheerio in your Google Apps Script projects:
As an example, you can follow the steps below:
- Add the library to your project (Project Key:
1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
) To check the library key, go to the Google Apps Script Directory and search for “CheerioGS”. - Use it in your code:
function scrapeWebsite() {
// Fetch HTML content
const url = 'https://www.scrapethissite.com/pages/forms/';
const content = UrlFetchApp.fetch(url).getContentText();
// Load content into Cheerio
const $ = Cheerio.load(content);
// Extract information
const title = $('title').text();
const firstParagraph = $('p').first().text();
// Log the results
Logger.log('Page Title: ' + title);
Logger.log('First Paragraph: ' + firstParagraph);
}
function updateSheetWithScrapedData() {
const sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
const url = 'https://example.com/data';
const content = UrlFetchApp.fetch(url).getContentText();
const $ = Cheerio.load(content);
$('table tr').each((index, element) => {
const tds = $(element).find('td');
const rowData = [
$(tds[0]).text(),
$(tds[1]).text(),
$(tds[2]).text()
];
sheet.appendRow(rowData);
});
}
By leveraging Cheerio in Google Apps Script, you can create powerful web scraping and HTML manipulation tools that integrate seamlessly with Google’s ecosystem of services. Remember to always scrape responsibly and in compliance with websites’ terms of service and robots.txt files.
Conclusion
This guide covers web scraping with Cheerio in Node.js, focusing on handling static HTML efficiently. For dynamic content, consider using services like Scrape.do or combining Cheerio with headless browsers.
Additional Resources:
To support your continued learning and development in web scraping, here are some valuable resources:
- Cheerio Documentation The official guide for Cheerio, covering all its features and APIs.
- Scrape.do Documentation Official documentation for Scrape.do, a powerful web scraping service.
- Node.js Official Documentation Comprehensive documentation for Node.js, essential for any server-side JavaScript development.
- Playwright Documentation Learn about this powerful tool for cross-browser automation.
- Google Apps Script Documentation Official documentation for Google Apps Script, useful for web scraping in Google Sheets and Google Docs.
- Puppeteer Documentation Official guide for Puppeteer, useful for scraping dynamic content.
- JSDom Documentation Official documentation for JSDom, a DOM implementation for server-side JavaScript.