How to Scrape Web Pages With Cheerio in Node.js

12 mins read Created Date: October 10, 2024   Updated Date: October 10, 2024

Cheerio is a fast and lightweight DOM (Document Object Model) manipulation library for Node.js, designed for server-side operations. It provides a jQuery-like API for parsing and manipulating HTML, making it an excellent choice for web scraping tasks. Built on htmlparser2, Cheerio offers robust APIs for extracting data and parsing HTML efficiently.

Key Features and Use Cases

  • Traversing and parsing HTML documents
  • Extracting specific elements from content
  • Manipulating DOM elements on the server side

Important Distinctions

  • Cheerio does not interpret the result as a web browser does.
  • It does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript.
  • These limitations make Cheerio much faster than other solutions.

Unlike browser-based tools like Puppeteer, Cheerio doesn’t render JavaScript but focuses on parsing static HTML, making it ideal for scenarios where speed and efficiency are critical.

Note: If your use case requires functionality like JavaScript execution, CSS rendering, or browser automation, consider alternatives such as Puppeteer, Playwright, or JSDom (DOM emulation project).

Setting Up the Development Environment

Dependencies

Before you begin web scraping with Cheerio, ensure you have Node.js installed on your system. Once Node.js is set up, you’ll use npm (Node Package Manager) to install the required packages for your project.

npm init -y  # Initialize a new Node.js project
npm install cheerio axios  # Install Cheerio and Axios for making HTTP requests

In this guide, we’ll be using the following Node.js libraries:

  • Axios: A promise-based HTTP client for making requests in both browser and Node.js environments.
  • Cheerio: A fast, flexible, and lean implementation of core jQuery, designed specifically for server-side use.

Understanding Cheerio Basics

Loading HTML and Making Requests

To begin scraping with Cheerio, you first need to provide it with HTML markup to parse. This is done using the load function. After loading the markup and initializing Cheerio, you can start manipulating and traversing the resulting data structure using Cheerio’s API.

Here’s a comprehensive example that fetches HTML from a website, extracts country data, processes it, and outputs the results:

const axios = require('axios');
const cheerio = require('cheerio');

const url = 'https://www.scrapethissite.com/pages/simple/';

axios.get(url)
  .then(response => {
    const html = response.data;
    const $ = cheerio.load(html);

    const countries = $('.country');
    const countryData = [];

    countries.each((index, element) => {
      const country = $(element);
      const name = country.find('.country-name').text().trim();
      const capital = country.find('.country-capital').text().trim();
      const populationStr = country.find('.country-population').text().trim();
      const areaStr = country.find('.country-area').text().trim();

      const population = parseInt(populationStr.replace(/,/g, ''), 10);
      const area = parseFloat(areaStr.replace(/,/g, ''));

      // Calculate population density (people per sq km)
      const density = area > 0 ? (population / area).toFixed(2) : 'N/A';

      countryData.push({ name, capital, population, area, density });
    });

    countryData.sort((a, b) => b.population - a.population);

    console.log(`Total countries found: ${countryData.length}`);
    console.log('Countries ordered by population (descending):');

    countryData.forEach((country, index) => {
      console.log(`${index + 1}. ${country.name} - Capital: ${country.capital}, Population: ${country.population.toLocaleString()}, Area: ${country.area} km², Density: ${country.density} people/km²`);
    });
  })
  .catch(error => console.log('Error:', error.message));

Important Notes on Cheerio’s Behavior

  • Automatic HTML Structure: Cheerio will automatically include <html>, <head>, and <body> elements in the rendered markup, similar to how browsers handle HTML (this only occurs if these elements aren’t already present in the parsed HTML).
  • Disabling Automatic HTML Structure: You can prevent Cheerio from adding these elements by passing false as a third argument to the load function:
const $ = cheerio.load(html, null, false);

Understanding Cheerio’s Selector Function

Cheerio’s main function for selecting elements has the following structure:

$(selector, [context], [root])

Let’s break down each parameter:

  • selector: This is used for targeting specific elements in the markup. It’s the starting point for traversing and manipulating the information. The selector can be:
    • A string (e.g., div.classname)
    • A DOM element
    • An array of elements
    • A Cheerio object
  • context: (Optional) This defines the scope or where to begin looking for the target elements. It can take the same forms as the selector.
  • root: (Optional) This is the markup string you want to traverse or manipulate.

Examples of Cheerio Selectors

Here are some common ways to use Cheerio’s selector function:

// Select all <a> elements
$('a')

// Select <div> elements with class 'content'
$('div.content')

// Select elements with id 'main-content'
$('#main-content')

// Select <li> elements that are direct children of <ul>
$('ul > li')

// Select the first <p> element
$('p:first')

// Select all <img> elements with a 'src' attribute
$('img[src]')

// Select elements using a function
$((i, elem) => elem.attribs.id === 'main')

These selectors allow you to precisely target the elements you want to extract or manipulate in your web scraping projects.

Advanced HTML Parsing with Cheerio

Scraping a Table

const axios = require('axios');
const cheerio = require('cheerio');
const qs = require('querystring');

const baseUrl = 'https://www.scrapethissite.com/pages/forms/';

async function scrapeHockeyTeams(searchTerm = 'north', perPage = 25) {
  try {
    // First, submit the search form
    const searchParams = qs.stringify({ q: searchTerm, per_page: perPage });
    const response = await axios.get(`${baseUrl}?${searchParams}`);
    let $ = cheerio.load(response.data);

    // Check if we need to adjust the per_page value
    const totalResults = $('.team').length;
    if (totalResults >= perPage && perPage !== 100) {
      console.log(`More than ${perPage} results found. Adjusting to 100 per page.`);
      return scrapeHockeyTeams(searchTerm, 100);
    }

    // Extract table data
    const tableData = [];
    $('.team').each((index, element) => {
      const team = $(element);
      const teamData = {
        name: team.find('.name').text().trim(),
        year: team.find('.year').text().trim(),
        wins: team.find('.wins').text().trim(),
        losses: team.find('.losses').text().trim(),
        otLosses: team.find('.ot-losses').text().trim(),
        winPercentage: team.find('.pct').text().trim(),
        goalsFor: team.find('.gf').text().trim(),
        goalsAgainst: team.find('.ga').text().trim(),
        plusMinus: team.find('.diff').text().trim()
      };
      tableData.push(teamData);
    });

    console.log(`Total teams found: ${tableData.length}`);
    console.log(`First ${tableData.length} teams:`);
    console.log(tableData.slice(0, 5));

    return tableData;
  } catch (error) {
    console.error('Error:', error.message);
  }
}

// Run the scraper
scrapeHockeyTeams();

Extracting Attributes (e.g., Image URLs)

const imageSrc = $('img').attr('src');
console.log('Image Source:', imageSrc);

Handling Pagination

const axios = require('axios');
const cheerio = require('cheerio');
const qs = require('querystring');

const baseUrl = 'https://www.scrapethissite.com/pages/forms/';

async function scrapeHockeyTeams(searchTerm = '', maxPages = 3, perPage = 100) {
  let currentPage = 1;
  let allTeams = [];

  while (currentPage <= maxPages) {
    try {
      const params = qs.stringify({
        q: searchTerm,
        per_page: perPage,
        page_num: currentPage
      });
      const url = `${baseUrl}?${params}`;
      console.log(`Scraping page ${currentPage}: ${url}`);

      const response = await axios.get(url);
      const $ = cheerio.load(response.data);

      $('.team').each((index, element) => {
        const team = $(element);
        const teamData = {
          name: team.find('.name').text().trim(),
          year: parseInt(team.find('.year').text().trim(), 10),
          wins: parseInt(team.find('.wins').text().trim(), 10),
          losses: parseInt(team.find('.losses').text().trim(), 10),
          otLosses: team.find('.ot-losses').text().trim(),
          winPercentage: parseFloat(team.find('.pct').text().trim()),
          goalsFor: parseInt(team.find('.gf').text().trim(), 10),
          goalsAgainst: parseInt(team.find('.ga').text().trim(), 10),
          plusMinus: parseInt(team.find('.diff').text().trim(), 10)
        };
        allTeams.push(teamData);
      });

      console.log(`Page ${currentPage}: Found ${$('.team').length} teams`);

      // Check if there's a next page
      const nextPageLink = $('.pagination li:last-child a').attr('href');
      if (!nextPageLink || currentPage >= maxPages) {
        break;
      }

      currentPage++;
    } catch (error) {
      console.error(`Error on page ${currentPage}:`, error.message);
      break;
    }
  }

  // Sort teams by wins in descending order
  allTeams.sort((a, b) => b.wins - a.wins);

  console.log(`Total teams scraped: ${allTeams.length}`);
  console.log('Top 10 teams by wins:');
  console.log(JSON.stringify(allTeams.slice(0, 10), null, 2));

  return allTeams;
}

// Run the scraper
scrapeHockeyTeams();

Error Handling

const scrapePage = () => {
  axios.get(`${url}?page=${currentPage}`).then(response => {
    // ... (scraping logic)
  }).catch(error => {
    console.log(error);
    if (error.response && error.response.status === 404) {
      console.log('Reached the last page');
    }
  });
};

Handling Dynamic Content

For content loaded dynamically via JavaScript, combine Cheerio with a headless browser like Puppeteer:

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function scrapeOscarWinners() {
  console.log('Launching browser...');
  const browser = await puppeteer.launch({ headless: false }); // Set to false for debugging
  const page = await browser.newPage();
  console.log('Navigating to the page...');
  await page.goto('https://www.scrapethissite.com/pages/ajax-javascript/', { waitUntil: 'networkidle2' });

  console.log('Waiting for #oscars container...');
  await page.waitForSelector('#oscars');

  console.log('Getting years...');
  const years = await page.evaluate(() => {
    const container = document.querySelector('#oscars');
    const yearLinks = Array.from(container.querySelectorAll('.year-link'));
    console.log('Year links found:', yearLinks.length);
    return yearLinks.map(el => el.textContent.trim()).sort((a, b) => b - a);
  });
  console.log('Years found:', years);

  const allFilms = {};

  for (const year of years) {
    console.log(`\nScraping data for year ${year}`);

    try {
      console.log(`Clicking on year ${year} link...`);
      await page.evaluate((yearToClick) => {
        const yearLink = Array.from(document.querySelectorAll('.year-link'))
          .find(el => el.textContent.trim() === yearToClick);
        if (yearLink) {
          yearLink.click();
        } else {
          throw new Error(`Year link for ${yearToClick} not found`);
        }
      }, year);

      console.log('Waiting for table body to be populated...');
      await page.waitForFunction(() => {
        const tableBody = document.querySelector('#table-body');
        return tableBody && tableBody.children.length > 0;
      }, { timeout: 5000 });

      console.log('Getting page content...');
      const content = await page.content();
      const $ = cheerio.load(content);

      const films = [];
      $('tr.film').each((index, element) => {
        const film = {
          title: $(element).find('.film-title').text().trim(),
          nominations: parseInt($(element).find('.film-nominations').text().trim(), 10),
          awards: parseInt($(element).find('.film-awards').text().trim(), 10),
          bestPicture: $(element).find('.film-best-picture i').length > 0
        };
        films.push(film);
      });

      console.log(`Found ${films.length} films for year ${year}`);
      allFilms[year] = films;
    } catch (error) {
      console.error(`Error scraping data for year ${year}:`, error);
    }
  }

  console.log('Closing browser...');
  await browser.close();

  // Print a summary of the data
  for (const [year, films] of Object.entries(allFilms)) {
    console.log(`\nYear ${year}:`);
    console.log(`Total films: ${films.length}`);
    console.log(`Films with most awards:`);
    const maxAwards = Math.max(...films.map(f => f.awards));
    films.filter(f => f.awards === maxAwards).forEach(f => {
      console.log(`- ${f.title} (${f.awards} awards, ${f.nominations} nominations)`);
    });
  }

  return allFilms;
}

scrapeOscarWinners().then(data => {
  console.log('Scraping completed.');
}).catch(error => {
  console.error('An error occurred:', error);
});

Performance Optimization

Techniques for optimization include:

  1. Rate Limiting
  2. Caching
  3. Asynchronous Processing
  4. Efficient Data Storage
  5. Using proxy services like Scrape.do for IP rotation and handling dynamic content

Example using Scrape.do:

const scrapeDoUrl = 'http://[email protected]:8080?render=true';
axios.get(scrapeDoUrl)
  .then(response => console.log(response.data))
  .catch(err => console.error(err));

Saving Scraped Data to CSV Files

After successfully scraping data from a website, the next crucial step is to save this data in a format that’s easy to analyze and manipulate. CSV (Comma-Separated Values) is a popular choice due to its simplicity and compatibility with various data analysis tools and spreadsheet applications.

npm install json2csv

We’ll also use the built-in fs (File System) module to write our CSV file to disk.

Importing Required Modules

At the top of your script, import the necessary modules:

const { Parser } = require('json2csv');
const fs = require('fs').promises;

Note that we’re using the promise-based version of the fs module for better asynchronous handling.

Creating a Function to Save Data as CSV

Let’s create an asynchronous function that accepts the scraped data, converts it to CSV, and saves it to a file:

async function saveToCSV(data, filename) {
  try {
    // Define the fields for the CSV file
    const fields = ['year', 'title', 'nominations', 'awards', 'bestPicture'];

    // Create a new parser instance with the defined fields
    const parser = new Parser({ fields });

    // Convert the data to CSV format
    const csv = parser.parse(data);

    // Write the CSV to a file
    await fs.writeFile(filename, csv);

    console.log(`Data successfully saved to ${filename}`);
  } catch (error) {
    console.error('Error saving data to CSV:', error);
  }
}

Using the Save Function

After you’ve scraped your data, you can use this function to save it to a CSV file. Here’s an example of how you might use it in your scraping script:

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const { Parser } = require('json2csv');
const fs = require('fs').promises;

async function saveToCSV(data, filename) {
  try {
    // Define the fields for the CSV file
    const fields = ['year', 'title', 'nominations', 'awards', 'bestPicture'];

    // Create a new parser instance with the defined fields
    const parser = new Parser({ fields });

    // Convert the data to CSV format
    const csv = parser.parse(data);

    // Write the CSV to a file
    await fs.writeFile(filename, csv);

    console.log(`Data successfully saved to ${filename}`);
  } catch (error) {
    console.error('Error saving data to CSV:', error);
  }
}

async function scrapeOscarWinners() {
  console.log('Launching browser...');
  const browser = await puppeteer.launch({ headless: false }); // Set to false for debugging
  const page = await browser.newPage();
  console.log('Navigating to the page...');
  await page.goto('https://www.scrapethissite.com/pages/ajax-javascript/', { waitUntil: 'networkidle2' });

  console.log('Waiting for #oscars container...');
  await page.waitForSelector('#oscars');

  console.log('Getting years...');
  const years = await page.evaluate(() => {
    const container = document.querySelector('#oscars');
    const yearLinks = Array.from(container.querySelectorAll('.year-link'));
    console.log('Year links found:', yearLinks.length);
    return yearLinks.map(el => el.textContent.trim()).sort((a, b) => b - a);
  });
  console.log('Years found:', years);

  const allFilms = [];

  for (const year of years) {
    console.log(`\nScraping data for year ${year}`);

    try {
      console.log(`Clicking on year ${year} link...`);
      await page.evaluate((yearToClick) => {
        const yearLink = Array.from(document.querySelectorAll('.year-link'))
          .find(el => el.textContent.trim() === yearToClick);
        if (yearLink) {
          yearLink.click();
        } else {
          throw new Error(`Year link for ${yearToClick} not found`);
        }
      }, year);

      console.log('Waiting for table body to be populated...');
      await page.waitForFunction(() => {
        const tableBody = document.querySelector('#table-body');
        return tableBody && tableBody.children.length > 0;
      }, { timeout: 5000 });

      console.log('Getting page content...');
      const content = await page.content();
      const $ = cheerio.load(content);

      $('tr.film').each((index, element) => {
        const film = {
          year: year,
          title: $(element).find('.film-title').text().trim(),
          nominations: parseInt($(element).find('.film-nominations').text().trim(), 10),
          awards: parseInt($(element).find('.film-awards').text().trim(), 10),
          bestPicture: $(element).find('.film-best-picture i').length > 0
        };
        allFilms.push(film);
      });

      console.log(`Found ${$('tr.film').length} films for year ${year}`);
    } catch (error) {
      console.error(`Error scraping data for year ${year}:`, error);
    }
  }

  console.log('Closing browser...');
  await browser.close();

  // Print a summary of the data
  const filmsByYear = allFilms.reduce((acc, film) => {
    if (!acc[film.year]) acc[film.year] = [];
    acc[film.year].push(film);
    return acc;
  }, {});

  for (const [year, films] of Object.entries(filmsByYear)) {
    console.log(`\nYear ${year}:`);
    console.log(`Total films: ${films.length}`);
    console.log(`Films with most awards:`);
    const maxAwards = Math.max(...films.map(f => f.awards));
    films.filter(f => f.awards === maxAwards).forEach(f => {
      console.log(`- ${f.title} (${f.awards} awards, ${f.nominations} nominations)`);
    });
  }

  // Save data to CSV
  await saveToCSV(allFilms, 'oscar_winners.csv');

  return allFilms;
}

scrapeOscarWinners().then(data => {
  console.log('Scraping completed.');
}).catch(error => {
  console.error('An error occurred:', error);
});

In this example, we’re scraping a hypothetical table from a website, pushing each row’s data into an array, and then using our saveToCSV function to save this data to a file named scraped_data.csv.

A Bonus: Google Apps Script Usage

You can use Cheerio (cheeriogs) in Google Apps Script and get benefit from other Google services such as spreedsheets, forms, slides, etc. By integrating Cheerio (specifically, the CheerioGS library), you can enhance your scripts with robust HTML parsing capabilities. Here’s how to get started and make the most of Cheerio in your Google Apps Script projects:

As an example, you can follow the steps below:

  1. Add the library to your project (Project Key: 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0) To check the library key, go to the Google Apps Script Directory and search for “CheerioGS”.
  2. Use it in your code:
function scrapeWebsite() {
  // Fetch HTML content
  const url = 'https://www.scrapethissite.com/pages/forms/';
  const content = UrlFetchApp.fetch(url).getContentText();

  // Load content into Cheerio
  const $ = Cheerio.load(content);

  // Extract information
  const title = $('title').text();
  const firstParagraph = $('p').first().text();

  // Log the results
  Logger.log('Page Title: ' + title);
  Logger.log('First Paragraph: ' + firstParagraph);
}

function updateSheetWithScrapedData() {
  const sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
  const url = 'https://example.com/data';
  const content = UrlFetchApp.fetch(url).getContentText();
  const $ = Cheerio.load(content);

  $('table tr').each((index, element) => {
    const tds = $(element).find('td');
    const rowData = [
      $(tds[0]).text(),
      $(tds[1]).text(),
      $(tds[2]).text()
    ];
    sheet.appendRow(rowData);
  });
}

By leveraging Cheerio in Google Apps Script, you can create powerful web scraping and HTML manipulation tools that integrate seamlessly with Google’s ecosystem of services. Remember to always scrape responsibly and in compliance with websites’ terms of service and robots.txt files.

Conclusion

This guide covers web scraping with Cheerio in Node.js, focusing on handling static HTML efficiently. For dynamic content, consider using services like Scrape.do or combining Cheerio with headless browsers.

Additional Resources:

To support your continued learning and development in web scraping, here are some valuable resources: