Category: Scraping basics

Quick Guide to Building Your First NodeJS Web Scraper

7 mins read Created Date: May 15, 2025   Updated Date: May 16, 2025

Collecting data by hand feels like watching paint dry; at scale it becomes a non-starter.

With Node.js you can turn that slog into an automated pipeline that fetches pages, extracts the nuggets you need, and saves them in neat, machine-readable files.

Market research, competitor monitoring, custom dashboards; they all start with reliable scraping.

Here’s what you’ll learn in the next few minutes:

  • Setting up Node.js and npm (or nvm if you want an easy version manager)
  • Installing essential libraries with npm
  • Making your first HTTP request and checking the response code
  • Parsing the returned HTML using Cheerio’s jQuery-style selectors
  • Writing scraped data to JSON and CSV for easy analysis
  • Previewing headless browsers and anti-bot tactics so JavaScript-heavy sites and CAPTCHAs do not trip you up

By the end you will have a working scraper in a single JavaScript file; the foundation for bigger automation projects.

Preparing Your Node.js Environment

Before you write a single line of scraping code, make sure your toolkit is in place; once these short steps are done, the rest of the guide will run smoothly.

Install Node.js (and npm)

  • Head to the official Node.js site and download the current LTS installer (Windows and macOS) or use your package manager on Linux.
  • The installer bundles npm (Node Package Manager), so you get both with one click.
  • nvm on macOS /Linux or nvm-windows on Windows lets you switch Node versions effortlessly; this prevents “works-on-my-machine” problems when different projects need different runtimes.

  • After installing nvm, one command installs the latest LTS release:

    nvm install --lts
    nvm use --lts
    

Confirm everything is ready

Open a terminal and run:

node -v
npm -v

You should see two version numbers (for Node.js and npm). If either command fails, double-check your installation or PATH.

Create a workspace and initialize the project

mkdir node-web-scraper
cd node-web-scraper
npm init -y      # generates a minimal package.json

A dedicated folder keeps dependencies and output files organized; the -y flag accepts sensible defaults.

Pick an editor and open the folder

Most Node developers start with VS Code (lightweight and feature-rich); others prefer WebStorm (full IDE) or Sublime Text (fast and minimal). Open the node-web-scraper folder in your editor so IntelliSense and linting work immediately.

Run a quick sanity check

  1. Inside the project, create scraper.js with:

    console.log('Scraping environment is ready!');
    
  2. In the terminal, execute:

    node scraper.js
    

    You should see:

    Scraping environment is ready!
    

If that message appears, Node.js is installed; npm is configured; your editor is set; and the project is ready for the next step; sending your first HTTP request.

Sending Your First Request

Before we parse or export anything we need proof that our program can reach a web page and receive a valid response. In Node.js the easiest way to do that is by using axios, a promise-based HTTP client that works both in Node and in the browser.

Install axios

npm install axios

When the command finishes you can confirm it worked:

npm list axios        # shows the version in the dependency tree

A quick library sanity check

Create a file named check-axios.js.

// check-axios.js
const axiosPkg = require('axios/package.json');
console.log('Axios version:', axiosPkg.version);

Run it with:

node check-axios.js

If a version string appears, axios is ready to use; you can delete the file once you are satisfied.

Fetch the HTTP status code

Now let us write a short script that requests a practice page and prints its status code. Place the following in status-test.js.

const axios = require('axios');

async function testStatus() {
  try {
    const url = 'https://www.scrapethissite.com/pages/simple/';
    const response = await axios.get(url);
    console.log('Status Code:', response.status);
  } catch (err) {
    console.error('Request failed:', err.message);
  }
}

testStatus();

Run:

node status-test.js

A healthy request prints:

Status Code: 200

200 means the page loaded successfully.
If you receive 403 the site has blocked you; 404 indicates the page does not exist.

Preview a slice of the HTML

Seeing the raw markup helps when you debug parsing issues.

Add one extra line to the previous script:

console.log('HTML preview:', response.data.slice(0, 500));

Run it again; you will see the opening of the document including the <title> and the first heading. This confirms that axios delivered the full HTML and that your program can move on to parsing with Cheerio in the next section.

Parsing and Extracting Data with Cheerio

Fetching the page confirms that our request works; the next goal is to navigate the HTML and pull out the exact elements we care about.

In Node.js the go-to library for this job is Cheerio; it gives you a jQuery-style API for querying a DOM that lives entirely in memory (no browser needed).

Install Cheerio

npm install cheerio

Don’t forget to verify the install:

npm list cheerio

Load HTML and read the <h1> heading

Create parse-h1.js.

const axios = require('axios');
const cheerio = require('cheerio');

async function grabHeading() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);

  const $ = cheerio.load(html);              // turn raw HTML into a queryable object
  const h1Text = $('h1').first().text().trim();

  console.log('Page title:', h1Text);
}

grabHeading();

Run it:

node parse-h1.js

Expected output:

Page title: Countries of the World: A Simple Example

Extract every country name and capital

Each country block on the demo page looks like this (simplified):

<div class="country">
  <h3 class="country-name">Andorra</h3>
  <p>Capital: <span class="country-capital">Andorra la Vella</span></p>
</div>

Knowing that structure we can loop through all div.country nodes and grab the nested elements.

Create scrape-countries.js.

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeCountries() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  const results = [];

  $('div.country').each((_, el) => {
    const name = $(el).find('h3.country-name').text().trim();
    const capital = $(el).find('span.country-capital').text().trim();
    results.push({ country: name, capital });
  });

  console.log(`Total countries found: ${results.length}`);
  console.log(results.slice(0, 5));   // preview the first five objects
}

scrapeCountries();

Sample output:

Total countries found: 250
[
  { country: 'Andorra', capital: 'Andorra la Vella' },
  { country: 'United Arab Emirates', capital: 'Abu Dhabi' },
  { country: 'Afghanistan', capital: 'Kabul' },
  { country: 'Antigua and Barbuda', capital: "St. John's" },
  { country: 'Anguilla', capital: 'The Valley' }
]

If nothing shows up

Sometimes selectors fail because a website changes its HTML. Right-click the page, choose Inspect (or View Source) and confirm the tags and classes.

💡 For example, if the heading changed to <h2 class="name">, you would adjust:

const name = $(el).find('h2.name').text().trim();

The skill of reading page structure is fundamental; with a reliable set of selectors you can scrape almost any static site.

Exporting Your Scraped Data

Printing objects to the console is fine for a sanity-check; real projects need the data saved in a format other tools can read. The two staples are JSON (great for APIs and JavaScript apps) and CSV (ideal for spreadsheets and databases).

Save as JSON

Node already knows how to turn objects into JSON strings; you only need the built-in fs module.

Create save-json.js.

const fs   = require('fs');
const axios = require('axios');
const cheerio = require('cheerio');

async function run() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  const countries = [];

  $('div.country').each((_, el) => {
    const name    = $(el).find('h3.country-name').text().trim();
    const capital = $(el).find('span.country-capital').text().trim();
    countries.push({ country: name, capital });
  });

  fs.writeFileSync('countries.json', JSON.stringify(countries, null, 2));
  console.log('✔ Data written to countries.json');
}

run();

Run the script:

node save-json.js

Open countries.json and you’ll see pretty-printed data:

[
  {
    "country": "Andorra",
    "capital": "Andorra la Vella"
  },
  {
    "country": "United Arab Emirates",
    "capital": "Abu Dhabi"
  }
  // …
]

Save as CSV

CSV is just plain text with commas; you could build the string yourself, but a helper library avoids edge cases (commas inside names, Unicode, line endings). A tiny, no-dependency package called csv-writer does the job.

npm install csv-writer

Create save-csv.js.

const fs         = require('fs');
const axios      = require('axios');
const cheerio    = require('cheerio');
const { createObjectCsvWriter } = require('csv-writer');

async function run() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  const rows = [];

  $('div.country').each((_, el) => {
    rows.push({
      Country:  $(el).find('h3.country-name').text().trim(),
      Capital:  $(el).find('span.country-capital').text().trim()
    });
  });

  const csvWriter = createObjectCsvWriter({
    path      : 'countries.csv',
    header    : [
      { id: 'Country', title: 'Country' },
      { id: 'Capital', title: 'Capital' }
    ],
    fieldDelimiter: ',',              // default
    alwaysQuote  : true               // keeps commas in values safe
  });

  await csvWriter.writeRecords(rows);
  console.log('✔ Data written to countries.csv');
}

run();

Run:

node save-csv.js

Open countries.csv in Excel, Google Sheets, or a text editor:

"Country","Capital"
"Andorra","Andorra la Vella"
"United Arab Emirates","Abu Dhabi"
"Afghanistan","Kabul"

So your two choices are:

  • JSON: perfect for programmatic consumption or RESTful APIs.
  • CSV: quick to inspect in spreadsheets, easy to import into SQL.

With reliable export in place, the next challenge is scraping sites that hide content behind JavaScript. We’ll tackle that with a headless browser in the upcoming section.