Quick Guide to Building Your First NodeJS Web Scraper

Collecting data by hand feels like watching paint dry; at scale it becomes a non-starter.

With Node.js you can turn that slog into an automated pipeline that fetches pages, extracts the nuggets you need, and saves them in neat, machine-readable files.

Market research, competitor monitoring, custom dashboards; they all start with reliable scraping.

Here’s what you’ll learn in the next few minutes:

Setting up Node.js and npm (or nvm if you want an easy version manager)
Installing essential libraries with npm
Making your first HTTP request and checking the response code
Parsing the returned HTML using Cheerio’s jQuery-style selectors
Writing scraped data to JSON and CSV for easy analysis
Previewing headless browsers and anti-bot tactics so JavaScript-heavy sites and CAPTCHAs do not trip you up

By the end you will have a working scraper in a single JavaScript file; the foundation for bigger automation projects.

Preparing Your Node.js Environment

Before you write a single line of scraping code, make sure your toolkit is in place; once these short steps are done, the rest of the guide will run smoothly.

Install Node.js (and npm)

Head to the official Node.js site and download the current LTS installer (Windows and macOS) or use your package manager on Linux.
The installer bundles npm (Node Package Manager), so you get both with one click.

Optional but recommended — use a version manager

nvm on macOS /Linux or nvm-windows on Windows lets you switch Node versions effortlessly; this prevents “works-on-my-machine” problems when different projects need different runtimes.
After installing nvm, one command installs the latest LTS release:
```
nvm install --lts
nvm use --lts
```

Confirm everything is ready

Open a terminal and run:

node -v
npm -v

You should see two version numbers (for Node.js and npm). If either command fails, double-check your installation or PATH.

Create a workspace and initialize the project

mkdir node-web-scraper
cd node-web-scraper
npm init -y      # generates a minimal package.json

A dedicated folder keeps dependencies and output files organized; the -y flag accepts sensible defaults.

Pick an editor and open the folder

Most Node developers start with VS Code (lightweight and feature-rich); others prefer WebStorm (full IDE) or Sublime Text (fast and minimal). Open the node-web-scraper folder in your editor so IntelliSense and linting work immediately.

Run a quick sanity check

Inside the project, create scraper.js with:

console.log('Scraping environment is ready!');

In the terminal, execute:

node scraper.js

You should see:

Scraping environment is ready!

If that message appears, Node.js is installed; npm is configured; your editor is set; and the project is ready for the next step; sending your first HTTP request.

Sending Your First Request

Before we parse or export anything we need proof that our program can reach a web page and receive a valid response. In Node.js the easiest way to do that is by using axios, a promise-based HTTP client that works both in Node and in the browser.

Install axios

npm install axios

When the command finishes you can confirm it worked:

npm list axios        # shows the version in the dependency tree

A quick library sanity check

Create a file named check-axios.js.

// check-axios.js
const axiosPkg = require('axios/package.json');
console.log('Axios version:', axiosPkg.version);

Run it with:

node check-axios.js

If a version string appears, axios is ready to use; you can delete the file once you are satisfied.

Fetch the HTTP status code

Now let us write a short script that requests a practice page and prints its status code. Place the following in status-test.js.

const axios = require('axios');

async function testStatus() {
  try {
    const url = 'https://www.scrapethissite.com/pages/simple/';
    const response = await axios.get(url);
    console.log('Status Code:', response.status);
  } catch (err) {
    console.error('Request failed:', err.message);
  }
}

testStatus();

Run:

node status-test.js

A healthy request prints:

Status Code: 200

200 means the page loaded successfully.
If you receive 403 the site has blocked you; 404 indicates the page does not exist.

Preview a slice of the HTML

Seeing the raw markup helps when you debug parsing issues.

Add one extra line to the previous script:

console.log('HTML preview:', response.data.slice(0, 500));

Run it again; you will see the opening of the document including the <title> and the first heading. This confirms that axios delivered the full HTML and that your program can move on to parsing with Cheerio in the next section.

Parsing and Extracting Data with Cheerio

Fetching the page confirms that our request works; the next goal is to navigate the HTML and pull out the exact elements we care about.

In Node.js the go-to library for this job is Cheerio; it gives you a jQuery-style API for querying a DOM that lives entirely in memory (no browser needed).

Install Cheerio

npm install cheerio

Don’t forget to verify the install:

npm list cheerio

Load HTML and read the `<h1>` heading

Create parse-h1.js.

const axios = require('axios');
const cheerio = require('cheerio');

async function grabHeading() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);

  const $ = cheerio.load(html);              // turn raw HTML into a queryable object
  const h1Text = $('h1').first().text().trim();

  console.log('Page title:', h1Text);
}

grabHeading();

Run it:

node parse-h1.js

Expected output:

Page title: Countries of the World: A Simple Example

Extract every country name and capital

Each country block on the demo page looks like this (simplified):

<div class="country">
  <h3 class="country-name">Andorra</h3>
  <p>Capital: <span class="country-capital">Andorra la Vella</span></p>
</div>

Knowing that structure we can loop through all div.country nodes and grab the nested elements.

Create scrape-countries.js.

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeCountries() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  const results = [];

  $('div.country').each((_, el) => {
    const name = $(el).find('h3.country-name').text().trim();
    const capital = $(el).find('span.country-capital').text().trim();
    results.push({ country: name, capital });
  });

  console.log(`Total countries found: ${results.length}`);
  console.log(results.slice(0, 5));   // preview the first five objects
}

scrapeCountries();

Sample output:

Total countries found: 250
[
  { country: 'Andorra', capital: 'Andorra la Vella' },
  { country: 'United Arab Emirates', capital: 'Abu Dhabi' },
  { country: 'Afghanistan', capital: 'Kabul' },
  { country: 'Antigua and Barbuda', capital: "St. John's" },
  { country: 'Anguilla', capital: 'The Valley' }
]

If nothing shows up

Sometimes selectors fail because a website changes its HTML. Right-click the page, choose Inspect (or View Source) and confirm the tags and classes.

💡 For example, if the heading changed to <h2 class="name">, you would adjust:

const name = $(el).find('h2.name').text().trim();

The skill of reading page structure is fundamental; with a reliable set of selectors you can scrape almost any static site.

Exporting Your Scraped Data

Printing objects to the console is fine for a sanity-check; real projects need the data saved in a format other tools can read. The two staples are JSON (great for APIs and JavaScript apps) and CSV (ideal for spreadsheets and databases).

Save as JSON

Node already knows how to turn objects into JSON strings; you only need the built-in fs module.

Create save-json.js.

const fs   = require('fs');
const axios = require('axios');
const cheerio = require('cheerio');

async function run() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  const countries = [];

  $('div.country').each((_, el) => {
    const name    = $(el).find('h3.country-name').text().trim();
    const capital = $(el).find('span.country-capital').text().trim();
    countries.push({ country: name, capital });
  });

  fs.writeFileSync('countries.json', JSON.stringify(countries, null, 2));
  console.log('✔ Data written to countries.json');
}

run();

Run the script:

node save-json.js

Open countries.json and you’ll see pretty-printed data:

[
  {
    "country": "Andorra",
    "capital": "Andorra la Vella"
  },
  {
    "country": "United Arab Emirates",
    "capital": "Abu Dhabi"
  }
  // …
]

Save as CSV

CSV is just plain text with commas; you could build the string yourself, but a helper library avoids edge cases (commas inside names, Unicode, line endings). A tiny, no-dependency package called csv-writer does the job.

npm install csv-writer

Create save-csv.js.

const fs         = require('fs');
const axios      = require('axios');
const cheerio    = require('cheerio');
const { createObjectCsvWriter } = require('csv-writer');

async function run() {
  const url = 'https://www.scrapethissite.com/pages/simple/';
  const { data: html } = await axios.get(url);
  const $ = cheerio.load(html);

  const rows = [];

  $('div.country').each((_, el) => {
    rows.push({
      Country:  $(el).find('h3.country-name').text().trim(),
      Capital:  $(el).find('span.country-capital').text().trim()
    });
  });

  const csvWriter = createObjectCsvWriter({
    path      : 'countries.csv',
    header    : [
      { id: 'Country', title: 'Country' },
      { id: 'Capital', title: 'Capital' }
    ],
    fieldDelimiter: ',',              // default
    alwaysQuote  : true               // keeps commas in values safe
  });

  await csvWriter.writeRecords(rows);
  console.log('✔ Data written to countries.csv');
}

run();

Run:

node save-csv.js

Open countries.csv in Excel, Google Sheets, or a text editor:

"Country","Capital"
"Andorra","Andorra la Vella"
"United Arab Emirates","Abu Dhabi"
"Afghanistan","Kabul"

So your two choices are:

JSON: perfect for programmatic consumption or RESTful APIs.
CSV: quick to inspect in spreadsheets, easy to import into SQL.

With reliable export in place, the next challenge is scraping sites that hide content behind JavaScript. We’ll tackle that with a headless browser in the upcoming section.

Handling JavaScript-Heavy Pages with a Headless Browser

A regular axios request is enough for pages that return all their data inside the initial HTML; many modern sites do the heavy lifting in JavaScript instead. When that happens you must run a real browser engine (just without the visible window) so scripts execute and dynamic elements appear.

Popular headless options in the Node.js ecosystem

Puppeteer — maintained by the Chrome team; it gives a high-level API that controls Chromium or Firefox programmatically and launches in headless mode by default
Playwright — similar API but works with Chromium, Firefox, and WebKit from a single codebase; perfect when you need cross-browser coverage or extra speed.
Selenium — language-agnostic classic; still useful for complex test automation but heavier than the two libraries above

We will use Puppeteer; the code style stays almost identical if you swap in Playwright later.

Install Puppeteer and pull a page

npm install puppeteer # downloads a bundled Chromium

Create browser-check.js.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();               // headless by default
  const page    = await browser.newPage();

  await page.goto('https://www.scrapethissite.com/pages/simple/', { waitUntil: 'networkidle0' });
  const heading = await page.$eval('h1', el => el.textContent.trim());

  console.log('Page title:', heading);                    // Countries of the World: A Simple Example
  await browser.close();
})();

💡 waitUntil: 'networkidle0' pauses until the browser sees no new network requests for a short period; in practice that means the page and its JavaScript are done loading. Then we run a tiny DOM query inside the browser’s context (page.$eval) to extract the <h1> text.

Scraping content that loads after a click

The Oscars demo page hides the movie list until you select a year; axios would never see those rows.

Using Puppeteer we can click the year, wait for the table, and collect the titles.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page    = await browser.newPage();

  const url = 'https://www.scrapethissite.com/pages/ajax-javascript/';
  await page.goto(url, { waitUntil: 'domcontentloaded' });

  // trigger the JavaScript that fetches 2015 winners
  await page.click('a[href="#2015"]');

  // wait until at least one table row appears
  await page.waitForSelector('table.table tbody tr');

  // grab the text of every film title cell
  const titles = await page.$$eval('td.title', cells =>
    cells.map(c => c.textContent.trim())
  );

  console.log('2015 Oscar winners:', titles.slice(0, 5));
  await browser.close();
})();

You should see an array beginning with "Spotlight", "Mad Max: Fury Road", and so on—data that axios alone would have missed.

When to reach for a headless browser

The page is blank or missing key elements in the axios/cheerio output.
You must interact (scroll, click, log in, type a search term).
Anti-bot checks look for real browser fingerprints; many simple blocks vanish when Chromium is driving the request.

Headless browsers are slower than plain HTTP calls; use them only when necessary for better scaling and resource management.