Category: Scraping basics

Advanced TypeScript Web Scraping

26 mins read Created Date: October 14, 2024   Updated Date: October 14, 2024

Everyone who scrapes data off the web will tell you the same thing; JavaScript-heavy websites are tough to scrape! Many challenges are involved, from handling dynamic content, dealing with anti-bot mechanisms, and ensuring scalability.

For developers already comfortable with TypeScript, leveraging its type safety and strictness can offer significant advantages when building scraping solutions. In this article, we’ll dive into implementing efficient, maintainable web scrapers in TypeScript, covering integration with popular libraries, optimizing performance, and solving complex problems like CAPTCHA handling, data extraction, and concurrency.

Whether you’re scraping e-commerce product listings or large-scale data sets, this guide offers actionable insights and best practices to help you build reliable, production-ready scrapers. Without further ado, let’s dive right in!

Setting Up a TypeScript Environment for Web Scraping

Before diving into web scraping with TypeScript, setting up a working development environment is essential. This article assumes the reader already has a working knowledge of JavaScript and TypeScript, and while we’ll try to explain every step, it may be a bit complex for beginners.

We’ll be working with NodeJs, so to get started, you need to install Node.js by going to the officialNode.js website and downloading the recommended version for your operating system. Alternatively, you can install Node.js via the command line:

For macOS:

brew install node

For Linux (Debian-based distributions):

sudo apt update
sudo apt install nodejs npm

For Windows, the downloaded installer is the best way to install NodeJs

Creating a TypeScript Project

To initialize a new TypeScript project, follow these steps:

mkdir typescript-scraper
cd typescript-scraper
npm init -y
npm install typescript ts-node @types/node --save-dev
npx tsc --init

This creates a new directory, initializes a Node.js project, installs TypeScript and necessary type definitions, and generates a default tsconfig.json file.

Finally, modify your `tsconfig.json` to optimize it for web scraping:

{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"outDir": "./dist",
"rootDir": "./src",
"lib": ["es2020", "dom"],
"moduleResolution": "node"
},
"include": ["src/**/*"],
"exclude": ["node_modules"]
}

The key configurations here are:

  • “target”: “ES2020”: Enables modern JavaScript features. Ensuring compatibility with the latest ECMAScript standards.
  • “strict”: true: Ensures strict type checking for better code safety.
  • “lib”: [“es2020”, “dom”]: Includes DOM types, useful for parsing HTML.
  • “moduleResolution”: “node”: Helps TypeScript resolve module imports correctly.

With that, we’re all set up and good to go.

Choosing a Web Scraping Library for TypeScript

When building a web scraper in TypeScript, choosing the right libraries can streamline your development process and ensure you handle common challenges like parsing HTML, handling HTTP requests, or managing dynamic content. To help make it easier, here’s an overview of popular libraries, focusing on their TypeScript integration, advantages, and usage patterns.

Axios/Cheerio

For websites where content is static (i.e., no JavaScript rendering required), the combination of Axios and Cheerio provides a simple and effective solution. Axios is used to make HTTP requests, while Cheerio helps parse and manipulate HTML.

  • Axios: Axios is a promise-based HTTP client, ideal for handling requests in web scraping. TypeScript’s type safety ensures you catch potential issues like incorrect request configurations or improper response handling at compile time.
  • Cheerio: Cheerio is a fast and flexible HTML parser that mimics jQuery’s API. It works well for scraping static websites where the content doesn’t rely on JavaScript execution.

Basically, Axios handles HTTP requests, while Cheerio parses HTML. Here’s how to integrate them with TypeScript:

import axios from "axios";
import * as cheerio from "cheerio"; // Use named import for 'cheerio'

interface ScrapedData {
title: string;
price: number;
description: string;
}

async function scrapeProductPage(url: string): Promise<ScrapedData> {
try {
const { data } = await axios.get(url);
const $ = cheerio.load(data);

// Update selectors based on the HTML structure
 const title = $("h2.product-name").text().trim(); // Updated to match the correct selector
const priceText = $(".product-price bdi").text().trim(); // Updated to target the price correctly
const price = parseFloat(priceText.replace("$", ""));
const description = $(".product-description").text().trim();

if (!title || isNaN(price)) {
throw new Error("Failed to extract required product information");
}

return { title, price, description };
} catch (error) {
console.error(`Error scraping ${url}:`, error);
throw error;
}
}

// Usage
scrapeProductPage(
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
)
.then((product) => console.log(product))
.catch((error) => console.error("Scraping failed:", error));

This demonstrates how to fetch a webpage, parse its HTML content, and extract specific data using CSS selectors. The ScrapedData interface ensures type safety for the extracted data.

Puppeteer

For websites that rely on dynamic content rendered by JavaScript, Puppeteer is an excellent choice. Puppeteer controls a headless version of Chrome or Chromium, allowing you to automate interaction with web pages, including waiting for content to load before extracting data.

There are a ton of reasons why you should consider puppeteer, but the core of it are:

  • It handles JavaScript-heavy pages
  • It can emulate user actions like clicking buttons, filling forms, or scrolling
  • It supports taking screenshots and generating PDFs, making it versatile for testing as well as scraping

Here’s a TypeScript example:

import puppeteer from "puppeteer";

interface ProductData {
title: string;
price: number;
rating: number;
}

async function scrapeDynamicProduct(url: string): Promise<ProductData> {
const browser = await puppeteer.launch();
const page = await browser.newPage();

try {
await page.goto(url, { waitUntil: "networkidle0" });

const productData = await page.evaluate(() => {
const title =
document.querySelector<HTMLHeadingElement>("h2.product-name")
?.textContent || "";
const priceText =
document.querySelector<HTMLElement>(".product-price bdi")
?.textContent || "";
const price = parseFloat(priceText.replace("$", "").trim());
const ratingText =
document.querySelector<HTMLElement>(".rating")?.textContent || "";
const rating = parseFloat(ratingText);

return { title, price, rating };
});

if (
!productData.title ||
isNaN(productData.price) ||
isNaN(productData.rating)
) {
throw new Error("Failed to extract product data");
}

return productData;
} finally {
await browser.close();
}
}

// Usage
scrapeDynamicProduct(
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
)
.then(console.log)
.catch(console.error);

This shows how to use Puppeteer to navigate to a page, wait for the content to load, and then extract data from the rendered DOM.

Playwright

Playwright is similar to Puppeteer, infact they can be considered direct competitors, but Playwright provides additional scraping capabilities like supporting multiple browsers (Chromium, Firefox, and WebKit). It’s well-suited for more complex scraping tasks, such as handling multiple browser contexts or simulating different user environments.

Here’s why you would use Playwright:

  • Supports multiple browsers (Chromium, Firefox, WebKit)
  • Can manage multiple browser contexts for concurrent scraping
  • Provides more advanced automation options for testing and scraping

Let’s see an example of using Playwright for multi-browser scraping:

import { chromium, firefox, webkit, Browser, Page } from 'playwright';

interface BrowserData {
browser: string;
title: string;
url: string;
}

async function scrapeBrowserTestPage(): Promise<BrowserData[]> {
const browsers = [
{ name: 'Chromium', instance: await chromium.launch() },
{ name: 'Firefox', instance: await firefox.launch() },
{ name: 'WebKit', instance: await webkit.launch() }
];

const results: BrowserData[] = [];

for (const { name, instance } of browsers) {
const page = await instance.newPage();
try {
await page.goto('https://www.whatismybrowser.com/');
const title = await page.title();
const url = page.url();
results.push({ browser: name, title, url });
} finally {
await instance.close();
}
}

return results;
}

// Usage
scrapeBrowserTestPage()
.then(results => {
console.log('Scraping results:');
results.forEach(result => {
console.log(`${result.browser}: ${result.title} (${result.url})`);
});
})
.catch(console.error);

This demonstrates how to use Playwright to scrape the same page across different browser engines, which can be useful for testing or for bypassing browser-specific anti-scraping measures.

Handling Asynchronous Operations in TypeScript

In typescript web scraping, dealing with asynchronous operations is crucial, especially when fetching data from external websites or APIs. TypeScript’s type-checking capabilities and async/await syntax make handling asynchronous operations both easier and safer.

TypeScript fully supports the async/await syntax, which allows you to write asynchronous code that looks synchronous, improving readability and reducing callback complexity. When you’re using Axios or the native fetch API, TypeScript enhances the experience by providing type safety, making it easier to detect and handle errors.

Here’s an example using Axios:

import axios, { AxiosResponse } from 'axios';

interface WebPage {
url: string;
content: string;
}

async function fetchPage(url: string): Promise<WebPage> {
try {
const response: AxiosResponse = await axios.get(url);
return { url, content: response.data };
} catch (error) {
if (axios.isAxiosError(error)) {
throw new Error(`Failed to fetch ${url}: ${error.message}`);
}
throw error;
}
}

This demonstrates how TypeScript enhances error handling by leveraging type information.

TypeScript also fully supports ES6 Promises and the async/await syntax, making writing and understanding asynchronous code easier.

Here’s an example of using async/await with error handling:

import axios from 'axios';

interface WebPage {
url: string;
content: string;
statusCode: number;
}

async function fetchPage(url: string): Promise<WebPage> {
try {
const response = await axios.get(url);
return {
url,
content: response.data,
statusCode: response.status
};
} catch (error) {
if (axios.isAxiosError(error)) {
// TypeScript knows this is an AxiosError
console.error(`HTTP Error: ${error.response?.status}`);
throw new Error(`Failed to fetch ${url}: ${error.message}`);
} else {
console.error('An unexpected error occurred:', error);
throw error;
}
}
}

// Usage
async function main() {
try {
const page = await fetchPage('https://scrapingcourse.com/ecommerce');
console.log(`Fetched page: ${page.url} (Status: ${page.statusCode})`);
console.log(`Content length: ${page.content.length} characters`);
} catch (error) {
console.error('Error in main:', error);
}
}

main();

This shows how to use async/await to handle asynchronous operations with proper typing and error handling.

Scraping Static Pages with TypeScript

Static pages, where content is rendered directly in the HTML, are often simpler to scrape because there’s no need to execute JavaScript. By combining Axios and Cheerio, you can fetch these pages and parse the HTML efficiently.

We’ve seen a previous example of integrating Axios and Cheerio, now let’s see a more complex example of using them to scrape static content and highlight how TypeScript’s strong typing helps ensure data integrity.

import axios from "axios";
import * as cheerio from "cheerio"; // Use named import for 'cheerio'

interface Product {
name: string;
price: number;
description: string;
specifications: Record<string, string>;
}

async function scrapeProductPage(url: string): Promise<Product> {
try {
const { data } = await axios.get(url);
const $ = cheerio.load(data);

const name = $("h2.product-name").text().trim(); // selector based on HTML
const priceText = $(".product-price bdi").text().trim(); // Updated selector
const price = parseFloat(priceText.replace("$", "").replace(",", "")); // Handle potential commas
const description = $(".product-description").text().trim(); // Adjust this selector based on the actual HTML structure

const specifications: Record<string, string> = {};
$(".product-specs tr").each((_, element) => {
const $row = $(element);
const key = $row.find("th").text().trim();
const value = $row.find("td").text().trim();
if (key && value) {
specifications[key] = value;
}
});

if (!name || isNaN(price)) {
throw new Error("Failed to extract essential product information");
}

return { name, price, description, specifications };
} catch (error) {
console.error("Error scraping the product page:", error);
throw error;
}
}

// Usage
scrapeProductPage(
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
)
.then((product) => {
console.log("Product Name:", product.name);
console.log("Price:", product.price);
console.log("Description:", product.description);
console.log("Specifications:");
Object.entries(product.specifications).forEach(([key, value]) => {
console.log(` ${key}: ${value}`);
});
})
.catch((error) => console.error("Scraping failed:", error));

This example shows how to extract more complex structured data, including a dynamic set of product specifications.

Strategies for Paginated Scraping

When scraping data across multiple pages (e.g., paginated product listings), you need to handle fetching and parsing multiple URLs. After fetching data from the first page, you detect the link to the next page and recursively fetch it until no more pages are found.

const scrapePaginatedProducts = async (url: string): Promise<Product[]> => {
let nextPageUrl: string | null = url;
const allProducts: Product[] = [];

while (nextPageUrl) {
try {
const { data } = await axios.get(nextPageUrl);
const $ = load(data);

// Log the HTML structure for debugging
console.log(data);

// Scrape the products on the current page
$("li[data-products='item']").each((_, element) => {
const name = $(element).find(".product-name").text().trim();
const price = $(element).find(".product-price bdi").text().trim();

if (name && price) {
// Check if data exists
allProducts.push({ name, price });
}
});

// Find the link to the next page
const nextPage = $(".pagination .next").attr("href");
nextPageUrl = nextPage
? `https://www.scrapingcourse.com/ecommerce${nextPage}`
: null;
} catch (error) {
console.error("Error fetching or parsing the page:", error);
break;
}
}

return allProducts;
};

// Example usage
scrapePaginatedProducts("https://www.scrapingcourse.com/ecommerce/")
.then((products) => {
console.log(products);
})
.catch((error) => {
console.error("Scraping failed:", error);
});

Here, the scraper loops through paginated pages by extracting the “next page” link until no further pages are available. The allProducts array accumulates products from each page, ensuring all pages are scraped.

If the total number of pages is known or can be inferred, you can scrape each page in sequence by constructing URLs dynamically based on the page number.

import axios from "axios";
import { load } from "cheerio";

interface Product {
name: string;
price: string; // Keeping it as string to handle formatting
availability: string; // Add availability as per your requirement
}

const scrapePagesByNumber = async (
baseUrl: string,
totalPages: number
): Promise<Product[]> => {
const allProducts: Product[] = [];

for (let i = 1; i <= totalPages; i++) {
const currentPageUrl = `${baseUrl}?page=${i}`;
try {
const { data } = await axios.get(currentPageUrl);
const $ = load(data);

// Scrape products from each page based on the correct HTML structure
$("li[data-products='item']").each((_, element) => {
const name = $(element).find(".product-name").text().trim(); // Update this based on actual HTML
const price = $(element).find(".product-price bdi").text().trim(); // Updated selector for price
const availability = $(element)
.find(".product-availability")
.text()
.trim(); // Ensure this selector matches your HTML

if (name && price) {
allProducts.push({ name, price, availability }); // Include availability if required
}
});
} catch (error) {
console.error(`Error fetching page ${i}:`, error);
}
}

return allProducts;
};

// Example usage
scrapePagesByNumber("https://www.scrapingcourse.com/ecommerce/", 5)
.then((products) => {
console.log(products);
})
.catch((error) => {
console.error("Scraping failed:", error);
});

In this case, you specify the total number of pages upfront totalPages.The script constructs a new URL for each page ?page=${i} and scrapes them in sequence. This strategy works well if you know the exact number of pages or can derive it from the website (e.g., by scraping the number of results and calculating the total pages).

Some websites use infinite scrolling instead of pagination, and in such cases, scrolling events or an API request to load more content triggers the loading of more data.

You can simulate the infinite scroll behavior or intercept API requests that load more data.

const scrapeInfiniteScroll = async (url: string): Promise<Product[]> => {
const allProducts: Product[] = [];
const browser = await puppeteer.launch();
const page = await browser.newPage();

try {
await page.goto(url);

let previousHeight;
let hasMoreContent = true;

// Scroll down until no new content is loaded
while (hasMoreContent) {
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForTimeout(2000); // Wait for new content to load

const newHeight = await page.evaluate('document.body.scrollHeight');
hasMoreContent = newHeight > previousHeight;

// Scrape products from the currently loaded content
const products = await page.$$eval('.product-list-item', elements =>
elements.map(el => ({
name: el.querySelector('.product-name')?.textContent?.trim() || '',
price: el.querySelector('.product-price')?.textContent?.trim() || '',
availability: el.querySelector('.product-availability')?.textContent?.trim() || '',
}))
);

allProducts.push(...products);
}
} catch (error) {
console.error('Error with infinite scrolling:', error);
} finally {
await browser.close();
}

return allProducts;
};

// Example usage
scrapeInfiniteScroll('https://scrapingcourse.com/ecommerce').then(products => {
console.log(products);
});

In this example, Puppeteer simulates the scrolling behavior by evaluating the page’s scroll height and scrolling until no new content is loaded. After each scroll, it scrapes the products on the page. This strategy works well for websites that use infinite scrolling, which is common in modern web applications.

Scraping Dynamic Content Using Puppeteer or Playwright in TypeScript

Modern websites rely heavily on JavaScript to render content dynamically, making traditional HTTP request-based scraping ineffective. To scrap these JavaScript-rendered pages, you need to use headless browsers like Puppeteer or Playwright. These tools simulate real user interactions, ensuring all content, including dynamic elements, is fully loaded before extraction.

Why Dynamic Scraping?

Static scraping methods can’t handle JavaScript-generated dynamic content, such as single-page applications (SPAs), infinite scrolling, or websites that load data asynchronously. Puppeteer and Playwright however, allow you to control browsers programmatically to interact with these pages, execute JavaScript, and wait for elements to render.

TypeScript further enhances this process by enforcing strict typing across your scraping logic. From browser automation and page navigation to data extraction, TypeScript ensures that all operations are type-safe, reducing the chances of runtime errors and improving code maintainability.

Here’s how you can scrape dynamic content using Puppeteer:

import puppeteer from 'puppeteer';

interface DynamicContent {
title: string;
price: number;
inStock: boolean;
}

async function scrapeDynamicContent(url: string): Promise<DynamicContent> {
const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto(url, { waitUntil: 'networkidle0' });

const content = await page.evaluate(() => {
const title = document.querySelector<HTMLHeadingElement>('h1.product-title')?.textContent || '';
const priceText = document.querySelector<HTMLElement>('.price')?.textContent || '';
const price = parseFloat(priceText.replace('$', ''));
const inStock = document.querySelector('.stock-status')?.textContent?.includes('In Stock') || false;

return { title, price, inStock };
});

await browser.close();
return content;
}

// Usage
scrapeDynamicContent('https://example.com/dynamic-product')
.then(console.log)
.catch(console.error);

Key Considerations

  • TypeScript Typing: The Typescript interface ensures that the scraped data is correctly structured, preventing runtime errors. TypeScript infers types for DOM elements and checks that you’re working with expected data structures.

  • Puppeteer’s page.evaluate: This method allows you to run JavaScript code in the browser context to extract data. TypeScript helps validate that the data returned from this function matches your defined types.

     

Managing browser contexts

In Puppeteer, a browser context allows you to isolate cookies, cache, and storage. This is particularly useful when scraping websites that use cookies for session management or when you need to avoid leaking state between different scraping tasks.

import puppeteer from 'puppeteer';

async function scrapeMultiplePages(urls: string[]): Promise<DynamicContent[]> {
const browser = await puppeteer.launch();
const results: DynamicContent[] = [];

for (const url of urls) {
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();

try {
await page.goto(url, { waitUntil: 'networkidle0' });
const content = await page.evaluate(() => {
// ... (same as previous example)
});
results.push(content);
} finally {
await context.close();
}
}

await browser.close();
return results;
}

This example demonstrates how to manage multiple browser contexts for parallel scraping while maintaining type safety.

Best Practices for Web Scraping with TypeScript

Web scraping, especially at scale, presents challenges such as avoiding detection and dealing with anti-scraping techniques. To successfully scrape data while minimizing the chances of getting blocked, you must employ best practices, such as bypassing anti-scraping mechanisms, handling errors gracefully, and controlling request rates.

Handling CAPTCHAs

While automated solutions can bypass CAPTCHAs, it’s usually better to avoid sites that employ them heavily or use external services to handle the challenge. Libraries like 2Captcha or Anti-Captcha can be integrated into your TypeScript scraping script. These services often provide APIs to solve CAPTCHA challenges.

import axios from 'axios';

// Function to convert the CAPTCHA image URL to a base64 string
const getBase64FromUrl = async (imageUrl: string): Promise<string> => {
const response = await axios.get(imageUrl, { responseType: 'arraybuffer' });
const buffer = Buffer.from(response.data, 'binary');
return buffer.toString('base64');
};

// Function to solve the CAPTCHA using 2Captcha API
const solveCaptcha = async (captchaImageUrl: string): Promise<string> => {
const apiKey = 'your-api-key'; // Replace with your actual 2Captcha API key
const base64Captcha = await getBase64FromUrl(captchaImageUrl); // Convert image URL to base64

// Prepare form data as required by 2Captcha
const formData = new URLSearchParams();
formData.append('key', apiKey);
formData.append('method', 'base64');
formData.append('body', base64Captcha);

// Send the CAPTCHA image to 2Captcha for solving
const response = await axios.post('https://2captcha.com/in.php', formData);
return response.data; // This returns the CAPTCHA solution ID
};

// Example usage of the solveCaptcha function
const captchaTest = async () => {
const captchaImageUrl = 'https://example.com/captcha.jpg'; // Replace with an actual CAPTCHA image URL

try {
const captchaId = await solveCaptcha(captchaImageUrl);
console.log('Captcha ID:', captchaId);
} catch (error) {
console.error('Error solving CAPTCHA:', error);
}
};

captchaTest();

User-Agent Rotation

Using a single user-agent for all requests increases the likelihood of detection. You can rotate user-agents with each request to mimic different browsers or devices. Here’s how to do this in TypeScript:

import axios, { AxiosRequestConfig } from 'axios';

const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
// Add more user agents...
];

function getRandomUserAgent(): string {
return userAgents[Math.floor(Math.random() * userAgents.length)];
}

async function scrapeWithRotation(url: string): Promise<string> {
const config: AxiosRequestConfig = {
headers: { 'User-Agent': getRandomUserAgent() }
};

const { data } = await axios.get(url, config);
return data;
}

Proxy Support

Rotating proxies helps distribute requests across multiple IP addresses, reducing the chance of being blocked. Libraries like httpsproxyagent can be used to manage proxy connections in TypeScript.

import { HttpsProxyAgent } from 'https-proxy-agent';
import axios from 'axios';

const proxyList = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
// Add more proxies...
];

function getRandomProxy(): string {
return proxyList[Math.floor(Math.random() * proxyList.length)];
}

async function scrapeWithProxy(url: string): Promise<string> {
const proxy = getRandomProxy();
const httpsAgent = new HttpsProxyAgent(proxy);

const { data } = await axios.get(url, { httpsAgent });
return data;
}

This effectively rotates proxies using HttpsProxyAgent from the https-proxy-agent package. It selects a random proxy from a list, applies it to an Axios request, and returns the scraped data. This approach ensures requests are routed through different proxies to avoid blocks or throttling from target sites, assuming the proxies are functional.

Error Handling and Retry Logic

Web scraping often encounters issues such as timeouts, rate limiting, or temporary server errors. To handle these gracefully, you can implement retry mechanisms with exponential backoff to reduce server load and avoid being blocked.

Here’s how to use axios-retry for automatic retries:

import axios from 'axios';
import axiosRetry from 'axios-retry';

axiosRetry(axios, {
retries: 3,
retryDelay: (retryCount) => {
return retryCount * 1000; // Wait 1s, 2s, 3s between retries
},
retryCondition: (error) => {
return axiosRetry.isNetworkOrIdempotentRequestError(error) || error.response?.status === 429;
}
});

async function scrapeWithRetry(url: string): Promise<string> {
try {
const { data } = await axios.get(url);
return data;
} catch (error) {
console.error(`Failed to scrape ${url} after retries:`, error);
throw error;
}
}

This demonstrates effective error handling and retry logic using the axios-retry package. It configures Axios to retry failed requests up to three times, with a delay that increases after each retry. The retryCondition ensures retries only occur for network errors, idempotent request errors, or when receiving a 429 Too Many Requests response.

Rate Limiting and Request Throttling

When scraping, sending too many requests in rapid succession can trigger server defenses like IP bans or CAPTCHA challenges. Implementing rate limiting and request throttling strategies is crucial to prevent detection because controlling the frequency of requests minimizes the likelihood of overwhelming the server and maintains access while scraping data over a longer period.

Here’s how to do that:

import { promisify } from 'util';
import axios from 'axios';

const sleep = promisify(setTimeout);

class ThrottledScraper {
private queue: string[] = [];
private isProcessing = false;

constructor(private delayMs: number) {}

async addUrl(url: string): Promise<string> {
return new Promise((resolve, reject) => {
this.queue.push(url);
this.processQueue().then(() => resolve(url)).catch(reject);
});
}

private async processQueue(): Promise<void> {
if (this.isProcessing) return;
this.isProcessing = true;

while (this.queue.length > 0) {
const url = this.queue.shift()!;
try {
const { data } = await axios.get(url);
console.log(`Scraped ${url}`);
await sleep(this.delayMs);
} catch (error) {
console.error(`Failed to scrape ${url}:`, error);
}
}

this.isProcessing = false;
}
}

// Usage
const scraper = new ThrottledScraper(1000); // 1 second delay between requests
scraper.addUrl('https://example.com/page1');
scraper.addUrl('https://example.com/page2');

This implements a throttled web scraper using the ThrottledScraper class, which manages a queue of URLs for scraping with a specified delay between requests. The addUrl method allows users to add URLs to the queue, while the processQueue method sequentially processes each URL, ensuring a delay (in milliseconds) using the sleep function.

Exporting and Storing Scraped Data

Once you’ve scraped data, it’s essential to export it in a structured format or store it in a database for further use. TypeScript’s strong typing helps ensure the integrity of the data while saving it, whether you’re exporting it to a file (e.g., JSON or CSV) or inserting it into a database.

Here’s how to save data to JSON:

import fs from 'fs/promises';
import axios from 'axios';
import * as cheerio from 'cheerio';

interface ScrapedItem {
title: string;
price: number;
url: string;
}

async function scrapeProducts(url: string): Promise<ScrapedItem[]> {
const { data } = await axios.get(url);
const $ = cheerio.load(data);

const products: ScrapedItem[] = [];

$("li[data-products='item']").each((_, element) => {
const title = $(element).find(".product-name").text().trim();
const priceText = $(element).find(".product-price bdi").text().trim();
const price = parseFloat(priceText.replace('$', ''));
const productUrl = $(element).find("a.woocommerce-LoopProduct-link").attr('href') || '';

if (title && !isNaN(price) && productUrl) {
products.push({ title, price, url: productUrl });
}
});

return products;
}

async function saveToJson(data: ScrapedItem[], filename: string): Promise<void> {
try {
await fs.writeFile(filename, JSON.stringify(data, null, 2));
console.log(`Data saved to ${filename}`);
} catch (error) {
console.error('Failed to save data:', error);
}
}

// Usage
const ecommerceUrl = 'https://www.scrapingcourse.com/ecommerce/';
scrapeProducts(ecommerceUrl)
.then(scrapedData => {
return saveToJson(scrapedData, 'scraped_data.json');
})
.catch(error => console.error('Failed to execute scraping operation:', error));

Here’s how to save data to CSV:

import fs from 'fs/promises';
import axios from 'axios';
import * as cheerio from 'cheerio';
import { stringify } from 'csv-stringify/sync';

interface ScrapedItem {
title: string;
price: number;
url: string;
}

async function scrapeProducts(url: string): Promise<ScrapedItem[]> {
const { data } = await axios.get(url);
const $ = cheerio.load(data);

const products: ScrapedItem[] = [];

$("li[data-products='item']").each((_, element) => {
const title = $(element).find(".product-name").text().trim();
const priceText = $(element).find(".product-price bdi").text().trim();
const price = parseFloat(priceText.replace('$', ''));
const productUrl = $(element).find("a.woocommerce-LoopProduct-link").attr('href') || '';

if (title && !isNaN(price) && productUrl) {
products.push({ title, price, url: productUrl });
}
});

return products;
}

async function saveToCsv(data: ScrapedItem[], filename: string): Promise<void> {
const csvString = stringify(data, {
header: true,
columns: ['title', 'price', 'url']
});
await fs.writeFile(filename, csvString);
console.log(`Data saved to ${filename}`);
}

// Usage
const ecommerceUrl = 'https://www.scrapingcourse.com/ecommerce/';
scrapeProducts(ecommerceUrl)
.then(scrapedData => {
return saveToCsv(scrapedData, 'scraped_data.csv');
})
.catch(error => console.error('Failed to execute scraping operation:', error));

Database Integration

Integrating with a database like MongoDB or PostgreSQL is often necessary for more robust data storage and querying. Using MongoDB to store JSON-like documents is a natural fit for scraped data, and the Mongoose library makes it easy to interact with MongoDB in TypeScript.

Here’s an example of storing scraped data in MongoDB using TypeScript:

import { MongoClient, Db, Collection } from 'mongodb';

interface ScrapedItem {
title: string;
price: number;
url: string;
}

class ScrapingDatabase {
private client: MongoClient;
private db: Db | null = null;
private collection: Collection<ScrapedItem> | null = null;

constructor(private uri: string, private dbName: string, private collectionName: string) {
this.client = new MongoClient(uri);
}

async connect(): Promise<void> {
await this.client.connect();
this.db = this.client.db(this.dbName);
this.collection = this.db.collection<ScrapedItem>(this.collectionName);
console.log('Connected to MongoDB');
}

async insertScrapedData(data: ScrapedItem[]): Promise<void> {
if (!this.collection) throw new Error('Database not connected');
const result = await this.collection.insertMany(data);
console.log(`Inserted ${result.insertedCount} items`);
}

async close(): Promise<void> {
await this.client.close();
console.log('Disconnected from MongoDB');
}
}

// Usage
async function main() {
const db = new ScrapingDatabase('mongodb://localhost:27017', 'scraping_db', 'scraped_items');
await db.connect();

const scrapedData: ScrapedItem[] = [
{ title: 'Product 1', price: 19.99, url: 'https://example.com/product1' },
{ title: 'Product 2', price: 29.99, url: 'https://example.com/product2' }
];

await db.insertScrapedData(scrapedData);
await db.close();
}

main().catch(console.error);

This example demonstrates how to use TypeScript interfaces with MongoDB to ensure type safety when storing scraped data.

Optimization Techniques for Large-Scale Scraping

When scraping large datasets or complex websites, optimizing your approach is crucial to ensure efficiency and prevent overloading the target site. To do this, you need to use techniques like concurrency and breaking down large scraping tasks into manageable chunks. Let’s look at some of these techniques.

Concurrency and Parallel Scraping

Making concurrent requests is essential to speed up scraping large numbers of pages or resources. TypeScript’s Promise.all provides a simple yet powerful way to handle multiple requests in parallel. For more sophisticated control over task execution, you can use task queues to avoid overwhelming the server or exceeding rate limits.

The following example demonstrates how to use Promise.all to perform multiple scraping tasks concurrently.

import axios from 'axios';

async function scrapeMultipleUrls(urls: string[]): Promise<string[]> {
const promises = urls.map(url => axios.get(url).then(response => response.data));
return await Promise.all(promises);
}

// Usage
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];

scrapeMultipleUrls(urls)
.then(results => console.log('Scraped data:', results))
.catch(error => console.error('Parallel scraping failed:', error));

Using a Task Queue for More Control

import Queue from 'bull';
import axios from 'axios';

interface ScrapeTask {
url: string;
}

const scrapeQueue = new Queue<ScrapeTask>('web scraping');

scrapeQueue.process(async (job) => {
const { url } = job.data;
const { data } = await axios.get(url);
return data;
});

// Add jobs to the queue
async function queueScrapingJobs(urls: string[]): Promise<void> {
for (const url of urls) {
await scrapeQueue.add({ url });
}
}

// Process results
scrapeQueue.on('completed', (job, result) => {
console.log(`Scraped ${job.data.url}:`, result);
});

// Usage
const urlsToScrape = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];

queueScrapingJobs(urlsToScrape).catch(console.error);

This implementation effectively utilizes a task queue for web scraping, providing more control over job processing.

Handling Large Datasets

For large-scale scraping tasks involving hundreds or thousands of pages, breaking down the task into smaller, more manageable pieces is crucial for both performance and reliability.

Rather than scraping all URLs in one go, you can divide them into smaller chunks and process each chunk separately. This reduces the load on both your system and the target website, while allowing you to handle errors more effectively. You can also process the entire array of URLs one by one, sequentially writing the scraped data to the output file.

import * as fs from "fs"; // Use named import for 'fs'
import { Transform } from "stream";
import axios from "axios";
import * as cheerio from "cheerio"; // Use named import for 'cheerio'

// Interface for scraped items
interface ScrapedItem {
title: string;
price: number;
description: string;
}

// ScrapingStream class that transforms URLs into scraped data
class ScrapingStream extends Transform {
constructor() {
super({ objectMode: true });
}

// The _transform method scrapes the URL and pushes the result into the stream
async _transform(url: string, encoding: string, callback: Function) {
try {
// Axios GET request to scrape the URL
const { data } = await axios.get(url);
const $ = cheerio.load(data);

const scrapedItem: ScrapedItem = {
title: $(".product-title").text().trim(),
price: parseFloat($(".price").text().replace("$", "").replace(",", "")),
description: $(".product-description").text().trim(),
};

// Push the scraped data as a JSON string followed by a newline character
this.push(JSON.stringify(scrapedItem) + "\n");
callback(); // Notify that transformation is done
} catch (error) {
callback(error); // Handle errors in scraping
}
}
}

// Function to chunk an array into smaller arrays (batches)
function chunkArray<T>(array: T[], chunkSize: number): T[][] {
const result: T[][] = [];
for (let i = 0; i < array.length; i += chunkSize) {
result.push(array.slice(i, i + chunkSize));
}
return result;
}

// Function to scrape URLs in batches and write results to a file
async function scrapeToFile(
urls: string[],
outputFile: string,
chunkSize: number
): Promise<void> {
// Chunk the URLs into smaller batches
const chunkedUrls = chunkArray(urls, chunkSize);

// Create a writable stream for the output file
const writeStream = fs.createWriteStream(outputFile);

for (const chunk of chunkedUrls) {
// Create a new scraping stream for each batch
const scrapingStream = new ScrapingStream();

// Pipe the scraping stream to the file write stream
scrapingStream.pipe(writeStream, { end: false }); // Keep the file stream open between chunks

// Write each URL in the current chunk to the scraping stream
for (const url of chunk) {
scrapingStream.write(url);
}

// End the current scraping stream after all URLs in the chunk are processed
scrapingStream.end();

// Wait for the current batch to finish writing
await new Promise((resolve, reject) => {
scrapingStream.on("finish", resolve);
scrapingStream.on("error", reject);
});
}

// End the file write stream once all chunks are processed
writeStream.end();

return new Promise((resolve, reject) => {
writeStream.on("finish", resolve);
writeStream.on("error", reject);
});
}

// Usage
const urls = [
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
"https://www.scrapingcourse.com/ecommerce/product/aero-daily-fitness-tee/",
// ... add more product URLs as needed
];

const chunkSize = 10; // Define the size of each chunk

scrapeToFile(urls, "scraped_data.jsonl", chunkSize)
.then(() => console.log("Scraping completed"))
.catch((error) => console.error("Scraping failed:", error));

In this example, The ScrapingStream class extends the Node.js Transform stream to handle scraping for each URL, and the results are pushed into a write stream that saves them to the file. The scrapeToFile function processes the URLs in smaller batches (defined by chunkSize), ensuring that the write stream remains open between batches, and waits for each batch to complete before moving to the next. This approach helps manage memory usage and control scraping for large numbers of URLs

Conclusion

Web scraping with TypeScript, combined with powerful libraries like Axios, Cheerio, Puppeteer, and Playwright, provides developers with flexible and robust solutions for extracting data from both static and dynamic websites. By leveraging TypeScript’s strong typing and async/await syntax, you can create reliable, maintainable scrapers that are well-suited for handling complex scenarios like CAPTCHAs, dynamic content, and large-scale scraping tasks.

However, as we’ve seen, building an efficient scraper from scratch can involve a lot of moving parts—managing browser contexts, implementing error handling, rotating proxies and user-agents, dealing with rate limits, and ensuring scalability. This complexity requires significant development effort, testing, and maintenance, particularly when scraping JavaScript-heavy websites.

Here’s where Scrape.do offers a huge advantage. Rather than building custom solutions, Scrape.do provides an all-in-one, fully managed scraping service that handles the complexities for you:

  • JavaScript Rendering: Scrape dynamic content without worrying about browser automation.
  • Automatic Proxy Rotation: Avoid IP bans with seamless proxy management.
  • CAPTCHA Solving: Let Scrape.do handle CAPTCHAs, eliminating the need for third-party services.
  • Rate Limiting: Scrape.do ensures your scrapers adhere to rate limits and avoid being blocked.
  • High Scalability: Scale your scraping operations effortlessly with no need for manual optimization.

By utilizing Scrape.do, you can focus on what truly matters—extracting valuable data—without needing to implement and manage the numerous technical challenges that come with large-scale, dynamic web scraping. The service abstracts away the complexities and gives you reliable, efficient, and scalable scraping capabilities with just a few lines of code.

In short, using Scrape is possible and sometimes necessary while building custom scraping solutions.do saves you time and resources, making it a more intelligent choice for most use cases.