Category: Scraping basics

Web Scraping With C++

27 mins read Created Date: October 14, 2024   Updated Date: October 14, 2024

C++ offers a unique set of features that make it an attractive choice for developers who need maximum performance and control over system resources. Although web scraping is typically associated with higher-level languages, C++ can achieve faster execution, better resource handling, and scalability critical factors when scraping large datasets or building scrapers that need to run for long periods without memory leaks or crashes.

C++ offers several advantages for web scraping tasks:

  • Low-level control: C++ provides direct access to memory and network operations, allowing for optimized resource management.
  • High performance: C++’s speed can significantly reduce processing time when dealing with large-scale scraping tasks.
  • System integration: C++ easily integrates with other systems via APIs or custom HTTP libraries, making it versatile for various scraping scenarios.

In this article,

Technical Requirements

It’s important to set up the necessary environment for building a web scraper in C++. This involves choosing the right libraries, the correct compiler setup, and managing dependencies effectively.

Windows

For Windows users, the setup process is slightly different. Here are the steps you should follow:

  • Install Visual Studio with C++ support if you haven’t already done so. This will provide you with the necessary compiler and development tools.
  • Install CMake for Windows from the official website (https://cmake.org/download/).
  • For libcurl, you have two main options: Use vcpkg, a package manager for C++ libraries on Windows or download pre-built binaries from the official curl website.

Let’s go with vcpkg as it’s better for future dependency additions.

  • First, you have to Install vcpkg: Clone the VCPKG repository:
git clone https://github.com/Microsoft/vcpkg.git

Next, run the bootstrap script:

\vcpkg\bootstrap-vcpkg.bat.
  • Once that’s done, you can Install libcurl using vcpkg:
.\vcpkg\vcpkg install curl:x64-windows
  • Finally, Integrate vcpkg with CMake:
\vcpkg\vcpkg integrate install

Linux

For Linux, setting up is more straightforward:

sudo apt-get install libcurl4-openssl-dev

Development Environment Setup

You’ll need a modern C++ compiler (e.g., GCC 9+ or Clang 10+) and CMake to handle the project’s build system. For large-scale projects, CMake is essential to manage dependencies and configurations across different platforms efficiently.

To manage dependencies like libcurl, CMake or package managers such as Conan or vcpkg can simplify the installation and configuration process, particularly on different platforms like Windows, Linux, and macOS. Now that we’re set up, lets get into the meet of things

Making HTTP Requests in C++

Interacting with websites efficiently is a critical aspect of building a web scraper in C++. This involves performing both GET and POST requests to retrieve and submit data. C++ offers powerful libraries like libcurl that simplify these operations while ensuring flexibility and performance.

libcurl is a widely-used library for performing HTTP requests in C++. It provides a simple and effective way to handle GET requests, which are commonly used in scraping to fetch data from web pages.

Let’s start with a basic example using libcurl to make a GET request:

#include <curl/curl.h>
#include <iostream>
#include <string>

size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
size_t total_size = size * nmemb;
output->append((char*)contents, total_size);
return total_size;
}

int main() {
CURL* curl;
CURLcode res;
std::string readBuffer;

curl = curl_easy_init();
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://example.com");
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);

res = curl_easy_perform(curl);
if (res != CURLE_OK) {
std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res) << std::endl;
}
else {
std::cout << readBuffer << std::endl;
}

curl_easy_cleanup(curl);
}

return 0;
}

This example demonstrates a simple GET request to `https://example.com`. The `WriteCallback` function appends the received data to a string buffer.

Handling POST requests

In addition to fetching data, many web scrapers need to submit data via POST requests, for example, when dealing with forms or APIs. Here’s how to send both form data and JSON payloads using libcurl.

For POST requests, you can modify the above code as follows:

#include <curl/curl.h>
#include <iostream>
#include <string>

// Proper cast from void* to char*
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
size_t total_size = size * nmemb;
output->append(static_cast<char*>(contents), total_size);
return total_size;
}

int main() {
CURL* curl; // Correct declaration, using CURL*
CURLcode res;
std::string readBuffer;

curl = curl_easy_init(); // Proper initialization of libcurl
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://example.com");
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);

res = curl_easy_perform(curl);
if (res != CURLE_OK) {
std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res) << std::endl;
} else {
std::cout << readBuffer << std::endl;
}

curl_easy_cleanup(curl);
}

return 0;
}

SSL/TLS Handling

When scraping over HTTPS, or doing anything over the internet, ensuring secure communication with SSL/TLS is vital. libcurl provides support for SSL with minimal configuration, allowing you to scrape from HTTPS sites securely.

libcurl handles SSL/TLS by default, so ensure you’re using a version of libcurl compiled with SSL support. You can verify SSL certificates like:

curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 1L);
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 2L);
curl_easy_setopt(curl, CURLOPT_CAINFO, "path/to/ca-bundle.crt");

Handling Redirects

To handle redirects automatically:

curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L); // Follow HTTP redirects
curl_easy_setopt(curl, CURLOPT_MAXREDIRS, 5L); // Set max num of redirects to follow

For more advanced scraping, especially when dealing with multiple requests simultaneously, you can implement asynchronous requests. Using libraries like Boost. Asio or native C++ threads allows you to handle multiple HTTP requests concurrently, improving the efficiency of your scraper.

#include <boost/asio.hpp>
#include <thread>

void async_request() {
// Implement asynchronous HTTP request here using Boost.Asio
// For example, making an asynchronous connection to a web server
}

int main() {
std::thread t1(async_request);
std::thread t2(async_request);

t1.join();
t2.join();

return 0;
}

Parsing HTML Responses

Once an HTTP request has been made and the HTML content retrieved, the next step in web scraping is parsing the HTML to extract meaningful data. C++ doesn’t have built-in HTML parsing, so we’ll use “gumbo**-**parser” for this example. “Boost.Beast” is another option if you’re already using Boost for network operations.

To use Gumbo-parser in your C++ project, you’ll first need to install it. For Linux, do this:

sudo apt-get install libgumbo-dev

For windows, do the following:

git clone https://github.com/google/gumbo-parser.git
cd gumbo-parser
autoreconf -vfi
./configure
make
make install

Once you’re done, you’re good to go. Here’s a basic example of gumbo parser in action:

#include <gumbo.h>
#include <iostream>
#include <string>
#include <vector>

struct LinkInfo {
std::string href;
std::string text;
};

void search_for_links(GumboNode* node, std::vector<LinkInfo>& links) {
if (node->type != GUMBO_NODE_ELEMENT) {
return;
}

if (node->v.element.tag == GUMBO_TAG_A) {
GumboAttribute* href = gumbo_get_attribute(&node->v.element.attributes, "href");
if (href) {
LinkInfo link;
link.href = href->value;
if (node->v.element.children.length > 0) {
GumboNode* text_node = static_cast<GumboNode*>(node->v.element.children.data[0]);
if (text_node->type == GUMBO_NODE_TEXT) {
link.text = text_node->v.text.text;
}
}
links.push_back(link);
}
}

GumboVector* children = &node->v.element.children;
for (unsigned int i = 0; i < children->length; ++i) {
search_for_links(static_cast<GumboNode*>(children->data[i]), links);
}
}

int main() {
// Sample HTML content
std::string html_content = R"(
<html>
<body>
<a href="https://example.com">Example</a>
<a href="https://google.com">Google</a>
</body>
</html>
)";

GumboOutput* output = gumbo_parse(html_content.c_str());
std::vector<LinkInfo> links;
search_for_links(output->root, links);

for (const auto& link : links) {
std::cout << "Link: " << link.text << " (" << link.href << ")" << std::endl;
}

gumbo_destroy_output(&kGumboDefaultOptions, output);
return 0;
}

This example shows how to extract all links from an HTML document using gumbo-parser. It searches for <a> tags and extracts both the href attribute and the link text.

XPath and CSS selectors are powerful tools for targeting specific elements within an HTML document, especially when the structure is complex. While C++ does not have built-in support for XPath or CSS selectors, there are third-party libraries, such as libxml2 for XPath and Gumbo-query for CSS selectors, that offer this functionality.

Extracting Structured Data(XPath and CSS Selectors in C++)

XPath and CSS selectors are powerful tools for targeting specific elements within an HTML document, especially when the structure is complex. While C++ does not have built-in support for XPath or CSS selectors, there are third-party libraries, such as libxml2 for XPath and Gumbo-query for CSS selectors, that offer this functionality.

For more complex scraping tasks, extracting structured data is the best path to take.. Here’s an example that extracts product information from a hypothetical e-commerce page:

struct ProductInfo {
std::string name;
std::string price;
std::string description;
};

void extract_product_info(GumboNode* node, ProductInfo& product) {
if (node->type != GUMBO_NODE_ELEMENT) {
return;
}

if (node->v.element.tag == GUMBO_TAG_DIV) {
GumboAttribute* class_attr = gumbo_get_attribute(&node->v.element.attributes, "class");
if (class_attr && strcmp(class_attr->value, "product") == 0) {
GumboVector* children = &node->v.element.children;
for (unsigned int i = 0; i < children->length; ++i) {
GumboNode* child = static_cast<GumboNode*>(children->data[i]);
if (child->type == GUMBO_NODE_ELEMENT) {
if (child->v.element.tag == GUMBO_TAG_H2) {
GumboNode* text_node = static_cast<GumboNode*>(child->v.element.children.data[0]);
product.name = text_node->v.text.text;
} else if (child->v.element.tag == GUMBO_TAG_SPAN) {
GumboAttribute* class_attr = gumbo_get_attribute(&child->v.element.attributes, "class");
if (class_attr && strcmp(class_attr->value, "price") == 0) {
GumboNode* text_node = static_cast<GumboNode*>(child->v.element.children.data[0]);
product.price = text_node->v.text.text;
}
} else if (child->v.element.tag == GUMBO_TAG_P) {
GumboAttribute* class_attr = gumbo_get_attribute(&child->v.element.attributes, "class");
if (class_attr && strcmp(class_attr->value, "description") == 0) {
GumboNode* text_node = static_cast<GumboNode*>(child->v.element.children.data[0]);
product.description = text_node->v.text.text;
}
}
}
}
}
}

GumboVector* children = &node->v.element.children;
for (unsigned int i = 0; i < children->length; ++i) {
extract_product_info(static_cast<GumboNode*>(children->data[i]), product);
}
}

This function recursively searches through the HTML structure to locate a div element with the class “product” and retrieves the name, price, and description from specific child elements.

Error Handling

Real-world HTML can be messy and non-compliant, leading to potential issues during parsing. Handling these cases gracefully is important to avoid crashes or incomplete scraping.

Here are a couple of best practices for handling malformed HTML data:

  • Use libraries like libxml2 and Gumbo-parser, which are great against malformed HTML.
  • Implement retry mechanisms if parsing fails due to malformed structures.
  • Log warnings or errors for malformed elements and continue processing the rest of the document.

By incorporating error handling into your web scraping workflow, you can ensure more stable and reliable data extraction, even when faced with imperfect HTML.

Managing Sessions and Cookies

Managing sessions and cookies is essential when scraping websites that require authentication or stateful interaction across multiple requests. In C++, libraries like libcurl provide a powerful way to handle cookies, manage sessions, and perform authenticated requests.

Handling sessions and cookies is essential for scraping websites requiring login or maintaining session data across requests. You can use libcurl’s built-in cookie handling to bypass this:

CURL* curl = curl_easy_init();
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://example.com");
curl_easy_setopt(curl, CURLOPT_COOKIEFILE, ""); // Enable cookie engine
curl_easy_setopt(curl, CURLOPT_COOKIEJAR, "cookies.txt"); // Save cookies to file

// Perform request
curl_easy_perform(curl);

// Subsequent requests will use the saved cookies
curl_easy_setopt(curl, CURLOPT_URL, "https://example.com/authenticated");
curl_easy_perform(curl);

curl_easy_cleanup(curl);
}

Handling Authentication

Web scraping often requires dealing with websites that implement authentication mechanisms such as Basic Authentication or OAuth2. libcurl supports these authentication techniques, making it easy to scrape data from authenticated websites. Here’s an example of how to handle basic authentication:

curl_easy_setopt(curl, CURLOPT_USERNAME, "user");
curl_easy_setopt(curl, CURLOPT_PASSWORD, "password");

For more complex authentication flows, like OAuth, you’ll need to implement the specific protocol. Here’s a basic example of how you might handle token-based authentication:

std::string auth_token;

// Function to perform login and get token
bool login(CURL* curl, const std::string& username, const std::string& password) {
// Set up POST data
std::string postfields = "username=" + username + "&password=" + password;
curl_easy_setopt(curl, CURLOPT_URL, "https://example.com/login");
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, postfields.c_str());

std::string response;
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);

CURLcode res = curl_easy_perform(curl);
if (res != CURLE_OK) {
return false;
}

// Parse response to get auth token
// This is a simplification. In reality, you'd parse JSON or whatever format the server returns
size_t token_pos = response.find("token:");
if (token_pos != std::string::npos) {
auth_token = response.substr(token_pos + 6);
return true;
}

return false;
}

// Function to make authenticated requests
CURLcode make_authenticated_request(CURL* curl, const std::string& url) {
struct curl_slist *headers = NULL;
headers = curl_slist_append(headers, ("Authorization: Bearer " + auth_token).c_str());
curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());

CURLcode res = curl_easy_perform(curl);

curl_slist_free_all(headers);
return res;
}

CAPTCHA and Anti-Scraping Measures

Many websites employ CAPTCHA systems or rate-limiting to prevent automated scraping. While bypassing these anti-scraping measures can be complex, you have several options in C++:

  • Detecting CAPTCHA:
    • Check if the HTML response contains CAPTCHA-related elements, such as

      <div class="g-recaptcha">.
      
              
      
    • If CAPTCHA is detected, pause the scraper and alert the user for manual intervention or bypass.

  • Bypassing CAPTCHA:
    • Integrate with CAPTCHA-solving services like 2Captcha or DeathByCaptcha, which provide API-based CAPTCHA-solving solutions.
    • Send CAPTCHA images or tokens to these services and wait for the response before continuing the scraping process.
  • Handling Rate-Limiting:
    • Implement rate-limiting in your scraper by adding delays between requests, respecting the website’s Retry-After headers, and limiting the number of requests per second.
    • Use techniques like IP rotation with proxies to distribute requests across different IP addresses and avoid triggering rate limits.

However, Scrape.do offers a better way of achieving the same result as everything we’ve discussed so far. When using Scrape.do, you don’t have to worry about setting up infrastructure for making HTTP requests, managing proxies, or handling complex authentication flows. We abstracts these complexities, allowing you to focus on extracting data without managing the low-level details.

Scrape.do also has CAPTCHA-solving mechanisms built into the service, meaning you don’t need to integrate external CAPTCHA-solving services or manually handle CAPTCHA challenges. It also bypasses other anti-scraping measures like rate-limiting by automatically delaying and retrying requests as needed.

By offloading the complex parts of scraping (session management, IP rotation, CAPTCHA solving, etc.) to Scrape.do, development time is reduced. This also makes long-term maintenance easier because you don’t have to update your scraper each time a website changes its anti-scraping tactics. Scrape.do keeps up with these changes automatically.

Multithreading and Scraping at Scale

Scaling web scraping operations requires the ability to send and handle multiple requests concurrently. In C++, multithreading allows you to perform parallel HTTP requests, making scraping faster and more efficient. By leveraging C++’s multithreading capabilities, you can scrape large datasets at scale while managing performance, rate limits, and error recovery effectively.

Let’s create a simple thread pool to manage multiple concurrent requests:

class ThreadPool {
std::vector<std::thread> workers;
std::queue<std::function<void()>> tasks;
std::mutex queue_mutex;
std::condition_variable condition;
bool stop;

public:
ThreadPool(size_t threads) : stop(false) {
for (size_t i = 0; i < threads; ++i)
workers.emplace_back([this] {
for (;;) {
std::function<void()> task;
{
std::unique_lock<std::mutex> lock(this->queue_mutex);
this->condition.wait(lock, [this] { return this->stop || !this->tasks.empty(); });
if (this->stop && this->tasks.empty()) return;
task = std::move(this->tasks.front());
this->tasks.pop();
}
task();
}
});
}

template<class F>
void enqueue(F&& f) {
{
std::unique_lock<std::mutex> lock(queue_mutex);
tasks.emplace(std::forward<F>(f));
}
condition.notify_one();
}

~ThreadPool() {
{
std::unique_lock<std::mutex> lock(queue_mutex);
stop = true;
}
condition.notify_all();
for (std::thread& worker : workers) worker.join();
}
};

// Usage
ThreadPool pool(4); // Create a thread pool with 4 threads
for (int i = 0; i < 8; ++i) {
pool.enqueue([i] {
// Perform scraping task
std::cout << "Scraping task " << i << " completed\n";
});
}

This thread pool implementation allows you to create a thread pool class in C++ for managing concurrent tasks. It creates a pool of worker threads that continuously wait for tasks to be added to a queue. When a task is available, a worker thread dequeues it and executes it. The thread pool is then stopped, and all worker threads are joined when the thread pool object is destroyed.

It’s a great way to queue up and execute scraping tasks simultaneously.

Rate Limiting & Throttling

To avoid getting blocked by websites due to excessive requests, it’s crucial to implement rate limiting in your scraper. You can introduce delays between requests or limit the number of requests sent per minute.

#include <chrono>
#include <thread>

void fetch_url_with_delay(const std::string& url, int delay_ms) {
fetch_url(url);
std::this_thread::sleep_for(std::chrono::milliseconds(delay_ms)); // Introduce delay
}

int main() {
std::vector<std::string> urls = { "https://example.com/page1", "https://example.com/page2" };

for (const auto& url : urls) {
fetch_url_with_delay(url, 2000); // 2-second delay between requests
}

return 0;
}

In this example, a delay of 2 seconds is introduced between each request to avoid overwhelming the server. This basic rate-limiting strategy can be adjusted based on the website’s restrictions (e.g., requests per minute).

Error Recovery

When scraping at scale, you’ll likely encounter network errors such as timeouts, connection issues, or server-side blocks. It’s important to implement error handling and retry mechanisms to ensure that your scraper can recover from these issues without crashing.

void fetch_url_with_retry(const std::string& url, int max_retries) {
int retries = 0;
while (retries < max_retries) {
CURL* curl = curl_easy_init();
if (curl) {
CURLcode res;
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
res = curl_easy_perform(curl);

if (res == CURLE_OK) {
std::cout << "Successfully fetched: " << url << std::endl;
curl_easy_cleanup(curl);
return;
} else {
std::cerr << "Error: " << curl_easy_strerror(res) << " | Retry: " << retries + 1 << std::endl;
retries++;
std::this_thread::sleep_for(std::chrono::seconds(2)); // Wait before retrying
}

curl_easy_cleanup(curl);
}
}

std::cerr << "Failed to fetch " << url << " after " << max_retries << " retries." << std::endl;
}

int main() {
fetch_url_with_retry("https://example.com", 3); // Retry up to 3 times on failure
return 0;
}

This example attempts to retry a failed request up to 3 times, with a 2-second delay between retries. By doing so, you can recover from temporary network issues or server errors, ensuring that your scraper remains robust.

Handling JavaScript-Rendered Pages

JavaScript-heavy websites, which dynamically load content using frameworks like React, Angular, or Vue, can pose significant challenges for traditional scrapers. Since much of the content is rendered client-side, simply sending HTTP requests to the server may not return the complete page content. For these scenarios, you need specialized tools to scrape JavaScript-rendered pages effectively.

Headless browsers like Puppeteer are widely used to scrape JavaScript-heavy websites. While Puppeteer is typically used in Node.js environments, you can integrate it into C++ workflows via subprocesses or C++ bindings to automate a Chromium browser, allowing you to interact with JavaScript-rendered content.

One straightforward way to use Puppeteer with C++ is by invoking it as a subprocess to scrape JavaScript-rendered pages. First, you need to install Puppeteer:

npm install puppeteer

Next, you’d need to create a `puppeteer_script.js` file that uses Puppeteer to render the page and return the rendered HTML. Here’s an example:

const puppeteer = require('puppeteer');
const url = process.argv[2];

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle0'});
const html = await page.content();
console.log(html);
await browser.close();
})();

Finally, In your C++ code, you can invoke this Puppeteer script using to fetch JavaScript-rendered content.Here’s an example:

#include <cstdio>
#include <iostream>
#include <memory>
#include <stdexcept>
#include <string>
#include <array>

std::string exec(const char* cmd) {
std::array<char, 128> buffer;
std::string result;
std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd, "r"), pclose);
if (!pipe) {
throw std::runtime_error("popen() failed!");
}
while (fgets(buffer.data(), buffer.size(), pipe.get()) != nullptr) {
result += buffer.data();
}
return result;
}

int main() {
std::string result = exec("node puppeteer_script.js https://example.com");
std::cout << result << std::endl;
return 0;
}

This approach allows you to handle JavaScript-rendered content by leveraging Node.js and Puppeteer from your C++ application.

Working with WebSockets

Some JavaScript-heavy websites rely on WebSockets to communicate with the server and retrieve live data (e.g., for live chat, stock price updates, or real-time notifications). In C++, you can handle WebSocket-based communications using libraries like Boost.Asio, which provides support for asynchronous network programming, including WebSockets.

If Boost is not installed on your system, you can install it via your package manager or manually:

sudo apt-get install libboost-all-dev

Let’s look at a basic example that demonstrates how to connect to a WebSocket server, send messages, and receive responses using Boost. Beast.

#include <boost/beast/core.hpp>
#include <boost/beast/websocket.hpp>
#include <boost/asio/connect.hpp>
#include <boost/asio/ip/tcp.hpp>
#include <iostream>
#include <string>

namespace beast = boost::beast;
namespace websocket = beast::websocket;
namespace net = boost::asio;
using tcp = boost::asio::ip::tcp;

int main() {
try {
// Set up an I/O context
net::io_context ioc;

// Create a resolver for DNS lookups
tcp::resolver resolver(ioc);

// Resolve the WebSocket server address (example: echo.websocket.org)
auto const results = resolver.resolve("echo.websocket.org", "80");

// Create a WebSocket stream
websocket::stream<tcp::socket> ws(ioc);

// Connect to the server
net::connect(ws.next_layer(), results.begin(), results.end());

// Perform the WebSocket handshake
ws.handshake("echo.websocket.org", "/");

// Send a message
ws.write(net::buffer(std::string("Hello, WebSocket!")));

// Buffer for incoming messages
beast::flat_buffer buffer;

// Read the response
ws.read(buffer);

// Print the received message
std::cout << beast::make_printable(buffer.data()) << std::endl;

// Close the WebSocket connection
ws.close(websocket::close_code::normal);
} catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
}

return 0;
}

The code establishes a WebSocket connection to the server “echo.websocket.org” on port 80. It sends a message “Hello, WebSocket!” to the server and receives a response. The received message is then printed to the console, and the connection is closed.

To scrape JavaScript-rendered content from a site that uses WebSockets:

  • Use Puppeteer (or a similar headless browser) to handle the initial page load and any client-side rendering.
  • Use WebSockets (via Boost.Beast) to capture real-time data from the WebSocket connection.

By combining these techniques, you can create a fully automated scraper capable of dealing with both static JavaScript-rendered content and dynamic, real-time updates.

Advanced-Data Storage Options

Efficient storage and processing of scraped data are critical, especially when dealing with large datasets or scaling web scraping operations. In C++, there are several approaches to store scraped data that optimize for performance, scalability, and ease of retrieval.

Storing Data in Files

For simpler projects or when a full database isn’t necessary, storing scraped data directly in files can be an effective solution. C++ provides powerful file-handling capabilities through standard libraries, allowing you to store data in text, CSV, JSON, or binary formats.

Storing data in binary format is faster and more space-efficient than using plain text or JSON formats. Using binary files can drastically speed up both write and read operations, especially for large datasets. Here’s an example:

#include <iostream>
#include <fstream>
#include <vector>

struct ScrapedData {
int id;
double price;
char title[100];
};

void write_to_file(const std::vector<ScrapedData>& data) {
std::ofstream outFile("scraped_data.bin", std::ios::binary);
if (outFile.is_open()) {
outFile.write(reinterpret_cast<const char*>(data.data()), data.size() * sizeof(ScrapedData));
outFile.close();
} else {
std::cerr << "Unable to open file for writing." << std::endl;
}
}

std::vector<ScrapedData> read_from_file() {
std::vector<ScrapedData> data;
std::ifstream inFile("scraped_data.bin", std::ios::binary);

if (inFile.is_open()) {
ScrapedData temp;
while (inFile.read(reinterpret_cast<char*>(&temp), sizeof(ScrapedData))) {
data.push_back(temp);
}
inFile.close();
} else {
std::cerr << "Unable to open file for reading." << std::endl;
}

return data;
}

int main() {
std::vector<ScrapedData> scraped_data = {
{1, 29.99, "Product 1"},
{2, 39.99, "Product 2"},
};

write_to_file(scraped_data);

auto loaded_data = read_from_file();
for (const auto& item : loaded_data) {
std::cout << "ID: " << item.id << ", Price: " << item.price << ", Title: " << item.title << std::endl;
}

return 0;
}

In this example, scraped data is stored in binary format, which significantly improves I/O efficiency compared to text formats like CSV or JSON.

Database Integration

For larger projects or when the data needs to be queried efficiently, integrating a database is a better option. You can use relational databases like SQLite or MySQL or NoSQL databases like MongoDB for greater flexibility and performance when dealing with large amounts of structured or unstructured data.

SQLite is the most preferred option, however, because it’s a lightweight, file-based database that’s simple to use in C++ and well-suited for projects where running a full database server is unnecessary. Integrating with databases like SQLite provides better scalability for more advanced projects. Let’s take an example:

#include <sqlite3.h>
#include <iostream>
#include <vector>

class Database {
public:
Database(const std::string& db_name) : db(nullptr) {
int rc = sqlite3_open(db_name.c_str(), &db);
if (rc) {
std::cerr << "Can't open database: " << sqlite3_errmsg(db) << std::endl;
sqlite3_close(db);
}
}

~Database() {
if (db) {
sqlite3_close(db);
}
}

bool execute(const std::string& sql) {
char* error_message = nullptr;
int rc = sqlite3_exec(db, sql.c_str(), nullptr, nullptr, &error_message);
if (rc != SQLITE_OK) {
std::cerr << "SQL error: " << error_message << std::endl;
sqlite3_free(error_message);
return false;
}
return true;
}

bool insert_scraped_data(const std::string& url, const std::string& title, const std::string& content) {
const char* sql = "INSERT INTO ScrapedData (URL, Title, Content) VALUES (?, ?, ?);";
sqlite3_stmt* stmt;
int rc = sqlite3_prepare_v2(db, sql, -1, &stmt, nullptr);
if (rc != SQLITE_OK) {
std::cerr << "Failed to prepare statement: " << sqlite3_errmsg(db) << std::endl;
return false;
}

sqlite3_bind_text(stmt, 1, url.c_str(), -1, SQLITE_STATIC);
sqlite3_bind_text(stmt, 2, title.c_str(), -1, SQLITE_STATIC);
sqlite3_bind_text(stmt, 3, content.c_str(), -1, SQLITE_STATIC);

rc = sqlite3_step(stmt);
sqlite3_finalize(stmt);

return rc == SQLITE_DONE;
}

private:
sqlite3* db;
};

int main() {
Database db("scraped_data.db");

// Create table if not exists
db.execute("CREATE TABLE IF NOT EXISTS ScrapedData("
"ID INTEGER PRIMARY KEY AUTOINCREMENT,"
"URL TEXT NOT NULL,"
"Title TEXT NOT NULL,"
"Content TEXT);");

// Insert scraped data
db.insert_scraped_data("https://example.com", "Example Title", "Example Content");

return 0;
}

This example demonstrates a simple wrapper class for SQLite operations, which allows you to store and retrieve scraped data easily.

Data Serialization

Data serialization is an efficient way to store complex data structures in a compact format. Boost.Serialization is a widely-used library that simplifies serializing and deserializing C++ objects, allowing you to store them in files or databases for later retrieval.

Here’s an example of data serialization using Boost.Serialization:

#include <boost/archive/text_oarchive.hpp> // For saving serialized data to text format
#include <boost/archive/text_iarchive.hpp> // For loading serialized data from text format
#include <fstream> // For file stream operations (input/output)
#include <string> // For using std::string in the class
#include <iostream> // For printing to console

// Class representing a product with id, price, and title
class Product {
friend class boost::serialization::access; // Allow Boost.Serialization access to private members

// Private member variables
int id;
double price;
std::string title;

public:
// Default constructor
Product() = default;

// Parameterized constructor
Product(int id, double price, const std::string& title) : id(id), price(price), title(title) {}

// Serialization function that Boost calls to serialize/deserialize object
template <class Archive>
void serialize(Archive& ar, const unsigned int version) {
ar & id; // Serialize/deserialize id
ar & price; // Serialize/deserialize price
ar & title; // Serialize/deserialize title
}

// Print function to output the product details
void print() const {
std::cout << "ID: " << id << ", Price: " << price << ", Title: " << title << std::endl;
}
};

// Function to save a Product object to a file using Boost.Serialization
void save_data(const Product& p, const std::string& filename) {
std::ofstream ofs(filename); // Open output file stream
boost::archive::text_oarchive oa(ofs); // Create a text archive object to serialize data
oa << p; // Serialize the Product object 'p' to file
}

// Function to load a Product object from a file using Boost.Serialization
Product load_data(const std::string& filename) {
Product p; // Create an empty Product object
std::ifstream ifs(filename); // Open input file stream
boost::archive::text_iarchive ia(ifs); // Create a text archive object to deserialize data
ia >> p; // Deserialize the Product object from file into 'p'
return p; // Return the loaded Product object
}

int main() {
// Create a Product object and save it to a file
Product p(1, 29.99, "Product 1");
save_data(p, "product.dat");

// Load the Product object from the file and print its contents
Product loaded_product = load_data("product.dat");
loaded_product.print();

return 0; // Exit the program
}

In this example, the Product class is serialized and deserialized using Boost.Serialization, allowing efficient storage and retrieval of complex data structures. This method is highly flexible and supports various formats, including text, binary, and XML.

Optimizing Performance and Resource Management

Writing highly performant and memory-efficient web scrapers in C++ is crucial for handling large-scale data extraction tasks. Efficient memory management is critical in scraping applications, particularly when dealing with large volumes of data or handling multiple HTTP requests.

C++ provides a powerful idiom called RAII (Resource Acquisition Is Initialization) to manage resources like memory and file handles automatically. RAII ensures that resources such as memory, files, or network connections are released automatically when they go out of scope, reducing the risk of memory leaks or resource mismanagement.

Here’s an example RAII with use smart pointers for automatic memory management:

std::unique_ptr<char[]> buffer(new char[1024]);

Here’s how to implement connection pooling to reuse connections with RAII:

#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <curl/curl.h>
#include <chrono>
#include <memory>
#include <stdexcept>

// RAII class to manage CURL* connections
class CurlHandle {
CURL* handle;
public:
CurlHandle() {
handle = curl_easy_init(); // Acquire resource
if (!handle) throw std::runtime_error("Failed to initialize CURL");
}

~CurlHandle() {
if (handle) curl_easy_cleanup(handle); // Automatically release resource
}

CURL* get() const { return handle; } // Expose raw CURL* if needed
};

class ConnectionPool {
std::vector<std::unique_ptr<CurlHandle>> connections; // Use RAII for connections
std::mutex mtx;

public:
ConnectionPool(size_t size) {
for (size_t i = 0; i < size; ++i) {
connections.push_back(std::make_unique<CurlHandle>()); // RAII-managed connections
}
}

CURL* getConnection() {
std::lock_guard<std::mutex> lock(mtx);
if (connections.empty()) {
return nullptr; // No available connections
}
auto conn = std::move(connections.back());
connections.pop_back();
return conn->get(); // Use RAII to manage connection lifecycle
}

void releaseConnection(CURL* conn) {
std::lock_guard<std::mutex> lock(mtx);
connections.push_back(std::make_unique<CurlHandle>()); // RAII handles cleanup
}

~ConnectionPool() {
// No explicit cleanup required, RAII manages resource release
}
};

void worker(ConnectionPool& pool, int id) {
CURL* conn = pool.getConnection(); // Get a connection from the pool
if (conn) {
// Simulate a scraping task
std::cout << "Worker " << id << " acquired a connection.\n";

// Perform a simple CURL operation (e.g., set URL)
curl_easy_setopt(conn, CURLOPT_URL, "http://example.com");

// Simulate some work with a sleep
std::this_thread::sleep_for(std::chrono::milliseconds(100));

// Release the connection back to the pool
pool.releaseConnection(conn);
std::cout << "Worker " << id << " released the connection.\n";
} else {
std::cerr << "Worker " << id << " failed to acquire a connection.\n";
}
}

int main() {
curl_global_init(CURL_GLOBAL_DEFAULT); // Initialize CURL globally

const size_t poolSize = 5; // Size of the connection pool
ConnectionPool pool(poolSize);

const size_t numWorkers = 10; // Number of worker threads
std::vector<std::thread> workers;

// Create worker threads
for (size_t i = 0; i < numWorkers; ++i) {
workers.emplace_back(worker, std::ref(pool), i);
}

// Join all worker threads
for (auto& worker : workers) {
worker.join();
}

curl_global_cleanup(); // Clean up CURL globally
return 0;
}

The CurlHandle class, adhering to RAII, effectively manages a singular CURL handle, guaranteeing its correct initialization and disposal to avoid resource leaks. The ConnectionPool class oversees a collection of CurlHandle instances, providing thread-safe mechanisms for obtaining and releasing connections. This architecture optimizes performance and reliability when web scraping in a multi-threaded context

Another performance optimization strategy you could implement is using asynchronous I/O for non-blocking operations:

#include <boost/asio.hpp>

boost::asio::io_context io_context;
boost::asio::steady_timer timer(io_context, boost::asio::chrono::seconds(5));

timer.async_wait([](const boost::system::error_code& error) {
if (!error) {
std::cout << "Timer expired" << std::endl;
}
});

io_context.run();

Additionally, you can implement a rate limiter to avoid overloading servers:

#include <csignal>

volatile sig_atomic_t shutdown_requested = 0;

void signal_handler(int signal) {
shutdown_requested = 1;
}

int main() {
std::signal(SIGINT, signal_handler);
std::signal(SIGTERM, signal_handler);

while (!shutdown_requested) {
// Perform scraping operations
if (shutdown_requested) {
std::cout << "Shutdown requested. Cleaning up..." << std::endl;
// Perform cleanup operations
break;
}
}

return 0;
}

Implementing these methods will allow you to build a more robust and dependable web scraper that can properly accept mistakes and recover from failures.

Profiling and Benchmarking

To identify performance bottlenecks, you need to profile and benchmark your scraper. Profiling helps you track CPU usage, memory consumption, and time spent in different parts of the code.

Here are two tools you can use for profiling

  • gprof: A powerful tool for profiling CPU usage in C++ programs. It generates detailed reports on the time spent in functions, allowing you to optimize hot paths.
  • Valgrind: A suite of tools that can detect memory leaks, improper memory access, and performance bottlenecks in your application.

Conclusion

C++ provides a powerful, high-performance platform for building robust and scalable web scrapers. While web scraping is often associated with higher-level languages, C++ offers unique advantages in terms of low-level control, memory management, and system integration that can optimize scraping operations, particularly when handling large-scale tasks or working with limited resources.

Here are the key takeaways for building efficient web scrapers in C++:

  • Performance Optimization: Leverage C++’s low-level capabilities, including RAII (Resource Acquisition Is Initialization) for managing memory and resources efficiently, and asynchronous I/O for non-blocking operations. Use multi-threading or thread pools to scale scraping tasks, but ensure you implement rate limiting and proper error handling.
  • Libraries for Network Operations: Use libcurl for HTTP requests, as it provides flexibility in handling GET, POST, and secure HTTPS requests. Manage cookies and session states effectively for websites requiring authentication, and consider using headless browsers like Puppeteer (invoked via subprocess) for JavaScript-rendered pages.
  • Parsing HTML: Libraries like Gumbo-parser and libxml2 are essential for parsing HTML content, and you can extend their functionality with XPath and CSS selectors for extracting structured data. Handle malformed HTML gracefully to ensure your scraper remains robust in real-world scenarios.
  • Data Storage: For simple data storage, C++’s file handling features (especially binary formats) offer speed and efficiency. For more complex use cases, SQLite provides a lightweight, file-based relational database solution, while Boost.Serialization enables the efficient serialization of complex data structures for future retrieval.
  • Error Handling and Resilience: A well-designed scraper should recover from network errors, implement retry logic, and handle challenges such as CAPTCHAs and rate-limiting. External CAPTCHA-solving services or integrating with tools like Scrape.do can automate the bypass of common anti-scraping mechanisms.

While building a web scraper in C++ offers powerful control and customization, Scrape.do is a far more efficient and scalable option for most scraping needs. It offloads the complexities and technical challenges involved in creating and maintaining a web scraper, allowing you to focus on extracting the data you need without worrying about the underlying infrastructure.

Here’s why Scrape.do is a better alternative:

  1. No Infrastructure Setup: With Scrape.do, you don’t have to set up HTTP request handling, proxy management, or complex authentication flows. The service manages all these, significantly reducing development time.
  2. Automatic CAPTCHA Solving: Scrape.do comes with built-in CAPTCHA-solving mechanisms, eliminating the need to manually integrate third-party CAPTCHA-solving services. This feature makes it easier to scrape sites with aggressive anti-bot measures.
  3. JavaScript Rendering and IP Rotation: Scrape.do handles JavaScript-rendered pages and manages IP rotation automatically. You don’t need to integrate headless browsers like Puppeteer or build proxy solutions to avoid being blocked by websites.
  4. Scalability and Performance: Scrape.do is built for high-performance web scraping at scale. Instead of writing custom multi-threading and error-handling code, Scrape.do manages concurrent requests, retries, and rate limiting efficiently, so you don’t have to.
  5. Maintenance-Free: Websites change their anti-scraping tactics frequently. Scrape.do takes care of these changes automatically, keeping your scrapers running without requiring constant updates to your code.

You can get started with Scrape.do now for free and enjoy all these benefits!