Categories: Web scraping, Tutorials

Web Scraping With C

32 mins read Created Date: October 14, 2024   Updated Date: October 14, 2024

Most developers instinctively turn to high-level languages like Python for web scraping, thanks to their ease of use and extensive library support. However, C offers a compelling alternative for those needing greater control over performance, memory usage, and networking.

As a low-level language, C enables developers to directly manage system resources and optimize every aspect of the scraping process, making it ideal for environments where efficiency is paramount.

In this article, we’ll dive into the unique challenges and advantages of using C for web scraping. From handling raw HTTP requests to managing memory manually, C’s trade-offs provide greater flexibility and speed than higher-level languages. Without further ado, let’s dive right in!

Setting Up the Environment

To start web scraping in C, you’ll need to set up an environment that allows you to manage HTTP requests and parse HTML data efficiently. The primary tools you’ll rely on are libcurl for handling network requests and libxml2 or gumbo-parser for parsing HTML and XML data.

Required Libraries

Ubuntu/Debian

To install libcurl and libxml2 (or gumbo-parser), you can use package managers like apt for Ubuntu or Debian-based systems. Here’s how to install these libraries:

sudo apt-get update
sudo apt-get install libcurl4-openssl-dev libxml2-dev

If you prefer to use gumbo-parser:

sudo apt-get install libgumbo-dev

Once you’ve installed the libraries, you’ll need to link them to your C project during compilation.

gcc -o scraper scraper.c -lcurl -lxml2

C code can be compiled using popular compilers like GCC or Clang. If you haven’t already installed a C compiler, you can install GCC with the following command:

sudo apt-get install build-essential

With these, you’re good to go!

Windows

For Windows, you’ll have to rely on an IDE like VScode. You can get the C/C++ extension from the extension store, and then install the C package manager vcpkg and set it up in Visual Studio with the official guide.

To install libcurl and libxml2, do this in the root folder of the project:

vcpkg install curl libxml2

With that, you’re fully set up!

Fetching Web Content Using libcurl

When it comes to fetching web content in C, libcurl is the go-to library for handling HTTP requests. It supports multiple protocols, including HTTP and HTTPS, and provides fine-grained control over every aspect of the request and response process. Let’s look at how to fetch HTML content using libcurl, handle different HTTP methods, manage response codes, and ensure proper memory management.

Performing an HTTP GET Request

To fetch content from a web page, you need to initialize a CURL session, set the URL, and use a callback to handle the data. Here’s a simple example that demonstrates how to fetch HTML content using a GET request:

#include <stdio.h>
#include <curl/curl.h>
#include <string.h>
// Callback function to handle data
size_t write_callback(void *ptr, size_t size, size_t nmemb, char *data) {
size_t real_size = size * nmemb;
strncat(data, ptr, real_size);
return real_size;
}
int main(void) {
CURL *curl;
CURLcode res;
char buffer[100000] = {0}; // Buffer to hold the response
curl = curl_easy_init(); // Initialize CURL session
if (curl) {
// Set URL to fetch the HTML
curl_easy_setopt(curl, CURLOPT_URL, "https://www.scrapethissite.com/pages/ajax-javascript/#2014");
// Optional: Set a user-agent to avoid being blocked by some websites
curl_easy_setopt(curl, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36");
// Set callback function to write response data
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_callback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, buffer); // Pass buffer to callback
res = curl_easy_perform(curl); // Perform the request
if (res != CURLE_OK) {
// Handle errors if the request fails
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
} else {
// Print fetched HTML content
printf("Fetched HTML:\n%s\n", buffer);
}
curl_easy_cleanup(curl); // Clean up CURL session
}
return 0;
}

To send data to a server, you have to use HTTP POST request. This is done by setting the POSTFIELDS option to send data along with the request:

#include <stdio.h>
#include <curl/curl.h>

int main(void) {
CURL *curl;
CURLcode res;

curl = curl_easy_init(); // Initialize CURL session
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://httpbin.org/post"); // Set URL
curl_easy_setopt(curl, CURLOPT_POST, 1L); // Specify POST request
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, "name=John&age=30"); // Set POST data

res = curl_easy_perform(curl); // Perform POST request

if (res != CURLE_OK) {
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res)); // Handle errors
} else {
printf("POST request succeeded.\n");
}

curl_easy_cleanup(curl); // Clean up CURL session
}
return 0;
}

In this random example, curl_easy_setopt is used to set the POST method and attach form data via POSTFIELDS. This makes it easy to submit data to a web server, such as form submissions.

Handling HTTP Response Codes and Headers

When performing web scraping or interacting with web servers, understanding the HTTP response codes returned by the server is crucial. These codes indicate the status of the request, whether it was successful, or if any errors occurred. In libcurl, you can retrieve and analyze the HTTP response code using the curl_easy_getinfo function.

Once a request is performed using curl_easy_perform(), the server will send back a response that includes both the content (if any) and the HTTP status code (e.g., 200 for success, 404 for not found, 500 for server errors). To extract this status code in libcurl, use the CURLINFO_RESPONSE_CODE option with curl_easy_getinfo. Here’s how it works:

#include <stdio.h>
#include <curl/curl.h>

int main(void) {
CURL *curl;
CURLcode res;
long http_code = 0;

curl = curl_easy_init(); // Initialize CURL session
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://www.example.com"); // Set URL

res = curl_easy_perform(curl); // Perform the request

if (res == CURLE_OK) {
// Retrieve the HTTP response code
curl_easy_getinfo(curl, CURLINFO_RESPONSE_CODE, &http_code);
printf("HTTP Response Code: %ld\n", http_code);
} else {
// Print error message if request failed
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
}

curl_easy_cleanup(curl); // Clean up CURL session
}
return 0;
}

In this example, after performing the HTTP request, we check if the operation succeeded (i.e., res == CURLE_OK). If it did, we extract the HTTP response code using curl_easy_getinfo() with the CURLINFO_RESPONSE_CODE option. This gives us the numeric HTTP status code, such as:

  • 200: OK – The request was successful.

  • 404: Not Found – The requested resource could not be found.

  • 500: Internal Server Error – The server encountered an error.

     

You can use this information to adjust the logic in your scraper, retry requests, or handle errors accordingly.

Besides the response code, headers provide additional metadata about the request, such as content type, content length, server information, caching policies, and more. libcurl allows you to retrieve and inspect these headers as part of your request.

To collect headers, you need to define a custom callback function that will capture the header information. This is similar to how you handle the response body.

Here’s how you can capture and print HTTP headers using libcurl:

#include <stdio.h>
#include <curl/curl.h>

// Callback function to handle headers
size_t header_callback(char *buffer, size_t size, size_t nitems, void *userdata) {
size_t header_size = nitems * size;
printf("Header: %.*s", (int)header_size, buffer); // Print the header line
return header_size;
}

int main(void) {
CURL *curl;
CURLcode res;
long http_code = 0;

curl = curl_easy_init(); // Initialize CURL session
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://www.example.com"); // Set URL

// Set the header callback function
curl_easy_setopt(curl, CURLOPT_HEADERFUNCTION, header_callback);

res = curl_easy_perform(curl); // Perform the request

if (res == CURLE_OK) {
// Retrieve the HTTP response code
curl_easy_getinfo(curl, CURLINFO_RESPONSE_CODE, &http_code);
printf("HTTP Response Code: %ld\n", http_code);
} else {
// Print error message if request failed
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
}

curl_easy_cleanup(curl); // Clean up CURL session
}
return 0;
}

url_easy_setopt and CURLOPT_HEADERFUNCTION instructs libcurl to use a custom function, header_callback, to handle each header line received from the server during a network request. This callback function is invoked once for every header line, allowing you to process or analyze the header information as needed.

Headers typically contain valuable metadata about the resource being requested, such as the content type, size, server information, cookies, and caching policies. For example, the Content-Type header specifies the media type of the resource (e.g., HTML, JSON), while the Content-Length header indicates the size of the response body. Other headers like Server, Set-Cookie, and Cache-Control provide additional information about the server, cookies, and caching behavior.

Here’s an example of the output:

Header: HTTP/1.1 200 OK
Header: Date: Mon, 11 Sep 2024 14:23:45 GMT
Header: Content-Type: text/html; charset=UTF-8
Header: Content-Length: 1256
Header: Connection: keep-alive
Header: Server: nginx
Header: Cache-Control: max-age=3600
Header: Set-Cookie: session_id=abc123; Path=/
HTTP Response Code: 200

Memory management is crucial when working with libcurl in C, especially when dealing with dynamic memory. You must ensure that all dynamically allocated memory is freed properly to avoid memory leaks. Here’s an example of how to manage memory when fetching large amounts of content:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <curl/curl.h>

struct MemoryStruct {
char *memory;
size_t size;
};

// Callback function to handle data
size_t write_callback(void *contents, size_t size, size_t nmemb, void *userp) {
size_t realsize = size * nmemb;
struct MemoryStruct *mem = (struct MemoryStruct *)userp;

char *ptr = realloc(mem->memory, mem->size + realsize + 1);
if (ptr == NULL) {
printf("Out of memory!\n");
return 0;
}

mem->memory = ptr;
memcpy(&(mem->memory[mem->size]), contents, realsize);
mem->size += realsize;
mem->memory[mem->size] = 0;

return realsize;
}

int main(void) {
CURL *curl;
CURLcode res;
struct MemoryStruct chunk;

chunk.memory = malloc(1); // Allocate initial memory
chunk.size = 0; // Initialize size

curl_global_init(CURL_GLOBAL_DEFAULT);

curl = curl_easy_init(); // Initialize CURL session
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://www.example.com"); // Set URL
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_callback); // Set callback
curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)&chunk); // Pass memory struct to callback

res = curl_easy_perform(curl); // Perform GET request

if (res != CURLE_OK) {
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res)); // Handle errors
} else {
printf("Fetched %lu bytes:\n%s\n", (unsigned long)chunk.size, chunk.memory); // Print content
}

curl_easy_cleanup(curl); // Clean up
free(chunk.memory); // Free allocated memory
}

curl_global_cleanup();

return 0;
}

In this example, realloc is used to dynamically allocate memory for the response content. Proper cleanup is done using free() to avoid memory leaks. Additionally, curl_easy_cleanup() ensures that resources used by the CURL session are properly released.

Parsing HTML with libxml2 or gumbo-parser

Once you’ve fetched HTML content using libcurl, the next step in web scraping is parsing the HTML to extract the desired information. This can be done by navigating the DOM tree and selecting specific elements. Two popular libraries for parsing HTML in C are libxml2 and gumbo-parser.

  • libxml2 is a powerful library for XML and HTML parsing. It provides full DOM tree navigation and is widely used in C-based scraping tasks.

  • gumbo-parser is a lightweight and HTML5-compliant parser designed to handle modern web content more efficiently, especially when dealing with messy or non-standard HTML.

     

libxml2 allows you to parse HTML documents and navigate the DOM tree to extract specific elements. Here’s a detailed example of using libxml2 to parse HTML content fetched with libcurl.

#include <stdio.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#include <string.h>

// Buffer to hold the fetched content
struct MemoryStruct {
char *memory;
size_t size;
};

// Callback function to write data into buffer
size_t write_callback(void *contents, size_t size, size_t nmemb, void *userp) {
size_t real_size = size * nmemb;
struct MemoryStruct *mem = (struct MemoryStruct *)userp;

char *ptr = realloc(mem->memory, mem->size + real_size + 1);
if (ptr == NULL) {
printf("Not enough memory!\n");
return 0;
}

mem->memory = ptr;
memcpy(&(mem->memory[mem->size]), contents, real_size);
mem->size += real_size;
mem->memory[mem->size] = 0;

return real_size;
}

// Function to parse HTML using libxml2
void parse_html(const char *html_content) {
htmlDocPtr doc;
xmlNodePtr root_element;

// Parse the HTML content and create a document tree
doc = htmlReadMemory(html_content, strlen(html_content), NULL, NULL, HTML_PARSE_NOERROR | HTML_PARSE_RECOVER);
if (doc == NULL) {
printf("Failed to parse document\n");
return;
}

// Get the root element node
root_element = xmlDocGetRootElement(doc);

// Print the title of the document
xmlChar *xpath = (xmlChar *)"//title";
xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);
xmlXPathObjectPtr xpathObj = xmlXPathEvalExpression(xpath, xpathCtx);

if (xpathObj && xpathObj->nodesetval->nodeNr > 0) {
xmlNodePtr node = xpathObj->nodesetval->nodeTab[0];
printf("Title: %s\n", xmlNodeGetContent(node));
}

// Cleanup
xmlXPathFreeObject(xpathObj);
xmlXPathFreeContext(xpathCtx);
xmlFreeDoc(doc);
xmlCleanupParser();
}

int main(void) {
CURL *curl;
CURLcode res;
struct MemoryStruct chunk;

chunk.memory = malloc(1); // Initial memory allocation
chunk.size = 0; // Initial size

curl_global_init(CURL_GLOBAL_DEFAULT);
curl = curl_easy_init();
if (curl) {
// Set the new URL
curl_easy_setopt(curl, CURLOPT_URL, "https://www.scrapethissite.com/pages/simple/");

curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_callback); // Set write callback
curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)&chunk); // Pass buffer to callback

// Perform HTTP request
res = curl_easy_perform(curl);
if (res != CURLE_OK) {
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
} else {
// Parse the fetched HTML content
parse_html(chunk.memory);
}

// Cleanup
curl_easy_cleanup(curl);
free(chunk.memory); // Free allocated memory
}
curl_global_cleanup();

return 0;
}

The above example will fetch the HTML content from the specified URL and extract and print the document’s <title>.

For simpler, HTML5-compliant parsing, you can use gumbo-parser. It’s lightweight, but still effective for web scraping, especially when dealing with non-standard or malformed HTML.

#include <stdio.h>
#include <curl/curl.h>
#include <gumbo.h>
#include <string.h>

// Buffer to hold fetched content
struct MemoryStruct {
char *memory;
size_t size;
};

// Callback function for curl to write data into buffer
size_t write_callback(void *contents, size_t size, size_t nmemb, void *userp) {
size_t real_size = size * nmemb;
struct MemoryStruct *mem = (struct MemoryStruct *)userp;

char *ptr = realloc(mem->memory, mem->size + real_size + 1);
if (ptr == NULL) {
printf("Not enough memory!\n");
free(mem->memory); // Free previously allocated memory to prevent a memory leak
return 0;
}

mem->memory = ptr;
memcpy(&(mem->memory[mem->size]), contents, real_size);
mem->size += real_size;
mem->memory[mem->size] = 0;

return real_size;
}

// Recursively search for the <title> element
void search_for_title(GumboNode *node) {
if (node->type != GUMBO_NODE_ELEMENT || node->v.element.tag != GUMBO_TAG_TITLE) {
if (node->type == GUMBO_NODE_ELEMENT) {
GumboVector *children = &node->v.element.children;
for (unsigned int i = 0; i < children->length; ++i) {
search_for_title((GumboNode *)children->data[i]);
}
}
return;
}

if (node->v.element.children.length > 0) {
GumboNode *title_text = (GumboNode *)node->v.element.children.data[0];
if (title_text->type == GUMBO_NODE_TEXT) {
printf("Title: %s\n", title_text->v.text.text);
}
}
}

void parse_html_with_gumbo(const char *html_content) {
if (html_content == NULL) {
printf("No HTML content provided.\n");
return;
}

GumboOutput *output = gumbo_parse(html_content);
if (output == NULL) {
printf("Failed to parse HTML.\n");
return;
}

search_for_title(output->root);
gumbo_destroy_output(&kGumboDefaultOptions, output);
}

int main(void) {
CURL *curl;
CURLcode res;
struct MemoryStruct chunk;

chunk.memory = malloc(1); // Initial memory allocation
if (chunk.memory == NULL) {
fprintf(stderr, "Failed to allocate memory.\n");
return 1;
}
chunk.size = 0; // Initial size

curl_global_init(CURL_GLOBAL_DEFAULT);
curl = curl_easy_init();
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, "https://www.example.com"); // Set URL
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_callback); // Set write callback
curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)&chunk); // Pass buffer to callback

// Perform HTTP request
res = curl_easy_perform(curl);
if (res != CURLE_OK) {
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
} else {
// Parse the fetched HTML content using gumbo-parser
parse_html_with_gumbo(chunk.memory);
}

// Cleanup
curl_easy_cleanup(curl);
free(chunk.memory); // Free allocated memory
} else {
fprintf(stderr, "Failed to initialize curl.\n");
}

curl_global_cleanup();

return 0;
}

Both libxml2 and gumbo-parser provide powerful mechanisms for parsing and navigating HTML in C. While libxml2 offers more advanced features and flexibility, gumbo-parser is a lighter, HTML5-compliant alternative for simpler scraping tasks. Depending on the complexity of the HTML and the scraping task, either library can be a great choice.

However, rather than gothrough all the trouble of setting this up, you can simply start using Scrape.do for free to achieve the same result with much less stress. We simplify data extraction by letting you focus on the data, while we handle the entire scraping process.

Handling Dynamic Content (JavaScript Rendering)

One of the major challenges in web scraping is dealing with dynamic content that JavaScript renders. Since C, as a low-level language, does not natively support JavaScript execution, it cannot directly handle the rendering of such content. However, there are strategies to scrape dynamic content by leveraging external tools designed for rendering JavaScript.

Strategy 1: Use External Tools for JavaScript Execution

A practical approach for handling dynamic content is automating a headless browser and extracting the fully rendered HTML after executing JavaScript. This can be done by using tools like PhantomJS, Headless Chrome, or Puppeteer, and integrating them with your C code.

In C, you can invoke system-level commands to execute an external headless browser, capture the output (the fully rendered HTML), and then parse it within your C program. Lets look at an example that demonstrates how to use Headless Chrome from a C program using system calls.

First, Ensure that you have Google Chrome installed on your system. You can run Chrome in headless mode using the command-line options:

google-chrome --headless --disable-gpu --dump-dom https://example.com

This will output the fully rendered HTML content of the page. You can redirect this output to a file or capture it in a C program. You can invoke Headless Chrome using the system() function in C, capture the output into a file, and then read the file to extract the fully rendered content. Here’s an example:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
// URL to scrape
const char *url = "https://www.example.com";

// Make sure URL length doesn't exceed the command buffer size
size_t max_url_length = 200;
if (strlen(url) > max_url_length) {
fprintf(stderr, "Error: URL is too long.\n");
return 1;
}

// Command to execute Headless Chrome in dump DOM mode
char command[512];
snprintf(command, sizeof(command), "google-chrome --headless --disable-gpu --dump-dom %s > output.html", url);

// Execute the system command to run Headless Chrome
int result = system(command);
if (result != 0) {
fprintf(stderr, "Error: Failed to execute Headless Chrome.\n");
return 1;
}

// Open the file containing the rendered HTML
FILE *file = fopen("output.html", "r");
if (!file) {
fprintf(stderr, "Error: Could not open output file.\n");
return 1;
}

// Read and print the content of the file (or process it with a parser)
char buffer[1024];
while (fgets(buffer, sizeof(buffer), file)) {
printf("%s", buffer); // For simplicity, just print the content
}

fclose(file);

// Optionally, remove the output file after reading
if (remove("output.html") != 0) {
fprintf(stderr, "Error: Could not remove output file.\n");
}

return 0;
}

Strategy 2: Use a REST API to Execute JavaScript

Another option is to use a third-party service like Scrape.do, which executes the JavaScript on the server side and returns the fully rendered HTML. This approach simplifies the workflow as you can fetch the rendered content directly with libcurl in your C program without needing to execute a headless browser locally.

You can use libcurl to make an HTTP request to Scrape.do’s API and retrieve the rendered HTML. The basic idea is similar to fetching any other web content, but you’ll include your Scrape.do API key and use our endpoint to perform the request.

#include <stdio.h>
#include <curl/curl.h>

#define API_KEY "your_scrape_do_api_key" // Replace with your Scrape.do API key

// Buffer to hold fetched content
struct MemoryStruct {
char *memory;
size_t size;
};

// Callback function to write data into buffer
size_t write_callback(void *contents, size_t size, size_t nmemb, void *userp) {
size_t real_size = size * nmemb;
struct MemoryStruct *mem = (struct MemoryStruct *)userp;

char *ptr = realloc(mem->memory, mem->size + real_size + 1);
if (ptr == NULL) {
printf("Not enough memory!\n");
return 0;
}

mem->memory = ptr;
memcpy(&(mem->memory[mem->size]), contents, real_size);
mem->size += real_size;
mem->memory[mem->size] = 0;

return real_size;
}

int main(void) {
CURL *curl;
CURLcode res;
struct MemoryStruct chunk;

chunk.memory = malloc(1); // Initial memory allocation
chunk.size = 0; // Initial size

curl_global_init(CURL_GLOBAL_DEFAULT);
curl = curl_easy_init();

if (curl) {
// The URL of the site you want to scrape
const char *target_url = "https://www.example.com";

// Format the Scrape.do API URL with your target URL
char api_url[512];
snprintf(api_url, sizeof(api_url), "https://api.scrape.do?api_key=%s&url=%s&render=true", API_KEY, target_url);

// Set the API URL to fetch rendered content
curl_easy_setopt(curl, CURLOPT_URL, api_url);

// Set the write callback function to save output in memory
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_callback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)&chunk);

// Perform the request
res = curl_easy_perform(curl);
if (res != CURLE_OK) {
fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
} else {
// Print the fetched content
printf("Fetched content:\n%s\n", chunk.memory);

// Here, you can proceed to parse the content using libxml2 or gumbo-parser
}

// Cleanup
curl_easy_cleanup(curl);
free(chunk.memory); // Free allocated memory
}

curl_global_cleanup();

return 0;
}

Once you have the fully rendered HTML content from Scrape.do, you can parse it using libxml2 or gumbo-parser (as described in the previous sections). The fetched content is stored in the chunk.memory buffer, and you can pass this to your parsing functions.

For example, after fetching the content, you can directly parse it like this:

// Parse the fetched content using libxml2parse_html(chunk.memory);

Or, if you’re using gumbo-parser:

// Parse the fetched content using gumbo-parserparse_html_with_gumbo(chunk.memory);

Advantages of Using Scrape.do

  • JavaScript Rendering: Scrape.do handles JavaScript-heavy websites, returning fully rendered HTML, which is crucial for modern web scraping.
  • Simple API: Scrape.do provides a simple, reliable API, so you don’t need to set up a headless browser or complex infrastructure.
  • Integration with C: Using libcurl makes it easy to integrate the Scrape.do API into a C-based scraping workflow, allowing you to handle dynamic content while maintaining C’s performance benefits.

By integrating Scrape.do with libcurl, you can easily handle dynamic content generated by JavaScript in your C-based web scraping tasks.

Efficient Memory and Error Handling

When scraping large web pages or handling numerous requests, efficient memory management and robust error handling are critical for ensuring that your program can handle large-scale data without crashing or exhausting system resources.

Handling large responses, especially when parsing extensive HTML files, requires careful memory management. Two key considerations are static memory allocation versus dynamic memory allocation:

Static Memory Allocation

Static memory allocation involves declaring fixed-size arrays for storing data. While this approach is simple and can prevent memory fragmentation, it may be impractical for web scraping due to the unpredictable size of responses.

char buffer[1024]; // Static memory allocation with a fixed buffer size.

However, if the response exceeds 1024 bytes, this approach could either result in data truncation or force you to use a very large buffer, which wastes memory.

Dynamic Memory Allocation

Dynamic memory allocation is much more flexible and efficient for handling large or unpredictable response sizes. It allows the program to allocate memory as needed, using functions like malloc(), calloc(), and realloc().

// Initial dynamic memory allocation
char *buffer = (char *)malloc(1024 * sizeof(char));

// Reallocate memory if more space is needed
buffer = (char *)realloc(buffer, new_size * sizeof(char));

Dynamic allocation helps you adjust memory usage based on the actual size of the response, minimizing waste. However, you need to handle memory carefully to avoid leaks and ensure that allocated memory is properly freed.

Handling Out-of-Memory Scenarios

In situations where the system runs out of memory, failing to handle the condition gracefully can cause the program to crash. When using dynamic memory allocation, always check if the memory allocation was successful, like this:

char *buffer = (char *)malloc(1024);
if (buffer == NULL) {
fprintf(stderr, "Error: Out of memory!\n");
return 1; // Exit or take recovery steps
}

Error Handling and Recovery Strategies

Web scraping is often error-prone due to network issues, server errors, rate limiting, or timeouts. Designing error-handling strategies that can recover from failures is essential for reliable scraping.

You can use libcurl to handle common network issues like timeouts, connection errors, and DNS failures:

curl_easy_setopt(curl, CURLOPT_TIMEOUT, 10L); // Set a 10-second timeout
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, 5L); // Set a connection timeout

If a timeout occurs, you must retry the request a few times before failing. The same goes for network errors, where you must handle network disconnections gracefully, attempting to reconnect and retry the request.

Retries can be implemented using simple logic to re-attempt failed requests. However, to avoid overwhelming the server, it’s crucial to implement an exponential backoff strategy, which increases the delay between retries after each failure.

int retry_count = 0;
int max_retries = 5;
int base_delay = 1; // Start with 1 second delay

while (retry_count < max_retries) {
CURLcode res = curl_easy_perform(curl);
if (res == CURLE_OK) {
// Request succeeded, break out of the loop
break;
} else {
// Print error and retry
fprintf(stderr, "Request failed: %s. Retrying...\n", curl_easy_strerror(res));
retry_count++;
sleep(base_delay); // Wait for the delay
base_delay *= 2; // Exponentially increase the delay
}
}

if (retry_count == max_retries) {
fprintf(stderr, "Max retries reached. Aborting...\n");
}

In this example, the delay starts at 1 second and doubles after each failure (1 second, 2 seconds, 4 seconds, etc.), allowing the server time to recover while preventing excessive request floods.

Many websites limit the number of requests you can make within a certain time period. If you hit the rate limit, the server will often respond with an HTTP status code like 429 Too Many Requests. You can handle this by pausing requests and retrying after the server-specified cooldown period.

if (http_code == 429) {
printf("Rate limit exceeded. Retrying after delay...\n");
sleep(retry_delay);
}

Optimizing for Performance

Web scraping can become slow and resource-intensive when dealing with numerous pages or large datasets. To optimize performance, you can use multi-threading and concurrent requests to scrape multiple pages simultaneously, significantly speeding up the process.

Using multi-threading allows you to scrape multiple pages in parallel. This can drastically reduce the time required to scrape large datasets, as each thread can independently handle a different URL. Here’s a simple example of how to create threads using pthread to perform web scraping tasks in parallel:

#include <stdio.h>
#include <pthread.h>
#include <curl/curl.h>

void *scrape_page(void *url) {
CURL *curl;
CURLcode res;

curl = curl_easy_init();
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, (char *)url); // Set the URL for the thread

// Perform the request
res = curl_easy_perform(curl);
if (res != CURLE_OK) {
fprintf(stderr, "Error scraping %s: %s\n", (char *)url, curl_easy_strerror(res));
} else {
printf("Successfully scraped: %s\n", (char *)url);
}

curl_easy_cleanup(curl); // Clean up
} else {
fprintf(stderr, "Failed to initialize curl for %s\n", (char *)url);
}

return NULL;
}

int main() {
const int thread_count = 5;
pthread_t threads[thread_count]; // Create an array for thread identifiers
const char *urls[thread_count] = {
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
"https://example.com/page4",
"https://example.com/page5"
};

curl_global_init(CURL_GLOBAL_DEFAULT); // Initialize libcurl globally

// Create threads for each URL
for (int i = 0; i < thread_count; i++) {
int err = pthread_create(&threads[i], NULL, scrape_page, (void *)urls[i]);
if (err != 0) {
fprintf(stderr, "Error creating thread for %s\n", urls[i]);
}
}

// Wait for all threads to complete
for (int i = 0; i < thread_count; i++) {
int err = pthread_join(threads[i], NULL);
if (err != 0) {
fprintf(stderr, "Error joining thread for %s\n", urls[i]);
}
}

curl_global_cleanup(); // Clean up global libcurl state

return 0;
}

With this approach, multiple pages are scraped concurrently, improving performance by utilizing the system’s multi-core capabilities. The thread count can be adjusted accordingly if you want to add more URLs or manage dynamic URLs.

For even more efficient scraping, libcurl’s multi-interface allows you to handle multiple connections in a non-blocking manner. Instead of creating separate threads for each request, you can batch multiple requests into a single event-driven loop, allowing libcurl to manage multiple requests simultaneously within the same thread.

#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>

// Callback function to handle response data
size_t write_callback(void *ptr, size_t size, size_t nmemb, void *userdata) {
// Write the response data into the provided buffer
size_t total_size = size * nmemb;
char *response = (char *) userdata;
strncat(response, (char *) ptr, total_size);
return total_size;
}

int main(void) {
CURL *handles[5]; // Array of CURL easy handles
CURLM *multi_handle; // Multi-handle for managing concurrent requests
int still_running; // Number of active transfers
CURLMsg *msg;
int msgs_left;

const char *urls[5] = {
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
"https://example.com/page4",
"https://example.com/page5"
};

char responses[5][1024] = {0}; // Array to store responses

curl_global_init(CURL_GLOBAL_DEFAULT); // Initialize libcurl globally

// Initialize multi-handle
multi_handle = curl_multi_init();

// Initialize individual easy handles and add them to the multi-handle
for (int i = 0; i < 5; i++) {
handles[i] = curl_easy_init();
if (handles[i]) {
curl_easy_setopt(handles[i], CURLOPT_URL, urls[i]);
curl_easy_setopt(handles[i], CURLOPT_WRITEFUNCTION, write_callback); // Set callback for writing data
curl_easy_setopt(handles[i], CURLOPT_WRITEDATA, responses[i]); // Pass the buffer to store response
curl_easy_setopt(handles[i], CURLOPT_TIMEOUT, 10L); // Set timeout for the request
curl_easy_setopt(handles[i], CURLOPT_CONNECTTIMEOUT, 5L); // Set connection timeout
curl_multi_add_handle(multi_handle, handles[i]); // Add the easy handle to the multi handle
} else {
fprintf(stderr, "Failed to initialize handle for %s\n", urls[i]);
}
}

// Perform the requests in a non-blocking manner
curl_multi_perform(multi_handle, &still_running);

// Wait for all transfers to finish
do {
int numfds;
CURLMcode mc = curl_multi_wait(multi_handle, NULL, 0, 1000, &numfds); // Wait for activity on the multi handle

if (mc != CURLM_OK) {
fprintf(stderr, "curl_multi_wait() failed: %s\n", curl_multi_strerror(mc));
break; // Exit the loop if waiting fails
}

curl_multi_perform(multi_handle, &still_running); // Continue performing the requests
} while (still_running);

// Check the results of each request
while ((msg = curl_multi_info_read(multi_handle, &msgs_left))) {
if (msg->msg == CURLMSG_DONE) {
int idx = -1;
for (int i = 0; i < 5; i++) {
if (msg->easy_handle == handles[i]) {
idx = i;
break;
}
}

if (msg->data.result == CURLE_OK) {
printf("Successfully scraped: %s\nResponse: %s\n", urls[idx], responses[idx]);
} else {
printf("Error scraping %s: %s\n", urls[idx], curl_easy_strerror(msg->data.result));
}
}
}

// Cleanup
for (int i = 0; i < 5; i++) {
if (handles[i]) {
curl_multi_remove_handle(multi_handle, handles[i]); // Remove from multi handle
curl_easy_cleanup(handles[i]); // Cleanup each easy handle
}
}

curl_multi_cleanup(multi_handle); // Cleanup the multi handle
curl_global_cleanup(); // Cleanup global libcurl

return 0;
}

This approach allows you to manage multiple requests in a single thread, making it ideal for batch-processing many URLs without creating separate threads for each request. It is efficient and reduces CPU overhead.

Handling Large-Scale Data Extraction

Large-scale web scraping operations in C require careful planning for efficient data storage, rate limiting, and connection management. When dealing with vast amounts of data, the way you store that data is crucial for maintaining performance and ensuring scalability. There are several approaches, depending on the volume of data and the target format.

Direct Insertion into Databases

For high-volume scraping operations, directly inserting data into a database can be an efficient way to handle large-scale data extraction. Databases provide structured storage, indexing, and querying capabilities that make managing large datasets more practical.

Popular database options include:

  • MySQL or PostgreSQL: For relational data storage.

  • MongoDB: For flexible, schema-less data storage in JSON-like documents.

     

Here’s an example of inserting Data into a MySQL Database:

#include <mysql/mysql.h>
#include <stdio.h>

// Function to insert data using a prepared statement
void insert_into_db(MYSQL *conn, const char *data) {
if (conn == NULL) {
fprintf(stderr, "Connection is NULL\n");
return;
}

// Prepare the SQL statement to avoid SQL injection
const char *query = "INSERT INTO scraped_data (data) VALUES (?)";
MYSQL_STMT *stmt = mysql_stmt_init(conn);
if (!stmt) {
fprintf(stderr, "mysql_stmt_init() failed\n");
return;
}

if (mysql_stmt_prepare(stmt, query, strlen(query))) {
fprintf(stderr, "mysql_stmt_prepare() failed: %s\n", mysql_stmt_error(stmt));
mysql_stmt_close(stmt);
return;
}

// Bind the data to the statement
MYSQL_BIND bind[1];
memset(bind, 0, sizeof(bind));

bind[0].buffer_type = MYSQL_TYPE_STRING;
bind[0].buffer = (char *)data;
bind[0].buffer_length = strlen(data);

if (mysql_stmt_bind_param(stmt, bind)) {
fprintf(stderr, "mysql_stmt_bind_param() failed: %s\n", mysql_stmt_error(stmt));
mysql_stmt_close(stmt);
return;
}

// Execute the statement
if (mysql_stmt_execute(stmt)) {
fprintf(stderr, "mysql_stmt_execute() failed: %s\n", mysql_stmt_error(stmt));
} else {
printf("Data successfully inserted: %s\n", data);
}

// Close the statement
mysql_stmt_close(stmt);
}

// Function to establish a connection to the MySQL database
MYSQL* connect_to_db() {
MYSQL *conn = mysql_init(NULL);

if (conn == NULL) {
fprintf(stderr, "mysql_init() failed\n");
return NULL;
}

if (mysql_real_connect(conn, "host", "user", "password", "dbname", 0, NULL, 0) == NULL) {
fprintf(stderr, "mysql_real_connect() failed: %s\n", mysql_error(conn));
mysql_close(conn);
return NULL;
}

return conn;
}

int main() {
MYSQL *conn = connect_to_db();
if (conn == NULL) {
return 1; // Exit if connection failed
}

// Example data insertion
insert_into_db(conn, "Sample scraped data");

// Close the connection when done
mysql_close(conn);
return 0;
}

In this example, data is inserted into a MySQL database in real-time. You can use connection pooling or batch inserts to improve performance.

File-Based Storage (CSV or JSON Formats)

If you’re not working with databases, storing data in CSV or JSON files is another efficient way to handle large-scale data extraction. These formats are simple, lightweight, and easy to parse.

Here’s how to store data in CSV:

#include <stdio.h>

void save_to_csv(const char *filename, const char *data) {
FILE *file = fopen(filename, "a");
if (file == NULL) {
fprintf(stderr, "Error opening file for writing\n");
return;
}

fprintf(file, "%s\n", data); // Write data to CSV format
fclose(file);
}

Here’s how to do the same for JSON:

#include <stdio.h>

void save_to_json(const char *filename, const char *data) {
FILE *file = fopen(filename, "a");
if (file == NULL) {
fprintf(stderr, "Error opening file for writing\n");
return;
}

fprintf(file, "{ \"data\": \"%s\" }\n", data); // Write data in JSON format
fclose(file);
}

Both CSV and JSON storage formats are easy to read and share, and they allow for flexible data storage without the overhead of managing a database.

Integrating with External APIs

For certain use cases, sending extracted data to an external API for processing or storage might be necessary. This is common in large-scale data extraction when integrating with third-party platforms like data warehouses or real-time analytics tools. Here’s how to do this with Scrape.do:

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>

// Callback function to capture the response
size_t write_callback(void *ptr, size_t size, size_t nmemb, void *userdata) {
size_t total_size = size * nmemb;
char *response = (char *) userdata;
strncat(response, (char *) ptr, total_size);
return total_size;
}

void send_to_scrapedo_api(const char *api_key, const char *target_url) {
CURL *curl;
CURLcode res;
char response[2048] = {0}; // Buffer to store the response

// Construct the Scrape.do API endpoint with the target URL
char scrape_do_url[1024];
snprintf(scrape_do_url, sizeof(scrape_do_url),
"https://api.scrape.do/?key=%s&url=%s", api_key, target_url);

curl = curl_easy_init();
if (curl) {
// Set the Scrape.do URL
if (curl_easy_setopt(curl, CURLOPT_URL, scrape_do_url) != CURLE_OK) {
fprintf(stderr, "Failed to set URL\n");
curl_easy_cleanup(curl);
return;
}

// Set a callback function to capture the response
if (curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_callback) != CURLE_OK) {
fprintf(stderr, "Failed to set write callback\n");
curl_easy_cleanup(curl);
return;
}

// Set the buffer for response data
if (curl_easy_setopt(curl, CURLOPT_WRITEDATA, response) != CURLE_OK) {
fprintf(stderr, "Failed to set response buffer\n");
curl_easy_cleanup(curl);
return;
}

// Set timeouts to prevent hanging
curl_easy_setopt(curl, CURLOPT_TIMEOUT, 10L); // 10 seconds timeout
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, 5L); // 5 seconds connection timeout

// Perform the request
res = curl_easy_perform(curl);

if (res != CURLE_OK) {
fprintf(stderr, "Failed to send data: %s\n", curl_easy_strerror(res));
} else {
// Print the scraped response
printf("Response from Scrape.do: %s\n", response);
}

// Cleanup
curl_easy_cleanup(curl);
} else {
fprintf(stderr, "Failed to initialize CURL\n");
}
}

int main() {
const char *api_key = "your_scrapedo_api_key"; // Replace with your Scrape.do API key
const char *target_url = "https://example.com"; // Replace with the target URL you want to scrape

send_to_scrapedo_api(api_key, target_url);

return 0;
}

Connection Pooling and Rate Limiting

It’s critical to manage connections efficiently and limit the number of requests per second to avoid IP blocking and maintain performance in large-scale scraping operations.

Connection pooling reduces the overhead of repeatedly opening and closing network connections by maintaining a pool of reusable connections. This improves performance and reduces server load.

While libcurl doesn’t natively support connection pooling, you can reuse connections by enabling the keep-alive option. This allows you to reuse a connection for multiple HTTP requests, reducing connection overhead.

curl_easy_setopt(curl, CURLOPT_TCP_KEEPALIVE, 1L); // Enable TCP keep-alive

To avoid triggering server-side rate limits or IP blocks, you should control the number of requests made per second. A common strategy is to implement a request delay or use token bucket algorithms to limit the rate of requests.

#include <unistd.h>

void request_with_rate_limiting(const char *url) {
CURL *curl;
CURLcode res;

curl = curl_easy_init();
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, url);

res = curl_easy_perform(curl); // Perform the request
if (res != CURLE_OK) {
fprintf(stderr, "Request failed: %s\n", curl_easy_strerror(res));
}

curl_easy_cleanup(curl);
}

sleep(1); // Sleep for 1 second between requests
}

This method adds a 1-second delay between requests to ensure you don’t exceed the server’s rate limits.

Integrating C Web Scraping into Larger Systems

Once you’ve implemented a C-based web scraper, the next step is often integrating its output with larger systems for further processing, storage, or automation. While C excels at performance and resource control, it may not always be the best choice for tasks like data analysis, visualization, or machine learning. By passing the data to higher-level services, such as Python or Node.js, you can leverage more specialized tools for these tasks.

Python is a popular choice for data analysis due to its rich ecosystem of libraries like pandas, NumPy, and Matplotlib. You can pass the scraped data from your C program to Python for deeper analysis or visualization.

Example: Exporting Data to CSV for Python Processing (your C scraper can save data in a CSV file, which Python can then process):

void save_to_csv(const char *filename, const char *data) {
FILE *file = fopen(filename, "a");
if (file == NULL) {
fprintf(stderr, "Error opening file\n");
return;
}
fprintf(file, "%s\n", data);
fclose(file);
}

In Python, you can read the CSV file and analyze the data:

import pandas as pd

# Load the data scraped by the C program
df = pd.read_csv('scraped_data.csv')

# Perform analysis or visualization
print(df.describe())

We talked more about how to use Python for webscraping in this article.

If your system requires real-time data processing or integration with web services, you can pass scraped data from your C program to Node.js, which is optimized for non-blocking, event-driven architectures.

void send_to_api(const char *url, const char *json_data) {
CURL *curl;
CURLcode res;

curl = curl_easy_init();
if (curl) {
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, json_data);
curl_easy_setopt(curl, CURLOPT_HTTPHEADER, "Content-Type: application/json");

res = curl_easy_perform(curl);
if (res != CURLE_OK) {
fprintf(stderr, "Failed to send data: %s\n", curl_easy_strerror(res));
}

curl_easy_cleanup(curl);
}
}

On the Node.js side, you could set up an endpoint to receive and process the data:

const express = require('express');
const app = express();

app.use(express.json());

app.post('/scraped-data', (req, res) => {
console.log('Received data:', req.body);
// Process the scraped data here
res.sendStatus(200);
});

app.listen(3000, () => console.log('Server running on port 3000'));

While C provides great control over performance and resource management, setting up a complete web scraping system can become complex, as you’ve just seen, especially when dealing with dynamic content, IP blocking, or rate limits. The best way to scrape data right now, is using Scrape.do

Scrape.do simplifies the web scraping process by handling much of this complexity for you. With Scrape.do, you can focus on extracting and processing the data, rather than worrying about rendering JavaScript-heavy pages, rotating proxies, or managing captchas.

One of the key advantages of Scrape.do is our ability to handle JavaScript rendering automatically, making it much easier to scrape modern web applications. Additionally, we provide built-in proxy rotation and IP management, helping you avoid rate-limiting or IP-blocking issues without manually setting up proxy pools.

We also allow you to scrape at scale by handling multiple requests concurrently, freeing you from managing the networking layer and allowing you to focus on data extraction. By integrating Scrape.do into your C-based system, you can leverage our advanced features while maintaining the performance and efficiency that C offers for processing large-scale data.

Additional Resources

To dive deeper into web scraping using C and explore more advanced techniques, here are some helpful links and best practices to help you maintain and scale your C-based scrapers:

  • libcurl Documentation: The official documentation for libcurl, the library used for making HTTP requests in C. It includes detailed guides on how to perform different types of requests, handle errors, and manage connections.
  • libxml2 Documentation: The official site for libxml2, a powerful library for parsing XML and HTML documents. The documentation covers its features and provides tutorials on DOM traversal and XPath queries.
  • Gumbo-Parser Documentation: Gumbo-parser is an open-source library for parsing HTML5 documents. The GitHub repository includes usage instructions and examples for integrating it into your web scraping tasks.

Best Practices for Maintaining and Scaling C-Based Scrapers

  • Modularize Your Code: Break down your scraping logic into reusable modules (e.g., HTTP requests, parsing, and data storage). This makes the code easier to maintain and update as the target websites change.
  • Implement Robust Error Handling: Ensure that your scraper can gracefully handle network errors, timeouts, and unexpected data formats. This includes using retry logic and exponential backoff for failed requests.
  • Monitor and Log Scraper Activity: Always log your requests, responses, and errors. This will help you troubleshoot issues as they arise and monitor the health of your scraper as it scales.
  • Respect Website Policies: Always follow a website’s terms of service and respect their robots.txt file. Implement rate limiting to avoid overloading servers and prevent your IP from being blocked.
  • Use Proxies and Rotate IPs: For large-scale scraping, using proxies and rotating IPs can help avoid detection and blocking. Services like Scrape.do provide built-in proxy management, making it easier to scale your operations.
  • Automate and Schedule Scraping Tasks: Use automation tools like cron jobs to schedule scraping tasks and keep your data updated regularly without manual intervention.