Category: Scraping basics

How to Use Curl_cffi for Web Scraping

14 mins read Created Date: September 12, 2024   Updated Date: September 18, 2024

A handy and efficient way of extracting information from websites is DATA SCRAPPING. Data scrapping is largely used for data collection, analysis, and automation. Although there are many libraries used for data scrapping on the web, curl_cffi gained the most popularity due to its performance benefits.

curl_cffi is an interface of Python for the libcurl which is a feature-dense URL transfer library. When used for web scrapping, curl_cffi gives better usage benefits than other traditional libraries. This is a result of its low-level interaction with the libcurl library. Due to this, faster network I/O operations are enabled. curl_cffi also supports asynchronous requests, amongst other features. For example: resilient error handling, detailed control over headers and cookies, and connection reuse. All of these enhance the speed, reliability, and resource management of the process. Furthermore, curl_cffi merges perfectly with the Python ecosystem, providing flexibility for advanced customizations. This makes it the top choice for executing complicated and high-traffic scrapping projects.

In this article, we will see how to use curl_cffi for web scraping. We will focus on an implementation. We will also be exploring advanced usage techniques for web scrapping.

Setting Up Your Environment

Setting up the development environment is the first step in starting a web scrapping project. This section will guide you through the system requirements, installation process, and verification steps.

System Requirements

To use curl_cffi, make sure that your laptop has the following requirements:

  1. Python: You’d need Python 3.8 and above to carryout this project successfully.
  2. libcurl: You must install the libcurl library. However, most Linux systems come with libcurl already installed. For macOS users, it can be installed via Homebrew. For Windows users, installation is via a package manager like vcpkg.

Install libcurl: More instructions can be found on this page here

  • Linux: sudo apt-get install libcurl4-openssl-dev
  • macOS: brew install curl
  • Windows: Install via vcpkg or download from curl.se.

Install curl_cffi:

The simplest way to install curl_cffi is to install it from pip. Open your terminal (on MacOS) or command prompt (on windows) and run the following command:

pip install curl_cffi

Verifying the Installation

After installing curl_cffi, it’s important to verify that the installation was successful. You can do this by running a simple script to import the library and make a basic request.

  1. Create a verification script: Create a new Python file named test_curl_cffi.py and add the following code:
from curl_cffi import requests


try:
    response = requests.get('https://httpbin.org/get')
    if response.status_code == 200:
        print('Yay! curl_cffi installation is successful!')
    else:
        print('Something went wrong with the request. Try again!')
except Exception as e:
    print(f'Oops! An error occurred: {e}')
  1. Run the verification script: Execute the script using Python:
python test_curl_cffi.py

If the installation was successful, you should see the message:

Yay! curl_cffi installation is successful!

With your environment set up, tested and verified, you are now ready to start using curl_cffi for web scraping.

Basic Usage of Curl_cffi

With your environment set up, next, we will look at the basics of using curl_cffi for web scraping. Here we will explore how to import the library, initialize a session, make a simple GET request, and handle the response.

Importing the Library and Initializing a Session

First, you need to import the necessary components from the curl_cffi library. Initializing a session is straightforward and allows you to maintain settings and cookies across multiple requests.

from curl_cffi import requests


# Initialize session
session = requests.Session()

Making a Simple GET Request and Handling the Response

Making a GET request with curl_cffi is similar to using other popular HTTP libraries. Here’s a basic example similar to our previous exercise of how to perform a GET request and handle the response:

# Define URL
url = 'https://www.google.com/'


# Make GET request
response = session.get(url)


# Check the status code
if response.status_code == 200:
    print('Request was successful!')
    print('Response content:')
    print(response.text)
else:
    print(f'Failed to retrieve the URL. Status code: {response.status_code}')

Expected Output:

If the request to Google’s homepage is successful, you will see:

Request was successful!
Response content:
[HTML content of Google's homepage]

If the request fails, you will see a message indicating the failure and the status code:

Failed to retrieve the URL. Status code: [status code]

The script demonstrates how to use curl_cffi to make a simple GET request, check the response status, and print the response content or an error message based on the result.

Error Handling and Troubleshooting Common Issues

Effective error handling is essential for robust web scraping scripts. Here are some common issues you might encounter and how to handle them:

  1. Invalid URL: Always ensure the URL is correctly formatted. If it’s malformed, curl_cffi will raise an exception.
try:
    response = session.get('htp://invalid-url')
except requests.errors.RequestsError as e:
    print(f'An error occurred: {e}')

The code attempts to make a GET request to an invalid URL (‘htp://invalid-url’). If an exception occurs during the request, it is caught by the except block. The specific exception type caught is requests.errors.RequestsError, which is the correct way to catch request-related exceptions in the curl_cffi library.

Since the URL ‘htp://invalid-url’ is invalid, the script will catch the exception and print an error message similar to the following:

An error occurred: Failed to perform, curl: (1) Protocol "htp" not supported. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
  1. HTTP Errors: Handle HTTP errors smoothly by checking the status code and using appropriate actions.
response = session.get('https://httpbin.co/status/404')
if response.status_code == 404:
    print('Resource not found.')
elif response.status_code == 500:
    print('Server error.')
else:
    print(f'HTTP Status Code: {response.status_code}')

Expected Output:

When you run the script, you should see the following output:

Resource not found.

This output is because the URL https://httpbin.org/status/404 returns a 404 status code, indicating that the resource is not found.

  1. Connection Issues: Handle connection-related errors such as timeouts and DNS failures.
try:
    response = session.get('https://httpbin.co/delay/10', timeout=5)
except requests.exceptions.Timeout:
    print('The request timed out.')
except requests.exceptions.RequestException as e:
    print(f'An error occurred: {e}')

By implementing these error-handling techniques, you can make your web scraping scripts more resilient and reliable.

Now that you are familiar with the basic usage of curl_cffi, including making simple GET requests and handling common issues, we can move on to more complex requests. Next, we will cover how to handle requests with custom headers, cookies, and data payloads.

Handling Complex Requests

In web scraping, it’s usually necessary to make more complex requests that include custom headers, cookies, and data payloads. Let’s look at how to handle these advanced scenarios using curl_cffi.

Setting Custom Headers

Custom headers can be essential for mimicking a browser request, avoiding detection, or providing necessary information to the server.

# Define URL
url = 'https://httpbin.co/headers'


# Define custom headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.3',
    'Accept-Language': 'en-US,en;q=0.9'
}


# Make the GET request with custom headers
response = session.get(url, headers=headers)


# Print the response content
print(response.json())

Managing Cookies

Cookies are often used for maintaining sessions, tracking user activity, or bypassing login pages. curl_cffi makes it easy to manage cookies within a session.

# Define URL
url = 'https://httpbin.co/cookies'


# Define cookies
cookies = {
    'session_id': '0123456789',
    'user': 'Dumebi'
}


# Make the GET request with cookies
response = session.get(url, cookies=cookies)


# Print the response content
print(response.json())

Sending POST Requests with Data

Sending data to the server via POST requests is common when interacting with forms, APIs, or submitting search queries.

# Define URL
url = 'https://httpbin.co/post'


# Define the data payload
data = {
    'username': 'user123',
    'password': 'pass123'
}


# Make the POST request with data
response = session.post(url, data=data)


# Print the response content
print(response.json())

Combining Headers, Cookies, and Data in a Single Request

Here’s an example that combines custom headers, cookies, and a data payload in a single POST request:

# Define the URL
url = 'https://httpbin.co/post'

# Define custom headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.3',
    'Accept-Language': 'en-US,en;q=0.9'
}

# Define cookies
cookies = {
    'session_id': '0123456789',
    'user': 'Dumebi'
}

# Define the data payload
data = {
    'username': 'user123',
    'password': 'pass123'
}

# Make the POST request with headers, cookies, and data
response = session.post(url, headers=headers, cookies=cookies, data=data)

# Print the response content
print(response.json())

By mastering these techniques, you can handle a wide variety of web scraping scenarios, making your scripts more versatile and effective.

Advanced Features and Optimization

Advanced features and optimization techniques are crucial for productive and effective web scraping, especially when dealing with large datasets or dynamic content. Here we will be going over handling asynchronous requests, configuring timeouts and retries, and using proxies with curl_cffi.

Handling Asynchronous Requests for Improved Performance

Asynchronous requests can significantly improve the performance of your web scraping tasks by allowing multiple requests to be made concurrently. curl_cffi supports asynchronous operations, which can be utilized for parallel processing.

import asyncio
from curl_cffi import requests


# Define an asynchronous function to perform a GET request using a synchronous wrapper
async def fetch(url):
    response = await asyncio.to_thread(requests.get, url)
    return response


# List of URLs to scrape
urls = [
    'https://httpbin.co/get?arg=1',
    'https://httpbin.co/get?arg=2',
    'https://httpbin.co/get?arg=3'
]


# Define an asynchronous function to fetch multiple URLs
async def fetch_all(urls):
    tasks = [fetch(url) for url in urls]
    responses = await asyncio.gather(*tasks)
    return responses


# Run the asynchronous function
responses = asyncio.run(fetch_all(urls))
for response in responses:
    print(response.json())

This script performs asynchronous GET requests to multiple URLs concurrently, improving performance by running requests in parallel.

Configuring Timeouts and Retries

Setting appropriate timeouts and retries can help handle slow or unreliable connections, ensuring that your scraping tasks are more resilient.

import curl_cffi
import time


url = "https://example.com"
max_retries = 3
retry_delay = 2  # seconds


curl = curl_cffi.Curl()
curl.setopt(curl_cffi.CurlOpt.URL, url)
curl.setopt(curl_cffi.CurlOpt.TIMEOUT, 10)  # Timeout after 10 seconds
curl.setopt(curl_cffi.CurlOpt.CONNECTTIMEOUT, 5)  # Connection timeout after 5 seconds


response_data = []
curl.setopt(curl_cffi.CurlOpt.WRITEFUNCTION, response_data.append)


for attempt in range(max_retries):
    try:
        curl.perform()
        break  # Exit the loop if the request is successful
    except curl_cffi.CurlError as e:
        print(f"Attempt {attempt + 1} failed: {e}")
        time.sleep(retry_delay)
else:
    print("Max retries reached. Failed to retrieve the data.")


response_text = ''.join([data.decode('utf-8') for data in response_data])
print(response_text)

Here we demonstrate making a GET request with a specified timeout and retry mechanism to handle slow responses or connection issues.

Using Proxies for Scraping

Using proxies can help you avoid IP bans and scrape websites that restrict access based on geographical location.

# Define the URL
url = 'https://httpbin.co/delay/3'


# Set timeout and retries
timeout = 2  # seconds
max_retries = 3


# Make the GET request with timeout and retries
for attempt in range(max_retries):
    try:
        response = session.get(url, timeout=timeout)
        response.raise_for_status()
        print('Request was successful!')
        break
    except requests.exceptions.Timeout:
        print('The request timed out. Retrying...')
    except requests.exceptions.RequestException as e:
        print(f'An error occurred: {e}')
        break

The code above shows how to use proxy settings to make a GET request, allowing the user to bypass IP restrictions or access content from different geographical locations.

Error Handling and Debugging

Effective error handling and debugging are essential for creating reliable web scraping scripts. This section will cover common HTTP errors, using logging for debugging, and handling connection errors and timeouts gracefully.

Common HTTP Errors and How to Handle Them

HTTP errors can occur for various reasons, such as invalid URLs, server issues, or access restrictions. Here’s how to handle some common HTTP errors:

# Define the URL
url = 'https://httpbin.co/status/404'

# Make the GET request
response = session.get(url)

# Handle common HTTP errors
if response.status_code == 404:
    print('Resource not found.')
elif response.status_code == 500:
    print('Server error.')
else:
    print(f'HTTP Status Code: {response.status_code}')

This code checks the HTTP status code of the response and provides specific handling for common errors like 404 (Not Found) and 500 (Server Error).

Using Logging for Debugging

Logging is a powerful tool for diagnosing issues and tracking the behavior of your script. Set up logging to capture detailed information about requests and responses:

import logging
from curl_cffi import requests

# Configure logging
logging.basicConfig(level=logging.DEBUG)

# Define the URL
url = 'https://httpbin.org/get'

# Make the GET request with logging
try:
    response = session.get(url)
    response.raise_for_status()
    logging.debug(f'Response: {response.text}')
except requests.exceptions.RequestException as e:
    logging.error(f'An error occurred: {e}')

Here the script configures logging to capture detailed debug information and errors, helping you troubleshoot issues with requests and responses.

Handling Connection Errors and Timeouts Efficiently

Managing connection errors and timeouts ensures your script can recover from temporary issues and continue running smoothly:

# Define the URL
url = 'https://httpbin.co/delay/5'

# Set timeout
timeout = 2

# Make the GET request with timeout handling
try:
    response = session.get(url, timeout=timeout)
    response.raise_for_status()
    print('Request was successful!')
except requests.exceptions.Timeout:
    print('The request timed out.')
except requests.exceptions.RequestException as e:
    print(f'An error occurred: {e}')

This code demonstrates how to handle timeouts and connection errors by catching exceptions and providing appropriate error messages.

By implementing robust error handling and logging, you can create more reliable web scraping scripts that are easier to debug and maintain.

Practical Example: Scraping a Real Website

To consolidate everything we’ve covered so far, let’s walk through a practical example of web scraping using curl_cffi. We will scrape a real website, handle pagination, extract and process data, and store the results in a suitable format.

For this example, we will scrape a public website that lists books and their details. Our goal is to extract the book titles, authors, and prices from multiple pages of listings. We will be using BeautifulSoup to aid with parsing our document. If you don’t already have it installed, quickly do so.

Let’s tackle it step by step:

  1. Sending Requests and Parsing Responses:
from curl_cffi import requests
from bs4 import BeautifulSoup


# Define the base URL
base_url = 'https://www.scrapingcourse.com/-{}.html'


# Function to get the HTML content of a page
def get_page_content(page_number):
    url = base_url.format(page_number)
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None


# Parse the HTML content and extract data
def parse_books(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    books = []
    for book in soup.select('.product_pod'):
        title = book.select_one('h3 a')['title']
        price = book.select_one('.price_color').text
        books.append({'title': title, 'price': price})
    return books


# Test the functions
page_content = get_page_content(1)
if page_content:
    books = parse_books(page_content)
    print(books)

Here we define functions to send a GET request to a specified page and parse the HTML content to extract book titles and prices.

  1. Handling Pagination and Dynamic Content.
def scrape_books(total_pages):
    all_books = []
    for page in range(1, total_pages + 1):
        page_content = get_page_content(page)
        if page_content:
            books = parse_books(page_content)
            all_books.extend(books)
        else:
            break
    return all_books


# Scrape books from the first 5 pages
total_pages = 5
all_books = scrape_books(total_pages)
print(f'Total books scraped: {len(all_books)}')

The code above handles pagination by looping through multiple pages and scraping book data from each page, aggregating the results into a single list.

  1. Storing the scraped data into proper formats like JSON, Exel, or CSV.
import json
import csv


# Save data to a JSON file
def save_to_json(data, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)


# Save data to a CSV file
def save_to_csv(data, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['title', 'price'])
        writer.writeheader()
        for row in data:
            writer.writerow(row)


# Save the scraped books to JSON and CSV
save_to_json(all_books, 'books.json')
save_to_csv(all_books, 'books.csv')

This code defines functions to save the scraped book data to CSV and JSON files, providing structured storage options.

Here is the complete code that integrates all the steps:

from curl_cffi import requests
from bs4 import BeautifulSoup
import csv
import json


# Define the base URL
base_url = 'https://www.scrapingcourse.com/-{}.html'


# Function to get the HTML content of a page
def get_page_content(page_number):
    url = base_url.format(page_number)
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None


# Parse the HTML content and extract data
def parse_books(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    books = []
    for book in soup.select('.product_pod'):
        title = book.select_one('h3 a')['title']
        price = book.select_one('.price_color').text
        books.append({'title': title, 'price': price})
    return books


# Scrape books from multiple pages
def scrape_books(total_pages):
    all_books = []
    for page in range(1, total_pages + 1):
        page_content = get_page_content(page)
        if page_content:
            books = parse_books(page_content)
            all_books.extend(books)
        else:
            break
    return all_books


# Save data to a CSV file
def save_to_csv(data, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['title', 'price'])
        writer.writeheader()
        for row in data:
            writer.writerow(row)


# Save data to a JSON file
def save_to_json(data, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)


# Scrape books and save the data
total_pages = 5
all_books = scrape_books(total_pages)
save_to_csv(all_books, 'books.csv')
save_to_json(all_books, 'books.json')


print(f'Total books scraped: {len(all_books)}')

Here we show all the steps to scrape book data from a website, handle multiple pages, and save the data to CSV and JSON files in one code block, demonstrating a full web scraping workflow using curl_cffi.

 

By following this example, you can see how to apply curl_cffi for web scraping in a real-world scenario, handling requests, parsing responses, managing pagination, and storing the scraped data in a structured format.

Conclusion

In conclusion, web scraping is a widely used means for curling useful information from websites. Using curl_cffi is very advantageous in terms of efficiency and performance. Throughout this guide, we have explored various aspects of web scraping with curl_cffi, from setting up your environment to handling complex requests and optimizing your scraping tasks.

We encourage you to experiment with the advanced features of curl_cffi and explore custom implementations to suit your specific needs. Additionally, leveraging tools like Scrape.do can further enhance your scraping capabilities, offering a scalable solution for extracting data from websites.