Categories: Web scraping, Tutorials

Advanced Web Scraping with Scrapy: A Python Tutorial for Efficient Data Extraction

26 mins read Created Date: October 03, 2024   Updated Date: October 03, 2024

Master advanced Scrapy techniques for efficient web scraping. Learn concurrency, data handling, and ethical practices.

Scrapy is a high-level web scraping and web crawling framework in Python that simplifies the process of building scalable, efficient scrapers. It allows you to manage requests, handle concurrency, parse HTML, and extract structured data, all within a framework that scales well across multiple domains.

In this article, we’ll discuss Scrapy and advanced techniques for scraping data with it in Python. This article is designed for those with prior experience in Python and web scraping who want to elevate their skills by tackling more complex and large-scale scraping challenges using Scrapy. If you’re familiar with basic scraping libraries like BeautifulSoup or requests, you’ll find Scrapy to be a powerful alternative for more advanced use cases.

Without further ado, let’s dive right in!

Installation and Setup

To get started with Scrapy, let’s quickly set it up on your local environment or server. Scrapy can be installed using pip or conda. Here’s how to install it using both methods:

pip install scrapy

conda install -c conda-forge scrapy

While Scrapy is lightweight, it’s recommended to have at least Python 3.6+ (Python 3.8+ recommended), 1GB+ RAM (depending on the scale of your scraping project), and proper networking access to scrape data.

Next, We’ll use a virtual environment to keep our project dependencies isolated.

# Create and activate a virtual environment
python -m venv scrapy_env
source scrapy_env/bin/activate  # On Windows, use: scrapy_env\Scripts\activate

# Create a new Scrapy project
scrapy startproject advanced_scraper
cd advanced_scraper

This creates a new Scrapy project with the following structure:

advanced_scraper/
│
├── scrapy.cfg
└── advanced_scraper/
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders/
        └── __init__.py

At this point, your project is set up, and you’re ready to start writing spiders to crawl websites.

Creating Your First Scrapy Spider

Now that you have your Scrapy project set up, let’s walk through creating a basic spider to extract data from a sample website. A spider is a class that defines how Scrapy will navigate the web, request pages, and extract data.

Let’s create a spider to scrape book information from a sample website.

Basic Structure of a Scrapy Spider

Spiders in Scrapy are Python classes that inherit from scrapy.Spider and define several key attributes and methods. A Scrapy spider defines how to scrape a particular site or group of sites. It must subclass scrapy.spider and define the initial requests to make.

Here’s the basic structure of a Scrapy spider:

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"  # Unique identifier for the spider
    allowed_domains = ["example.com"]  # Optional: restricts the spider to these domains
    start_urls = ["https://example.com/start_page"]  # Starting point(s) for the crawl

    def parse(self, response):
        # This method is called for each request
        # It handles the response and returns either dicts with extracted data,
        # or Request objects for following links
        pass

To create a new spider, use the following command:

scrapy genspider example example.com

This creates a file named example.py in the spiders/ directory. Now, let’s edit the advanced_scraper/spiders/example.py file:

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = "ecommerce_spider"
    allowed_domains = ["scrapingcourse.com"]
    start_urls = ["https://www.scrapingcourse.com/ecommerce/"]

    def parse(self, response):
        products = response.css('div.card')
        for product in products:
            yield {
                'title': product.css('h5.card-title::text').get(),
                'price': product.css('p.card-text::text').get(),
                'availability': product.css('p.availability::text').get().strip(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Anatomy of a Scrapy Spider

Here’s a breakdown of the core components of a Scrapy spider:

  • name: This is the unique name for your spider. It must be unique within your project as Scrapy uses this to run the spider.
  • allowed_domains: A list of domains the spider is allowed to crawl. It prevents your spider from accidentally going to external sites.
  • start_urls: This defines the list of URLs where Scrapy will start crawling from.
  • parse() method: The main method of the spider. It’s responsible for processing the HTTP response and defining how to extract the relevant data.

Scrape Data from Sample Site

Now, set’s create a simple spider to scrape a sample website. To do that, we’ll update the parse() method to extract the book titles and prices from the scraping course homepage. Here’s an example of scraping data from a HTML structure:

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = "ecommerce"
    allowed_domains = ["scrapingcourse.com"]
    start_urls = ["https://www.scrapingcourse.com/ecommerce/"]

    def parse(self, response):
        # Extract the product name and price using XPath selectors
        for product in response.xpath("//div[@class='card']"):
            name = product.xpath(".//h5[@class='card-title']/text()").get()
            price = product.xpath(".//p[@class='card-text']/text()").get()

            # Yield the scraped data
            yield {
                "name": name.strip() if name else None,
                "price": price.strip() if price else None
            }

In the EcommerceSpider class, we use XPath selectors to extract product information from the e-commerce site. The selector div[@class=‘card’] targets each product card, while .//h5[@class=‘card-title’]/text() retrieves the product name, and .//p[@class=‘card-text’]/text() extracts the product price.The scraped data is then yielded as a Python dictionary, which Scrapy processes for export.

A better way to do this is to use Scrapy-Scrapedo, a library from Scrape.do, one of the best scraping services in the business. Scrapydo’s rotating proxy, headless browser, and captcha handling technologies allow you to access raw data before the target website detects that you are sending bot traffic, avoiding the blocking issues you get when scraping your target website.

Here’s a quick example usage:

from scrapydo import scrapy, scrapedo

class ScrapedoSampleCrawler(scrapy.Spider):
    name = "scrapedo_sample"

    def __init__(self):
        super().__init__(scrapedo.RequestParameters(
            token="YOUR_TOKEN",  # Replace with your Scrape.do token from: dashboard.scrape.do
            params={
                "geoCode": "us",
                "super": False,
                "render": True,  # Using render feature for JavaScript-heavy pages
                "playWithBrowser": [
                    {"Action": "Click", "Selector": "#manpage > div.mp > ul > li:nth-child(3) > a"},
                    {"Action": "Wait", "Timeout": 2000},
                    {"Action": "Execute", "Execute": "document.URL"}
                ]
            }
        ))

    def start_requests(self):
        urls = [
            'https://www.scrapingcourse.com/ecommerce/'
        ]
        for url in urls:
            yield self.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(response.body)  # Output page contents
        print("Target URL:", self.target_url(response))

You can also use Scrapydo’s proxy mode for anonymous and geo-targeted scraping:

from scrapydo import scrapy, scrapedo

class ScrapedoProxyCrawler(scrapy.Spider):
    name = "scrapedo_proxy"

    def __init__(self):
        super().__init__(scrapedo.RequestParameters(
            token="YOUR_TOKEN",  # Replace with your Scrape.do token from: dashboard.scrape.do
            params={
                "geoCode": "uk",  # Example: Geo-targeting for the UK
                "super": True
            },
            proxy_mode=True  # Enable proxy mode
        ))

    def start_requests(self):
        urls = [
            'https://www.scrapingcourse.com/ecommerce/'
        ]
        for url in urls:
            yield self.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(response.body)  # Output page contents
        print("Target URL:", self.target_url(response))

Save Data to a File

To save the scraped data to a JSON file, run the following command:

scrapy crawl books -o products.json

This will create a books.json file containing the titles and prices of the books listed on the homepage.

Testing and Troubleshooting Spiders

To test your spider, you can use the built-in scrapy shell command. It allows you to interactively inspect the response from a URL, which helps troubleshoot extraction issues.

scrapy shell “https://www.scrapingcourse.com/ecommerce"

Once inside the shell, you can test XPath or CSS selectors. For example, to test your spider, you can use the following command:

response.css('article.product_pod').get()

For more detailed output, use the

-s LOG_LEVEL=DEBUG option:
scrapy crawl books_spider -s LOG_LEVEL=DEBUG

Common errors and troubleshooting tips:

  • ImportError: Ensure all required libraries are installed.
  • No items extracted: Double-check your XPath or CSS selectors using scrapy shell.
  • HTTP 403 Forbidden errors: This often occurs if the website blocks non-browser user agents. Add a custom user agent in settings.py to mimic a browser.
  • Slow response times: Consider adjusting Scrapy’s concurrency settings to optimize crawling speed in the settings.py file.
  • Blocked requests: Some websites may block scraping. Consider using delays or rotating user agents.

Selectors and XPath/CSS Expressions

Data extraction is at the core of web scraping, and Scrapy’s Selector class offers powerful tools to target and extract data from web pages precisely. Let’s look at using the Selector class, XPath, and CSS selectors to navigate the HTML DOM and extract structured data efficiently.

Using Scrapy’s Selector Class

The scrapy.Selector class simplifies working with web page content by providing an easy interface for parsing and querying the HTML tree structure. When Scrapy fetches a web page, it creates a Selector object that allows you to access elements via XPath or CSS expressions.

selector = scrapy.Selector(response)

This selector can be used to apply XPath or CSS queries to extract elements from the HTML.

Using XPath and CSS Selectors

Scrapy supports XPath and CSS selectors, but each has advantages depending on the use case.

  • XPath: More powerful and flexible. It allows you to navigate the entire DOM, query elements by their attributes, position, or text content. XPath supports complex queries, which are particularly useful when HTML is irregular.
  • CSS: Simpler to use and often faster for basic queries. CSS selectors are more limited compared to XPath, but they work well for straightforward element selections, especially for performance-critical tasks.

Examples of Extracting Specific Elements

Let’s enhance our spider with more complex selections:

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = "ecommerce_spider"
    allowed_domains = ["scrapingcourse.com"]
    start_urls = ["https://www.scrapingcourse.com/ecommerce"]

    def parse(self, response):
        self.log(f"Visiting {response.url}")

        # Locate products
        products = response.xpath('//div[@class="card"]')
        self.log(f"Found {len(products)} products.")

        for product in products:
            # Safely handle availability
            availability = product.xpath('.//p[@class="card-text"]/text()').get()
            availability = availability.strip() if availability else 'N/A'  # Handle missing availability text

            yield {
                'title': product.xpath('.//h5[@class="card-title"]/text()').get(),
                'price': product.xpath('.//p[@class="card-text"]/text()').get(),
                'availability': availability,
                'category': response.xpath('//ul[@class="breadcrumb"]/li[3]/a/text()').get(),
                'image_url': response.urljoin(product.xpath('.//img/@src').get()),
            }

        # Follow the next page link if it exists
        next_page = response.xpath('//a[contains(@class, "page-link") and contains(text(), "Next")]/@href').get()
        if next_page:
            self.log(f"Following next page: {next_page}")
            yield response.follow(next_page, self.parse)

This uses XPath selectors to extract product titles, prices, and availability from each product card, along with the category from the breadcrumb navigation. The image URL is constructed from the product’s relative source. The spider also handles pagination by following the “Next” page link to gather additional product data, ensuring a thorough extraction from the site.

Extracting Text Content (Product Title, Price, Description)

For an e-commerce product page, let’s say you want to extract the product title, price, and description.

Using XPath:

title = response.xpath("//h1[@class='product-title']/text()").get()
price = response.xpath("//span[@class='product-price']/text()").get()
description = response.xpath("//div[@class='product-description']/p/text()").get()

Using CSS:

title = response.css("h1.product-title::text").get()
price = response.css("span.product-price::text").get()
description = response.css("div.product-description p::text").get()

If you want to scrape product links or images:

product_link = response.xpath("//a[@class='product-link']/@href").get()
product_image = response.xpath("//img[@class='product-image']/@src").get()

Using CSS:

product_link = response.css("a.product-link::attr(href)").get()
product_image = response.css("img.product-image::attr(src)").get()

XPath vs CSS: Performance and Precision

  • Performance: Generally, CSS selectors are faster for simple selections, while XPath can be more efficient for complex queries.
  • Precision: XPath allows for more precise selection, especially when dealing with complex relationships or selecting parent/sibling elements.

Choose based on your specific needs and the complexity of the HTML structure you’re scraping.

Handling Requests and Pagination

One key challenge when scraping multi-page websites is handling pagination to ensure data is collected across all available pages. Scrapy provides powerful mechanisms to manage requests, follow links, and handle interactions like sending custom headers, cookies, or form data.

Our spider already handles basic pagination, but let’s explore more advanced request-handling techniques.

The scrapy.Request() method allows you to create new requests to follow links. To manage pagination, you’ll often need to extract the “Next Page” URL from the HTML and use scrapy.Request() to follow it. Here’s an example that scrapes detailed information from each book’s page:

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = "ecommerce_spider"
    allowed_domains = ["scrapingcourse.com"]
    start_urls = ["https://www.scrapingcourse.com/ecommerce/"]

    def parse(self, response):
        products = response.xpath('//div[@class="card"]')
        for product in products:
            product_url = response.urljoin(product.xpath('.//a/@href').get())
            yield scrapy.Request(product_url, callback=self.parse_product_page,
                                 meta={'category': response.xpath('//a[@class="breadcrumb-item active"]/text()').get()})

        next_page = response.xpath('//a[contains(@class, "page-link")]/@href').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_product_page(self, response):
        yield {
            'title': response.xpath('//h5[@class="card-title"]/text()').get(),
            'price': response.xpath('//p[@class="card-text"]/text()').get(),
            'description': response.xpath('//div[@id="product_description"]/following-sibling::p/text()').get(),
            'upc': response.xpath('//th[contains(text(), "UPC")]/following-sibling::td/text()').get(),
            'category': response.meta['category'],
        }

Handling dynamic pagination

For dynamically generated pagination links (e.g., using JavaScript), Scrapy won’t natively handle JavaScript execution. However, if the links are embedded in the HTML as attributes (e.g., in a hidden element like a load more button), you can extract them using XPath or CSS selectors. Here’s how you can handle this:

class DynamicPaginationSpider(scrapy.Spider):
    name = "dynamic_pagination"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        products = response.css('.product')
        for product in products:
            yield self.parse_product(product)

        # Check if there's a "Load More" button
        load_more = response.css('button#load-more::attr(data-url)').get()
        if load_more:
            yield scrapy.Request(load_more, self.parse_ajax_products)

    def parse_ajax_products(self, response):
        # Parse JSON response
        data = json.loads(response.text)
        for product in data['products']:
            yield self.parse_product(product)

        # Check if there are more products to load
        if data['has_more']:
            yield scrapy.Request(data['next_page_url'], self.parse_ajax_products)

    def parse_product(self, product):
        # Extract product information
        return {
            'name': product.css('.name::text').get(),
            'price': product.css('.price::text').get(),
        }

Handling GET and POST Requests with Custom Headers and Cookies

Some websites require interaction, like submitting a form or logging in, before granting access to certain data. Scrapy allows you to make both GET and POST requests with custom headers, cookies, and form data. Here’s how you can customize your requests:

class AuthenticatedSpider(scrapy.Spider):
    name = "authenticated_spider"
    start_urls = ["https://example.com/login"]

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'myuser', 'password': 'mypassword'},
            callback=self.after_login
        )

    def after_login(self, response):
        # Check if login succeeded
        if "Welcome" in response.text:
            # Now let's scrape the authenticated pages
            yield scrapy.Request("https://example.com/protected-page",
                                 callback=self.parse_protected_page,
                                 headers={'Custom-Header': 'Value'},
                                 cookies={'session_id': 'some_value'})

    def parse_protected_page(self, response):
        # Scrape the protected content
        yield {
            'protected_data': response.css('.protected-content::text').get()
        }

When scraping data from multiple pages, handling pagination effectively is crucial. Here’s an example of how we can implement pagination using scrapy.Request() in the e-commerce scraping scenario:

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = "ecommerce_spider"
    allowed_domains = ["scrapingcourse.com"]
    start_urls = ["https://www.scrapingcourse.com/ecommerce/"]

    def parse(self, response):
        products = response.xpath('//div[@class="card"]')
        for product in products:
            yield {
                'title': product.xpath('.//h5[@class="card-title"]/text()').get(),
                'price': product.xpath('.//p[@class="card-text"]/text()').get(),
            }

        next_page = response.xpath('//a[contains(@class, "page-link")]/@href').get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url, callback=self.parse)

Pipelines and Data Storage

Once you’ve successfully scraped the data, the next step is to process and store it efficiently. Scrapy’s Item Pipeline allows you to define custom processing steps for your data, such as cleaning, validation, and saving to various formats or storage backends like databases, cloud storage, or local files.

To implement a pipeline, create a class in the pipelines.py file:

class DataCleansingPipeline:
    def process_item(self, item, spider):
        item['title'] = item['title'].strip().lower()
        item['price'] = float(item['price'].replace('£', ''))
        return item

Items are containers used to collect scraped data. You can define them in the items.py file:

import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    description = scrapy.Field()
    upc = scrapy.Field()
    category = scrapy.Field()

Finally, you have to update your spider to use this Item:

from ..items import BookItem

class BooksSpider(scrapy.Spider):
    # ... (previous code)

    def parse_book_page(self, response):
        book_item = BookItem()
        book_item['title'] = response.xpath('//h1/text()').get()
        book_item['price'] = response.xpath('//p[@class="price_color"]/text()').get()
        book_item['description'] = response.xpath('//div[@id="product_description"]/following-sibling::p/text()').get()
        book_item['upc'] = response.xpath('//th[contains(text(), "UPC")]/following-sibling::td/text()').get()
        book_item['category'] = response.meta['category']
        return book_item

Saving Data in Multiple Formats

Scrapy makes it easy to save data to JSON or CSV format using the FEED_URI setting in settings.py. You don’t need to implement custom pipelines for these formats:

scrapy crawl books_spider -o books.json

scrapy crawl books_spider -o books.csv

For saving data to a relational database like SQLite, you’ll need to implement a custom pipeline that inserts each item into the database.

import sqlite3

class SQLitePipeline:
    def __init__(self):
        self.conn = sqlite3.connect('products.db')
        self.cur = self.conn.cursor()
        self.cur.execute("""
        CREATE TABLE IF NOT EXISTS products
        (title TEXT, price REAL, description TEXT, upc TEXT PRIMARY KEY, category TEXT)
        """)

    def process_item(self, item, spider):
        self.cur.execute("""
        INSERT OR REPLACE INTO products (title, price, description, upc, category)
        VALUES (?, ?, ?, ?, ?)
        """, (item['title'], item['price'], item['description'], item['upc'], item['category']))
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.conn.close()

Real-World Example: Saving Data to Amazon S3

To save data to an Amazon S3 bucket, you can use Scrapy’s built-in support for S3 using the scrapy-s3pipeline extension:

pip install scrapy-s3pipeline

Then simply update your settings.py:

ITEM_PIPELINES = {
    'scrapy_s3pipeline.S3Pipeline': 100,
}

S3PIPELINE_URL = 's3://my-bucket/books-%(time)s.json'
AWS_ACCESS_KEY_ID = 'your-access-key'
AWS_SECRET_ACCESS_KEY = 'your-secret-key'

Data Cleaning and Validation

It’s important to clean and validate data before saving it. This ensures that you avoid storing corrupt or incomplete records, which could cause issues downstream in your data analysis pipeline.

Data Cleaning:

  • Remove leading/trailing whitespace from strings.
  • Convert data types (e.g., ensure prices are stored as floats).

Data Validation:

  • Check that required fields (e.g., name, price) are not empty.
  • Ensure data meets specific conditions (e.g., prices are greater than 0).

You can implement data cleaning and validation in your pipeline like this:

from scrapy.exceptions import DropItem

class DataCleaningPipeline:
    def process_item(self, item, spider):
        if not item['title']:
            raise DropItem("Missing title")
        if not item['price']:
            raise DropItem("Missing price")

        # Clean the data
        item['title'] = item['title'].strip().title()
        item['price'] = float(item['price'].replace('£', ''))

        # Validate the data
        if item['price'] <= 0:
            raise DropItem("Invalid price")

        return item

Remember to enable your pipelines in settings.py:

ITEM_PIPELINES = {
    'advanced_scraper.pipelines.DataCleaningPipeline': 300,
    'advanced_scraper.pipelines.SQLitePipeline': 400,
}

These enhancements provide a more comprehensive coverage of Scrapy’s features for handling requests, pagination, and data processing.

Concurrency and Performance Optimization

Scrapy is designed for performance and scale, but to achieve optimal results when scraping complex or large-scale websites, you need to fine-tune various settings and strategies.Let’s explore advanced techniques for managing concurrency, preventing IP bans, and handling anti-scraping measures like CAPTCHAs and rate limiting.

Configuring Concurrency Settings

The CONCURRENT_REQUESTS setting controls how many requests Scrapy processes simultaneously. Increasing concurrency speeds up scraping, but too many requests can overwhelm the server or lead to IP blocks. The default value is 16, but you can tune it based on your needs. For optimal results, update your settings.py to fine-tune concurrency:

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True

Using DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY

While increasing concurrency can boost speed, it also raises the risk of overwhelming the target server, which might result in being blocked. To prevent this, use the DOWNLOAD_DELAY setting to introduce a delay between requests.

  • DOWNLOAD_DELAY: Defines a fixed delay (in seconds) between requests to the same domain.
  • RANDOMIZE_DOWNLOAD_DELAY: Randomizes the delay between requests, adding some variability to the crawling speed.
DOWNLOAD_DELAY = 1  # Add a 1-second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = True  # Randomize the download delay

Using DOWNLOAD_DELAY is particularly useful when scraping small websites with limited resources, ensuring you don’t overwhelm their servers.

Setting appropriate user agents and managing IP rotation

Many websites use IP blocking or user-agent detection as anti-scraping mechanisms. Scrapy provides middlewares to rotate user-agents and proxies, helping you avoid detection and blocking. Here’s how to implement this using middlewares:

  • Create a middleware for rotating user agents:
import random
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15',
        # Add more user agents...
    ]

    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        obj = cls(crawler.settings.get('USER_AGENT'))
        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        return obj

    def process_request(self, request, spider):
        user_agent = random.choice(self.user_agents)
        request.headers['User-Agent'] = user_agent
  • Scrape.do offers several advantages over Scrapy when it comes to handling anti-bot measures. While Scrapy is a powerful framework for building custom web scrapers, it requires significant manual effort to configure and manage anti-bot evasion strategies. Scrape.do, on the other hand, abstracts away many of these complexities and provides built-in solutions designed specifically for dealing with the common anti-bot measures employed by websites. Here’s a simple middleware to use it to handle IP rotation:
class ScrapedoProxyMiddleware:
    def __init__(self, scrape_do_api_key):
        self.scrape_do_api_key = scrape_do_api_key

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.get('SCRAPE_DO_API_KEY'))

    def process_request(self, request, spider):
        request.meta['proxy'] = f'http://{self.scrape_do_api_key}:@proxy.scrape.do:8080'
  • You can enable these middlewares in settings.py:
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateUserAgentMiddleware': 400,
    'myproject.middlewares.ScrapedoProxyMiddleware': 410,
}

SCRAPE_DO_API_KEY = 'your_api_key_here'

Handling Retries and Timeouts

Sometimes, websites may be temporarily unreachable or take too long to respond. Scrapy allows you to configure retries and timeouts to handle such cases without interrupting the scraping process. Configure them in settings.py:

RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
DOWNLOAD_TIMEOUT = 15

For more control, you can create a custom retry middleware:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

class CustomRetryMiddleware(RetryMiddleware):
    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        return response

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) and not request.meta.get('dont_retry', False):
            return self._retry(request, exception, spider)

Handling CAPTCHAs

One common method to prevent scraping is to serve CAPTCHAs. While Scrapy doesn’t handle CAPTCHAs out of the box, you can use services like 2Captcha or Anti-Captcha to solve them programmatically. You can integrate these services using custom middlewares.

Alternatively, you Suse Scrape.do. Our solutiom automatically detects and bypasses CAPTCHAs by using CAPTCHA-solving algorithms or third-party services to handle challenges seamlessly, so you won’t need to worry about CAPTCHA integration or failure handling.

Handling JavaScript-Rendered Content

Scrapy doesn’t execute JavaScript, which means it can’t scrape content that’s rendered client-side by JavaScript. However, you can use tools like Splash or ScrapyJS to render and scrape JavaScript-heavy websites.

First, install Spash

pip install scrapy-splash

Next, configure tha settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Finally, in your spider, use Splash to handle JavaScript-rendered pages:

from scrapy_splash import SplashRequest

class JavascriptSpider(scrapy.Spider):
    name = "javascript_spider"
    start_urls = ["https://example.com/javascript"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 2})

    def parse(self, response):
        # Extract data from rendered page
        product_name = response.css('h1::text').get()
        yield {'name': product_name}

This allows you to scrape content that is loaded via JavaScript after the initial page load.

Real-World Example: Handling Blocked Requests

Let’s say you are scraping a site that blocks repetitive requests from the same IP or user-agent.

Here’s how you would optimize the spider to deal with these issues:

import scrapy
from scrapy.exceptions import CloseSpider

class AntiBlockSpider(scrapy.Spider):
    name = 'anti_block_spider'
    start_urls = ['https://example.com/products']
    request_count = 0
    max_requests = 1000

    def parse(self, response):
        if response.status == 403:
            self.logger.error("Received 403 Forbidden. Possibly blocked.")
            raise CloseSpider('Blocked by target website')

        self.request_count += 1
        if self.request_count > self.max_requests:
            self.logger.info("Reached maximum number of requests. Closing spider.")
            raise CloseSpider('Reached request limit')

        # Extract data here
        products = response.css('.product')
        for product in products:
            yield {
                'name': product.css('.name::text').get(),
                'price': product.css('.price::text').get(),
            }

        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), self.parse)

This spider counts requests, stops after a certain number, and handles 403 errors that might indicate blocking.

Advanced Features: Middleware, Extensions, and Signals

Scrapy is designed to be highly customizable and extensible, allowing you to modify its behavior at almost every stage of the scraping process. Middlewares, extensions, and signals are powerful tools that let you add custom behavior, monitor performance, and interact with the core scraping process. Let’s explore some of these.

Middleware

Middlewares in Scrapy allow you to process requests and responses as they flow through the framework. You can create custom middlewares to alter requests, handle responses, or modify the crawling process dynamically. We’ve already seen examples with user agent and proxy rotation. Here’s another example that adds custom headers:

class CustomHeadersMiddleware:
    def process_request(self, request, spider):
        request.headers['X-Custom-Header'] = 'Some Value'
        return None

Extensions

Scrapy extensions allow you to add functionality beyond request and response processing, such as logging, metrics collection, or monitoring the performance of your spiders. Extensions run alongside the main scraping process, providing insights and additional features without disrupting the core behavior.

Here’s an example that logs stats at the end of a crawl:

from scrapy import signals
from scrapy.exceptions import NotConfigured

class SpiderStats:
    def __init__(self):
        self.stats = {}

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_closed(self, spider, reason):
        spider.logger.info(f"Spider closed: {reason}")
        spider.logger.info(f"Items scraped: {spider.crawler.stats.get_value('item_scraped_count')}")

To use it, you have to enable this extension in settings.py:

EXTENSIONS = {
    'myproject.extensions.SpiderStats': 500,
}

Signals

Scrapy’s signal API allows you to hook into various events in the scraping process, giving you full control over how and when certain actions are triggered. Signals can be used to handle events like spider startup, request errors, item scraping, or spider shutdown.

Here’s an example using signals to pause and resume a spider:

from scrapy import signals
import time

class PauseResumeExtension:
    def __init__(self, crawler):
        self.crawler = crawler
        self.interval = 60  # Pause every 60 seconds
        self.pause_duration = 10  # Pause for 10 seconds

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls(crawler)
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        return ext

    def spider_opened(self, spider):
        self.last_pause = time.time()

    def should_pause(self):
        return time.time() - self.last_pause > self.interval

    def pause_spider(self, spider):
        self.crawler.engine.pause()
        spider.logger.info("Spider paused for %s seconds", self.pause_duration)
        time.sleep(self.pause_duration)
        self.crawler.engine.unpause()
        self.last_pause = time.time()
        spider.logger.info("Spider resumed")

Here are a few real-world examples where extending Scrapy’s default behavior using middlewares, extensions, and signals is necessary:

  • Handling Dynamic Content: Creating a middleware to handle dynamic content loading with tools like Selenium or Splash for JavaScript-heavy websites.
  • Retrying on Specific Errors: Using signals to retry requests when encountering specific HTTP errors, like rate limits (HTTP 429).
  • Scaling Crawls: Building custom extensions to distribute scraping tasks across multiple servers or manage resource limits like memory and CPU usage.

Common Issues and Debugging

When developing web scrapers with Scrapy, various challenges can arise, from blocked requests to problems with selectors or unexpected server responses.

Handling Blocked Requests

One of the most common issues in web scraping is receiving blocked requests, resulting in HTTP status codes like 403 Forbidden or 503 Service Unavailable. This usually happens when websites detect that you’re using a bot and implement anti-scraping mechanisms.

When dealing with 403 or 503 errors:

  • Implement IP rotation and user agent rotation as shown earlier.
  • Add delays between requests:
DOWNLOAD_DELAY = 2  # 2 seconds between requests
  • Respect robots.txt:
ROBOTSTXT_OBEY = True

Debugging Selectors

Another common issue is when XPath or CSS selectors fail to return the data you expect. This usually happens due to changes in the website’s structure, incorrect selectors, or dynamic content that requires JavaScript to render.

If your selectors aren’t returning expected data:

  • Use Scrapy shell to test selectors. This helps identify issues with selectors and refine them until they correctly target the desired elements.:
scrapy shell "https://example.com"
>>> response.css('div.product::text').getall()
  • Check for dynamic content loaded by JavaScript. You might need to use Splash or Selenium for JS-rendered content.

Handling unexpected responses

Sometimes, the server responds with unexpected data, such as broken HTML, rate-limiting pages, or redirect loops. Handling these scenarios is crucial for building robust scrapers.

  • Use try-except blocks in your parsing methods:
def parse(self, response):
    try:
        title = response.css('h1::text').get()
        price = response.css('.price::text').get()
        yield {'title': title, 'price': price}
    except Exception as e:
        self.logger.error(f"Error on {response.url}: {e}")
  • Implement custom error handling middleware:
from scrapy.spidermiddlewares.httperror import HttpErrorMiddleware

class CustomHttpErrorMiddleware(HttpErrorMiddleware):
    def process_spider_exception(self, response, exception, spider):
        if isinstance(exception, HttpError):
            spider.logger.error(f"HttpError on {response.url}")
            return []
  • Gracefully Handle Broken HTML: Broken or poorly formatted HTML can cause selectors to fail. Use lxml’s robust parsing to handle such cases, and catch exceptions when selectors fail.
try:
    title = response.xpath('//h1/text()').get()
except Exception as e:
    self.logger.warning(f"Failed to extract title: {e}")

Useful Debugging Tools

Scrapy provides several tools and techniques that make it easier to debug your spiders and track down issues:

  • Scrapy’s built-in logging:
self.logger.debug("Debug message")
self.logger.info("Info message")
self.logger.warning("Warning message")
self.logger.error("Error message")
  • Use the scrapy parse command to test your spider:
scrapy parse --spider=myspider -c parse_item -d 2 <url>
  • Enable debug logging in settings.py:
LOG_LEVEL = 'DEBUG'

Deployment: Scraping at Scale

Deploying Scrapy spiders in production environments is a critical step to ensure your scraping tasks run efficiently and at scale. Whether you’re managing multiple spiders or need to scrape large amounts of data from various websites, understanding how to deploy Scrapy effectively is key.

Let’s quickly cover the essentials of deploying Scrapy spiders using Scrapyd, Docker, and cloud platforms, as well as best practices for scaling your scraping operations.

Deploying with Scrapyd

Scrapyd is a service for running Scrapy spiders. It allows you to deploy and manage your spiders remotely through an HTTP API, making it easier to scale your operations and manage multiple spiders in production.

First, lets install Scrapyd

pip install scrapyd

You can start Scrapyd locally by running

scrapyd

To deploy your Scrapy project to Scrapyd, you need to add a scrapy.cfg file to your project’s root directory, specifying the Scrapyd target.

[deploy]
url = http://localhost:6800/
project = myproject

Then, deploy your spider:

scrapy deploy

Once deployed, you can schedule your spiders using Scrapyd’s HTTP API.

curl http://localhost:6800/schedule.json -d project=myproject -d spider=myspider

Scrapyd simplifies remote spider management and allows you to deploy new versions of your spider without manual intervention on the server.

Using Docker

Docker is an excellent tool for containerizing and deploying Scrapy spiders. Containerization makes your spiders portable, easy to deploy, and ensures that they run in a consistent environment. First, create a Dockerfile:

FROM python:3.9

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["scrapy", "crawl", "myspider"]

Build and run the Docker container:

docker build -t my-scrapy-project .
docker run my-scrapy-project

Cloud Deployment

Once containerized, you can deploy your Scrapy spiders to any cloud provider that supports Docker, such as AWS EC2, Google Cloud, DigitalOcean, or Heroku. For example, you can use AWS Elastic Container Service (ECS) or Google Kubernetes Engine (GKE) to manage your containers at scale.

For AWS EC2

  • Launch an EC2 instance
  • SSH into your instance
  • Install Docker
  • Pull your Docker image or clone your Git repository.
  • Run your Scrapy spider

Scheduling and large-scale scraping

On Linux-based servers (like EC2), you can use cron to schedule your spiders to run at specific intervals.

0 0 * * * /usr/local/bin/docker run my-scrapy-project

For distributed scraping, consider using Scrapy Cluster:

git clone https://github.com/istresearch/scrapy-cluster.git
cd scrapy-cluster
docker-compose up -d

This sets up a distributed crawling system using Redis and Kafka.

Best Practices for Large-Scale Data Extraction:

  • Use a Proxy Pool: For large-scale scraping operations, integrate a proxy service to avoid IP bans.
  • Enable AutoThrottle: Use Scrapy’s AUTOTHROTTLE feature to dynamically adjust request rates based on server response times.
  • Store data in cloud storage: For persistent storage, integrate with cloud storage solutions like Amazon S3 or Google Cloud Storage to store your scraped data at scale.

Best Practices for Ethical Scraping

Web scraping can provide valuable insights and data, but it’s essential to ensure that your scraping activities are conducted ethically and within legal boundaries. Here are some best practices to ensure your scraping efforts respect website owners, users, and the data being collected.

Adhering to robots.txt

The robots.txt file on websites specifies the pages or sections that are allowed or disallowed for automated access (such as scraping and crawling). In Scrapy, you can configure the spider to automatically respect robots.txt rules by setting the following in settings.py:

ROBOTSTXT_OBEY = True

You can also implement a custom RobotsTxtMiddleware for more control:

from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.exceptions import IgnoreRequest

class CustomRobotsTxtMiddleware(RobotsTxtMiddleware):
    def process_request(self, request, spider):
        if not self.obey_robots:
            return None
        parsed_url = urlparse(request.url)
        robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
        robots_parser = self.robots_parser(robots_url)
        if not robots_parser.can_fetch("*", request.url):
            raise IgnoreRequest(f"URL {request.url} is forbidden by robots.txt")
        return None

Respecting Rate Limits

Many websites have rate limits that dictate how often a user (or bot) can make requests within a certain timeframe. Failing to observe rate limits can lead to website overload or blocking. Ethical scraping involves respecting these limits and adjusting your scraping frequency accordingly. To do that, you can:

  • Implement delays:
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
  • Use AutoThrottle extension:
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Preventing Data Abuse

Scraping data for legitimate purposes is acceptable, but it’s important to ensure that scraped data isn’t abused or misused.

  • Implement data sanitization in your pipelines:
class DataSanitizationPipeline:
    def process_item(self, item, spider):
        # Remove sensitive information
        if 'email' in item:
            del item['email']
        # Mask partial information
        if 'phone' in item:
            item['phone'] = self.mask_phone(item['phone'])
        return item

    def mask_phone(self, phone):
        return phone[:3] + 'xxx' + phone[-4:]
  • Implement access controls and encryption for stored data.
  • Document the intended use of scraped data and adhere to it.
  • Regularly review and update your scraping practices to ensure ongoing compliance with ethical standards and changing website policies.

By following these best practices, you ensure that your web scraping activities are responsible, respectful, and less likely to cause issues for the websites you’re scraping or for the individuals whose data you might encounter.

Conclusion

Scraping the web effectively requires not just technical expertise but also ethical awareness and performance optimization. From setting up Scrapy spiders to managing large-scale data extraction with tools like Docker, Scrapyd, and cloud platforms, each step plays a vital role in building a successful scraping project. However, as you can see, it can be quite challenging to implement.

Scrape.do enhances this entire process, offering a seamless solution for many of the advanced techniques we’ve discussed. It has built-in features like automatic IP rotation, CAPTCHA handling, anti-bot evasion, and scrape.do simplifies the toughest challenges, allowing you to focus on extracting data without worrying about the complexities of scraping.

Whether you’re scaling your operations or tackling heavily protected websites, Scrape.do ensures that your scraping is faster, more reliable, and compliant with best practices.

So, if you’re looking to elevate your web scraping, Scrape.do is your go-to tool for efficiency and scalability.


Fimber (Kasarachi) Elemuwa

Author: Fimber (Kasarachi) Elemuwa


As a certified content writer and technical writer, I transform complex information into clear, concise, and user-friendly content. I have a Bachelor’s degree in Medical Microbiology and Bacteriology from the University of Port Harcourt, which gives me a solid foundation in scientific and technical writing.

Table of Contents