Categories:Scraping Use Cases,Scraping Tools

Web Scraper for Airflow: The Ultimate Scheduled Pipeline Template (Powered by Scrape.do)

5 Mins Read

Created Date: May 06, 2026

Updated Date: May 06, 2026

Growth

In data engineering and growth automation, getting your scrapers to run on a schedule — without babysitting them — is half the battle. You need data flowing into your pipelines every morning, but managing cron jobs, retry logic, and proxy bans is a massive headache.

That's why we built the Scrape.do Universal DAG Template for Apache Airflow.

We've done the heavy lifting. We packaged the Scrape.do scraping API into a battle-tested Airflow DAG file (.py) with built-in error handling, automatic retries, and daily scheduling. All you have to do is drop it in your dags folder, plug in your token, and let Airflow do the rest.

Here is your step-by-step guide on how to use this template to pull data from any website on the internet — on autopilot.

Step 1: Boot the Engine on macOS

We're going to use macOS's Unix backbone and install everything directly through Terminal. Hit Cmd + Space, type "Terminal", and hit Enter. Then paste these commands one by one.

1. Carve out a dedicated workspace folder and jump inside:

mkdir scrape_do_airflow && cd scrape_do_airflow

2. Spin up an isolated virtual environment so you don't pollute your system Python:

python3 -m venv venv && source venv/bin/activate

3. Install the engine (Airflow + requests):

pip install apache-airflow requests

4. Boot the whole stack and serve the panel:

airflow standalone

Crucial first step: capture the admin password

When you run the last command, logs will start streaming. Watch carefully — Airflow prints a one-time admin password to the console. Copy that password immediately. You'll need it in the next step.

Step 2: Get Inside the Airflow Dashboard

Now let's check that the engine is running.

Open your browser and go to localhost:8080.
Log in with username admin and the password you just copied.
The Airflow dashboard loads.

You'll see the example DAGs Airflow ships with by default. Ignore them. We're about to add our own — the only one that matters.

Step 3: Drop in the Scrape.do Universal DAG Template

In Airflow, scheduled task definitions are called DAGs (Directed Acyclic Graphs). The airflow standalone command auto-created an airflow/ folder inside your project. Inside it, create a dags/ subfolder and drop in a new file named scrape_do_template.py.

Here's the template — production-ready, with auto-retry on failure and daily scheduling baked in:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
import requests

# 1. The Scrape.do core engine
def scrape_do_task(**kwargs):
    token = "<your_token>"  # Plug in your own Scrape.do token here
    target_url = "https://books.toscrape.com/"

    api_url = f"https://api.scrape.do/?token={token}&super=true&render=true&url={target_url}"

    print(f"Initiating request to target: {target_url}")

    try:
        response = requests.get(api_url, timeout=45)
        if response.status_code == 200:
            print("Success! Data extracted.")
            # Print the first 200 chars of the response to the logs
            print(response.text[:200])
            return "Scraping Completed"
        else:
            # If it fails, Airflow catches it and auto-retries 2 minutes later
            raise ValueError(f"Error Code: {response.status_code}")

    except Exception as e:
        raise Exception(f"Failed: {str(e)}")

# 2. Rules and scheduling settings
default_args = {
    'owner': 'scrape_do',
    'depends_on_past': False,
    'retries': 2,                              # If it fails, try 2 more times
    'retry_delay': timedelta(minutes=2),       # Wait 2 minutes between retries
}

# 3. Define the DAG (the workflow)
with DAG(
    dag_id='scrape_do_universal_pipeline',
    default_args=default_args,
    description='Scheduled scraping template powered by Scrape.do',
    start_date=datetime(2026, 5, 1),           # Pipeline start date
    schedule_interval='@daily',                # Run once every day
    catchup=False,
    tags=['scrape.do', 'growth', 'automation'],
) as dag:

    # 4. Bind the task into the system
    run_scraper = PythonOperator(
        task_id='execute_scrape_do_request',
        python_callable=scrape_do_task,
    )

    # If you had multiple steps, you'd chain them with arrows
    # (e.g. run_scraper >> save_to_db)

The moment you save this file in your dags/ folder, it will appear in the localhost:8080 panel as scrape_do_universal_pipeline — no restart, no registration. Airflow's scheduler picks it up automatically.

Why two flags matter

Notice the API URL uses super=true&render=true. The first flag activates premium residential proxies (see how rotating proxies work) and the full anti-bot bypass stack (TLS fingerprinting, header rotation, CAPTCHA handling). The second one runs the page through a real headless browser, executing JavaScript exactly like a human visitor. With these two flags, you can scrape virtually any modern website — including ones protected by Cloudflare, DataDome, or PerimeterX.

Step 4: Watch the Pipeline Run (Live Logs)

Jump back to the Airflow UI. Search for "scrape" in the DAG list and you'll see your new pipeline.

Toggle it ON using the switch on the right side of the DAG row.
Click the Play button (▶) to trigger a manual run right now (instead of waiting for midnight).
Click on the running DAG → click the task instance → switch to the Logs tab.

You'll watch your scraper execute in real time:

[INFO] Initiating request to target: https://books.toscrape.com/
[INFO] Success! Data extracted.
[INFO] <!DOCTYPE html><html lang="en-us">...
[INFO] Done. Returned value was: Scraping Completed
[INFO] Task instance in success state

That's it. Your scheduled scraping pipeline is now live. Every day at midnight, Airflow triggers this task automatically, routes the request through Scrape.do's anti-bot infrastructure, and logs the result.

If a run fails (target site down, rate limit hit, network blip), Airflow's retry logic kicks in: it waits 2 minutes, tries again, and only marks the task failed after the third attempt. No babysitting required.

Ready to Automate?

With this template, you no longer need to write custom cron jobs, set up monitoring scripts, or pay for expensive scheduling SaaS tools. You have a universal scheduled scraping skeleton that integrates natively into your existing Airflow data pipelines — and scales from one URL to ten thousand without changing a line of orchestration code.

The same DAG works for daily price monitoring, weekly competitor crawls, hourly inventory checks, or real-time SERP tracking. Just swap the target_url, tweak the schedule_interval, and let it run.

Grab your token, fire up Airflow, and let the data flow on autopilot.

Get 1000 free credits and start scraping with Scrape.do

Bugrahan Saka

Growth