Category:Scraping Basics

How to Download Files & Images From a URL in Python

9 Mins Read

Created Date: March 06, 2026

Updated Date: March 06, 2026

Raif Tekin

R&D Engineer

Downloading files from URLs in Python sounds basic until the URL returns a 403, the file hides behind JavaScript rendering, or you need to scrape the page first to find the download link.

Product images for a price tracker, PDFs from a research archive, video clips from a media host. Sooner or later every automation project needs to pull files from the web. A naive requests.get() works on cooperative servers, but real-world sites fight back with bot detection, hotlink protection, and dynamic content loading.

This article walks through five file types on five real websites using the three patterns that cover all of them: binary downloads, streaming, and anti-bot bypass.

Download Images From a Web Page

Images are the most common file you'll download. Product photos, logos, thumbnails, they all follow the same pattern: fetch the binary content, write it to disk.

[This product page on scrapingtest.comhttps://scrapingtest.com/ecommerce/product/7553957) is a good test target. It's an e-commerce listing for a Motorola phone with a product image, a site logo, and thumbnails.

![scrapingtest product page/uploads/blog/scrapingtest-product-page.png)

Inspect the Page Structure

Open DevTools first. Check what images exist on the page before writing any code.

![product page with images in devtools/uploads/blog/product-page-devtools-img.png)

Three <img> tags on this page. The site logo sits at /phone-guy-logo.svg. The product image lives at /ecommerce/images/7553957.jpg, but it's wrapped inside a Next.js image optimizer URL (/_next/image?url=...). You'll need to extract the real path from that.

Download a Single Image

If you already know the direct URL, downloading an image is three lines of code:

import requests

image_url = "https://scrapingtest.com/ecommerce/images/7553957.jpg"
response = requests.get(image_url)

if response.status_code == 200:
    with open("product-image.jpg", "wb") as file:
        file.write(response.content)
    print(f"Image saved: product-image.jpg ({len(response.content):,} bytes)")
else:
    print(f"Failed to download: {response.status_code}")

Image saved: product-image.jpg (7,810 bytes)

response.content returns raw bytes, not text. Images, PDFs, and videos are binary data, so always write them in "wb" (write binary) mode. Check the status code before writing, too. A 403 or 404 still creates a file, but it'll contain an error page instead of your image.

Extract and Download All Images

When you don't know the image URLs upfront, parse the page first with BeautifulSoup and extract every <img> tag:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, parse_qs, unquote

page_url = "https://scrapingtest.com/ecommerce/product/7553957"
response = requests.get(page_url)
soup = BeautifulSoup(response.text, "html.parser")

images = soup.find_all("img")
print(f"Found {len(images)} images on the page\n")

for img in images:
    src = img.get("src", "")
    alt = img.get("alt", "no alt text")

    # Next.js wraps real paths in /_next/image?url=...
    if "/_next/image" in src:
        parsed = urlparse(src)
        real_path = parse_qs(parsed.query).get("url", [src])[0]
        real_path = unquote(real_path)
        full_url = urljoin(page_url, real_path)
    else:
        full_url = urljoin(page_url, src)

    filename = full_url.split("/")[-1]
    img_response = requests.get(full_url)

    if img_response.status_code == 200:
        with open(filename, "wb") as file:
            file.write(img_response.content)
        print(f"Saved: {filename} ({len(img_response.content):,} bytes) - {alt}")
    else:
        print(f"Failed: {filename} - {img_response.status_code}")

Found 3 images on the page

Saved: phone-guy-logo.svg (1,570,715 bytes) - Phone Guy Logo
Saved: 7553957.jpg (7,810 bytes) - moto g - 2025 128GB (Unlocked) - Forest Gray
Saved: 7553957.jpg (7,810 bytes) - Main

urljoin turns relative paths like /phone-guy-logo.svg into absolute URLs. The Next.js optimizer URLs need extra handling because the real image path is URL-encoded inside the ?url= parameter. We decode it before downloading.

Images are binary files. So are PDFs, but they can run into hundreds of megabytes, and loading all of that into memory before writing is a problem.

Download a PDF File

Same binary pattern as images, but for larger files you need to stream the response and write it in chunks.

Stream a PDF Download

Adobe's sample PDF is a 4-page test document that's been sitting on their servers for over 21 years. Good enough.

adobe sample pdf in browser

import requests

pdf_url = "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf"
response = requests.get(pdf_url, stream=True)

if response.status_code == 200:
    content_type = response.headers.get("Content-Type", "")
    print(f"Content-Type: {content_type}")

    with open("sample.pdf", "wb") as file:
        total = 0
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
            total += len(chunk)

    print(f"PDF saved: sample.pdf ({total:,} bytes)")
else:
    print(f"Failed to download: {response.status_code}")

Content-Type: application/pdf
PDF saved: sample.pdf (88,226 bytes)

stream=True tells requests not to load the entire response body into memory. Instead, iter_content(chunk_size=8192) reads 8KB at a time and writes each chunk to disk as it arrives.

For an 88KB PDF, this doesn't matter much. For a 500MB research paper, it's the difference between your script working and your process getting killed.

Verify the Download

The Content-Type header confirms the server is returning a PDF and not an HTML error page. If you get text/html instead of application/pdf, the server is probably redirecting you to a login page or returning a bot-detection challenge.

Streaming works for any large binary file. Video is where it matters most, and where you'll hit your first real-world blocker.

Download Video Files

Videos are the largest files you'll deal with. Even a 5-second clip runs 3MB, and a full recording can hit gigabytes. Streaming isn't optional here.

Stream a Video From a Direct URL

A sample MP4 from samplelib.com, 5 seconds, about 2.8MB:

samplelib video download page

import requests

video_url = "https://download.samplelib.com/mp4/sample-5s.mp4"
response = requests.get(video_url, stream=True)

if response.status_code == 200:
    total_size = int(response.headers.get("Content-Length", 0))
    downloaded = 0

    with open("sample-5s.mp4", "wb") as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
            downloaded += len(chunk)
            if total_size:
                percent = (downloaded / total_size) * 100
                print(f"\rDownloading: {percent:.1f}%", end="")

    print(f"\nVideo saved: sample-5s.mp4 ({downloaded:,} bytes)")
else:
    print(f"Failed to download: {response.status_code}")

Downloading: 100.0%
Video saved: sample-5s.mp4 (2,848,208 bytes)

Content-Length gives the total file size upfront, so you can calculate a percentage. Same streaming pattern as the PDF section, just with a progress readout on top.

When the Server Blocks You

Not every server hands you files willingly. Try file-examples.com:

import requests

response = requests.get("https://file-examples.com/index.php/sample-video-files/")
print(f"Status: {response.status_code}")

Status: 403

403 Forbidden. The server detected a script, not a browser. A few reasons this happens:

The default User-Agent header in requests identifies itself as python-requests/x.x.x, which is an obvious bot signature
Some servers check TLS fingerprints, IP reputation, or request patterns
Hotlink protection blocks direct file downloads from unknown referrers

The quick fix is adding a browser-like User-Agent header:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers)

This works on some sites. On others, you're up against IP bans, CAPTCHA walls, and TLS fingerprinting. That's where a web scraping API comes in.

Scrape.do routes requests through rotating residential proxies and handles browser fingerprinting automatically. Instead of managing headers, cookies, and proxy rotation yourself, one API call:

import requests
import urllib.parse

TOKEN = "<your-token>"
target = "https://file-examples.com/index.php/sample-video-files/"
encoded_url = urllib.parse.quote_plus(target)
api_url = f"https://api.scrape.do/?token={TOKEN}&url={encoded_url}&super=true"

response = requests.get(api_url)
print(f"Status: {response.status_code}")

The super=true parameter activates premium proxy rotation for sites with aggressive bot detection. The page that returned 403 Forbidden now returns clean HTML, ready for parsing.

Binary files (images, PDFs, videos) use response.content and write in "wb" mode. HTML is text. Different approach.

Save a Page as HTML

HTML is text, not binary. That changes how you write to disk: response.text instead of response.content, "w" mode instead of "wb". Same product page from web scraping earlier, different output.

Download Raw HTML

import requests

page_url = "https://scrapingtest.com/ecommerce/product/7553957"
response = requests.get(page_url)

if response.status_code == 200:
    with open("product-page.html", "w", encoding="utf-8") as file:
        file.write(response.text)
    print(f"HTML saved: product-page.html ({len(response.text):,} characters)")
else:
    print(f"Failed to download: {response.status_code}")

HTML saved: product-page.html (34,064 characters)

What changed from the binary downloads:

response.text instead of response.content. Python decodes the bytes into a string using the response's charset.
"w" mode instead of "wb". Text writing, not binary.
encoding="utf-8" so special characters (accents, emojis, non-Latin scripts) survive the write.

Open the saved file in a browser and it renders the full product page, styles and all.

When HTML Needs JavaScript Rendering

The scrapingtest.com page works because its content is server-rendered. Many modern sites load content dynamically with JavaScript. Your downloaded HTML would be an empty shell with a <script> tag and no actual data.

For those pages, you need a headless browser to execute JavaScript first. Scrape.do handles this with the render=true parameter:

api_url = f"https://api.scrape.do/?token={TOKEN}&url={encoded_url}&render=true"

This renders the page in a real browser before returning the HTML. For a deeper walkthrough, check the guide on scraping JavaScript-rendered pages with Python.

Raw HTML works for archiving or re-processing. But if what you actually need is clean, readable content for an LLM pipeline or documentation archive, Markdown is the better output format.

Convert a Web Page to Markdown

The idea is to strip out HTML tags while keeping the actual content structure. Headings, lists, code blocks, links all survive.

Install markdownify

markdownify handles the conversion. Lightweight and produces cleaner output than html2text, especially for pages with code blocks and tables.

pip install requests beautifulsoup4 markdownify

Fetch and Convert

Scrape.do's browser interactions documentation makes a good test page. Structured content with headings, code blocks, and navigation elements to filter out:

scrape.do documentation page

import re
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify

page_url = "https://scrape.do/documentation/headless-browser/browser-interactions/"
response = requests.get(page_url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract only the main content, skip navigation/footer/sidebar
main_content = soup.find("main") or soup.find("article") or soup.find("div", class_="content")
html = str(main_content) if main_content else response.text

markdown_text = markdownify(html, heading_style="ATX", strip=["img", "script", "style"])

# Clean up excessive blank lines
markdown_text = re.sub(r"\n{3,}", "\n\n", markdown_text).strip()

with open("browser-interactions.md", "w", encoding="utf-8") as file:
    file.write(markdown_text)

print(f"Markdown saved: browser-interactions.md ({len(markdown_text):,} characters)")

Markdown saved: browser-interactions.md (8,169 characters)

The script isolates the main content area with BeautifulSoup before converting. Without this step, you'd get navigation menus, footer links, and sidebar widgets in the output. The strip parameter removes images, scripts, and styles that don't belong in a Markdown document.

converted markdown file in editor

heading_style="ATX" produces # Heading syntax instead of underline-style headings. The regex cleanup collapses triple-or-more blank lines into double line breaks so the output isn't full of gaps.

Every example so far targeted cooperative servers. Most real targets won't be this friendly.

Handle Protected Downloads With Scrape.do

Production servers actively block automated downloads. The usual suspects:

Common Blockers

403 Forbidden: the server detects your script's User-Agent or IP and rejects the request
CAPTCHA challenges: Cloudflare or similar WAFs serve a verification page instead of the file
Geo-restrictions: content locked to specific countries or regions
Rate limiting: too many requests from the same IP triggers 429 Too Many Requests

Route Any Download Through Scrape.do

Your download logic stays the same. You just route it through Scrape.do's API to handle the protection:

import requests
import urllib.parse

SCRAPE_DO_TOKEN = "<your-token>"
target_url = "https://scrapingtest.com/ecommerce/images/7553957.jpg"

encoded_url = urllib.parse.quote_plus(target_url)
api_url = f"https://api.scrape.do/?token={SCRAPE_DO_TOKEN}&url={encoded_url}"

response = requests.get(api_url)

if response.status_code == 200:
    with open("product-image-via-scrape-do.jpg", "wb") as file:
        file.write(response.content)
    print(f"Image saved: product-image-via-scrape-do.jpg ({len(response.content):,} bytes)")
else:
    print(f"Failed: {response.status_code}")

Same pattern for any file type. Swap the target_url for a PDF, video, or HTML page. The API returns the raw content, and you write it to disk the same way.

Key parameters:

super=true: premium proxy rotation for sites with aggressive bot detection (bypass Cloudflare and similar WAFs)
render=true: renders JavaScript before returning content
geoCode=us: routes the request through a specific country's IP

Before scraping any site, it's worth checking if the website allows scraping by reviewing its robots.txt and terms of service.

Conclusion

Three patterns handle everything. Use response.content with "wb" mode for binary files like images, PDFs, and videos. Use response.text with "w" mode for HTML and Markdown. Add stream=True with iter_content when the file is large enough to matter.

When the server fights back with 403s, CAPTCHAs, or geo-restrictions, route through Scrape.do and keep your download logic clean.

Start downloading with 1000 free credits