Category: Scraping basics

8 Types of Data Extracted with Web Scraping

27 mins read Created Date: June 02, 2025   Updated Date: June 02, 2025

Web scraping extracts not just plain text, but all kinds of content from websites.

Scrapers may deal with HTML pages one day and PDF files or JSON structures the next.

Being efficient with the type of data you’re dealing with is crucial, because each format requires a different approach.

In this guide, we’ll explore the different data types web scrapers encounter and how to handle each one effectively:

1 - HTML (Basic Text)

HTML is the standard format of all web pages, containing the text content and structure of sites.

It is also the crushing majority of data that is scraped from the web, which is then parsed using libraries like BeautifulSoup or Cheerio to turn into structured data.

There are a few different approaches to scraping and parsing text from HTML:

Static Pages

Static pages are the simple, unchanging pages on a site like blog posts, news articles, and documentation pages.

Their content doesn’t update frequently once published. You typically scrape them once and maybe check back occasionally for edits.

These pages have a fixed HTML structure (headers, paragraphs, etc.), making them straightforward to parse.

💡 For example, using Python and requests:

import requests

url = "https://example.com/blog-post"
response = requests.get(url)

with open("page.html", "w", encoding="utf-8") as file:
    file.write(response.text)

In the code above, we request the page and scrape its full HTML content and write it to a file.

Static pages are usually free of complex scripts, so a simple request is enough.

Product & Listing Pages

Ever wondered how price trackers like CamelCamelCamel or travel fare aggregators like Skyscanner work?

They’re powered by scraping product, property, or car listing pages. These pages are dynamic in content (prices, stock levels, etc.) but often structured in a repeatable way. E-commerce product pages, real estate listings, job postings – they all present itemized data (name, price, description, etc.) that a scraper can regularly collect. The key here is scheduling: you might scrape daily or hourly to catch changes.

By regularly scraping competitor or market sites, businesses can monitor prices and inventory in real time.

Such pages may also require handling pagination (multiple pages of listings). A crawler might start at a category page, grab all item links, then scrape each item page for details. You should also be prepared for anti-scraping measures on these sites (CAPTCHAs or IP blocks), since price monitoring is common. Tools like Scrapy (with auto-throttle, proxies) or specialized scraping APIs can be very useful for these cases. The data itself is in HTML, so you’d parse it similar to static pages – just run the scraper on a loop or schedule to keep data fresh.

User-Generated Content

User-generated content includes social media posts, forum threads, product reviews – basically content that users continuously create.

This data type is fast-moving and often requires continuous or frequent scraping.

Why scrape it?

Companies track trends and sentiment on social platforms to make decisions. Scraping social media can help monitor competitor activity, follow market trends, or analyze customer sentiment in real time.

For example, you might scrape tweets for a hashtag to see how people feel about a new product.

Because this content updates by the minute, scrapers need to run continuously or on short intervals.

APIs are your friend here: many platforms (Twitter/X, Reddit, etc.) offer APIs that return JSON data for posts, comments, etc. Using the official API when possible is ideal (it gives structured data and respects the platform’s rules). If no API is available or sufficient, you may resort to HTML scraping or even automation (logging in, scrolling). Keep in mind that rate limits and terms of service are important considerations for user-generated data.

One trick for speed: scrape only recent items (e.g., the latest page of a forum or the newest tweets). In the race to catch the next viral trend, speed is essential – being first to spot a trending post can be a game changer.

2. Markdown

Markdown is a lightweight text format that you might not scrape directly from websites, but it often comes into play.

Many documentation sites or developer content (like README files on GitHub) are written in Markdown. When scraping, you usually retrieve HTML, but you might convert that HTML to Markdown for easier downstream use. Why? Because Markdown is clean and LLM-friendly – it preserves structure (headings, lists, bold/italic) without the clutter of HTML tags.

Structured content in Markdown offers significant advantages for machine learning models, improving their accuracy and understanding.

Essentially, Markdown is simpler for both humans and AI to read compared to raw HTML or JSON.

If you plan to use scraped text for training a language model or simply want a cleaner look, you can use libraries to convert HTML to Markdown.

For example, Python’s markdownify library can do this in one call. Markdown keeps things like headings, bullet points, and italics, which is great for maintaining readability.

It’s commonly used to prepare datasets for AI: entire websites are scraped, then converted into Markdown files for training GPT models, because the structured simplicity of Markdown helps the model parse the content more effectively.

So while Markdown itself isn’t fetched from websites in most cases (unless the site explicitly serves it), it’s often the output format you’ll transform your scraped data into for certain applications.

Web scraping isn’t only about grabbing content from single pages – often, you need to discover all the relevant pages to scrape.

That’s where crawling comes in. Crawling means systematically following links on web pages to find other pages. A web crawler (or spider) starts with a set of seed URLs, fetches them, then extracts the hyperlinks (<a href="...">) in each page to continue the process.

In short, web crawling discovers URLs or links on websites, whereas web scraping is what you do when you extract data from those pages. Most scrapers have a crawling component that finds what to scrape next.

Handling links involves parsing HTML and collecting the href attributes. For example, to get all links from a page with BeautifulSoup:

links = []
for a in soup.find_all('a', href=True):
    links.append(a['href'])

This snippet gathers every hyperlink on the page. Of course, not all links are relevant and you’ll usually filter for certain domains or URL patterns (e.g., only crawl links that start with /products/).

Also beware of infinite loops or cyclic links. A naive crawler could bounce between pages endlessly. Using a framework like Scrapy can automate a lot of this, as it handles queueing URLs and avoiding repeats.

Some strategies for crawling include starting from a site’s sitemap (if available) or starting at a category page and working down to item pages.

4. Media Files (Images, Videos, PDFs)

Not all data comes as text. Web scrapers often need to deal with media files like images, videos, and PDFs. These require a different approach: instead of parsing HTML text, you’re typically downloading files and possibly processing them with specialized libraries. Let’s break down each type of media and how to scrape it.

Images

Images are everywhere online – product photos, profile pictures, infographics, you name it. To scrape images, you usually find their URLs in the HTML (commonly in <img> tags). Websites host images as static files accessible at specific URLs. The page’s HTML contains an <img src="..."> tag pointing to the image file, often with an alt attribute describing it. As a scraper, you gather those src links and then download the images via HTTP.

Downloading an image is straightforward with a GET request for the image URL, which returns binary data. For example:

img_url = "https://example.com/image.jpg"
img_data = requests.get(img_url).content
with open("image.jpg", "wb") as f:
    f.write(img_data)  # save image to disk

This will save the image locally. You might do this for multiple images (e.g., scraping all product images from a page). Keep in mind the file size – images can be large, so consider throttling or checking the Content-Length header before downloading huge files. Also, some sites serve different image versions based on device (using the srcset attribute for responsive images). In those cases, you may need logic to pick the desired resolution.

Once you have the image, what next? Sometimes the goal is just to download and store it (for example, creating a dataset of pictures). Other times, you might analyze the image – e.g., using OCR to extract text from images or an image processing library to classify it. If needed, you can extract metadata like EXIF (camera info) from photos using libraries like Pillow. But that veers into image processing; from a scraping perspective, the main task is finding and retrieving the image files. Make sure to also scrape the context if needed – for instance, the alt="Image description" text is useful as it describes the image content, which can be important for labeling.

Videos

Scraping videos is a bit more complex than images. Videos on web pages are often embedded players (YouTube, Vimeo, etc.) or HTML5 video tags. The video content might be served in chunks or streams (like .m3u8 playlists for HLS streaming). As a scraper, you have a few options:

  • Direct download: Some sites provide direct video file links (.mp4, .webm). If you find a direct URL to the video file, you can download it similar to an image (GET request and save). The challenge is finding that URL – you might need to inspect network calls or page scripts to locate it.
  • Use a tool: For platforms like YouTube, it’s easier to use an existing tool or API. Tools like youtube-dl (or its successor yt-dlp) can fetch videos from many sites by handling all the behind-the-scenes details. You can call these from your script or use their Python libraries to programmatically download videos.
  • Headless browser approach: If videos load via JS and you can’t easily find the source, you might use a headless browser to run the page, then grab the video file once it starts playing. This is heavy and usually a last resort.

Scraping videos often means large file downloads, so plan for bandwidth and storage. It may also be time-consuming, so consider if you really need the whole video or just metadata. Speaking of metadata, you might not need the video content itself – maybe you just want the title, duration, view count, etc. Many video platforms provide this info via their pages or APIs. For example, YouTube’s pages embed JSON with video stats, or you can use YouTube’s Data API to get video details without downloading the video.

If you do download videos, you might subsequently run them through tools (like ffprobe or mediainfo) to extract metadata or snapshots. But at scraping time, it’s usually about getting the file. Here’s a simple conceptual example for a direct video URL:

video_url = "https://example.com/video.mp4"
resp = requests.get(video_url)
with open("video.mp4", "wb") as f:
    f.write(resp.content)

In reality, many sites won’t give you a plain MP4 link without some negotiation (maybe cookies, headers, or a token), so be prepared for that. Also, always check if downloading videos is allowed for the site – videos are often copyrighted content.

PDFs

PDF files are a common format for documents – think research papers, reports, product specs, etc. Scraping PDFs involves two steps: download the PDF, then parse its content. The downloading part is straightforward (again a GET request to the file URL, like with images/videos). The real challenge is parsing, because PDFs are essentially binary documents that encode text and graphics in a complex layout. They are not as easily machine-readable as HTML or JSON.

Common issues with PDFs include:

  • Unstructured text: Text in PDFs might not have a clear order (columns, footnotes, etc., can confuse the sequence).
  • Varied formatting: PDFs can contain tables, images, and text with varying fonts/sizes. There’s no DOM like HTML; it’s more like coordinates on a page.
  • Scanned pages: Some PDFs are just scanned images of text – requiring OCR to get actual text.
  • No standard structure: Unlike HTML which has tags, PDFs don’t have semantic tags to indicate what is a heading or a table, making it hard to consistently extract data.

Because of these challenges, specialized tools are used. Libraries in Python such as PyMuPDF (fitz), pdfplumber, or PDFMiner can extract text. For tables, libraries like Camelot or Tabula can detect table structures in PDFs. If the PDF is scanned (images), you’d use an OCR solution like pytesseract after converting PDF pages to images. There’s rarely a one-size-fits-all; often a combination is needed. In practice, scraping PDFs might mean saving them and then running a parsing pipeline.

Notably, there are recommended tools for different tasks: PyMuPDF or pdfplumber for general text extraction, Camelot for table extraction, and pdf2image + OCR for image-based PDFs. For example, one approach is:

import fitz  # PyMuPDF
doc = fitz.open("report.pdf")
text = ""
for page in doc:
    text += page.get_text()
print(text[:200])  # print first 200 chars of combined text

This would give you the raw text of a PDF (if it’s text-based). But if you need structured data (like a specific table or field), you might have to analyze that text or use a more table-aware library. Be prepared for cleanup – PDF text often includes line breaks in weird places or headers/footers mixed in.

It’s worth noting that PDFs are often scraped for data in research and finance. But it’s one of the tougher formats to handle due to the lack of structure. Always verify the output – the scraper might think it got the data, but if columns got jumbled, you need to adjust your parsing logic. In summary: grab the PDF file, then leverage PDF parsers (or even AI in some cases) to extract the information you need. Patience is key with PDFs, as they can throw many curveballs during parsing.

5. Tables

Tables are a special case of HTML content worth calling out. Data presented in rows and columns (like financial data, schedules, etc.) is often enclosed in <table> tags on a webpage. Scraping tables can be rewarding (you get nicely structured rows), but also frustrating when the HTML is complex or when tables are rendered via JavaScript.

HTML Tables (Static)

HTML tables in static content are part of the page’s HTML source. You can parse them with an HTML parser. However, tables can be deeply nested or use spanning cells which break the neat row-column assumption. For instance, a table might have <td rowspan="2"> which causes misalignment if not accounted for. Despite these challenges, libraries make table scraping easier. The pandas library, for example, has a read_html function that can automatically grab tables from HTML into a DataFrame. This works well if the table is well-formed. Under the hood it uses libraries like lxml, but for a scraper it means less manual parsing. BeautifulSoup is also commonly used to extract table rows and cells if you need custom handling.

A quick way to get tables:

import pandas as pd
tables = pd.read_html(html_content)
print(f"Found {len(tables)} tables")
df = tables[0]  # first table as DataFrame
print(df.head())

The above will give you structured data if the table isn’t too tricky. If the table has headers, multi-level columns, or merged cells, you might need to post-process the DataFrame. In other cases, you may choose to parse manually: find the <table>, iterate over each <tr> (table row), and then each <td> or <th> cell. This lets you build a list of rows where each row is a list of cell values. Manual parsing is more work but gives you control, especially if you need to skip header rows or handle nested tables.

A big challenge is that HTML tables often contain formatting (like empty cells for spacing, or non-data rows). You’ll have to clean those out. Also, not all tables use proper <th> for headers – sometimes the first row is all <td>. You, as the scraper, must infer what’s header vs data. Despite these issues, scraping static tables is typically easier than scraping data scattered through paragraphs, because tables are already in a structured format. It’s a matter of extracting that structure intact.

JS-Rendered Tables

Modern web applications sometimes load table data on the fly with JavaScript. This means when you fetch the page HTML via a normal request, you might get an empty <table> or just headers, and the rows are populated by a script after page load (for example, via an AJAX call). To scrape these tables, you have two main strategies:

1. Headless browser / automation:

Use a tool like Selenium, Puppeteer, or Playwright to actually render the page in a browser environment. The automation script can wait for the table to load, then either take the HTML of the filled table or directly read the data. Selenium is invaluable for scraping tables that are loaded via JavaScript.

It can simulate user actions (like clicking “load more” or navigating through a table UI). Once the table is visible, you can use Selenium’s methods to iterate over the rows and cells. This approach guarantees you see what a user sees, but it’s heavier on resources. Example (conceptual):

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com/dynamic-table")
# Wait for table to load (could use WebDriverWait here)
rows = driver.find_elements(By.TAG_NAME, "tr")
for row in rows:
    cells = row.find_elements(By.TAG_NAME, "td")
    # ... extract text from cells
driver.quit()

This will load the page in Chrome and allow JavaScript to execute, so if the table data is fetched asynchronously, it should appear in rows. Selenium even allows waiting for specific conditions (like an element with certain text to appear) to ensure data is ready.

The downside is speed – a headless browser is much slower than direct HTML requests, especially if you have to scrape many pages.

API call inspection:

Often, the JavaScript that populates a table does so by calling a backend API (for example, a REST endpoint returning JSON, or a GraphQL query). If you can figure out that request (using developer tools to watch network XHR calls), you can skip the browser and call that API directly.

By finding the XHR that returns the data you need, you can retrieve the entire table in structured form without rendering the page.

This is a huge win when possible. For example, suppose a stock prices table on a site is fetched via an API call like GET /api/prices?page=1&pageSize=100 which returns JSON data for 100 rows. If you call that URL with the correct headers (and authentication if needed), you get JSON that you can parse easily – no need to parse HTML at all. So, always check if dynamic content can be accessed via a hidden API. Many web apps separate data and presentation this way.

In summary, for JS-rendered tables: try API-first, and if that fails, use a headless browser. Either way, the goal is to end up with the table data. Once you have it (HTML or JSON), you treat it as you would any table: rows and columns of data to iterate through or convert to CSV/Excel, etc.

⚠ Keep in mind that dynamically loaded tables might also have pagination or “infinite scroll” which means your scraper needs to handle multiple requests or scroll events to get all data (or find an API parameter for page number).

6. JSON (API Calls & Embedded Data)

JSON (JavaScript Object Notation) has become a ubiquitous data format on the web. It’s not something you view on a webpage (usually), but rather something transmitted behind the scenes or embedded in pages for scripts. Scrapers love JSON because it’s structured data – no messy HTML parsing, just key-value pairs and arrays that can be directly read by code. There are two main scenarios where you’ll deal with JSON while scraping:

API Calls (External or Internal)

A lot of websites provide RESTful or GraphQL APIs. These APIs return data in JSON format (sometimes XML, but JSON is more common nowadays). Instead of scraping the website’s HTML, you can sometimes get the same or better data by calling these APIs. For example, an e-commerce site might have an API endpoint like /api/products/12345 that returns a JSON with all the product details (name, price, stock, reviews, etc.). If you call that directly, you skip parsing the HTML product page altogether – you get structured data ready to use.

Even when sites don’t have “public” APIs, many web apps use JSON APIs internally. Single Page Applications (built with React, Angular, etc.) often load data via fetch/XHR calls. As a scraper, sniff those calls out (using your browser’s dev tools or a proxy) and see if you can replicate them. Usually it involves sending the right headers (often the X-Requested-With or auth tokens), but the result is a nicely formatted JSON. This approach is great for staying under the radar too, since you’re mimicking the site’s own requests. As a bonus, hitting a JSON endpoint is lightweight compared to loading a full page.

Handling JSON in code is straightforward. In Python, for instance:

import requests, json
url = "https://api.example.com/data?item=123"
resp = requests.get(url, headers={"Authorization": "Bearer TOKEN"})
data = resp.json()  # directly parse JSON response
print(data["itemName"], data["price"])

Here we send a request (with an auth header, as an example) and use resp.json() to get a Python dict. If the site expects certain headers like a custom User-Agent or API key, you’ll need to include those. The key point is that APIs return structured data, which is a scrapper’s dream – no regex or HTML tree navigation needed.

One thing to watch: APIs might have rate limits or require API keys. Always check if you’re allowed to use them for scraping purposes. Some public APIs might be unofficial (undocumented but open) – use those carefully to avoid causing issues for the site.

Embedded JSON in HTML

Sometimes, the data you need is neither in plain HTML text nor available from a separate API endpoint, but it might be sitting right in the page source as JSON. Web developers often embed JSON data in pages to pass initial state or configuration to front-end scripts. This can appear in two ways:

  • JSON-LD (Linked Data): This is usually in <script type="application/ld+json"> tags. It’s meant for search engines and describes structured data like product info, events, recipes, etc., in JSON format. It’s part of SEO schema, but scrapers can use it too. For example, a product page might have a JSON-LD script containing the product name, price, and availability. Because it’s already in JSON, you can extract that string and parse it to get those values directly. No need to scrape the surrounding HTML if the JSON-LD is comprehensive.
  • Inline data objects: Some pages just include a <script> that sets a JavaScript variable to a JSON object. For example: <script> window.__INITIAL_STATE__ = { ...big JSON object... } </script>. This is common in modern web apps – they deliver a blob of JSON to the client which the JS code then uses to render the page. You can grep the HTML for something like __INITIAL_STATE__ or even { characters that look like JSON. If you find a JSON structure, you can pull it out with string operations or an HTML parser and then load it with a JSON library.

In both cases, the approach is: find the script tag that contains JSON, get its content, and run it through a JSON parser. For instance:

script_tag = soup.find("script", type="application/ld+json")
data = json.loads(script_tag.string)
print(data.get("name"), data.get("price"))

This would extract a product name and price from a JSON-LD script if those fields exist. One thing to note: sometimes the JSON in HTML isn’t perfectly formatted (they might use single quotes or non-standard JSON). In those cases, you may need to do a bit of string cleanup or use something like json.loads(data_string, strict=False) or even regex to tweak it. But usually, JSON-LD is well-formed and parseable.

Using embedded JSON can save you a lot of parsing effort. It’s like the website handing you a structured data package on a silver platter. For example, scraping a recipe site might be as easy as grabbing the JSON-LD which contains the ingredient list and instructions, rather than trying to parse the HTML paragraphs. Always scan the HTML for “{” or “application/ld+json” as a quick check – you might get lucky!

7. Screenshots

Sometimes the goal of a scraping project isn’t just data extraction, but also visual confirmation. Screenshots are images of how a web page looks in a browser. While not data in the structured sense, they are a form of captured information and can be part of a scraping routine. You might take screenshots to verify that your headless browser rendered the page correctly, to monitor visual changes on a site (like layout or banner changes), or even as the end-product (e.g., an archive of site appearances over time). Let’s distinguish between two types of screenshots a scraper might deal with:

Regular Screenshots

A “regular” screenshot usually means the visible portion of a webpage (what you’d see without scrolling) at a certain browser window size. These are quick to capture and smaller in file size. They’re great for a snapshot of a particular view. For instance, if you only care about the header section of a page or a specific element, you might set your browser viewport and screenshot just that. Many headless browser tools (Selenium, Playwright, Puppeteer) provide a simple command to capture a screenshot. With Selenium it looks like:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://example.com")
driver.save_screenshot("page_view.png")
driver.quit()

This will save an image of the current view of the page. Screenshots are immensely useful for debugging scraping runs – you can see if a pop-up was blocking the content or if the page didn’t load correctly. In fact, capturing the page state in an image is a great way to verify what your scraper is seeing. If something looks off in the screenshot, you know the scrape data might be off as well. Screenshots can also be part of the data pipeline: e.g., taking an image of a chart that’s hard to scrape as numbers, so you keep the image for a person or an OCR to analyze later.

Keep in mind that by default, a screenshot will usually capture only the current viewport. If the page scrolls, the rest is cut off – that’s where full-page screenshots come in.

Full-Page Screenshots

A full-page screenshot captures the entire scrollable area of a webpage from top to bottom. This is larger (potentially very tall image) and takes a bit more work. Some browser automation tools have a built-in full-page mode. For example, in Puppeteer you can do page.screenshot({ fullPage: true }) to get it all. In Selenium, there’s no one-liner for full page in older versions, but you can script a scroll or use browser-specific commands.

Full-page screenshots are useful when you need a complete visual archive of a page. For instance, when archiving a news article for legal or compliance reasons, a full screenshot shows the content, sidebar, ads, timestamps – everything. It’s essentially like printing the webpage to an image. This can be part of scraping when the presentation matters, not just the raw text. However, full-page shots result in large images (could be several megabytes if the page is long). They also can be tricky if the page has lazy-loaded images (you might need to scroll to trigger loading before capturing).

From a scraping workflow perspective, screenshots (regular or full) can be taken with headless browsers or even special screenshot APIs. Headless Chrome (and others) have made it straightforward – capturing web page screenshots is a common use case for browser automation. There are also services where you send a URL and they return an image of the page. But if you’re already scraping the page’s data, you can use that same session to snap a screenshot. Just be aware of dynamic elements – sometimes you want to hide cookie banners or tooltips in your screenshot by injecting a little CSS before capturing.

In summary, screenshots are a supplementary data type in web scraping. They are visual data. You won’t parse them with BeautifulSoup, but you might run them through OCR or just store them as evidence of what was on the page at that time. They’re especially handy for debugging and monitoring. Pro tip: if you do a lot of screenshots, clean them up or store efficiently; hundreds of PNGs can consume disk space quickly. Also consider image formats – PNG is lossless (clear text, big size), JPEG is smaller but may blur small text. Choose based on your needs.

8. Metadata & Headers

Finally, let’s talk about the extra bits of information around web content – metadata and HTTP headers. These often get overlooked, but they can be very useful to scrapers.

Metadata

Metadata in the context of a webpage usually means the meta tags in HTML or other on-page descriptors.

These include things like: <title> of the page, <meta name="description" content="...">, <meta name="keywords" content="...">, and Open Graph tags (<meta property="og:title" ...> etc.).

They provide information about the page, rather than visible content. For example, the meta description is a summary of the page, and Open Graph tags define how the page should display when shared on social media (title, description, image). Meta tags are easy to scrape – they’re in the <head> of the HTML.

Using BeautifulSoup, you can find them by name or property:

desc = soup.find("meta", attrs={"name": "description"})
if desc:
    print("Page description:", desc["content"])

Why scrape metadata?

It can be useful for SEO analysis (collecting descriptions, keywords), for creating previews (like building your own link preview with title/desc/image), or for supplementing your data.

For instance, a scraper saving articles might also grab the description for a quick summary. As another example, if you’re scraping a list of URLs to curate content, you might fetch each page’s Open Graph image (og:image) to have a thumbnail for that content. One practical use case: instead of manually writing a title and blurb for a bookmarked URL, you can programmatically retrieve the page’s meta title and description, which is instant bookmark info!.

Headers

Now, HTTP headers are not part of the page content; they are part of the HTTP request/response.

When your scraper makes a request, it sends headers (like User-Agent, Accept, Authorization, etc.), and the server responds with headers (like Content-Type, Content-Length, Date, Server, etc.). Headers contain metadata about the request or response (e.g., what format the content is in, or how long to cache it).

For scrapers, headers play two roles:

1. Request Headers (outgoing):

You often need to set these to mimic a real browser or to satisfy the server’s requirements. The most important is the User-Agent string. Many websites block requests with the default Python user-agent because it signals a bot.

By setting a User-Agent to, say, a common browser signature, you make your request look more legitimate. Other headers like Accept-Language can influence content (some sites respond with localized text if you send Accept-Language: fr for French, for example). If an API requires an API key or token, that often goes in an Authorization or custom header.

Using the correct headers helps avoid blocks and ensures you get the format you want (e.g., some APIs give XML by default unless you send Accept: application/json).

For example:

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
  "Accept": "application/json"
}
resp = requests.get(url, headers=headers)

Here we set a desktop browser User-Agent and indicate we want JSON response. The site sees a normal-looking request and might return JSON if it supports that. In fact, the Accept header is another important one – it tells the server what content types you can handle (HTML, JSON, image, etc.).

2. Response Headers (incoming):

These can tell your scraper useful info about what was returned.

For example, the Content-Type header shows the MIME type of the response (HTML, JSON, PDF, image, etc.).

Your code can check this to decide how to process the response. If you expected JSON but got Content-Type: text/html, maybe you got an error page. Content-Length tells you size of the response in bytes – good for gauging if an image download is huge. Last-Modified and ETag headers are used for caching; you could store those and send an If-Modified-Since on the next request to avoid re-downloading unchanged data.

There are also headers indicating rate limit info in some APIs (like X-RateLimit-Remaining).

Another example is cookies (Set-Cookie header) which your scraper might need to capture and send back in subsequent requests if the site uses login sessions or CSRF tokens.

In practice, you typically set a few key request headers (UA, maybe Accept and Referer) and then inspect response headers if needed.

Headers and metadata might not be the main data you’re after, but they enrich and facilitate the scraping process. They help your scraper speak the right language to the server and can provide context for the data you get.

💡 For instance, a Content-Type header of application/pdf immediately tells your code to treat the response as a PDF file download instead of trying to parse it as HTML.

Conclusion

We’re living in the golden age of internet and all types of data from articles to product listings have countless use cases across different industries.

Being able to get access to any format of public data you need can take you from a beginner to master web scraper.

Practice with different formats, tasks, and challenges to get familiar with everything that might come up in an important future project.