Category: Scraping basics

12 Ways Big Websites Prevent Web Scraping

20 mins read Created Date: May 22, 2025   Updated Date: May 22, 2025

Web scraping feels effortless at first when you’re sending requests to httpbin.org or demo websites like scrapingtest.com.

Then you move on to scraping an e-commerce site like Amazon, and the page throws a CAPTCHA your way instead of data.

Big platforms protect their content using an evolving arsenal of defenses.

Some are subtle like browser fingerprints, others are aggressive like AI systems that flag your request before it even hits the server. If you rely on web data for pricing models, LLM training, or market research, understanding these defenses is essential.

The rest is adapting before the next update breaks your workflow.

In this blog, we’ll see twelve of the most common anti-scraping techniques. But first, let’s understand why companies fight so hard against bots:

Why Do Websites Fight Scrapers?

Most websites don’t hate scraping.

They hate abuse.

The real battle is against bots that cause harm. Bots that overload servers, scrape sensitive or gated content, undercut competitive pricing, or copy user-generated content without permission.

These actions bring risk without return and can degrade the experience for actual users.

That is why many sites deploy anti-bot systems with the purpose of filtering out harmful automation while letting legitimate visitors through.

But not all platforms draw the same line.

Some go further and block scraping of public data entirely, even when it causes no disruption. Whether to protect business strategy, prevent aggregation, or enforce licensing restrictions, they treat all scraping attempts as threats and build targeted barriers to stop them.

This creates a moving battlefield.

Scrapers evolve. So do defenses.

The goal is rarely to stop scraping completely which is nearly impossible.

The goal is to make scraping so expensive, unreliable, or frustrating that most scrapers give up long before they succeed.

Frontend (Client Side) Defenses

Some defenses do not live on the server, they are built into the page itself.

Frontend protections are embedded in the HTML, CSS, and JavaScript that load in the browser.

These techniques rely on one simple assumption: A real user will load the entire page, run every script, and interact in natural ways.

Most scrapers will not.

That difference is what these defenses look for.

1. Browser Fingerprinting and Device Identification

Every browser leaves a trail.

Browser fingerprinting collects dozens of data points from your device and browser to generate a unique signature.

This includes:

  • your browser version,
  • operating system,
  • screen size,
  • time zone,
  • language settings,
  • GPU details,
  • supported graphics APIs like WebGL,
  • and even the number of CPU cores.

That combination is often enough to distinguish one user from another. And more importantly, it helps detect scrapers that are trying to hide.

Here is how:

  • Inconsistent setups: If a client claims to be Chrome on Windows but reports a Linux-style WebGL output, something does not add up. Bot detection tools look for these mismatches.
  • Automation clues: Properties like navigator.webdriver being set to true are dead giveaways that the browser is under automation. Real browsers do not expose this by default. Many bots forget to hide it.
  • Unusual fingerprints: Headless tools and outdated libraries often produce device profiles that no real user would have. The order of HTTP headers, the shape of the canvas output, the way fonts are rendered—all of it adds up.

Even if a scraper rotates IP addresses or changes user-agent strings, fingerprinting lets websites link sessions together using deeper clues.

If a suspicious fingerprint makes hundreds of requests, it gets flagged.

Browser fingerprinting is hard to fake and harder to reset, which makes it one of the most effective client side defenses against bots.

2. JavaScript Challenges and Obfuscated Content

Some pages look empty at first glance. Then the JavaScript runs, and the real content appears.

Modern websites use this behavior as a weapon. Instead of serving plain HTML, they embed key content inside JavaScript or hide it behind scripts that only real browsers can execute. A simple scraper that just fetches the raw HTML sees nothing useful.

Here are the most common techniques:

  • Obfuscated responses: The server might return scrambled or encoded data, along with a script that knows how to decode it. Prices, listings, or user details could be tucked inside a JavaScript variable or rendered through DOM manipulation. Unless the scraper runs the page like a browser would, the data stays hidden.
  • Delay scripts and puzzles: Some sites serve a challenge that takes a few seconds to solve. It could be a math problem, a sorting task, or just a delay timer. When the task finishes, a token or cookie is set. If that token is missing on future requests, access is denied. Human users barely notice. Scrapers that skip JavaScript hit a wall.
  • Automation traps: Challenge scripts often include tests designed to catch headless tools like Puppeteer or Playwright. They might trigger known side effects of browser automation, check for missing browser features, or probe for inconsistencies in how events behave. If the script detects something unusual, it stops the page or blocks the session.

From a scraper’s perspective, JavaScript challenges are a serious upgrade in difficulty. You need to run a real browser engine or at least a headless version that can process JavaScript. That means more memory, more CPU, and much slower requests. And even if it works today, there is no guarantee it will work tomorrow. Sites often change the script, just enough to break the automation.

That is the real strength of this method. It does not just block basic bots. It forces every scraper to pay the cost of complexity.

3. CAPTCHAs (Completely Automated Public Turing Tests)

CAPTCHAs are the most visible roadblock a scraper can hit.

You request a page. Instead of content, you get a test. It asks you to find all images with a traffic light or type out some distorted text. Humans solve it in a few seconds. Bots do not.

Websites deploy CAPTCHAs when they suspect automation. Maybe your client made too many requests too quickly. Maybe your browser fingerprint looks off. Whatever the trigger, the site pauses everything and demands proof that you are human.

For scrapers, that is a full stop.

Without human help or a CAPTCHA-solving service, the bot cannot continue. The data is there, but now it is locked behind a puzzle. Some scrapers try to bypass this using third-party services that route the challenge to human solvers or use machine vision to guess the answer. But even then, sites like Google and Cloudflare are a step ahead. They make the tests harder, more varied, and sometimes multi-layered if your traffic looks suspicious.

This is what makes CAPTCHAs so effective. They do not just stop bots. They drain their speed, add cost, and create friction. And while they can frustrate real users, most major sites use them carefully—usually as a last resort, only when other signals strongly suggest bot activity.

From the site’s perspective, that is the tradeoff. One quick puzzle for a real visitor is acceptable if it shuts down a thousand bots trying to scrape the same data.

4. Honeypot Traps (Hidden Bot Traps)

Some traps are invisible to humans but deadly for bots.

A honeypot is a hidden element on a page designed to catch automated clients by baiting them into doing something a real user never would. These traps are simple, quiet, and extremely effective against careless scrapers.

Here is how they work:

  • Hidden form fields: A signup form might include a field named “email” that is hidden with CSS. A human never sees it. A naive scraper that blindly fills in every field might populate it without realizing. That single action is enough to flag the request as a bot.
  • Invisible links or buttons: Some sites embed links that are not visible on screen or are placed far outside the viewport. No real user would click them, but a bot that crawls everything will follow the link and expose itself.
  • Decoy APIs and fake sitemaps: A site might list bogus URLs in a trap robots.txt or dummy sitemap. Legitimate crawlers ignore them. Scrapers that follow every link or fetch every endpoint will fall right in.

Once a honeypot is triggered, the site can react instantly. It might block the IP, inject a CAPTCHA, or mark the session for closer inspection.

What makes this method so powerful is its invisibility. Real users never encounter the trap. Basic scrapers walk straight into it. LinkedIn, for example, has used hidden form fields and other tricks for years to trip up unsophisticated bots.

Honeypots are the low-hanging defense that filters out low-effort automation before it even gets started.

5. Behavioral Analysis and Interaction Tracking

Real users move like people. Bots do not.

Behavioral analysis is a powerful way to spot automation by studying how a visitor interacts with the page. A human scrolls in bursts, clicks at uneven intervals, pauses to read, and moves the mouse in unpredictable ways. Bots often do none of that.

Modern anti-bot systems inject scripts into the browser that quietly track these signals. The data is then sent back to the server to assess whether the client behaves like a real person.

Some of the most common clues include:

  • Mouse movement: Human mouse paths are jittery and imprecise. A bot might have no mouse activity at all, or produce a perfectly straight, instant movement that no hand could replicate.
  • Scrolling behavior: Real users scroll in chunks. They stop and read. A bot might scroll to the bottom in one motion or not scroll at all. That difference is easy to catch.
  • Typing rhythm: Humans make mistakes. They hesitate, backspace, and type with irregular timing. Bots fill fields instantly or type each character with robotic consistency.
  • Time on page and click flow: No human reads twenty product pages in five seconds. If the navigation is too fast or too uniform, it starts to look artificial.

Sites like Amazon and Facebook constantly collect this data. They compare it against huge datasets of real user behavior. If something feels off, the system increases the client’s bot score. Once the score is high enough, the client might get blocked, challenged, or slowed down.

For scrapers, this adds a new layer of difficulty. It is not enough to just load the page anymore. You have to behave. Some headless browsers can simulate mouse movement or random delays, but mimicking human interaction convincingly is complex and expensive.

And the models keep improving. Some sites even monitor mobile sensors like device tilt or acceleration to confirm that a real person is holding the phone. Bots cannot fake that easily.

Behavioral tracking does not rely on one signal. It combines many. And when paired with browser fingerprinting or traffic timing, it becomes a powerful way to spot and stop automation.

Backend (Server Side) Defenses

Not all defenses are visible on the page. Some live behind the scenes.

Beyond what runs in the browser, big websites use a range of techniques on the server itself. These defenses operate at the web server, inside the application logic, or deeper in the network infrastructure.

Server side measures have a major advantage. They are invisible to the client. A scraper has no clear way to know what triggered a block or why the request was flagged. And since the server sees the full picture (patterns across many sessions, IPs, and requests) it can make smarter decisions.

6. IP Address Rate Limiting and Blocklisting

Every request has an origin. That origin is an IP address.

Rate limiting is one of the oldest and most effective ways to slow down or stop scrapers. Websites track how many requests come from each IP, and when the number crosses a certain threshold, they respond with throttling, CAPTCHAs, or outright blocks.

Here is how it typically works:

  • Rate limiting: The server allows a set number of requests per minute from a single IP. If that limit is passed, further requests might return a 429 status code or trigger a delay. Some sites insert a CAPTCHA or challenge to break the scraper’s flow.
  • Blocklisting: If an IP repeatedly hits limits or behaves like a bot (fast bursts of activity, no scroll behavior, or interaction with honeypots) it might get blocked. Some sites even preemptively block entire IP ranges from cloud hosting providers, VPNs, or proxy networks commonly used by scrapers.

Google is a clear example. Send too many queries from the same IP, and you will hit a warning page asking you to verify you are human. Amazon does the same. Fetch too many product pages too fast, and you start getting 503 errors or security blocks.

This defense is not foolproof, but it raises the bar. Scrapers must rotate through large pools of proxies, each with their own limits. Even then, smarter systems track behavior across IPs. If hundreds of addresses all visit the same URLs in the same order, they are flagged and blocked together.

IP rate limiting is not fancy. It does not need to be. Every bot has to connect from somewhere, and this is the first gate it has to pass. Most do not.

7. HTTP Header and User Agent Validation

Every request to a website carries a signature. It is hidden in the headers.

Web browsers include dozens of HTTP headers with each request. These headers reveal details like the browser type, language preferences, accepted content types, navigation history, and more. Scrapers that do not get this right stand out instantly.

Here is how websites use headers to spot bots:

  • User Agent filtering: The User Agent string tells the server what kind of client is making the request. A real browser might say Chrome or Safari. A lazy scraper might say Python, curl, or Java. That is a red flag. Some websites only allow requests from a known list of popular browser signatures. Anything else is rejected or challenged.
  • Header consistency checks: Legitimate browsers send a very specific set of headers. They come in a certain order, include gzip support, language preferences, and other quirks. A simple HTTP client might skip some of these or get the order wrong. Even if the User Agent is valid, missing or mismatched headers will raise suspicion. Some systems even compare header casing and spacing to known browser behavior.
  • Referer and navigation checks: If you land on a deep product page with no referer header and no session cookie, it might look suspicious. Many sites expect users to move through pages in a sequence. Scrapers that jump straight to data endpoints without going through the proper flow risk being blocked for looking unnatural.

To bypass these checks, a scraper must do more than set a fake User Agent. It needs to carefully mimic all the headers a real browser would send, with matching language codes, encoding types, and navigation behavior. Some advanced detection systems still catch inconsistencies, but even basic header validation blocks a large number of unsophisticated bots.

For many major sites, sloppy headers are an immediate fail. There is no second chance.

8. TLS and SSL Fingerprinting (JA3 Fingerprints)

Before any data is exchanged, there is a handshake.

When a client connects to a site over HTTPS, it begins with a TLS handshake. During this handshake, the client shares its supported encryption methods, TLS version, extensions, and other connection details. These values form a fingerprint that can reveal what kind of software is making the request.

In 2017, researchers introduced a method called JA3. It turns the handshake parameters into a unique hash that characterizes the client’s TLS stack. Chrome, Firefox, Python’s requests library, and curl all produce different JA3 fingerprints.

This is where detection begins.

If a client claims to be Chrome through the User Agent, but its JA3 fingerprint matches Python requests, something is off. Websites and CDNs log these fingerprints to verify that the client’s identity matches its behavior. When it does not, the connection is flagged.

Cloudflare, for example, allows administrators to challenge or block requests that match specific JA3 hashes. They even expose these values through their bot management platform so sites can track which fingerprints are trusted and which are suspicious.

The challenge for scrapers is that JA3 data is hard to fake. It comes from deep inside the network stack. Changing it requires modifying how your HTTP library or headless browser negotiates encryption—something most tools do not expose or let you tweak easily.

Some headless browsers try to stay aligned with real Chrome TLS behavior, but fall behind over time. Others miss subtle extensions or present outdated cipher preferences that no modern browser would use.

This makes TLS fingerprinting a powerful silent filter. It runs before any HTTP headers are seen. No JavaScript. No page load. Just the initial handshake, quietly evaluated. If your scraper shows the wrong signature, it might be blocked before it ever sends a request.

Cookies are more than just for keeping you logged in. They are also a test.

Websites rely on cookies to manage user sessions, store preferences, and track navigation. But they also use them as a barrier to block bots that do not act like real browsers.

There are two main ways cookies come into play:

  • Required cookies for access: Some sites issue a cookie after the first visit or after solving a challenge. That cookie becomes a key. Without it, future requests are rejected. For example, Cloudflare might challenge you with a script, and once passed, your browser gets a cookie like cf_clearance. Any scraper that skips that process and jumps directly to the content will be blocked. Amazon does something similar. If the expected session cookies are missing or malformed, the site responds with an error or a redirect.
  • Session tracking and behavior analysis: Cookies link requests together. They tell the server that all this activity is coming from the same client. If one session starts making strange requests—too fast, too frequent, or to unexpected URLs—the server can flag it, throttle it, or end it. This means even with multiple IPs, if the scraper reuses one session or account, it can still get blocked.

Cookies and login flows often go together. Many high-value pages are only accessible after authentication. That login process sets session cookies or tokens that the server later checks. A bot that fails to carry those forward will not get through.

Advanced defenses go even further. They drop multiple cookies, sometimes using different storage layers like LocalStorage or IndexedDB. They then expect all of them to be returned correctly in later requests. A real browser does this without fail. A basic scraper might skip one or misformat it—and that is all the site needs to block the request.

This forces scrapers to become stateful. They must simulate full browser behavior, persist cookies across sessions, and carry context from one page to the next. If they cannot, they get stopped. For large websites, cookie checks are a quiet but powerful filter. Most bots never make it past them.

10. Credential Gating and Login Walls

Sometimes, the most effective way to stop scraping is to lock the door.

Many major platforms restrict access to valuable content behind a login wall. If you are not signed in, you either see nothing or just a limited preview. LinkedIn profiles, Facebook pages, and certain search results fall into this category.

This tactic forces scrapers to do more than just fetch URLs. They now need to manage user accounts, handle login flows, and maintain valid sessions. That adds serious friction.

Here is why login walls work:

  • Account level throttling: Once logged in, every request is tied to an account. If one account starts viewing hundreds of profiles in rapid succession, the site can flag it and shut it down. Unlike IP blocks, which are easy to rotate, accounts are harder to replace. Especially if phone verification, email validation, or payment details are involved.
  • Friction at scale: To scrape at scale, a bot now needs a pool of working accounts. Each one may require signup, verification, and possibly payment. Platforms track unusual signup activity and can block multiple accounts tied to the same behavior or infrastructure.
  • Dynamic and limited content: After login, content might be personalized, rate-limited, or delivered in small chunks. Scrapers must handle paginated requests, token management, or user-specific APIs. This slows things down and adds complexity.

Sites like LinkedIn are aggressive in enforcing login walls. They monitor how each account behaves, trigger CAPTCHAs on suspicious actions, and block accounts that show bot-like behavior. This method does not block scraping outright, but it makes it expensive and risky. Every account used for scraping can be banned, tracked, or even legally challenged.

For scrapers, this is a major step up in effort. They must simulate full user flows, manage cookies, handle logins, and rotate accounts constantly. That cost is often enough to stop low-effort scraping before it even begins.

11. Advanced Bot Detection Services and WAFs

Some websites do not build defenses themselves. They outsource the job to specialists.

Modern bot detection services are often bundled into Web Application Firewalls or Content Delivery Networks. These systems offer full-stack protection against scraping by combining many techniques into a single, constantly evolving defense layer.

Here is how they work.

When a user visits the site, a script runs in the browser. It collects data—browser fingerprints, mouse movements, sensor info, and more. On the server side, the system analyzes everything else: IP reputation, header structure, TLS fingerprint, cookie patterns, and request timing. All of this feeds into a scoring system that evaluates how likely the visitor is a bot.

If the score crosses a threshold, the system reacts. It might trigger a CAPTCHA, run a challenge script, or block the request altogether. The website owner decides the level of aggression.

Why are these services so effective?

Because they see the big picture.

A bot that hits one site and gets flagged can be blocked instantly on another site using the same provider. These platforms share threat intelligence and update their models constantly. They detect patterns that a single site team might miss—like new headless tools, proxy networks, or behavioral anomalies.

They also use deception. Some deploy honeypots silently across thousands of client sites. Others monitor obscure browser features, plugin data, or minute timing differences that bots struggle to mimic.

For scrapers, this is a nightmare scenario. You are not dealing with one defense, you are facing a layered system that adapts, shares data, and improves over time. Even advanced scrapers with full browser automation need constant updates to avoid detection. And once flagged, it gets harder with every request.

That is why many bots simply avoid these targets. It is often easier to find a weaker source or rely on an official API than to challenge a defense system that never stops learning.

12. AI and Machine Learning Based Traffic Pattern Analysis

Sometimes, the request looks perfect. Headers are right. Cookies are valid. TLS fingerprint matches Chrome.

And yet, it gets blocked.

That is the power of machine learning.

Large websites are no longer relying only on hard-coded rules to catch bots. They now use AI models trained on massive datasets to understand how real users behave. These models look at everything (click paths, page view timings, scroll patterns, even the pace of navigation) and they learn what normal looks like.

Then they look for outliers.

A client that visits every product page in alphabetical order. A group of accounts that all load the same pages at the same speed. A scraper that adds delays but still clicks just a little too predictably. Individually, none of these are smoking guns. Together, they form a pattern.

Here is how these systems work:

  • Every request gets a score: It might be called a bot score, a risk score, or something else. If the score is too high, the site might challenge the visitor or block access entirely.
  • The system learns over time: If a new scraping method starts to spread, and multiple sites see it, the detection model gets updated. This is especially true for services that share intelligence across platforms.
  • It works even when everything else looks right: Some scrapers pass all the traditional checks. But they still move too fast. Or too consistently. Or too well. Humans are messy. Bots are efficient. And AI picks up on that.

For scrapers, this is the hardest wall to climb. You can fake headers. You can mimic browser behavior. But beating a real-time anomaly detector trained on millions of sessions is another level entirely. It forces you to constantly adapt, inject noise, slow down, and behave less like a script and more like a person who is casually browsing.

And even then, you still might get caught.

Machine learning tilts the balance. It gives defenders the advantage of scale, pattern recognition, and adaptability. The rules are no longer fixed. And that makes scraping much harder to win.

Conclusion

Most websites do not care about scraping, they care about abuse.

But the biggest ones like Amazon, LinkedIn, and Meta have decided scraping is abuse. And they fight it with everything from TLS fingerprinting to AI-driven behavior models.

If you are scraping at scale, you need to be ready for that level of resistance.

That is where Scrape.do comes in:

  • Real browser automation with JavaScript execution
  • Automatic CAPTCHA handling and fingerprint rotation
  • Built-in support for login flows and session management
  • Infrastructure that adapts when the target changes

Use Scrape.do to spend less time dodging defenses and more time getting clean, usable data.