Category: Scraping basics

How to Find and Read robots.txt for Crawling and Scraping

33 mins read Created Date: March 24, 2025   Updated Date: March 24, 2025

Before you scrape a website, you need to check its rules of engagement; and that starts with robots.txt.

Think of it like a “Do Not Disturb” sign for web crawlers. It doesn’t stop you from entering, but it tells you where you’re not welcome.

This simple text file, quietly sitting at the root of almost every website, tells bots what they can and can’t do.

Ignore it, and you risk getting blocked, throttled, or even violating terms of service.

Respect it, and you’re on the path to ethical, efficient scraping.

In this guide, we’ll break down exactly what robots.txt is, how to find it, how to read it like a pro, and how to use it responsibly in your scraping workflow:

What Is robots.txt?

robots.txt is a plain text file that website owners place at the root of their site to communicate with web crawlers. It’s part of the Robots Exclusion Protocol (REP), a widely adopted standard used by websites to indicate which parts of the site crawling bots are allowed or disallowed from accessing​.

In practice, before a crawler fetches any page, it will first check for a file at the standard location (e.g. https://www.example.com/robots.txt) to read the site’s crawling instructions​.

These instructions rely on voluntary compliance. well-behaved bots honor the rules, though malicious bots might ignore them​. (In fact, attackers could even use robots.txt as a convenient list of sensitive paths since it’s publicly visible, which is why it’s a guideline and not a security feature.)

Brief History

The robots.txt standard was proposed in 1994 by Martijn Koster, a software engineer who was grappling with early web “spiders” that hammered sites with traffic​.

The story goes that a poorly behaved crawler (written by developer Charles Stross) inadvertently caused a denial-of-service on Koster’s server, prompting the need for a protocol to rein in web robots​.

Koster and other web pioneers discussed a solution on the WWW-Talk mailing list, eventually agreeing on a simple text-file format named “robots.txt” (originally it was going to be RobotsNotWanted.txt before they opted for a shorter name)​.

By June 1994, the major search engines of the time (Lycos, AltaVista, WebCrawler, etc.) all agreed to honor robots.txt, making it a de facto standard almost immediately​

Adoption and Evolution

Over the decades, robots.txt became nearly ubiquitous on the web.

As of 2019, Google estimated that over 500 million websites have a robots.txt file in place​, an astonishing adoption of this simple protocol.

Until recently it was an informal standard, but in 2019 Google, along with Koster and others, pushed to formalize it. The REP was finally published as an official internet standard (RFC 9309) in 2022​, after 25+ years of real-world use.

The core rules remain simple, though modern conventions (like wildcard support and sitemap pointers) have been documented to eliminate ambiguities​

Is It Still Relevant Today?

Absolutely.

Robots.txt is still the first line of communication between websites and crawlers, including search engines and data scrapers.

Nearly every major site uses it in some form, and search engine bots (Googlebot, Bingbot, etc.) check it on each crawl. In the 2020s it has even taken on new importance with the rise of AI content crawlers, many sites have updated their robots.txt to explicitly block bots like OpenAI’s GPTBot from scraping their content​.

For example, in 2023 Amazon, Facebook, Pinterest, WikiHow, WebMD, and many others added rules to disallow GPTBot in their robots.txt files​. This shows that even in the era of AI and big data, the humble robots.txt remains a crucial tool for webmasters to declare scraping boundaries.

In short, robots.txt is very much alive and relevant as an ethical web scraping guideline and server resource safeguard.

Why Do You Need robots.txt?

From a developer or scraper’s perspective, understanding and respecting robots.txt is important for ethical and efficient crawling.

Here’s what happens when your web scraper respects the robots.txt rules versus when it ignores them:

If you respect robots.txt

Your scraper will avoid disallowed areas of the site, which means you won’t crawl content the site owner has deemed off-limits. This is good practice for ethical scraping​.

By staying within allowed sections, you reduce the risk of overwhelming the website with requests (the site owner often uses robots.txt to prevent crawlers from hitting large or sensitive sections that could be resource-intensive).

Respecting these rules also helps you stay in the site owner’s good graces; you’re less likely to get your IP blocked or face legal terms-of-service issues if you honor the site’s published policies.

In short, you act as a “polite” bot, which is important for ethical data collection and sustainability.

If you ignore robots.txt

Your crawler might venture into sections the website explicitly asked bots not to touch (such as private archives, search result pages, or admin areas).

While nothing technical stops you from doing so, it’s akin to entering through a “No Entry” door, it can be seen as bad faith.

Scraping disallowed pages can put undue load on the site (since those sections might not be optimized for crawlers), potentially impacting the site’s performance. It may also get you flagged and blocked quickly; many web admins monitor server logs and will ban user agents that ignore robots.txt. In some cases, webmasters set up honeypot URLs in robots.txt (disallowed links that legitimate bots will avoid) to detect crawlers that don’t comply – if your bot requests those, it’s a dead giveaway you’re ignoring the rules.

Overall, ignoring robots.txt is unethical scraping and could lead to consequences ranging from being blacklisted to legal challenges.

How to Find robots.txt of a Website

Finding a website’s robots.txt file is straightforward because its location is standardized. You don’t need any special tools – a web browser or basic HTTP request will do. Below are three step-by-step methods to locate and fetch a site’s robots.txt:

Step 1: Append “/robots.txt” to the Domain URL

By convention, the robots.txt file is always placed in the root directory of the website. This means you can typically find it by taking the base URL of the site and adding /robots.txt. For example, for example.com, navigate to:

https://www.example.com/robots.txt

Simply enter that URL in your browser’s address bar. If the site has a robots.txt, it should display as plain text. (You can do the same for any site: e.g. instagram.com/robots.txt, nytimes.com/robots.txt, github.com/robots.txt, etc.) If the site uses a www subdomain, include that (e.g. http://www.example.com/robots.txt). Likewise, use the proper scheme (http:// or https://) that the site is on. In summary, the standard URL path for a robots.txt is “/<u>robots.txt</u>” right off the main domain​.

.Note: The robots.txt is a public file. You don’t need any authentication to access it. If the site is reachable, its robots.txt (if it exists) should be as well. If your browser shows an error or blank page, that likely means the site has no robots.txt (or you typed the URL incorrectly).

Step 2: Retrieve it with an HTTP request (using Python)

While a browser is convenient for a quick check, as a developer you might want to fetch robots.txt programmatically. This is easily done with an HTTP library. In Python, for instance, you can use the requests library to GET the robots.txt URL:

import requests

url = "https://www.example.com/robots.txt"

response = requests.get(url)

print(response.status_code)
print(response.text[:500]) # print first 500 characters of the content

In this snippet, we request the robots.txt file and then check the status code and a portion of the text. A successful fetch should return a 200 OK status and the body of the file in response.text. You can then examine or parse this text as needed. If using another language or tool, the process is analogous: make a GET request to the site’s base URL + /robots.txt.

Tip: You can also use command-line tools like curl or wget to fetch robots.txt. For example:

curl -L -o robots.txt https://www.example.com/robots.txt

will download the file to your local machine. But be mindful of polite usage – fetching robots.txt is lightweight compared to crawling pages, but you should still avoid spamming the site with repeated requests.

Step 3: Check the HTTP Response

After requesting robots.txt, you need to interpret the HTTP response:

  • If you get a 200 OK: Great, the robots.txt was found. The response body will contain the rules in plain text. You can proceed to read or parse those rules (we’ll cover how to read the contents in the next section). For example, response.status_code == 200 in the Python snippet indicates success, and response.text now holds the file’s content as a string.
  • If you get a 404 Not Found: This means the site does not have a robots.txt file at the expected location. A missing robots.txt is essentially an “all clear” – by convention, the absence of a robots.txt means crawlers are free to fetch anything on the site (no restrictions)​. In other words, crawlers assume the site owner didn’t impose any crawling limitations if the file is missing. Your scraper can interpret a 404 on robots.txt as “no rules to worry about.” (However, it’s still good to crawl responsibly even if there’s no robots.txt.)
  • If you get a different status (301, 403, etc.): Occasionally, a site might redirect the robots.txt URL (301/302 redirect). If so, follow the redirect (most HTTP clients do this automatically) to get the content. A 403 Forbidden could imply the site intentionally blocks access to the robots.txt (which is unusual and counterproductive, but if encountered, you must decide how to proceed – likely treat it as “no permission,” meaning you should not crawl, or as “no info,” meaning proceed carefully). In most cases, you will see either 200 or 404.
  • If the request times out or the site is down: That’s similar to not having the file reachable. Many crawlers in this case assume the safest route, which is to treat it like an empty file (no disallows). But if the site is down, you can’t scrape it anyway. You might try again later.

In our Python example, you can programmatically check response.status_code. For instance:

if response.status_code == 200:
robots_rules = response.text

elif response.status_code == 404:
robots_rules = "" # treat as no rules

else: # handle other statuses or errors robots_rules = ""

After this step, you should have the robots.txt content (or know that none exists). The next task is to read and interpret the rules inside that file.

How to Read robots.txt

Once you have the robots.txt file content, understanding its directives is key. A robots.txt file is structured as a set of rules that tell crawlers what they can or cannot crawl. In this section, we’ll break down all the directives and syntax you might encounter in a robots.txt, and how to interpret them. By the end, you’ll know how to parse rules like User-agent, Disallow, Allow, Crawl-delay, Sitemap, and more, including wildcard patterns.

Format Overview: A robots.txt file is a plain text file with one rule per line. The general format is a series of “records” or blocks, each targeting a specific crawler (or set of crawlers). A record looks like this:

User-agent: [crawler-name]
Disallow: [path]
Disallow: [another path]
Allow: [path]
# (etc.)

Each record starts with a User-agent directive, which specifies which bot(s) the following rules apply to. It’s followed by one or more directives like Disallow or Allow that apply to that bot. A single robots.txt can contain multiple such records (to give different rules for different crawlers). Typically, records are separated by blank lines for readability. Lines starting with # are comments and should be ignored by parsers. Directive names are case-insensitive (“User-agent” is usually capitalized as shown, but “user-agent” or “User-Agent” are equivalent). The file is usually UTF-8 text.

Let’s go through each directive and parameter you might see:

User-agent Directive

User-agent is the first directive of a rule block. It declares which crawler(s) the subsequent rules apply to. The syntax is:

User-agent: <name>

Here <name> is the identifier of a web robot, as sent in its HTTP User-Agent header. For example, Google’s crawler identifies as “Googlebot”, Bing’s as “Bingbot”, and so on. You can also target all crawlers with a wildcard *.

How it works: When a crawler visits a site, it looks for the group of rules that best matches its own user-agent string. For example, if Googlebot visits, it will check if there’s a User-agent: Googlebot section. If yes, it will obey those rules. If not, it will fall back to a generic User-agent: * (asterisk) section if present. The * (asterisk) is a wildcard meaning “any crawler”. Most sites have a User-agent: * record that serves as default rules for all bots not specifically called out.

A robots.txt can list multiple user-agent lines if the same rules apply to multiple bots. For instance:

User-agent: Googlebot
User-agent: Bingbot
Disallow: /private-data/

This means Googlebot and Bingbot should not crawl /private-data/ (all other bots are not mentioned here, so they wouldn’t be bound by this record). You can also have separate records for different bots if you want different rules for each. For example, a site might allow Googlebot to do something that others cannot (or vice versa). In that case, one block might start with User-agent: Googlebot (with its rules), and another block User-agent: * (with the general rules for everyone else). Each user-agent section’s rules are independent. A crawler will use the first matching block (usually the most specific match). If a bot’s name appears in multiple sections, the convention is that the longest (most specific) user-agent string match is used​. For example, a bot named “Googlebot-Image” would match a “User-agent: Googlebot-Image” rule over a generic “User-agent: Googlebot” rule, because the former is a more specific match.

Example: In a simple case, if a site’s robots.txt contains:

User-agent: *
Disallow: /admin/

This means all bots should avoid crawling the /admin/ section. If a specific bot (say “BadBot”) is being troublesome, the site could add a specific rule:

User-agent: BadBot
Disallow: /

This would tell that particular bot not to crawl anything on the site (Disallow all). Meanwhile, other bots would still follow the * rules. (We’ll explain the Disallow: / syntax next.)

Disallow and Allow Directives

The Disallow and Allow directives are the core of robots.txt, specifying which paths bots cannot or can crawl. They are usually listed right after a User-agent line (or lines). Think of them as filters applied to URLs on the site.

Disallow

A Disallow directive tells the crawler not to visit any URL that starts with the specified path. The syntax is:

Disallow: <path>
  • <path> can be a full path (e.g. /secret-folder/) or a partial prefix. It is always relative to the site’s root. For example, Disallow: /secret would block https://example.com/secret, as well as /secret/page1 and /secret-plans (any URL beginning with “/secret”). Essentially, if the URL path starts with the Disallow string, it’s off-limits.

  • An important edge case: if <path> is empty (i.e. Disallow: with nothing after it), it means “disallow nothing.” An empty Disallow value indicates that no pages are disallowed for that user-agent. Some sites explicitly use:

     

    User-agent: *
    Disallow:
    

    which is effectively the same as saying “you’re allowed to crawl everything.” (It’s equivalent to having no robots.txt at all, or you might also see Allow: / used to similar effect.)​

     

  • If you want to block everything on the site, you use a forward slash as the path: Disallow: /. The single slash matches all URLs (since all URLs start with “/”). This is the bluntest directive, meaning “no access anywhere.”​

     

  • You can list multiple Disallow lines under one user-agent. Each one adds a path to the block list. For example:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /temp/
    Disallow: /junk/
    

    This tells bots to avoid the /cgi-bin, /temp, and /junk directories entirely​. You might do this to keep crawlers out of irrelevant or sensitive areas.

     

  • Default allow: If a user-agent section has no Disallow lines at all, it implicitly means “allow everything.” Some sites use an explicit Allow: / for clarity (covered below), but it isn’t necessary. Conversely, if a section has no Allow but has some Disallow rules, anything not disallowed is allowed.

In summary, Disallow = “do not crawl these paths.” It’s the most common directive you’ll see. For instance, many sites disallow /admin/, /login, /search, or other non-public pages to all bots.

Allow

An Allow directive does the opposite – it permits crawling of a path, even if an overarching Disallow might apply. Its syntax is:

Allow: <path>

Allow is typically used as an exception to a broader disallow. For example, suppose you disallowed a whole directory but want to let bots crawl a specific file or subdirectory within it. You could do:

User-agent: *
Disallow: /public/
Allow: /public/index.html

This would block everything under /public/ except the /public/index.html page (which is explicitly allowed). When crawlers interpret this, they usually give precedence to the more specific rule. In this case, the allow rule for the specific file overrides the blanket disallow for that directory.

It’s important to note that the Allow directive was not part of the original 1994 spec but was introduced later and is now supported by all major search engine bots. It gives webmasters finer control. For scrapers, it means you can’t just check disallows; you also need to see if an allow explicitly opens something that a disallow would otherwise cover.

Multiple Allow/Disallow Rules: When there are potentially conflicting rules, the usual resolution (per Google and the standard) is:

  • The most specific rule (the one with longest path that matches the URL) wins.
  • For example, given Disallow: /public/ and Allow: /public/index.html, the URL /public/index.html is allowed (because the allow rule is a more specific match to that URL). Another URL /public/contact.html would be disallowed (no specific allow for it, so the /public/ disallow takes effect).
  • If two rules have equal specificity, a crawler might choose the deny over allow as a precaution. But in practice, avoid ambiguous cases.

Allow all: You might see Allow: / which effectively means “you are allowed to fetch the root and everything under it” (it’s a blanket allow). However, by itself Allow: / is redundant unless used in combination with other disallows. It’s sometimes used in records to explicitly state that everything is open (or to override a disallow from an earlier rule). For example, Google’s documentation shows using both directives to let one bot in and keep others out​

Example – Combining Allow and Disallow: A common pattern is to disallow an entire directory but allow certain subdirectory or file. For instance:

User-agent: *
Disallow: /gallery/
Allow: /gallery/public/

This means “Don’t crawl the /gallery section, except you can crawl the /gallery/public subfolder.” Bots would refrain from /gallery/private or any other subfolder except the explicitly allowed /gallery/public path.

Another example:

User-agent: *
Disallow: /
Allow: /public/

This setup would block everything on the site except the /public directory (so only that directory is crawlable)​. This kind of rule might be used during a site’s development – e.g., keep the whole site unindexed except a specific section meant for public view.

Wildcards and Pattern Matching in Paths

Robots.txt rules can include wildcard characters to match patterns of URLs. The two special tokens you’ll see are * (asterisk) and $ (dollar sign):

  • * (asterisk) – matches any sequence of characters. This can be used in the middle or end of a path to indicate “anything goes here.” For example, Disallow: /blog/*/comments would disallow URLs like /blog/2023/post1/comments and /blog/draft/comments – essentially any URL that has “/blog/ something /comments”. Similarly, Disallow: /*.php would disallow all URLs containing “.php” (like /index.php, /folder/page.php?id=5, etc.). Wildcards are very powerful and commonly used. Important: The wildcard does not match the / in older implementations, but modern standard does treat it as “match any characters” including slashes. (The 2022 spec clarifies wildcard usage​.)

     

  • $ (dollar sign) – matches the end of the URL. It’s used to indicate “ends with…”. For example, Disallow: /*.pdf$ means disallow any URL that ends in “.pdf” (likely to block PDF files from being crawled)​. The $ is typically used in conjunction with a preceding pattern to specify file extensions or exact endings. Another example: Disallow: /temp$ would disallow the URL that exactly ends with “/temp” (but not /temp/anything because that has additional characters after “temp”). In practice, $ is mostly used for file-type blocking as shown.

     

These wildcards were not in the original robots.txt spec but became common through Google’s and others’ interpretation. Now, according to Google’s documentation, all rules (except the Sitemap directive) support the * wildcard, and the $ is recognized for end-of-string matches​.

Examples of wildcard usage:

  • Disallow: /*? – disallow any URL containing a ? (query string). This is a way to block all query parameters (perhaps to avoid crawling dynamic URLs or duplicates). For instance, Disallow: /*?session= would block any URL that has “?session=” in it.
  • Disallow: /*.jpg$ – block all JPEG image URLs. A site might do this if they want to prevent image search engines from indexing their photos, for example.
  • Allow: /*.css$ – allow all .css files to be crawled (maybe amidst a disallow of everything else). This could be used if the site wants search engines to fetch CSS (for rendering), even if much of the site is disallowed.
  • Disallow: /backup/* – disallow any URL under /backup/ (and the * means anything after /backup/ is covered, effectively the same as Disallow: /backup/ with a trailing slash in this case).

Real-world case: Instagram’s robots.txt uses wildcards in its rules. One notable rule in Instagram’s file is:

Disallow: /*__a=1*

This pattern *__a=1* appears for multiple user-agents. It blocks any URL that contains the query parameter __a=1 anywhere in it​. (Instagram uses __a=1 in certain internal API calls to fetch JSON data; they clearly don’t want crawlers hitting those endpoints.) The wildcard before and after __a=1 means it will match if that sequence appears anywhere in the URL query or path. So URLs like https://www.instagram.com/p/XYZ/?__a=1 would be disallowed.

To give another concrete example of wildcards and allow/disallow together: Google’s own guidelines show the following rule to block all GIF images on a site from Google’s crawler:

User-agent: Googlebot
Disallow: /*.gif$

This tells Googlebot: “do not crawl any URL that ends in .gif”​. As a result, none of the .gif images will be fetched by Googlebot (so they won’t appear in Google’s index or image search).

In summary, wildcards greatly enhance the flexibility of robots.txt. As a scraper, you need to be aware that a disallow or allow may not be a simple prefix – it could be a pattern. When interpreting rules, your code should handle * and $. If writing your own parser, a * can be translated to “.*” in regex (with appropriate escape of other chars) and $ to end-of-string anchor. But you don’t necessarily need full regex; just understand substring matching with those wildcards is usually sufficient. The standard defines that * can match any sequence (including empty sequence) and $ matches end, making these patterns fairly straightforward to apply.

Crawl-delay Directive

Crawl-delay is a directive used to throttle the crawl rate of bots. It is not universally supported by all crawlers, but many honor it. The syntax is:

Crawl-delay: <number>

Where <number> is typically an integer (or a decimal in some cases) indicating the delay in seconds between successive requests to the website. For example:

User-agent: Bingbot
Crawl-delay: 10

This tells Bing’s crawler to wait 10 seconds between each request. The idea is to reduce server load by spacing out the hits. Smaller numbers mean faster crawling; larger means slower. A Crawl-delay of 1 means one request per second at most.

Important: Google’s crawlers do not support Crawl-delay​. Google instead manages crawl rate via its own systems (like Search Console settings). Bing, Yahoo, Yandex, and several other crawlers do respect it. As a result, you’ll often see Crawl-delay directives aimed at Bingbot or Yandex’s bot (often in those specific user-agent sections). If you’re writing a custom scraper, you can choose to respect crawl-delay as an act of courtesy.

Some points about Crawl-delay:

  • It applies per bot user-agent section. So a site could say Crawl-delay: 5 for one bot and a different number for another.
  • The unit (seconds) is common, though technically it could be fractional (e.g. “0.5” for half-second) – not all parsers handle decimals consistently, so usually integers are used.
  • If multiple bots share the same record (multiple User-agent lines in one block), a crawl-delay in that block applies to all of them.

If you’re implementing this in a scraper, it means after each request, you should sleep for that many seconds before the next request to that same site. This is especially relevant if you plan to bombard a site with many requests – a crawl-delay is the site’s way of saying “please go slow.”

Related: Request-rate – In some cases, instead of (or in addition to) crawl-delay, you might see a Request-rate directive, typically used by Yandex and a few others. For example: Request-rate: 1/5 would mean 1 request per 5 seconds. Or Request-rate: 5/1 (as seen in some robots.txt) means 5 requests per 1 second (which is effectively the same as crawl-delay of 0.2 seconds per request). Request-rate is less common, but if present, it provides a more granular way to indicate crawl speed. As a scraper, you can interpret it similarly (calculate the interval from it). Most modern sites stick to Crawl-delay since it’s simpler.

Note: Because Google ignores Crawl-delay, some webmasters choose not to use it (since Google is a major crawler). But others (like large news sites or e-commerce sites) include it to modulate bots like Bing, which might otherwise come too fast. If you respect crawl-delay in your scraper, you’re aligning with the more conservative approach, which is good for site friendliness.

Sitemap Directive

The Sitemap directive in robots.txt is used to specify the location of the site’s XML sitemap(s). An XML sitemap is a file (usually sitemap.xml) that lists URLs on the site along with metadata like last modified dates. By listing it in robots.txt, you make it easier for crawlers to find the sitemap. The syntax is simply:

Sitemap: <absolute URL to the sitemap>

Key points:

  • The URL should be absolute (include http:// or https:// and the full path). For example:
    Sitemap: https://www.example.com/sitemap.xml

  • You can include multiple Sitemap lines if your site has several sitemaps (common for very large sites that split sitemaps). E.g.:

    Sitemap: https://example.com/sitemap-main.xml
    Sitemap: https://example.com/sitemap-blog.xml
    
  • The Sitemap directive is not tied to user-agents. It’s a global directive, meaning you usually put it at the very top or bottom of the robots.txt, not inside a specific User-agent block (though parsers will typically find it anywhere). It applies to all crawlers that support sitemaps.

  • This directive was added later (mid-2000s) when XML sitemaps became popular. Search engines definitely pay attention to it. It does not restrict crawling; it’s just informational (it tells crawlers “hey, here’s where you can find a list of all my pages”).

For scrapers, the Sitemap directive can be a goldmine. If you find a sitemap URL in robots.txt, you might fetch that sitemap file (which is XML) to get a full list of the site’s URLs to scrape. This is often easier than crawling through links on the site to discover content. Many web scraping workflows use sitemaps as a starting point.

(This is a compressed sitemap file.) If you were scraping Instagram (hypothetically, since they disallow most scraping), this line tells you where to get a structured list of some of their URLs.

Not all sites list sitemaps in robots.txt (some rely on search engines finding sitemaps via other means or submission), but many do. It’s a best practice because any crawler that knows about robots.txt will then know about the sitemap too.

Other (Non-Standard) Directives

The above are the most common and important directives: User-agent, Disallow, Allow, Crawl-delay, Sitemap. These cover the vast majority of use cases. However, there are a few additional directives or parameters you might encounter in some robots.txt files. These are usually specific to certain search engines or have fallen out of favor. Here’s a quick overview for completeness:

Host

The Host directive is recognized primarily by Yandex. Its purpose is to specify the preferred domain if your site is accessible from multiple hostnames. For example, if example.com and www.example.com serve the same content, Yandex recommends indicating the main one via Host. Syntax:

Host: example.com

Only one Host directive should be present (Yandex will use the first if multiple). Other search engines typically ignore this directive. It’s essentially a hint for canonical host for Yandex’s benefit​. You’ll rarely see this unless the site specifically caters to Yandex or has mirror domains. If you do see it, you as a scraper generally don’t need to do anything with it – it’s more relevant to indexing logic than crawling.

Noindex

Noindex is an unofficial directive that some sites have used to try to prevent indexing of certain pages. Example:

User-agent: *
Noindex: /private-page.html

The intent here is to tell crawlers, “You can crawl this page but do not index it in search results.” However, Noindex in robots.txt is not part of the standard. Google does not support Noindex in robots.txt at all​.

Historically, Yandex did support it (and maybe still does for their engine). In 2019, Google explicitly clarified that it will ignore any “noindex” directives in robots.txt and that webmasters should use meta tags or HTTP headers for noindexing instead​.

So, if you come across Noindex lines, know that they’re meant for specific search engines. As a scraper, you might not need to worry about them, since they’re about search indexing rather than access. It could be an ethical consideration though – if a site says Noindex: /somepage, they likely consider that page sensitive enough to not want it listed publicly. But they didn’t outright disallow it (so crawling it is still permitted). Use your judgment in such cases.

Clean-param

This is a directive used by Yandex (and formerly by Ask.com) to manage URL query parameters. It looks like:

Clean-param: <param1>&<param2> /path/

It’s a bit complex in syntax, but basically it tells the crawler that certain query parameters can be ignored because they don’t affect the page’s content. For example, Clean-param: sessionid&ref /docs/ might indicate that for any URL under /docs/, the parameters sessionid and ref can be dropped (they might be tracking or session params) to avoid crawling multiple variants of the same page. This helps search engine crawlers avoid duplicate content.

If you’re scraping and you see a Clean-param, it’s a hint that you could ignore those parameters too, to avoid duplicating requests. But unless you specifically target Yandex or have duplicate URL concerns, you can usually ignore this. It’s not widely used these days.

Others

There are a few other historical or niche directives (like Request-rate which we mentioned, and deprecated ones like Visit-time to specify preferred crawling times, used by some old bots). These are rare. Modern bots mostly stick to the core ones.

To summarize the complete rule set you may see in a robots.txt:

  • User-agent – which bot the rules apply to.
  • Disallow – paths not to crawl.
  • Allow – paths that are ok to crawl (usually exceptions to disallow).
  • Crawl-delay – request interval guidance (seconds).
  • Sitemap – link to XML sitemap(s).
  • (Less common: Host, Noindex, Clean-param, Request-rate, etc., which are either search-engine-specific or deprecated.)*

Now, let’s tie this together with a real example to see how these rules manifest in an actual robots.txt file.

Example: Instagram’s robots.txt

Instagram provides a good case study of a complex robots.txt because it has multiple sections for different bots. Below is an excerpt from Instagram’s robots.txt (simplified for brevity):

User-agent: Googlebot
Disallow: /api/
Disallow: /publicapi/
Disallow: /query/
Disallow: /logging/
Disallow: /qp/
Disallow: /client_error/
Disallow: /*__a=1*
Allow: /api/v1/guides/guide/

User-agent: Bingbot
Disallow: /api/
Disallow: /publicapi/
... (same disallows as above) ...
Disallow: /*__a=1*

User-agent: DuckDuckBot
Disallow: /api/
... (same disallows) ...
Disallow: /*__a=1*

User-agent: Yeti Disallow: /api/
... (same disallows) ...
Disallow: /*__a=1*

User-agent: *
Disallow: /

Let’s break down what’s happening here:

  • There are specific records for Googlebot, Bingbot, DuckDuckBot (DuckDuckGo’s crawler), Yeti (Naver’s crawler), and finally a wildcard *. Each of the named bots has identical Disallow rules: they are blocking access to various internal API endpoints like /api/, /publicapi/, etc., as well as any URL containing __a=1​. In the Googlebot section, there is an Allow for /api/v1/guides/guide/ – presumably Instagram has some guides feature that they want Google to crawl even though it’s under “/api/”. This is that exception use of Allow we discussed (they open one specific path while the rest of /api/ is closed off to Googlebot). No other bot got that Allow line, so only Googlebot may access that path​.

  • The final User-agent: * section applies to all other bots not listed explicitly. It simply has Disallow: / for them​. Disallow: / under User-agent * means total disallow for every other crawler. In plain English, Instagram is saying: “All other bots, do not crawl our site at all.” This effectively blocks any scraper or lesser-known bot that isn’t Google, Bing, DuckDuckGo, etc. So if you write a new scraper with a unique User-Agent string, Instagram’s rules tell you that you’re not welcome to crawl (they gave no allowance in the * section).

     

  • Not shown in this snippet, but Instagram’s file also includes lines for other bots like Apple’s bots, Facebook’s external bot, etc., all with similar disallows. And it lists a Sitemap URL as well (which a scraper could use to find public profiles/posts, though if you’re disallowed by * you ethically shouldn’t fetch those pages).

For a developer reading this robots.txt:

  • If you are coding a scraper that pretends to be Googlebot (which is not ethical, by the way), you’d see you’re allowed everything except those API endpoints – but don’t do that. If you are any normal scraper with your own User-Agent, the User-agent: * rules apply and you’re disallowed from the whole site. So the correct action is to not scrape Instagram’s web pages (and indeed Instagram aggressively blocks scraping in practice).
  • A search engine like Google will follow its section: it will crawl the site but avoid any URLs that match those disallow patterns (so it won’t hit Instagram’s internal API calls, etc.).

This example illustrates the structure: multiple user-agent specific blocks, lots of disallows (some with wildcards), and a blanket disallow for all other bots. Many large sites have similar robots.txt – they allow Google and maybe a few others to crawl certain parts, and keep everyone else out.

Now that you know how to read the rules, the final step is learning how to use this information when building a crawler.

Using robots.txt in Web Scraping

Knowing how to parse a robots.txt is only useful if you actually use it to inform your web scraping. In this section, we’ll cover how to integrate robots.txt compliance into your scraping workflow.

At a high level, a robots-aware scraper should do the following:

  1. Fetch the robots.txt of the target site before crawling (as discussed earlier).
  2. Parse the robots.txt rules to find which directives apply to your scraping bot.
  3. Determine allow/disallow for each URL you intend to scrape, based on those rules.
  4. Crawl accordingly, skipping any disallowed URLs (and optionally respecting crawl delays).

Let’s break that down with a bit more detail:

  • Identify your scraper’s User-Agent: When you write a scraper, you typically set a User-Agent string for your HTTP requests (to identify your bot). For example, you might use something like "MyScraperBot/1.0" or "Mozilla/5.0 (compatible; MyScraperBot/1.0; +http://mywebsite.com/bot)". The content of that string is what robots.txt rules will be matched against. If your bot’s name isn’t specifically mentioned in the file, the User-agent: * rules apply. It’s good practice to use a unique User-Agent that includes contact info or website, and of course to honor robots.txt.
  • Parse the robots.txt content: You can write a simple parser or use existing libraries. The logic is: find the record (section) that matches your bot. That means, e.g., if your User-Agent is “MyScraperBot”, look for a “User-agent: MyScraperBot” section. If not found, find “User-agent: *”. Once you find the appropriate section, collect all the Disallow and Allow rules under it (until the next User-agent or end of file). Ignore comments (#) and unrelated sections. You then have the ruleset relevant to you.
  • Check URLs against rules: For each URL you plan to scrape, compare it to your disallow/allow patterns. If the URL path starts with any of the disallowed strings (after wildcard expansion logic) and is not overridden by a more specific allow, then do not scrape that URL. Skip it and maybe log why (for your own info). If it’s allowed, proceed to fetch it.
  • Handle crawl-delay: If the ruleset included a Crawl-delay: N, you should build a pause of N seconds between requests to the site. This can often be done by sleeping in your crawl loop for that domain.
  • Iterate politely: Continue to fetch pages, always respecting the rules. If you encounter a different domain during crawling, you need to fetch that domain’s robots.txt as well and repeat the process (each site has its own rules).

Using Python’s robotparser (robotexclusion) module

You don’t have to code the parsing from scratch. Python’s standard library includes a module urllib.robotparser which can handle robots.txt parsing for you. Here’s a basic example of how to use it:

import requests from urllib.robotparser
import RobotFileParser

# Step 1: Fetch robots.txt site = "https://www.instagram.com"
robots_url = site + "/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()

# Step 2: Choose a user-agent for our scraper
my_user_agent = "MyScraperBot"
# this should match your actual scraping User-Agent string

# Step 3: Check if a given URL is allowed
test_url = "https://www.instagram.com/explore/"
if rp.can_fetch(my_user_agent, test_url):
# Allowed by robots.txt – proceed to scrape
response = requests.get(test_url, headers={"User-Agent": my_user_agent}) html = response.text
# ... use BeautifulSoup or other parsing on html ... print("Fetched:", test_url)

else: print("Skipping disallowed URL:", test_url)

In this snippet:

  • We initialize a RobotFileParser and feed it Instagram’s robots.txt URL. The read() method downloads and parses the file for us.
  • We set my_user_agent to our bot’s name (make sure it’s the same string you use in your actual requests).
  • Then rp.can_fetch(user_agent, url) is used to ask: Can this user_agent fetch this URL according to the robots.txt rules? It returns True (allowed) or False (disallowed)​. In our example, if we test "https://www.instagram.com/explore/" with "MyScraperBot", this will likely return False – because as we saw, Instagram disallows * (which includes our bot) from everything. If we tested a Google URL with agent “Googlebot”, it might return True or False depending on the path.
  • Based on can_fetch, we either proceed with requests.get to fetch the page, or skip it.

The robotparser handles the details of finding the right record and matching the URL against allow/disallow patterns (including wildcards). It even provides methods to get crawl-delay (rp.crawl_delay(user_agent)) if present​, and site_maps() to get Sitemap URLs if listed (in Python 3.9+).

Using such a library simplifies compliance. If you’re not using Python, there are similar libraries in other languages (for instance, Node has npm packages for robots parsing, Java has crawlercommons, etc.). Many scraping frameworks have this built-in too – e.g., Scrapy (a Python scraping framework) has a RobotsTxtMiddleware that you can enable, which will automatically fetch and respect robots.txt for you.

Respecting robots.txt in practice: In a scraping workflow, you might structure your code like:

  1. On startup or when encountering a new domain, fetch and parse robots.txt (cache it so you don’t fetch it repeatedly).
  2. For each URL to crawl, call a function allowed(url) that checks it against the parsed rules.
  3. Only fetch if allowed. If not, skip and maybe log “skipped due to robots.txt rule”.
  4. If Crawl-delay is set, incorporate time.sleep(delay) between requests to that domain.
  5. Continue until done.

Also, always identify your bot properly via the User-Agent header when scraping. Don’t pretend to be a browser or Googlebot just to evade rules – that defeats the purpose of robots.txt and can get you into trouble (some sites have defenses that specifically watch for such behavior). Honoring robots.txt is both the ethical choice and often the safest choice to prevent IP bans or legal notices.

Finally, note that robots.txt compliance is voluntary but can sometimes be tied to legal terms. Some websites’ Terms of Service require that bots follow their robots.txt. Ignoring it could then be not just a technical faux pas but a legal violation of the ToS. For example, the LinkedIn vs. hiQ Labs case highlighted the ambiguity in scraping legalities, but generally if you want to stay on the safe side, obey the rules the site has laid out.

By incorporating robots.txt checks, you ensure your crawler behaves politely. Ethically, it’s the right thing, and practically, it can save you headaches by avoiding crawling content that will just get you blocked. In short, it’s like the site’s way of saying “Here are the ground rules for crawling me.”

Conclusion

Always start your crawling project by finding and reading the target site’s robots.txt. Understand what parts of the site are open to you and any limitations on rate. Then design your scraper to play by those rules. This will make your web scraping more respectful, and reduce the chance of interfering with the website’s operations or getting into trouble. Happy (ethical) crawling!