RFC9309! Robots.txt Quietly Became an Official Internet Standard

Pop quiz hotshot:

    Is robots.txt a formally accepted internet standard?

  • a) No
  • b) Yes

Most educated SEO's will say "no, it was never ratified by any internet standards body". That fact, was a running joke in search that we all agreed to the Not-really-a-standard Standard. It was more of a suggestion than a standard.

As I was digging through docs and LLM's on robots.txt for the big story on Differentiated Delivery (Cloaking your Robots.txt), I discovered - we were all wrong. That indeed it was ratified as RFC 9309. Here is the kicker - that was THREE YEARS AGO in September of 2022! Say What!? Here a  foundational piece of SEO lore occurred and we all missed it!

Robots.txt was first proposed in 1994 as an internet standard. We all kinda gave up waiting for it to be ratified by anyone, anywhere as a standard. So I was a bit taken aback when I asked an LLM who wrote the original proposal for robots.txt, and it informed me it was standardized in 2022. I thought Google was throwing AI slop again. I was shocked - what, when, where, why did this happen in our space and no one reported on it!?

Internet Engineering Task Force (IETF)
Robots Exclusion Protocol : RFC 9309

Yes, I remember reading about Google formalizing a robots.txt proposal to make it an official standard, but after that - like the previous 3 decades - nothing but dead air. Barry Schwartz covered it for SEL.

The great thing about an internet standard is that it clears up confusion between various implementations. Years ago, SEO's had issues with search engines that attempted to extend Robots.txt syntax into new areas:


Three Decades? Sources of Controversy

Here is the brief history of the arguments,  battles, and long-running serious confusion around “extended” robots.txt syntax. This has directly affected SEOs, bot developers, and site operators for almost thirty years.

1. Wildcards and pattern matching

The biggest street fight lasted for decades.

  • Google supported * and $
  • Bing interpreted these wildcard patterns differently
  • Other crawlers either ignored them or treated them wildly unpredictable ways
  • Some servers assumed regex-style patterns, which never existed in the protocol
  • Some libraries would bail on nonstandard syntax when it had worked for decades

2. The Allow directive

  • Allow was never in the original 1994 proposal.
    It spread because Google needed finer control for crawling scripts, widgets, and dynamic directories. Other engines adopted it informally, but behavior was not consistent.  Site operators often assumed Allow had higher priority, but priority logic depended entirely on the bot parsing the file.

3. Crawl-delay

This one produced heated arguments and created a long-standing myth that crawl-delay was universal.

Most SEOs today still think it works everywhere, even though it never did.
RFC 9309 intentionally excluded crawl-delay because there was no consistent real-world behavior to standardize.

  • Yahoo introduced it first, then Bing accepted it, and finally, Google refused to support it
  • Then Cloudflare bot-mgt tools encouraged it even though half the crawlers ignored it

4. NoIndex in robots.txt

Google briefly honored Noindex: in robots.txt. This was never part of any formal syntax, and it caused confusion and broke portability.

  • Many sites used it thinking it worked across engines
  • Microsoft never supported it
  • Smaller bots ignored it
  • Google eventually deprecated and killed support for it, leaving a legacy of broken configs to clean up

5. Sitemaps directive

RFC 9309 still leaves Sitemaps out of the core protocol.

  • Sitemap: in robots.txt was also an extension.
  • Unlike other additions, this one saw broad adoption fast because it was simple and didn't conflict with existing rule interpretation.
  • Even so, it was not part of the 1994 spec and not formally defined until later documentation efforts.

6. Bot-specific vendor directives

None of these were interoperable. Search engines occasionally added instructions for their own crawlers
They cluttered robots.txt files and misled operators into thinking they were standard.

  • Googlebot media directives
  • MSNbot parameters
  • Platform-specific crawl tuning fields
  • Amazon bot flags used for scraping and indexing of products

7. Nonstandard uses: Comments as instructions

Webmasters sometimes put, rate limits, legal notices, bot threats, terms of service, directly in robots.txt assuming crawlers would follow them, but the practice still shows up on large sites.

8. Vendors encouraging nonstandard syntax

Over the years, CMS templates, SEO plugins, hosting panels, and managed-site providers shipped robots.txt files with various foo. This created real damage since malformed robots rules can block entire sections unintentionally.

  • nonexistent directives and wildcard patterns other crawlers didn't understand
  • default Deny rules that hurt indexing
  • assumptions based on Google behavior only

So ya, most people missed it. In September 2022, the Robots Exclusion Protocol behind robots.txt was published as an IETF Proposed Standard, RFC 9309. It formalizes syntax, parsing, caching rules, and clarifies that robots.txt is advisory guidance, not an access-control mechanism.

Key points clarified in the accepted standard:

  • Formal definition of syntax for robots.txt (plain text, UTF-8, line breaks CR/CRLF/LF) and exact location (root directory, same protocol/host/port) for the file.
  • Formal grammar for User-agent, Allow, and Disallow, including how groups are defined and terminated.
  • Rules for handling errors (HTTP status codes, caching the file) so crawlers know how to behave when robots.txt is unreachable.
  • Definition of which directives are standardized, what rules crawlers must support for net interoperability.
  • Error handling, 4xx makes the file "unavailable" which permits crawling, 5xx makes it "unreachable" which requires crawlers to assume a full disallow until conditions clear.
  • Specification notes that the protocol is advisory only (not an access authorization mechanism).
  • Redirect handling, crawlers should follow up to five chained redirects for robots.txt.
  • Caching guidance, cache robots.txt, but do not rely on a cached copy longer than 24 hours unless the file is unreachable.

Directives That Were left Out?

  • Sitemap:
  • Crawl-delay:
  • Noindex:
  • Full regex patterns
  • Shell-style wildcards beyond the limited matching in RFC 9309
  • Vendor-specific directives (Google image rules, MSNbot parameters, Amazon flags, etc.)
  • Request-rate:
  • Visit-time:
  • Clean-param:
  • Host:
  • Any access-control syntax placed in robots.txt
  • Alternate discovery paths for robots.txt
  • Non-UTF-8 encoding rules
  • Directory-level HTML attempts to mimic robots.txt logic

It's not just SEO trivia

The standard arrived just as the storm of AI crawlers exploded. Publishers began using robots.txt to express rules for AI agents, then ran into two problems. First, RFC 9309 does not define AI-specific directives. Second, some AI crawlers ignore robots.txt altogether, which reduces its value as a control signal for training and synthesis. Reuters

The IETF is discussing follow-ups. There are drafts to carry AI usage preferences in HTTP and to add training-specific directives as extensions. These are drafts, not finished standards, but they show where things are headed.

Backstory

In 2019 Google worked with Martijn Koster and others to push REP into the standards process and open-sourced its own parser. That kicked us off the path that ended up with RFC 9309 in 2022.

A file format or protocol only gets called an internet standard after it moves through the IETF process. The path is structured and slow on purpose, since the point is to make sure the new rule will behave the same across servers, clients, and vendors.

How something becomes a standard:

The short version of how something becomes a published standard through the IETF.

1. An individual or group submits an Internet Draft
This is the first public document. Anyone can write one. It is treated as a work in progress and expires unless renewed. Robots.txt stalled on several occasions at this step.

2. An IETF working group reviews it
If a working group adopts the draft, the group discusses technical details and edits it until there is rough consensus. There is no formal vote. Chairs decide willy-nilly when the draft is stable enough to move forward.

3. A Last Call invites broader IETF feedback
Engineers across the IETF community can point out security issues, conflicts with existing protocols, or unclear language or general foo. Again, robots.txt stalled at this point on two occasions.

4. The Internet Engineering Steering Group (IESG) evaluates it
Specialists check the draft for clarity, correctness, and interoperability. If they approve it, the document advances on to general publication.

5. The RFC Editor publishes it as an RFC
At this stage it becomes part of the official internet permanent record. Many RFCs stay at the Informational or Experimental level. Only a small number later reach the Internet Standard level.

6. Additional review for Standard status
To become a true Internet Standard, two things must be demonstrated

  • Independent, interoperable implementations exist
  • The protocol has proven stable in practice (umm ya 1994 people)

Once these are confirmed, the RFC can be labeled an Internet Standard and receives an official STD (that's unfortunate eh?) number.

Robots.txt took a long path. It existed as a generally agreed upon de facto convention for decades before the community formalized it through this process and published it as RFC 9309 in 2022!

Robots.txt Standardization Timeline

1994 Origin

Martijn Koster proposes the Robots Exclusion Protocol after early crawlers hammer servers. The community adopts it quickly because it solves a real operational problems related to server load.

1994 to mid-2010s De facto convention
Search engines, academic crawlers, and commercial bots all support robots.txt, but no single documented rule set exists.
Every major crawler interprets edge cases differently, and directives beyond Allow and Disallow emerge informally.

2019 Google publishes REP draft
Google posts a public specification for the Robots Exclusion Protocol to document how Googlebot interprets robots.txt. This helps align behaviors across engines and prepares the ground for formal work.

2020 to 2022 Draft submitted to the IETF

A formal Internet Draft called draft-koster-rep is submitted.
It goes through several revisions as engineers reconcile long-standing interpretations and real crawler behavior.

Key tasks in this stage

  • Clean up decades of wildly inconsistent behavior
  • Clarify encoding and line-handling rules for bots
  • Resolve how crawlers should react to redirects, error codes, missing files and foo
  • Establish which directives are core and which are way out of scope

Early 2022 Working group and community consensus

Because robots.txt already had global adoption, the IETF process had to work backwards from real usage. This required a consensus point that reflected what servers and crawlers already did in the field. The community focused more on documenting established practice than inventing something new.

September 2022 RFC 9309 published

The IETF formally publishes RFC 9309, titled Robots Exclusion Protocol.
This gives robots.txt an official place in the standards catalog for the first time since 1994!

Post-2022 Adoption and alignment

Most major crawlers already complied with the specification.
The standard now provides a reference for anyone building new crawlers, bot mitigation tools, or website auditing systems.


What changed for SEOs and site operators

  • You now have an authoritative reference when diagnosing crawl problems
  • Line parsing rules are no longer social meme tribal knowledge
  • Redirect chains, file size quirks, error issues, and failures now have required defined behavior
  • Crawler authors have no excuse for inconsistent or ambiguous handling of crawling
  • Tools like validators and site auditors can benchmark behavior directly against the RFC 9309

Further Robots.Txt Reading