Cloudflare's recent post calls out Perplexity for operating undeclared web crawlers, quietly sidestepping robots.txt while using a stock chrome agentname. Cloudflare's tone hints at alarm, but if you've been running a server for any length of time - or spent even five minutes on WebmasterWorld - you already know the score.
If this is all deja vu, then you are on the right track as just last year, WIRED and Forbes both called out Perplexity of doing the same.
Undeclared Crawlers? Yawn.
This behavior isn't novel. It's not rare. It's not even especially clever. At WebmasterWorld, we've tracked rogue bots for decades. What Perplexity is doing is more the norm than the exception. In the last two years, we have blocked:
- Over 10,000 IPs
- Nearly 100 C level IP blocks
- More than 50 different fake User-Agent headers
- Entire ccTLDs (Sorry, Brazil, Taiwan, China - you've got some cleaning up to do)
These crawlers are often aggressive, recursive, sometimes using headless browsers or rotating proxies, and they don't stop when you ask nicely. Some even figured out that we require login after 30 page views and rotated ip's after that. This has been the reality of operating a public-facing website since at least the early 2000s.
The Illusion of Robots.txt
Hey, let's get something straight: robots.txt is a voluntary suggestion, not a legal wall. It was never standardized by W3C, IETF, or any other tech governing body . The syntax - such as it is - was unilaterally expanded by Google over time - no consensus, no standards body, no RFC. Just a loose gentleman's agreement by Google to fit their corporate needs and greeds.
So when Cloudflare frames this as a surprising breach of protocol, they're handwaving a bit. If someone puts up a sign that says "Do Not Read," and then broadcasts that sign on the world's most public platform, they shouldn't be shocked when people read it. That applies to humans, scrapers, crawlers, and curious researchers alike.
And lets pause for a moment and remember that Cloudflare runs a service that blocks bots. <sarcasm mode>Nope, no conflict of interest there. nothing to see here, move along, move along</sarcasm>.
That is not to say Cloudflare isn't right here. We are saying, so what if they are? Lets be real - in the error of unregulated bots, Cloudflare is seizing the moment. (with both hands, and running to the bank).
"Rendering" JavaScript, Allegedly
Meanwhile, let's talk about another long-standing myth: GoogleBot and JavaScript rendering that in turn supposedly points to Google crawling with stock agent names as well. (bruh, it's any body's guess)
Google techs have claimed for years that it downloads and renders JavaScript, even supporting single-page apps and complex frameworks. It even went out on a big blitz telling everyone about how they'd updated Google bot to a more recent version of chrome. Hello? Yet here we are in 2025, and no one's ever observed a Googlebot IP triggering AJAX or dynamically loading JSON endpoints in the wild. In our server logs, there's no GET request that corresponds with a secondary API call - ever from a Google owned IP. No fetch(), no XMLHttpRequest, no sign of hydration. WTF - Navboost all over again?
If you rely on JavaScript to deliver core content, you're not cloaking like we used too, but you are certainly gambling.
Lastly, lets recall that half the web opted out of Google AI training, and Google trained anyway.
You Can't Block Bots Without Hurting Users
The idea that you can fully separate bots from real users is, today, impossible. Here's why:
- Headless Chrome is indistinguishable from real Chrome when done right.
- Most bot operators don't declare themselves.
- Many popular anti-bot solutions (including Cloudflare) already block legitimate browsers. Case in point: Opera, when reporting itself as Chrome, is outright blocked by CloudFlare.
- We hear Perplexity's own Comet browser has been wired up as a Clickbot to burn down competitors ad budgets with clicks.
You can play whack-a-mole with IPs, ASNs, headers, and behavior, but at some point, you'll either miss the bot or lock out a real person. There are no current tools that reliably block all crawlers while letting all browsers through. It's not a solved problem. It's not even a well-framed one.
What This Means for SEOs
For SEOs and site owners, here's the takeaway:
- Assume your content is being scraped. It probably is.
- Don't rely on
robots.txtto protect anything sensitive. It's not even a speed bump, nor a gate. - If you publish content on the open web, you are not giving it away - you are broadcasting it.
- If you want privacy, don't publish.
- If you want search traffic, accept the cost. You'll sleep better at night.
The fact that AI companies are scraping sites is not new. The fact that they do it stealthily is not new. The only thing that's new is Cloudflare deciding to spotlight it.
Bottom line: This isn't a crisis. It's Monday.
And if you're not already tracking rogue bots, fake crawlers, and bad actors - start. Just don't act surprised.
Let's stop pretending the web is polite. It never was.


