Bots vs. Your Website: Which Ones to Allow and Which to Block?

Artificial intelligence is advancing rapidly, forcing website owners to face a big question: which AI crawlers should you exclude from your site via robots.txt and which should you allow? Since an increasing number of bots are using the web every day, the answer to this question is more relevant than ever. Whether a bot is used for training language models, for real-time responses by AI assistants and search engines, or for autonomous task execution, each use brings its own opportunities and challenges. The bots that you choose to allow versus the ones you block should depend on factors such as website size, individual business model, and the value of your content. What's needed is a nuanced, strategic approach. This guide will help you make informed decisions for your website based on the latest insights and the constantly evolving landscape of AI crawlers.

What different types of AI bots are there?

The landscape includes dozens of different AI bots, with new ones launching regularly. Most bots fall into just three main categories, each with distinct objectives. We are going to walk you through them and what they do.

LLM/AI data scrapers

LLM/AI data scrapers are bots that systematically search websites and collect their content for training large language models (LLMs) or other AI applications. Their main purpose is to extract as much information from the web as possible to improve the performance and knowledge of AI models. These bots don't index content for traditional search engines but rather store it as training data for AI systems to later generate text or answer questions when prompted.

The most well-known LLM/AI data scraper bots include GPTBot (OpenAI), CCBot (Common Crawl), Applebot-Extended (Apple), and ClaudeBot (Anthropic).

List of major LLM crawlers

Bot Name Provider Description robots.txt Snippet (Example)
GPTBot OpenAI Collects web content for training OpenAI language models (e.g., GPT-4o, GPT-5) User-agent: GPTBot
Allow: /
CloudVertexBot Google For website owners to request crawls for AI training purposes on the Vertex AI platform User-agent: CloudVertexBot
Allow: /
ClaudeBot Anthropic Anthropic's bot for collecting training data for Claude models User-agent: ClaudeBot
Allow: /
CCBot Common Crawl Collects web data for the Common Crawl project, which is the foundation for many AI training datasets User-agent: CCBot
Allow: /
FacebookBot Meta Facebook-specific bot for content analysis and AI training User-agent: FacebookBot
Allow: /
Meta-ExternalAgent Meta Meta bot for data collection for AI models and Meta products User-agent: meta-externalagent
Allow: /
DeepseekBot DeepSeek AI Collects web content for training DeepSeek language models and AI applications User-agent: DeepseekBot
Allow: /
Applebot-Extended Apple Extended Applebot that collects content for AI training and Siri User-agent: Applebot-Extended
Allow: /
Bytespider ByteDance (TikTok) ByteDance (TikTok) crawler that collects data for AI models and search services User-agent: Bytespider
Allow: /
cohere-training-data-crawler Cohere Collects training data for Cohere LLMs User-agent: cohere-training-data-crawler
Allow: /
PanguBot Huawei (China) Collects training data for the multimodal LLM PanGu User-agent: PanguBot
Allow: /
Timpibot Timpi Builds an index that can be used for training LLMs User-agent: Timpibot
Allow: /
AI2Bot Ai2 Collects data for training open-source AI models User-agent: AI2Bot
Disallow: /
Diffbot Diffbot Specializes in structured extraction of web data for AI and analytics User-agent: Diffbot
Disallow: /
Omgili Webz.io Maintains a repository of web crawl data sold to other companies, including for AI training User-agent: Omgili
Disallow: /
Webzio-Extended Webz.io Maintains a repository of web crawl data sold to other companies, including for AI training User-agent: Webzio-Extended
Disallow: /

AI search & assistants

AI search & assistants work differently. This category includes bots that perform real-time queries of web content – both for AI assistants like ChatGPT, Claude, or Meta AI, and for AI-powered search engines like Perplexity, You.com, or ChatGPT's search function. Both functions are almost identical, technically speaking. They're triggered by user requests, they selectively access current web content, and they usually provide direct source attribution. Whether a person asks a question in ChatGPT or enters a search query in Perplexity, the underlying process is the same. The AI finds specific information from the web and then credits the original source. This is a win-win situation: the server load remains relatively low, while your website gets the opportunity to be cited as sources in AI-generated responses or search results.

List of AI search & assistants

Bot Name Provider Description robots.txt Snippet (Example)
ChatGPT-User OpenAI Bot for user-initiated web queries in ChatGPT (browsing mode) User-agent: ChatGPT-User
Allow: /
OAI-SearchBot OpenAI Bot for search queries and real-time indexing for ChatGPT and other OpenAI products User-agent: OAI-SearchBot
Allow: /
Google-Extended Google Extended Googlebot for AI grounding within Gemini, not for traditional web search User-agent: Google-Extended
Allow: /
BingBot Microsoft Standard Bing bot crawling service that powers Bing Search and Bing Chat (Copilot) with responses User-agent: BingBot
Allow: /
PerplexityBot Perplexity AI AI search bot for the Perplexity search engine and real-time AI responses User-agent: PerplexityBot
Allow: /
Perplexity-User Perplexity Loads a page only when a user clicks on a Perplexity citation; treated as human-like traffic and bypasses robots.txt User-agent: Perplexity-User
Allow: /
anthropic-ai Anthropic Real-time query bot for the Claude AI assistant User-agent: anthropic-ai
Allow: /
Claude-SearchBot Anthropic Search bot for real-time indexing and AI responses with Claude User-agent: Claude-SearchBot
Allow: /
Claude-Web Anthropic Bot for real-time web queries by the Claude assistant User-agent: claude-web
Allow: /
Meta-ExternalFetcher Meta Real-time bot from Meta for AI assistants and search User-agent: Meta-ExternalFetcher
Allow: /
MistralAI-User Mistral Retrieves citations for Le Chat User-agent: MistralAI-User
Allow: /
Amazonbot Amazon Search and data bot for Amazon products and AI applications (e.g., Alexa) User-agent: Amazonbot
Allow: /
Applebot Apple Traditional search and indexing bot for Siri and Apple services User-agent: Applebot
Allow: /
DuckAssistBot DuckDuckGo Collects fact-based snippets for DuckAssist, DuckDuckGo's private AI response feature User-agent: DuckAssistBot
Allow: /
YouBot You.com Search and AI bot for the You.com search engine User-agent: YouBot
Allow: /

AI agents: the next generation of autonomous bots

AI agents are a relatively new type of bot. They’re capable of executing complex, multi-step tasks on the web, often by simulating a web browser and interacting with websites like a human would. They're designed to act like personal assistants, filling out forms, booking travel, or gathering specific information on behalf of users. While crawlers primarily collect data – and assistants only respond to direct queries – these agents can take initiative and make decisions to achieve a goal. Naturally, this new generation of bots has already created opportunities and challenges for website owners.

List of AI agents

Bot Name Provider Description robots.txt Snippet (Example)
ChatGPT-Operator OpenAI AI agent that can use its own browser to perform tasks for users, such as filling out forms or booking travel User-agent: ChatGPT-Operator
Allow: /
GoogleAgent-Mariner Google AI agent that intelligently navigates and interacts with websites to perform tasks on behalf of human users; part of Project Mariner User-agent: GoogleAgent-Mariner
Allow: /
Google-Shopping-AI Google Agentic shopping features that allow shoppers to see product images and AI-powered guidance, use price tracking, and facilitate payments User-agent: Google-Shopping-AI
Allow: /
Deepseek Agent DeepSeek AI AI agent for browser automation and autonomous task execution on the web, including form filling and automated interactions User-agent: Deepseek/1.0
Allow: /

Which AI bots should I allow, and which should I block from crawling?

Different bot types have different goals:

  • AI Search & Assistants can directly cite your website or brand as a source, thus contributing to visibility.
  • LLM crawlers are designed to collect large amounts of web content for training language models – often without direct attribution.
  • AI Agents perform specific tasks and can present both opportunities and risks for website owners.

Understanding these goals will help you decide which AI bots should be allowed to crawl your site and which should be blocked.

Why you should exclude LLM crawlers

A clear advantage of excluding LLM Crawlers is that this helps protect exclusive or paid content. For example, if you publish unique articles on your website, you probably want to prevent your content from being used without permission as training data for AI models and possibly reappearing without being properly cited. If you don't want your texts, data, or media to end up in large AI models and potentially be reused without attribution, blocking LLM crawlers can help you maintain control. LLM crawlers can also significantly increase server load. For those who operate large online stores or extensive content sites, this additional load can quickly have a noticeable impact. Loading times for real users increase, the site's overall performance suffers, and ongoing server costs rise. In some cases, these crawlers can "get lost" and then crawl countless irrelevant pages, exacerbating the problem of high server requests. For these reasons, it’s wise to regularly monitor AI bot behavior using tools like LogFiles, and to act when necessary. For example, you may want to selectively block certain AI crawlers through your robots.txt file to protect your website's performance while avoiding unnecessary costs.

Who should consider excluding LLM crawlers?

Excluding LLM crawlers is especially recommended for:

  • Large online stores with many (over 100,000) URLs, where server load can noticeably increase due to massive crawling and where content is not particularly unique
  • Publishers and content platforms whose business model is based on exclusive content and who want to prevent their content from flowing uncontrollably into AI models
  • Companies with sensitive data that publish copyrighted information and want to maintain control over its use

Arguments against excluding LLM crawlers

Allowing LLM crawlers may mean your content appears in AI models, which can indirectly contribute to your reach and brand awareness. This presence in AI systems can help establish your brand as a knowledge source, especially for very unique or high-quality content, even if a direct citation isn't always guaranteed. For smaller websites or those without sensitive or exclusive content, the additional server load is usually manageable, and the potential benefits often outweigh the costs. If brand awareness is crucial for your business model, then allowing crawling from all bots may make sense.

Recommendations for AI search & assistants

AI search & assistants selectively access website content when users ask specific questions, enter search queries, or use prompts. This generally keeps server load low while still giving websites and brands the opportunity to be mentioned as sources in AI-generated responses or search results. These bots also usually directly cite their sources, which is valuable for visibility and traffic. For most companies and brands, allowing these bots to crawl is advisable. It can secure visibility in modern search systems and AI applications, and the benefits of direct reach and brand awareness usually outweigh the disadvantages.

Handling AI agents

AI agents are a challenge all of their own. They can perform autonomous tasks, which opens up some exciting new possibilities for users. However, it also means that they can perform unwanted automated actions on websites. How you decide to deal with AI agents should be based on the type of website and the services you offer. E-commerce sites might benefit from shopping agents, while other websites might want to be more restrictive.

Why isn’t robots.txt always enough to block AI crawlers?

Using robots.txt to block AI crawlers is a standard approach. However, the case of Perplexity AI has proven that this alone isn't always sufficient. Recently, Cloudflare accused Perplexity of using stealth crawling to deliberately bypass website restrictions. According to Cloudflare, Perplexity has been ignoring robots.txt instructions and using various methods to obscure its crawling identity and access blocked websites.

The accusations against Perplexity AI

The case can be summarized as follows:

  • Circumventing robots.txt: Cloudflare, a leading provider of web infrastructure and security, accused Perplexity of not respecting robots.txt files that prohibit website crawling. Instead of honoring the robots.txt file restrictions, Perplexity allegedly changed its tactics to continue accessing content.
  • Identity concealment: It's alleged that Perplexity configures its bots to conceal their true identity when attempting to scrape websites. This is done to circumvent website restrictions based on recognizing specific user agents.
  • Using non-public IP addresses: Perplexity allegedly uses unpublished IP addresses to access restricted content, raising ethical and legal concerns.
  • Customer complaints: Cloudflare's accusations are based partly on complaints from customers who found that Perplexity crawled their websites despite explicit blocking attempts and partly on their own study.

What was Perplexity’s response to the Cloudflare allegations? They stated that Cloudflare's systems are "fundamentally inadequate" and can’t distinguish between legitimate AI assistants and actual bots. They deny the accusations and view Cloudflare's allegations as "embarrassing mistakes." True or not, the allegations certainly raise questions for website owners.

Cloudflare's response: proactive measures against AI crawlers

In response to this issue, Cloudflare has taken steps to give website owners more control over AI crawlers and address challenges from bots that disregard robots.txt rules. Their measures include several initiatives:

  • One-click AI crawler blocking: In September 2024, Cloudflare introduced an option that allows customers to block AI crawlers with a single click. Since then, over one million customers have used this feature.
  • Default AI crawler blocking: Since July 2025, Cloudflare prevents AI bots from crawling customer websites by default. Website owners using Cloudflare must actively consent if they want to allow AI crawlers access to their content.
  • "Pay per Crawl" program: Cloudflare has introduced a new program called "Pay per Crawl." This allows website owners to charge AI companies fees for accessing their content. The intention is to give content creators a way to profit from their data being used by AI models.
  • AI training control with managed robots.txt: Cloudflare offers tools that make it easier for website owners to control the use of their content for AI model training. This includes the ability to manage robots.txt files to tell AI bot operators which content they may crawl and which they may not.
  • Managed rules for AI bots: Cloudflare has implemented managed rules that block bots categorized as AI crawlers (AI Bots). These rules aim to prevent unwanted scraping without affecting traditional search engine crawlers.
  • robots.txt policy enforcement: Cloudflare is working to improve enforcement of robots.txt policies. They offer features that allow website owners to quickly identify which AI services comply with their robots.txt policies and which don’t and can then take action to enforce these policies.

What does this mean for website owners?

This controversy and Cloudflare's response highlight several important points:

  1. robots.txt alone isn't enough: As the Perplexity example shows, some bots can circumvent or ignore robots.txt rules. Website owners must rely on additional protective measures.
  2. Infrastructure providers are taking steps: Companies like Cloudflare are developing specialized tools to give website owners better control. These tools go far beyond what's possible with robots.txt alone.
  3. Transparency becomes more important: The ability to recognize which bots actually follow robots.txt rules is a crucial factor for effective crawler management.
  4. Legal and ethical questions: The debate raises important questions regarding the ethics of data scraping, copyright, and the need for transparent practices in the AI industry.

Cloudflare emphasizes that its systems aim to give website owners control over their content and promote transparency when dealing with AI crawlers. This shows that the industry is aware of the problem and is actively working on solutions.

Conclusion: your AI bot strategy must be unique

Deciding which AI bots to block and which to allow isn't a simple yes/no question. The answer depends on your individual goals, your content strategy, and the technical infrastructure of your individual website. A balanced strategy could look like this:

  • LLM crawlers: Selective blocking based on content type and business model
  • AI search & assistants: Recommended for most websites for direct attribution and visibility
  • AI agents: Individual assessment depending on website type and security requirements

The good news is that you are in the driver’s seat. The AI landscape is constantly evolving, and the list of active bots is only going to get longer, but tools like log file analyses can give you real data about actual bot traffic so that you can make decisions based on facts, not guesswork.

Want to stay in control? Regularly review and adjust your robots.txt configuration so that you’re the one who shapes how AI interacts with your content.