Skip to main content
Brandswarm
robots.txtgeoreference

robots.txt for AI search: the 2026 cheat sheet (GPTBot, ClaudeBot, and the rest)

Every AI crawler user-agent that matters in 2026, what they do, whether to allow them, and a copy-pasteable robots.txt for Brandswarm-style 'maximum AI visibility' policy. Plus the Content-Signal compromise for brands that want to allow retrieval but block training.

Brandswarm Team · · 7 min read

Your robots.txt is the first place AI crawlers look when they arrive at your site. Get it wrong and you're invisible to ChatGPT, Claude, Perplexity, Gemini, and AI Overviews regardless of how good your content, schema, or backlinks are. Get it right and the cost is zero — it's just a text file.

This is the cheat sheet. Every AI crawler that matters in 2026, whether to allow them, and a copy-pasteable robots.txt file you can drop in today.

The user-agents that matter

User-agentOperatorWhat it doesAllow?
GPTBotOpenAITrains future models. Does NOT do real-time retrieval for ChatGPT.Yes — visibility, not training
OAI-SearchBotOpenAIRetrieval for ChatGPT search / SearchGPT.Yes — direct ChatGPT visibility
ChatGPT-UserOpenAIUsed when a user invokes ChatGPT's browsing tool. Fetches a single URL.Yes — required for browsing
ClaudeBotAnthropicCrawl for Claude (training + retrieval).Yes — direct Claude visibility
Claude-Web / anthropic-aiAnthropicOlder / alternate user-agent variants.Yes — same reason
Google-ExtendedGoogleCrawls for Gemini training. Separate from Googlebot.Optional — yes if you want training inclusion
GooglebotGooglePowers regular Google search + AI Overviews. Do not block.Always yes
PerplexityBotPerplexityRetrieval for Perplexity answers.Yes
Perplexity-UserPerplexityFetches single URLs when users follow Perplexity links.Yes
BytespiderByteDanceCrawls for Doubao / TikTok AI features.Yes if you have TikTok/APAC audience
AmazonbotAmazonPowers Alexa / Q / Amazon AI features.Optional
Applebot-ExtendedAppleCrawls for Apple Intelligence training. Separate from Applebot (search).Optional
ApplebotApplePowers Spotlight + Siri suggestions.Always yes
meta-externalagentMetaCrawls for Meta AI training.Optional
CCBotCommon CrawlOpen-source crawl used as training data by many models.Optional — wide influence
BingbotMicrosoftRegular Bing search + ChatGPT browsing tool retrieval.Always yes
DuckAssistBotDuckDuckGoPowers DuckDuckGo's AI Assist.Yes
Diffbot / BrandBot / etc.VariousNiche crawlers used by enterprise AI tools.Optional — minor traffic

Quick decision: 3 policies that cover 95% of cases

Policy A: maximum AI visibility (recommended for SaaS, content brands, B2B)

# Maximum AI visibility. Allows training + retrieval for all major engines.
User-agent: *
Allow: /

# Block private/auth surfaces from any crawler
User-agent: *
Disallow: /admin/
Disallow: /app/
Disallow: /billing/
Disallow: /accounts/

Sitemap: https://yourdomain.com/sitemap.xml

This is the right policy if your business benefits from being discovered in AI answers. Almost every SaaS, B2B company, and brand that sells anything falls into this category. The wildcard User-agent: * applies to every crawler including the AI ones.

Policy B: allow AI retrieval, block AI training (the Content-Signal compromise)

# Allow real-time retrieval (so AI can cite you when users ask)
# but signal that content should not be used for model training.
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
Allow: /

Disallow: /admin/
Disallow: /app/
Disallow: /billing/

Sitemap: https://yourdomain.com/sitemap.xml

Use this if you want to be discoverable in ChatGPT/Perplexity/Gemini answers but you don't want your content baked into next year's model training data. The Content-Signal header is honored by OpenAI, Anthropic, Google, and Perplexity as of mid-2025. It's the right middle ground.

Policy C: block everything (only for sites that genuinely don't want AI visibility)

# Block all AI crawlers explicitly. Allow Googlebot/Bingbot for traditional search.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: CCBot
Disallow: /

# Allow traditional search
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Use this only if you have a strong reason — premium paid content, news org with monetization concerns, sensitive material. Be aware: blocking AI retrieval means your brand will not be cited when users ask AI assistants about your category. For most businesses, this is a strategic mistake.

The Cloudflare gotcha

If your site is behind Cloudflare, there's a feature called "AI Crawl Control → Managed robots.txt" that injects a Policy-C-style block into your robots.txt on the wire, regardless of what your origin serves. The toggle is on by default for many zones. Many brands are blocking every AI crawler without knowing.

To check: curl https://yourdomain.com/robots.txt. If you see a block titled "# BEGIN Cloudflare Managed content", you're affected. Turn the toggle off in Cloudflare → AI Crawl Control → Managed robots.txt. We wrote up the full story here.

Validating your robots.txt

Three quick checks:

  1. Use Google's robots.txt tester in Search Console — paste a URL and a user-agent, it tells you if the page is fetchable. Their tester is now under the URL Inspection tool.
  2. Curl with each crawler's user-agent and inspect the response:
    curl -A "GPTBot" https://yourdomain.com/robots.txt
    curl -A "ClaudeBot" https://yourdomain.com/robots.txt
  3. Watch Bing Webmaster Tools' Crawl Errors — Bing reports robots.txt-blocked URLs there. Other engines don't surface this as cleanly.

Three rules

  1. Specific user-agents override the wildcard. If you have User-agent: * Allow: / and below it User-agent: GPTBot Disallow: /, GPTBot is blocked. The wildcard isn't a fallback; it's a default that specific rules override.
  2. One User-agent block per crawler. Some sites repeat User-agent: GPTBot with different rules in different blocks; only the first block is honored. Consolidate.
  3. Don't block Googlebot when you mean Google-Extended. These are different crawlers. Googlebot powers Search + AI Overviews. Google-Extended powers Gemini training. Blocking Googlebot tanks your traditional Google traffic.

FAQ

I want to be in ChatGPT but not in Claude. Can I?

Yes. Allow GPTBot, OAI-SearchBot, and ChatGPT-User; disallow ClaudeBot, Claude-Web, and anthropic-ai. Practical impact is modest because most brands want presence everywhere AI assistants exist, but the option is there.

What about noai and noimageai meta tags?

These are the page-level equivalent of robots.txt rules. They tell crawlers not to use the page's content for AI training. Less widely honored than Content-Signal headers; useful as defense-in-depth on pages where you really care.

What about llms.txt?

A proposed standard for "here's a curated text version of my content for LLMs to ingest cleanly." Adoption is uneven; OpenAI and Anthropic both said publicly in 2025 that they prefer to crawl normally. Worth shipping if it's easy to generate, but don't rely on it as your primary AI-visibility strategy.

Do I need to also add an X-Robots-Tag HTTP header?

Only if you want per-page granularity that robots.txt can't express (e.g., "noindex this specific PDF without listing it"). For broad AI-visibility policy, robots.txt is sufficient.

Bottom line

Most brands win by shipping Policy A. Some by shipping Policy B. Very few should ship Policy C. Whichever you choose, do it deliberately — and re-check after every CDN configuration change. The most common reason brands lose AI visibility isn't a strategy decision; it's a CDN feature that flipped a switch they didn't notice.

Check your own brand's AI visibility

Free scan across ChatGPT, Claude, Perplexity, Gemini, and AI Overviews — 60 seconds, no credit card.

Brandswarm tracks how 5 AI engines describe your brand, every day.

Keep reading