robots.txtgeoreference

robots.txt for AI search: the 2026 cheat sheet (GPTBot, ClaudeBot, and the rest)

Every AI crawler user-agent that matters in 2026, what they do, whether to allow them, and a copy-pasteable robots.txt for Brandswarm-style 'maximum AI visibility' policy. Plus the Content-Signal compromise for brands that want to allow retrieval but block training.

Brandswarm Team · May 31, 2026 · 7 min read

Your robots.txt is the first place AI crawlers look when they arrive at your site. Get it wrong and you're invisible to ChatGPT, Claude, Perplexity, Gemini, and AI Overviews regardless of how good your content, schema, or backlinks are. Get it right and the cost is zero — it's just a text file.

This is the cheat sheet. Every AI crawler that matters in 2026, whether to allow them, and a copy-pasteable robots.txt file you can drop in today.

The user-agents that matter

User-agent	Operator	What it does	Allow?
`GPTBot`	OpenAI	Trains future models. Does NOT do real-time retrieval for ChatGPT.	Yes — visibility, not training
`OAI-SearchBot`	OpenAI	Retrieval for ChatGPT search / SearchGPT.	Yes — direct ChatGPT visibility
`ChatGPT-User`	OpenAI	Used when a user invokes ChatGPT's browsing tool. Fetches a single URL.	Yes — required for browsing
`ClaudeBot`	Anthropic	Crawl for Claude (training + retrieval).	Yes — direct Claude visibility
`Claude-Web` / `anthropic-ai`	Anthropic	Older / alternate user-agent variants.	Yes — same reason
`Google-Extended`	Google	Crawls for Gemini training. Separate from Googlebot.	Optional — yes if you want training inclusion
`Googlebot`	Google	Powers regular Google search + AI Overviews. Do not block.	Always yes
`PerplexityBot`	Perplexity	Retrieval for Perplexity answers.	Yes
`Perplexity-User`	Perplexity	Fetches single URLs when users follow Perplexity links.	Yes
`Bytespider`	ByteDance	Crawls for Doubao / TikTok AI features.	Yes if you have TikTok/APAC audience
`Amazonbot`	Amazon	Powers Alexa / Q / Amazon AI features.	Optional
`Applebot-Extended`	Apple	Crawls for Apple Intelligence training. Separate from Applebot (search).	Optional
`Applebot`	Apple	Powers Spotlight + Siri suggestions.	Always yes
`meta-externalagent`	Meta	Crawls for Meta AI training.	Optional
`CCBot`	Common Crawl	Open-source crawl used as training data by many models.	Optional — wide influence
`Bingbot`	Microsoft	Regular Bing search + ChatGPT browsing tool retrieval.	Always yes
`DuckAssistBot`	DuckDuckGo	Powers DuckDuckGo's AI Assist.	Yes
`Diffbot` / `BrandBot` / etc.	Various	Niche crawlers used by enterprise AI tools.	Optional — minor traffic

Quick decision: 3 policies that cover 95% of cases

Policy A: maximum AI visibility (recommended for SaaS, content brands, B2B)

# Maximum AI visibility. Allows training + retrieval for all major engines.
User-agent: *
Allow: /

# Block private/auth surfaces from any crawler
User-agent: *
Disallow: /admin/
Disallow: /app/
Disallow: /billing/
Disallow: /accounts/

Sitemap: https://yourdomain.com/sitemap.xml

This is the right policy if your business benefits from being discovered in AI answers. Almost every SaaS, B2B company, and brand that sells anything falls into this category. The wildcard User-agent: * applies to every crawler including the AI ones.

Policy B: allow AI retrieval, block AI training (the Content-Signal compromise)

# Allow real-time retrieval (so AI can cite you when users ask)
# but signal that content should not be used for model training.
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
Allow: /

Disallow: /admin/
Disallow: /app/
Disallow: /billing/

Sitemap: https://yourdomain.com/sitemap.xml

Use this if you want to be discoverable in ChatGPT/Perplexity/Gemini answers but you don't want your content baked into next year's model training data. The Content-Signal header is honored by OpenAI, Anthropic, Google, and Perplexity as of mid-2025. It's the right middle ground.

Policy C: block everything (only for sites that genuinely don't want AI visibility)

# Block all AI crawlers explicitly. Allow Googlebot/Bingbot for traditional search.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: CCBot
Disallow: /

# Allow traditional search
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Use this only if you have a strong reason — premium paid content, news org with monetization concerns, sensitive material. Be aware: blocking AI retrieval means your brand will not be cited when users ask AI assistants about your category. For most businesses, this is a strategic mistake.

The Cloudflare gotcha

If your site is behind Cloudflare, there's a feature called "AI Crawl Control → Managed robots.txt" that injects a Policy-C-style block into your robots.txt on the wire, regardless of what your origin serves. The toggle is on by default for many zones. Many brands are blocking every AI crawler without knowing.

To check: curl https://yourdomain.com/robots.txt. If you see a block titled "# BEGIN Cloudflare Managed content", you're affected. Turn the toggle off in Cloudflare → AI Crawl Control → Managed robots.txt. We wrote up the full story here.

Validating your `robots.txt`

Three quick checks:

Use Google's robots.txt tester in Search Console — paste a URL and a user-agent, it tells you if the page is fetchable. Their tester is now under the URL Inspection tool.

Curl with each crawler's user-agent and inspect the response:

curl -A "GPTBot" https://yourdomain.com/robots.txt
curl -A "ClaudeBot" https://yourdomain.com/robots.txt

Watch Bing Webmaster Tools' Crawl Errors — Bing reports robots.txt-blocked URLs there. Other engines don't surface this as cleanly.

Three rules

Specific user-agents override the wildcard. If you have User-agent: * Allow: / and below it User-agent: GPTBot Disallow: /, GPTBot is blocked. The wildcard isn't a fallback; it's a default that specific rules override.
One User-agent block per crawler. Some sites repeat User-agent: GPTBot with different rules in different blocks; only the first block is honored. Consolidate.
Don't block Googlebot when you mean Google-Extended. These are different crawlers. Googlebot powers Search + AI Overviews. Google-Extended powers Gemini training. Blocking Googlebot tanks your traditional Google traffic.

FAQ

I want to be in ChatGPT but not in Claude. Can I?

Yes. Allow GPTBot, OAI-SearchBot, and ChatGPT-User; disallow ClaudeBot, Claude-Web, and anthropic-ai. Practical impact is modest because most brands want presence everywhere AI assistants exist, but the option is there.

What about `noai` and `noimageai` meta tags?

These are the page-level equivalent of robots.txt rules. They tell crawlers not to use the page's content for AI training. Less widely honored than Content-Signal headers; useful as defense-in-depth on pages where you really care.

What about `llms.txt`?

A proposed standard for "here's a curated text version of my content for LLMs to ingest cleanly." Adoption is uneven; OpenAI and Anthropic both said publicly in 2025 that they prefer to crawl normally. Worth shipping if it's easy to generate, but don't rely on it as your primary AI-visibility strategy.

Do I need to also add an `X-Robots-Tag` HTTP header?

Only if you want per-page granularity that robots.txt can't express (e.g., "noindex this specific PDF without listing it"). For broad AI-visibility policy, robots.txt is sufficient.

Bottom line

Most brands win by shipping Policy A. Some by shipping Policy B. Very few should ship Policy C. Whichever you choose, do it deliberately — and re-check after every CDN configuration change. The most common reason brands lose AI visibility isn't a strategy decision; it's a CDN feature that flipped a switch they didn't notice.

Check your own brand's AI visibility

Free scan across ChatGPT, Claude, Perplexity, Gemini, and AI Overviews — 60 seconds, no credit card.

Brandswarm tracks how 5 AI engines describe your brand, every day.

Keep reading

May 31, 2026 · 11 min

Schema markup for AI search: the complete 2026 reference

Which Schema.org types AI engines actually retrieve, copy-pasteable JSON-LD for Organizat…

May 31, 2026 · 8 min

Why Bing matters more than Google for ChatGPT visibility

ChatGPT's browsing tool retrieves from Bing's index, not Google's. A brand can dominate G…