robots.txt for AI search: the 2026 cheat sheet (GPTBot, ClaudeBot, and the rest)
Every AI crawler user-agent that matters in 2026, what they do, whether to allow them, and a copy-pasteable robots.txt for Brandswarm-style 'maximum AI visibility' policy. Plus the Content-Signal compromise for brands that want to allow retrieval but block training.
Your robots.txt is the first place AI crawlers look when they
arrive at your site. Get it wrong and you're invisible to ChatGPT, Claude,
Perplexity, Gemini, and AI Overviews regardless of how good your content,
schema, or backlinks are. Get it right and the cost is zero — it's just a
text file.
This is the cheat sheet. Every AI crawler that matters in 2026, whether
to allow them, and a copy-pasteable robots.txt file you can
drop in today.
The user-agents that matter
| User-agent | Operator | What it does | Allow? |
|---|---|---|---|
GPTBot | OpenAI | Trains future models. Does NOT do real-time retrieval for ChatGPT. | Yes — visibility, not training |
OAI-SearchBot | OpenAI | Retrieval for ChatGPT search / SearchGPT. | Yes — direct ChatGPT visibility |
ChatGPT-User | OpenAI | Used when a user invokes ChatGPT's browsing tool. Fetches a single URL. | Yes — required for browsing |
ClaudeBot | Anthropic | Crawl for Claude (training + retrieval). | Yes — direct Claude visibility |
Claude-Web / anthropic-ai | Anthropic | Older / alternate user-agent variants. | Yes — same reason |
Google-Extended | Crawls for Gemini training. Separate from Googlebot. | Optional — yes if you want training inclusion | |
Googlebot | Powers regular Google search + AI Overviews. Do not block. | Always yes | |
PerplexityBot | Perplexity | Retrieval for Perplexity answers. | Yes |
Perplexity-User | Perplexity | Fetches single URLs when users follow Perplexity links. | Yes |
Bytespider | ByteDance | Crawls for Doubao / TikTok AI features. | Yes if you have TikTok/APAC audience |
Amazonbot | Amazon | Powers Alexa / Q / Amazon AI features. | Optional |
Applebot-Extended | Apple | Crawls for Apple Intelligence training. Separate from Applebot (search). | Optional |
Applebot | Apple | Powers Spotlight + Siri suggestions. | Always yes |
meta-externalagent | Meta | Crawls for Meta AI training. | Optional |
CCBot | Common Crawl | Open-source crawl used as training data by many models. | Optional — wide influence |
Bingbot | Microsoft | Regular Bing search + ChatGPT browsing tool retrieval. | Always yes |
DuckAssistBot | DuckDuckGo | Powers DuckDuckGo's AI Assist. | Yes |
Diffbot / BrandBot / etc. | Various | Niche crawlers used by enterprise AI tools. | Optional — minor traffic |
Quick decision: 3 policies that cover 95% of cases
Policy A: maximum AI visibility (recommended for SaaS, content brands, B2B)
# Maximum AI visibility. Allows training + retrieval for all major engines.
User-agent: *
Allow: /
# Block private/auth surfaces from any crawler
User-agent: *
Disallow: /admin/
Disallow: /app/
Disallow: /billing/
Disallow: /accounts/
Sitemap: https://yourdomain.com/sitemap.xml
This is the right policy if your business benefits from being discovered
in AI answers. Almost every SaaS, B2B company, and brand that sells anything
falls into this category. The wildcard User-agent: * applies to
every crawler including the AI ones.
Policy B: allow AI retrieval, block AI training (the Content-Signal compromise)
# Allow real-time retrieval (so AI can cite you when users ask)
# but signal that content should not be used for model training.
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
Allow: /
Disallow: /admin/
Disallow: /app/
Disallow: /billing/
Sitemap: https://yourdomain.com/sitemap.xml
Use this if you want to be discoverable in ChatGPT/Perplexity/Gemini answers
but you don't want your content baked into next year's model training data.
The Content-Signal header is honored by OpenAI, Anthropic,
Google, and Perplexity as of mid-2025. It's the right middle ground.
Policy C: block everything (only for sites that genuinely don't want AI visibility)
# Block all AI crawlers explicitly. Allow Googlebot/Bingbot for traditional search.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: CCBot
Disallow: /
# Allow traditional search
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Use this only if you have a strong reason — premium paid content, news org with monetization concerns, sensitive material. Be aware: blocking AI retrieval means your brand will not be cited when users ask AI assistants about your category. For most businesses, this is a strategic mistake.
The Cloudflare gotcha
If your site is behind Cloudflare, there's a feature called "AI Crawl
Control → Managed robots.txt" that injects a Policy-C-style block
into your robots.txt on the wire, regardless of what your origin
serves. The toggle is on by default for many zones. Many brands are blocking
every AI crawler without knowing.
To check: curl https://yourdomain.com/robots.txt. If you see a
block titled "# BEGIN Cloudflare Managed content", you're
affected. Turn the toggle off in Cloudflare → AI Crawl Control → Managed
robots.txt. We wrote up the full story
here.
Validating your robots.txt
Three quick checks:
- Use Google's robots.txt tester in Search Console — paste a URL and a user-agent, it tells you if the page is fetchable. Their tester is now under the URL Inspection tool.
- Curl with each crawler's user-agent and inspect the response:
curl -A "GPTBot" https://yourdomain.com/robots.txt curl -A "ClaudeBot" https://yourdomain.com/robots.txt - Watch Bing Webmaster Tools' Crawl Errors — Bing reports robots.txt-blocked URLs there. Other engines don't surface this as cleanly.
Three rules
- Specific user-agents override the wildcard. If you have
User-agent: * Allow: /and below itUser-agent: GPTBot Disallow: /, GPTBot is blocked. The wildcard isn't a fallback; it's a default that specific rules override. - One
User-agentblock per crawler. Some sites repeatUser-agent: GPTBotwith different rules in different blocks; only the first block is honored. Consolidate. - Don't block
Googlebotwhen you meanGoogle-Extended. These are different crawlers.Googlebotpowers Search + AI Overviews.Google-Extendedpowers Gemini training. BlockingGooglebottanks your traditional Google traffic.
FAQ
I want to be in ChatGPT but not in Claude. Can I?
Yes. Allow GPTBot, OAI-SearchBot, and
ChatGPT-User; disallow ClaudeBot,
Claude-Web, and anthropic-ai. Practical impact is
modest because most brands want presence everywhere AI assistants exist,
but the option is there.
What about noai and noimageai meta tags?
These are the page-level equivalent of robots.txt rules. They tell crawlers
not to use the page's content for AI training. Less widely honored than
Content-Signal headers; useful as defense-in-depth on
pages where you really care.
What about llms.txt?
A proposed standard for "here's a curated text version of my content for LLMs to ingest cleanly." Adoption is uneven; OpenAI and Anthropic both said publicly in 2025 that they prefer to crawl normally. Worth shipping if it's easy to generate, but don't rely on it as your primary AI-visibility strategy.
Do I need to also add an X-Robots-Tag HTTP header?
Only if you want per-page granularity that robots.txt can't express (e.g., "noindex this specific PDF without listing it"). For broad AI-visibility policy, robots.txt is sufficient.
Bottom line
Most brands win by shipping Policy A. Some by shipping Policy B. Very few should ship Policy C. Whichever you choose, do it deliberately — and re-check after every CDN configuration change. The most common reason brands lose AI visibility isn't a strategy decision; it's a CDN feature that flipped a switch they didn't notice.
Check your own brand's AI visibility
Free scan across ChatGPT, Claude, Perplexity, Gemini, and AI Overviews — 60 seconds, no credit card.
Brandswarm tracks how 5 AI engines describe your brand, every day.
Keep reading
May 31, 2026 · 11 min
Schema markup for AI search: the complete 2026 reference
Which Schema.org types AI engines actually retrieve, copy-pasteable JSON-LD for Organizat…
May 31, 2026 · 8 min
Why Bing matters more than Google for ChatGPT visibility
ChatGPT's browsing tool retrieves from Bing's index, not Google's. A brand can dominate G…