Robots.txt for AI Crawlers: Best Practices and Common Mistakes
Robots.txt Is Your AI Access Control Layer
The robots.txt file has been around since 1994, but it has never been more important than now. With a growing number of AI crawlers visiting websites — GPTBot, ClaudeBot, Google-Extended, Bytespider, and more — your robots.txt is the primary mechanism for controlling which AI systems can access your content.
The problem: most robots.txt files were written years ago with only Googlebot and Bingbot in mind. They either inadvertently allow full AI crawler access or, worse, use overly broad rules that block everything including legitimate search crawlers.
The AI Crawlers You Need to Know
Here are the primary AI crawlers active today and their user-agent strings:
| Crawler | User-Agent | Company | Purpose | |---------|-----------|---------|---------| | GPTBot | GPTBot | OpenAI | ChatGPT training & retrieval | | ChatGPT-User | ChatGPT-User | OpenAI | Real-time browsing in ChatGPT | | ClaudeBot | ClaudeBot | Anthropic | Claude training | | Google-Extended | Google-Extended | Google | Gemini AI training | | Bytespider | Bytespider | ByteDance | AI training | | CCBot | CCBot | Common Crawl | Open dataset used by many AI labs | | FacebookBot | FacebookBot | Meta | AI training | | Applebot-Extended | Applebot-Extended | Apple | Apple Intelligence training | | PerplexityBot | PerplexityBot | Perplexity | AI search retrieval |
Each of these requires its own explicit robots.txt directive. There is no wildcard "AI crawler" category.
Best Practices for AI Crawler Configuration
1. Be Explicit About Each Crawler
Don't rely on a general User-agent: * rule to manage AI crawlers. Specify each one individually:
# AI Crawlers - Allowed
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /account/
Disallow: /admin/
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /account/
Disallow: /admin/
User-agent: PerplexityBot
Allow: /
Disallow: /account/
Disallow: /admin/
2. Separate Training Crawlers from Retrieval Crawlers
This is a critical distinction many site owners miss:
- Training crawlers (GPTBot, Google-Extended, ClaudeBot) collect content to train AI models
- Retrieval crawlers (ChatGPT-User, PerplexityBot) fetch content in real-time to answer user questions
If you block retrieval crawlers, AI search engines cannot cite your content in real-time responses. If you block training crawlers, you only prevent model training — citations from indexed data may still occur.
Recommended approach: Allow retrieval crawlers broadly, and make strategic decisions about training crawlers.
# Retrieval - Allow broadly (enables citations)
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
# Training - Strategic control
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /premium/
User-agent: Google-Extended
Allow: /blog/
Disallow: /premium/
3. Never Block Search Crawlers by Accident
The most dangerous mistake is writing a broad rule that accidentally affects Googlebot or Bingbot:
# DANGEROUS - Do NOT do this
User-agent: *
Disallow: /
# This blocks EVERYTHING including Google Search indexing
Always place your search crawler rules explicitly at the top:
# Search engines - Full access
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Then AI-specific rules below...
4. Use Allow + Disallow Together for Precision
Robots.txt supports combining Allow and Disallow for fine-grained control:
User-agent: GPTBot
Disallow: /
Allow: /blog/
Allow: /guides/
Allow: /resources/
This blocks everything by default, then explicitly opens the directories you want AI crawlers to access. It's the safest approach for sites with mixed public/private content.
Common Mistakes That Hurt GEO
Mistake 1: No AI Crawler Rules at All
If your robots.txt has no mention of AI crawlers, they have full access to everything by default. This might be fine if you want maximum AI visibility, but it's not an intentional strategy — it's an oversight.
Mistake 2: Blocking Retrieval Crawlers
# This prevents AI search from citing you
User-agent: ChatGPT-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
If you want AI search visibility, never block retrieval crawlers. These are the bots that fetch your content to generate cited responses.
Mistake 3: Outdated Crawl-Delay Directives
User-agent: GPTBot
Crawl-delay: 60
Most AI crawlers do not respect crawl-delay. It's an informal extension to robots.txt that only some bots honor. Don't rely on it for rate limiting — use server-level rate limiting instead if crawl volume is a concern.
Mistake 4: Blocking Sitemaps from AI Crawlers
Your sitemap helps AI crawlers discover content efficiently. Make sure it's accessible:
Sitemap: https://example.com/sitemap.xml
The Sitemap directive is global — it applies regardless of user-agent blocks. But ensure your sitemap URL itself isn't in a blocked path.
Mistake 5: Conflicting Rules
User-agent: GPTBot
Disallow: /blog/
Allow: /blog/guides/
When rules conflict, the most specific path wins (per the robots.txt standard). But not all crawlers implement conflict resolution identically. Keep your rules simple and non-contradictory when possible.
WordPress-Specific Considerations
WordPress generates a virtual robots.txt by default. To customize it for AI crawlers:
Option 1: Physical robots.txt File
Create a physical robots.txt file in your WordPress root directory. This overrides the virtual one.
Option 2: Filter the Virtual robots.txt
add_filter('robots_txt', function($output, $public) {
$output .= "\n# AI Crawlers\n";
$output .= "User-agent: GPTBot\n";
$output .= "Allow: /\n";
$output .= "Disallow: /wp-admin/\n\n";
$output .= "User-agent: ClaudeBot\n";
$output .= "Allow: /\n";
$output .= "Disallow: /wp-admin/\n\n";
return $output;
}, 10, 2);
Option 3: Use a GEO Plugin
Plugins like Arvo GEO provide a UI for managing AI crawler access without manually editing robots.txt. This reduces the risk of syntax errors and makes it easy to update as new crawlers emerge.
Testing Your robots.txt
After making changes, validate:
- Google's robots.txt tester in Search Console (for Googlebot rules)
- Manual verification — visit yoursite.com/robots.txt and read it carefully
- Log monitoring — after deploying changes, check if AI crawlers are respecting your rules
- Syntax checkers — use online robots.txt validators to catch formatting errors
A Complete Example
Here's a well-structured robots.txt for a site that wants AI search visibility while protecting private content:
# Search Engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI Retrieval (enables citations)
User-agent: ChatGPT-User
Allow: /
Disallow: /members/
Disallow: /wp-admin/
User-agent: PerplexityBot
Allow: /
Disallow: /members/
Disallow: /wp-admin/
# AI Training (strategic access)
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /
User-agent: Google-Extended
Allow: /blog/
Allow: /docs/
Disallow: /
# Default
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
Your robots.txt is the foundation of your AI access strategy. Get it right, and you control exactly how AI systems interact with your content. Get it wrong, and you're either invisible to AI search or giving away premium content without realizing it.