Robots.txt for AI Crawlers: Best Practices and Common Mistakes

6 min read
TechnicalGEOWordPress

Robots.txt Is Your AI Access Control Layer

The robots.txt file has been around since 1994, but it has never been more important than now. With a growing number of AI crawlers visiting websites — GPTBot, ClaudeBot, Google-Extended, Bytespider, and more — your robots.txt is the primary mechanism for controlling which AI systems can access your content.

The problem: most robots.txt files were written years ago with only Googlebot and Bingbot in mind. They either inadvertently allow full AI crawler access or, worse, use overly broad rules that block everything including legitimate search crawlers.

The AI Crawlers You Need to Know

Here are the primary AI crawlers active today and their user-agent strings:

| Crawler | User-Agent | Company | Purpose | |---------|-----------|---------|---------| | GPTBot | GPTBot | OpenAI | ChatGPT training & retrieval | | ChatGPT-User | ChatGPT-User | OpenAI | Real-time browsing in ChatGPT | | ClaudeBot | ClaudeBot | Anthropic | Claude training | | Google-Extended | Google-Extended | Google | Gemini AI training | | Bytespider | Bytespider | ByteDance | AI training | | CCBot | CCBot | Common Crawl | Open dataset used by many AI labs | | FacebookBot | FacebookBot | Meta | AI training | | Applebot-Extended | Applebot-Extended | Apple | Apple Intelligence training | | PerplexityBot | PerplexityBot | Perplexity | AI search retrieval |

Each of these requires its own explicit robots.txt directive. There is no wildcard "AI crawler" category.

Best Practices for AI Crawler Configuration

1. Be Explicit About Each Crawler

Don't rely on a general User-agent: * rule to manage AI crawlers. Specify each one individually:

# AI Crawlers - Allowed
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /account/
Disallow: /admin/

User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /account/
Disallow: /admin/

User-agent: PerplexityBot
Allow: /
Disallow: /account/
Disallow: /admin/

2. Separate Training Crawlers from Retrieval Crawlers

This is a critical distinction many site owners miss:

  • Training crawlers (GPTBot, Google-Extended, ClaudeBot) collect content to train AI models
  • Retrieval crawlers (ChatGPT-User, PerplexityBot) fetch content in real-time to answer user questions

If you block retrieval crawlers, AI search engines cannot cite your content in real-time responses. If you block training crawlers, you only prevent model training — citations from indexed data may still occur.

Recommended approach: Allow retrieval crawlers broadly, and make strategic decisions about training crawlers.

# Retrieval - Allow broadly (enables citations)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Training - Strategic control
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /premium/

User-agent: Google-Extended
Allow: /blog/
Disallow: /premium/

3. Never Block Search Crawlers by Accident

The most dangerous mistake is writing a broad rule that accidentally affects Googlebot or Bingbot:

# DANGEROUS - Do NOT do this
User-agent: *
Disallow: /

# This blocks EVERYTHING including Google Search indexing

Always place your search crawler rules explicitly at the top:

# Search engines - Full access
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Then AI-specific rules below...

4. Use Allow + Disallow Together for Precision

Robots.txt supports combining Allow and Disallow for fine-grained control:

User-agent: GPTBot
Disallow: /
Allow: /blog/
Allow: /guides/
Allow: /resources/

This blocks everything by default, then explicitly opens the directories you want AI crawlers to access. It's the safest approach for sites with mixed public/private content.

Common Mistakes That Hurt GEO

Mistake 1: No AI Crawler Rules at All

If your robots.txt has no mention of AI crawlers, they have full access to everything by default. This might be fine if you want maximum AI visibility, but it's not an intentional strategy — it's an oversight.

Mistake 2: Blocking Retrieval Crawlers

# This prevents AI search from citing you
User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

If you want AI search visibility, never block retrieval crawlers. These are the bots that fetch your content to generate cited responses.

Mistake 3: Outdated Crawl-Delay Directives

User-agent: GPTBot
Crawl-delay: 60

Most AI crawlers do not respect crawl-delay. It's an informal extension to robots.txt that only some bots honor. Don't rely on it for rate limiting — use server-level rate limiting instead if crawl volume is a concern.

Mistake 4: Blocking Sitemaps from AI Crawlers

Your sitemap helps AI crawlers discover content efficiently. Make sure it's accessible:

Sitemap: https://example.com/sitemap.xml

The Sitemap directive is global — it applies regardless of user-agent blocks. But ensure your sitemap URL itself isn't in a blocked path.

Mistake 5: Conflicting Rules

User-agent: GPTBot
Disallow: /blog/
Allow: /blog/guides/

When rules conflict, the most specific path wins (per the robots.txt standard). But not all crawlers implement conflict resolution identically. Keep your rules simple and non-contradictory when possible.

WordPress-Specific Considerations

WordPress generates a virtual robots.txt by default. To customize it for AI crawlers:

Option 1: Physical robots.txt File

Create a physical robots.txt file in your WordPress root directory. This overrides the virtual one.

Option 2: Filter the Virtual robots.txt

add_filter('robots_txt', function($output, $public) {
    $output .= "\n# AI Crawlers\n";
    $output .= "User-agent: GPTBot\n";
    $output .= "Allow: /\n";
    $output .= "Disallow: /wp-admin/\n\n";
    $output .= "User-agent: ClaudeBot\n";
    $output .= "Allow: /\n";
    $output .= "Disallow: /wp-admin/\n\n";
    return $output;
}, 10, 2);

Option 3: Use a GEO Plugin

Plugins like Arvo GEO provide a UI for managing AI crawler access without manually editing robots.txt. This reduces the risk of syntax errors and makes it easy to update as new crawlers emerge.

Testing Your robots.txt

After making changes, validate:

  1. Google's robots.txt tester in Search Console (for Googlebot rules)
  2. Manual verification — visit yoursite.com/robots.txt and read it carefully
  3. Log monitoring — after deploying changes, check if AI crawlers are respecting your rules
  4. Syntax checkers — use online robots.txt validators to catch formatting errors

A Complete Example

Here's a well-structured robots.txt for a site that wants AI search visibility while protecting private content:

# Search Engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI Retrieval (enables citations)
User-agent: ChatGPT-User
Allow: /
Disallow: /members/
Disallow: /wp-admin/

User-agent: PerplexityBot
Allow: /
Disallow: /members/
Disallow: /wp-admin/

# AI Training (strategic access)
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /

User-agent: Google-Extended
Allow: /blog/
Allow: /docs/
Disallow: /

# Default
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

Your robots.txt is the foundation of your AI access strategy. Get it right, and you control exactly how AI systems interact with your content. Get it wrong, and you're either invisible to AI search or giving away premium content without realizing it.