Robots.txt and AI: Taking Control of How Your Content is Used

Robots.txt and AI: Taking Control of How Your Content is Used

Robots.txt and AI Taking Control of How Your Content is Used

If you own a website, you’ve probably heard the word “robots” thrown around in relation to AI lately. But here’s the thing: robots.txt has been around since the 1990s, and it’s not about AI at all. It’s about telling any automated crawler how to behave on your site. Right now, it’s one of the few tools you have to actually have a say in whether your content feeds into AI models.

Let me explain what’s happening, why it matters, and what you can do about it.

What’s Actually Happening

Major AI companies (OpenAI, Anthropic, Google, Meta, Apple, Amazon, and others) are scraping the internet to train their language models. They’re doing this at massive scale. Without any intervention on your part, your website’s content is likely being included in those training datasets.

Here’s the important bit: they’re not asking, and they’re not paying. Your writing, your knowledge, your expertise – it’s all being copied and fed into AI models so those companies can build better products.

You might think: “Well, Google’s been doing this for years with search.” And you’d be right. But here’s the difference. When Google indexes your site, they send people back to you through search results. It drives traffic and visibility. With AI training, there’s no traffic back to you. Your content is used once, at scale, with no attribution and no benefit.

The Two Types of AI Crawlers

This is the crucial bit, because not all AI crawlers are the same.

Search and citation bots: these are tools like ChatGPT’s search, Claude’s search, and Perplexity. When you ask them a question, they fetch a page in real time, read it, and cite it in their answer. They link back to you. They drive traffic. They’re a bit like Google. They’re beneficial to you because people discover your content through them.

Training scrapers: these are bots like GPTBot, ClaudeBot, Google-Extended, and others. They crawl your site once, download everything, and use it to train models. No traffic back. No attribution. Just access to your content and then they move on.

The problem is: most people don’t know the difference, so they either block everything or allow everything. But there’s a smarter middle ground.

The Smart Approach: Allow Search, Block Training

This is where robots.txt comes in, and why it’s suddenly relevant again.

A robots.txt file sits in the root of your website and tells automated crawlers what they can and can’t do. It’s been used for decades to tell Google and Bing where to crawl, what to index, and what to leave alone.

The smart strategy is this:

Allow the search and citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot). These help you get discovered and cited. You want them to work.

Block the training scrapers (GPTBot, ClaudeBot, Google-Extended). You don’t want your content quietly feeding someone else’s AI model.

This way, you stay visible in AI-powered search and answers. People can find you through ChatGPT or Claude and actually click through to your site. But you’re not letting your content be used as training data without permission or attribution.

It’s the best of both worlds.

How to Do It

If you’re on WordPress (and most hosting sites are), your robots.txt file already exists. You probably have a basic version that blocks the admin folder so search engines don’t index your WordPress backend.

Adding AI rules is straightforward. Here’s what a full robots.txt file looks like with the AI rules included:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# Allow AI search and citation bots
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Block AI training scrapers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

Sitemap: https://yoursite.com/sitemap.xml

The first section (User-agent: *) is your default rule that applies to all crawlers. It blocks the WordPress admin folder so search engines don’t index your backend. Then you add specific rules for the AI bots you want to allow or block. Just replace “yoursite.com” with your actual domain.

If your site uses an SEO plugin like Yoast, Rank Math, or All in One SEO, you can add these rules right in the plugin’s robots.txt editor. If you’re uploading a file manually, it goes in your public_html folder at the root of your domain.

It’s not complicated, and it’s safe. It won’t break anything on your site or harm your regular search visibility.

The Honest Limitations

Here’s where I need to be straight with you: robots.txt is a request, not a lock.

Reputable bots (and that includes all the major ones: OpenAI, Anthropic, Perplexity, Google) respect robots.txt. They honour what you tell them. But some don’t. Bytespider (ByteDance’s bot) has a documented history of ignoring robots.txt, for example. If a bot misbehaves, robots.txt alone won’t stop them.

If you really need a hard block on a specific bot, you need to go further: server-level blocking, CDN rules, or IP-based restrictions. Cloudflare, for example, has a one-click “Block AI bots” toggle. But that comes with a trade-off. It also blocks the helpful citation bots, so you lose the search visibility benefit.

Also, the user-triggered bots (ChatGPT-User, Claude-User, Perplexity-User) that fetch pages on-demand when you ask a question don’t always honour robots.txt as strictly as the indexing bots do. They try to, but it’s not guaranteed.

So robots.txt is a good first step. It works for the companies that actually respect it, and it’s a signal of your intent. But it’s not a perfect solution.

Why This Matters

Here’s the bottom line: AI companies are going to keep training their models on internet data. That’s not going away. The question is whether you have any say in how your specific content is treated.

Right now, robots.txt is one of the few mechanisms you have to actually express that preference. And there’s a real difference between:

  • “Help yourself to my content for your training data” (allowing everything)
  • “You can search my content and cite it back to me, but don’t scrape it for training” (the smart middle ground)
  • “Don’t touch my content at all” (blocking everything, including helpful search)

Most people don’t realise they have this option. They think it’s all or nothing. It’s not.

What’s Next?

In the longer term, there will probably be legal frameworks around this — laws about content scraping, licensing, and attribution. But we’re not there yet. Right now, robots.txt is what you’ve got.

If you haven’t already, it’s worth reviewing your robots.txt file and thinking about what stance you want to take. Do you want to be discoverable in AI-powered search? Do you want to opt out of training? Or something in between?

It’s your choice. And that’s actually worth something.


Have questions about your own robots.txt? Reach out and we can talk through the right approach for your site.

You may Also Like..