★ Rated 4.9 by verified clients·Offices in 6 countries·hello@growwithba.comCase StudiesCareersContact
GROWWITHBA

Robots.txt: What It Does, What It Doesn't, and How Sites Get It Wrong

By Arjun Mehta · Updated June 2026 · SEO

Robots.txt is a few lines of text with site-killing power — one wrong slash has deindexed entire businesses after a staging file shipped to production. It's also widely misunderstood: blocking a URL doesn't remove it from Google, and the file is a request, not a lock.

Here's what robots.txt actually controls, the syntax that matters, and the 2026 layer: AI crawler policy.

Key takeaways

  • Robots.txt controls crawling, not indexing — blocked pages can still appear in results from external links.
  • To remove pages from the index, allow crawling and use noindex; blocking + noindex means the noindex is never seen.
  • The classic disasters: Disallow: / left from staging, blocked CSS/JS breaking rendering, and case-sensitive path mistakes.
  • AI crawlers (GPTBot and peers) honor their own user-agent rules — robots.txt is now your AI-training and AI-answer policy file.

What it is and how it reads

The file lives at the domain root and speaks in groups: a User-agent line naming the bot, then Disallow/Allow paths. Paths are prefix matches and case-sensitive; * wildcards and $ end-anchors refine them; the most specific rule wins. A Sitemap line points crawlers to your sitemap. That's nearly the whole language — its danger is concision: 'Disallow: /' bars the entire site, while 'Disallow:' (empty) bars nothing, and the difference has ruined launches.

The indexing confusion

Robots.txt says 'don't fetch'; it never says 'don't list'. A blocked URL that earns external links can still rank as a bare 'no information available' result. Removal logic is the reverse of instinct: let the bot crawl the page so it can see a noindex tag (or return 404/410, or use removal tools for urgency). The pairing mistake — blocking a path and noindexing its pages — guarantees the noindex is never read. Use robots.txt for crawl economy (utility paths, infinite parameter spaces, internal search); use meta directives for index control; never confuse the two layers.

The audit and the AI layer

Quarterly, read your live file and test critical URLs in Search Console's checker: confirm nothing essential is blocked (especially CSS/JS — blocked resources break rendering and hurt evaluation), staging rules didn't ship, and sitemap lines are current. Then set deliberate AI policy: crawlers like GPTBot, ClaudeBot, and peers identify themselves and respect their own rules — you can allow them (visibility in AI answers, inclusion in training) or block them per business stance. There's a real tradeoff: blocking AI crawlers protects content from training but can remove you from the answer surfaces where buyers increasingly ask questions. Decide it as strategy, not default.

Frequently asked questions

Will robots.txt remove my page from Google?

No — it prevents crawling, not listing. For removal: allow the crawl, serve noindex (or 404/410), and use Search Console removal for urgent cases.

Should I block AI crawlers in robots.txt?

It's a business decision: blocking limits training-data use but also limits presence in AI answers that now influence buyers. Many brands allow answer-engine bots while watching policy evolve.

Why does Search Console show 'indexed though blocked by robots.txt'?

External links surfaced the URL despite the crawl block. Unblock and noindex it to remove it, or unblock and let it be crawled properly if it should rank.