Robots.txt: What It Does, What It Doesn't, and How Sites Get It Wrong

Arjun Mehta
Senior Growth Strategist · Reviewed by the GrowwithBA team
SEO5 MIN READUpdated June 2026
THE SHORT ANSWER

Robots.txt guide: directive syntax, what blocking actually means, the noindex confusion, AI crawler controls, and the mistakes that deindex sites.

Robots.txt is a few lines of text with site-killing power — one wrong slash has deindexed entire businesses after a staging file shipped to production. It's also widely misunderstood: blocking a URL doesn't remove it from Google, and the file is a request, not a lock.

Here's what robots.txt actually controls, the syntax that matters, and the 2026 layer: AI crawler policy.

Key takeaways

  • Robots.txt controls crawling, not indexing — blocked pages can still appear in results from external links.
  • To remove pages from the index, allow crawling and use noindex; blocking + noindex means the noindex is never seen.
  • The classic disasters: Disallow: / left from staging, blocked CSS/JS breaking rendering, and case-sensitive path mistakes.
  • AI crawlers (GPTBot and peers) honor their own user-agent rules — robots.txt is now your AI-training and AI-answer policy file.

What it is and how it reads

The file lives at the domain root and speaks in groups: a User-agent line naming the bot, then Disallow/Allow paths. Paths are prefix matches and case-sensitive; * wildcards and $ end-anchors refine them; the most specific rule wins. A Sitemap line points crawlers to your sitemap. That's nearly the whole language — its danger is concision: 'Disallow: /' bars the entire site, while 'Disallow:' (empty) bars nothing, and the difference has ruined launches.

The indexing confusion

Robots.txt says 'don't fetch'; it never says 'don't list'. A blocked URL that earns external links can still rank as a bare 'no information available' result. Removal logic is the reverse of instinct: let the bot crawl the page so it can see a noindex tag (or return 404/410, or use removal tools for urgency). The pairing mistake — blocking a path and noindexing its pages — guarantees the noindex is never read. Use robots.txt for crawl economy (utility paths, infinite parameter spaces, internal search); use meta directives for index control; never confuse the two layers.

The audit and the AI layer

Quarterly, read your live file and test critical URLs in Search Console's checker: confirm nothing essential is blocked (especially CSS/JS — blocked resources break rendering and hurt evaluation), staging rules didn't ship, and sitemap lines are current. Then set deliberate AI policy: crawlers like GPTBot, ClaudeBot, and peers identify themselves and respect their own rules — you can allow them (visibility in AI answers, inclusion in training) or block them per business stance. There's a real tradeoff: blocking AI crawlers protects content from training but can remove you from the answer surfaces where buyers increasingly ask questions. Decide it as strategy, not default.

Common mistakes that quietly kill results

These come straight from audits we run every week. If any of them stings, you’re in good company — and the fix is usually faster than you think.

Publishing without a keyword owner. Two pages chasing the same query split your authority. Before anything new goes live, run a site: search for the head term — if a URL already ranks 15-40, update that page instead. We've seen consolidations jump a page from #18 to #6 in three weeks with zero new content.

Building links to the homepage only. Homepage links lift the domain a little. Links to the actual page you want ranked lift that page a lot. Aim 70% of outreach at money and pillar pages.

Blocking crawl budget with junk. Faceted URLs, tag pages, and paginated archives eat crawl budget on large sites. Noindex what doesn't earn traffic and watch important pages get crawled faster.

Writing meta descriptions like a robot. Your meta description is ad copy. Lead with the outcome, include a number, end with a reason to click. CTR moves rankings more than most on-page tweaks.

FROM THE TRENCHES

A DTC skincare client had 340 blog posts and falling traffic. We deleted or merged 180 of them, redirected the URLs, and refreshed the top 40. Organic traffic rose 62% in four months — with less content, not more.

Quick checklist before you ship

  • Primary keyword appears in title, H1, URL, and first 100 words — once each, naturally
  • Title under 60 characters with a number or a hook
  • Images compressed under 100KB with descriptive alt text
  • Search the SERP: your format matches what's already ranking
  • One original element competitors don't have: data, example, template, or screenshot
  • Checked the page renders and ranks-tracks on mobile
  • At least 5 internal links pointing in, 3-8 pointing out to related pages

Frequently asked questions

Will robots.txt remove my page from Google?

No — it prevents crawling, not listing. For removal: allow the crawl, serve noindex (or 404/410), and use Search Console removal for urgent cases.

Should I block AI crawlers in robots.txt?

It's a business decision: blocking limits training-data use but also limits presence in AI answers that now influence buyers. Many brands allow answer-engine bots while watching policy evolve.

Why does Search Console show 'indexed though blocked by robots.txt'?

External links surfaced the URL despite the crawl block. Unblock and noindex it to remove it, or unblock and let it be crawled properly if it should rank.

Arjun Mehta

Senior Growth Strategist at GrowwithBA. 12 years running SEO, paid media, and retention for ecommerce and SaaS brands from $1M to $100M+. Every guide here comes from live client work — not theory.

Get a free audit from our team →