robots.txt Tutorial 2026 Write Test Deploy

1. What is robots.txt

robots.txt is a plain-text rules file at your site root (https://yourdomain.com/robots.txt) that tells crawlers which paths they may fetch and which they may not. Key boundary: it controls crawling, not indexing. A URL that is disallowed can still appear in search results if other sites link to it. Google states clearly: to keep a page out of search results, use noindex or access controls—not robots.txt alone. See Google Search Central: how robots.txt works and its limits.

2. Where the file lives and how it takes effect

The filename must be lowercase: robots.txt (case-sensitive).
It must sit at the domain root: https://yourdomain.com/robots.txt.
Each subdomain needs its own file (e.g. blog.example.com).
Search engines cache the file; changes usually propagate within a few hours.

Minimal template (three lines to go live)

User-agent: *
Disallow:
Sitemap: https://yourdomain.com/sitemap.xml

This allows all crawlers and points to your sitemap.

3. Core directives (five you should know)

Directive	Role	Example
`User-agent`	Which crawler the rules apply to	`User-agent: Googlebot`
`Disallow`	Paths crawlers must not fetch	`Disallow: /admin/`
`Allow`	Explicit allow within a broader disallow	`Allow: /admin/public/`
`Sitemap`	Declares sitemap URL(s)	`Sitemap: https://example.com/sitemap.xml`
`Crawl-delay`	Delay between requests (limited support)	`Crawl-delay: 10`

Syntax notes:

Case-sensitive: /Photo is not the same as /photo.
Paths are prefix-based: Disallow: /Photo can also block /Photography/.
Google ignores Crawl-delay; use crawl rate settings in Google Search Console where applicable.
Wildcards * and end-anchors $ are widely supported (not part of the original spec).

4. Three copy-paste templates

Template A: content site (good default)

User-agent: *
Allow: /
Disallow: /search
Disallow: /wp-admin/
Sitemap: https://yourdomain.com/sitemap.xml

Template B: ecommerce (cut parameter churn)

User-agent: *
Allow: /
Disallow: /cart/
Disallow: /checkout/
Disallow: /*?sort=
Disallow: /*?filter=
Sitemap: https://yourdomain.com/sitemap.xml

Template C: AI crawler policy (layer by intent)

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Googlebot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

How to decide: if you want visibility in AI-generated answers, you may allow retrieval-oriented bots (e.g. Google-Extended, PerplexityBot); to protect training data, tighten rules for training-focused crawlers.

5. Pre-launch checks

🚀 Validate your robots.txt before you ship

Catch syntax issues, conflicting rules, and accidental blocks—plus fix suggestions. Open the robots.txt validator (new tab)

Pair this with the robots report in Google Search Console for a second opinion. Smoke-test checklist:

Can critical pages be crawled?
Does the sitemap return HTTP 200?
Are CSS/JS assets unblocked?
After launch, watch crawl/index trends for 7–14 days.

6. Real cases: when robots.txt goes wrong

Case 1: third-party rules drift—“slow leak” traffic loss

Search Engine Land covered a case where CMS or vendor changes to robots rules—plus case sensitivity—caused important URLs to slip from the index over weeks or months, not overnight. Source: Search Engine Land write-up.

Case 2: polluted robots response from CDN—strong recovery after cleanup

After server/CDN injection garbled robots.txt, impressions and indexed URLs tanked; within about three weeks of a clean file and cache purge, indexed URLs rose ~260%, impressions ~261%, and CTR moved from 0.4% to 1.2%. Source: robots.txt error fix case study.

7. Screenshots: robots.txt from well-known sites

Open source reference

Google’s public robots.txt—complex rules and fine-grained groups.

Open source reference

Cloudflare—layered paths including locale directories.

Open source reference

Moz—including GPTBot and other AI-related user-agents.

Open source reference

OpenAI—a concise policy-style example.

8. Seven common mistakes and fixes

Shipping Disallow: / to production: blocks the whole site. Fix: validate in a tool before deploy.
Blocking render assets (e.g. /_next/): Google can’t render the page. Fix: allow JS/CSS paths.
Disallowing sitemap URLs: slows discovery. Fix: keep sitemap crawlable.
Using robots.txt as “noindex”: disallowed URLs can still be indexed via external links. Fix: use noindex or auth.
Wrong case or trailing slash: /admin ≠ /admin/. Fix: match the paths you intend.
Blocking whole directories by mistake: e.g. Disallow: /blog/ stops all posts. Fix: narrow the path.
No change monitoring: silent edits cause slow index loss. Fix: diffs or alerts.

9. FAQ

Q1: Is robots.txt legally binding?

Usually no—it’s a voluntary convention. Major search crawlers respect it; bad actors may not.

Q2: Is robots.txt still relevant?

Yes—especially with AI crawlers. It’s the first layer of crawl governance; pair it with noindex and access control.

Q3: What does “Blocked by robots.txt” mean?

The URL is not to be fetched per your rules. It may still appear in search if linked elsewhere.

Q4: What belongs in a minimal robots.txt?

At least User-agent rules, any needed Disallow, and a correct Sitemap line.

Q5: How do I fix “blocked by robots.txt” errors?

Find the blocking rule → adjust scope → re-test → confirm in Search Console → watch trends for 7–14 days.

Q6: Can robots.txt create a security hole?

Not by itself, but it can hint at URL patterns. Do not rely on it for secrecy—use authentication.

Q7: What should I put in the file?

Only crawler directives (User-agent, Disallow, Allow, Sitemap)—not HTML, marketing copy, or sensitive paths as “security”.

robots.txt Tutorial 2026 Write Test Deploy

1. What is robots.txt

2. Where the file lives and how it takes effect

Minimal template (three lines to go live)

3. Core directives (five you should know)

4. Three copy-paste templates

Template A: content site (good default)

Template B: ecommerce (cut parameter churn)

Template C: AI crawler policy (layer by intent)

5. Pre-launch checks

6. Real cases: when robots.txt goes wrong

Case 1: third-party rules drift—“slow leak” traffic loss

Case 2: polluted robots response from CDN—strong recovery after cleanup

7. Screenshots: robots.txt from well-known sites

8. Seven common mistakes and fixes

9. FAQ

Q1: Is robots.txt legally binding?

Q2: Is robots.txt still relevant?

Q3: What does “Blocked by robots.txt” mean?

Q4: What belongs in a minimal robots.txt?

Q5: How do I fix “blocked by robots.txt” errors?

Q6: Can robots.txt create a security hole?

Q7: What should I put in the file?

Reference links

Recommended reading

Talk to me about your SEO / GEO bottlenecks

Product

Tutorials

GEO Toolkit