Page loaded

robots.txt Tutorial 2026 Write Test Deploy

2026-04-21·6 min·By Ethan

1. What is robots.txt robots.txt is a plain-text rules […]

1. What is robots.txt

robots.txt is a plain-text rules file at your site root (https://yourdomain.com/robots.txt) that tells crawlers which paths they may fetch and which they may not. Key boundary: it controls crawling, not indexing. A URL that is disallowed can still appear in search results if other sites link to it. Google states clearly: to keep a page out of search results, use noindex or access controls—not robots.txt alone. See Google Search Central: how robots.txt works and its limits.

2. Where the file lives and how it takes effect

  1. The filename must be lowercase: robots.txt (case-sensitive).
  2. It must sit at the domain root: https://yourdomain.com/robots.txt.
  3. Each subdomain needs its own file (e.g. blog.example.com).
  4. Search engines cache the file; changes usually propagate within a few hours.

Minimal template (three lines to go live)

User-agent: *
Disallow:
Sitemap: https://yourdomain.com/sitemap.xml
This allows all crawlers and points to your sitemap.

3. Core directives (five you should know)

Directive Role Example
User-agent Which crawler the rules apply to User-agent: Googlebot
Disallow Paths crawlers must not fetch Disallow: /admin/
Allow Explicit allow within a broader disallow Allow: /admin/public/
Sitemap Declares sitemap URL(s) Sitemap: https://example.com/sitemap.xml
Crawl-delay Delay between requests (limited support) Crawl-delay: 10
Syntax notes:
  • Case-sensitive: /Photo is not the same as /photo.
  • Paths are prefix-based: Disallow: /Photo can also block /Photography/.
  • Google ignores Crawl-delay; use crawl rate settings in Google Search Console where applicable.
  • Wildcards * and end-anchors $ are widely supported (not part of the original spec).

4. Three copy-paste templates

Template A: content site (good default)

User-agent: *
Allow: /
Disallow: /search
Disallow: /wp-admin/
Sitemap: https://yourdomain.com/sitemap.xml

Template B: ecommerce (cut parameter churn)

User-agent: *
Allow: /
Disallow: /cart/
Disallow: /checkout/
Disallow: /*?sort=
Disallow: /*?filter=
Sitemap: https://yourdomain.com/sitemap.xml

Template C: AI crawler policy (layer by intent)

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Googlebot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml
How to decide: if you want visibility in AI-generated answers, you may allow retrieval-oriented bots (e.g. Google-Extended, PerplexityBot); to protect training data, tighten rules for training-focused crawlers.

5. Pre-launch checks

🚀 Validate your robots.txt before you ship

Catch syntax issues, conflicting rules, and accidental blocks—plus fix suggestions. Open the robots.txt validator (new tab)

Pair this with the robots report in Google Search Console for a second opinion. Smoke-test checklist:
  1. Can critical pages be crawled?
  2. Does the sitemap return HTTP 200?
  3. Are CSS/JS assets unblocked?
  4. After launch, watch crawl/index trends for 7–14 days.

6. Real cases: when robots.txt goes wrong

Case 1: third-party rules drift—“slow leak” traffic loss

Search Engine Land covered a case where CMS or vendor changes to robots rules—plus case sensitivity—caused important URLs to slip from the index over weeks or months, not overnight. Source: Search Engine Land write-up.

Case 2: polluted robots response from CDN—strong recovery after cleanup

After server/CDN injection garbled robots.txt, impressions and indexed URLs tanked; within about three weeks of a clean file and cache purge, indexed URLs rose ~260%, impressions ~261%, and CTR moved from 0.4% to 1.2%. Source: robots.txt error fix case study.

7. Screenshots: robots.txt from well-known sites

Open source reference
Google’s public robots.txt—complex rules and fine-grained groups.
Open source reference
Cloudflare—layered paths including locale directories.
Open source reference
Moz—including GPTBot and other AI-related user-agents.
Open source reference
OpenAI—a concise policy-style example.

8. Seven common mistakes and fixes

  1. Shipping Disallow: / to production: blocks the whole site. Fix: validate in a tool before deploy.
  2. Blocking render assets (e.g. /_next/): Google can’t render the page. Fix: allow JS/CSS paths.
  3. Disallowing sitemap URLs: slows discovery. Fix: keep sitemap crawlable.
  4. Using robots.txt as “noindex”: disallowed URLs can still be indexed via external links. Fix: use noindex or auth.
  5. Wrong case or trailing slash: /admin ≠ /admin/. Fix: match the paths you intend.
  6. Blocking whole directories by mistake: e.g. Disallow: /blog/ stops all posts. Fix: narrow the path.
  7. No change monitoring: silent edits cause slow index loss. Fix: diffs or alerts.

9. FAQ

Q1: Is robots.txt legally binding?

Usually no—it’s a voluntary convention. Major search crawlers respect it; bad actors may not.

Q2: Is robots.txt still relevant?

Yes—especially with AI crawlers. It’s the first layer of crawl governance; pair it with noindex and access control.

Q3: What does “Blocked by robots.txt” mean?

The URL is not to be fetched per your rules. It may still appear in search if linked elsewhere.

Q4: What belongs in a minimal robots.txt?

At least User-agent rules, any needed Disallow, and a correct Sitemap line.

Q5: How do I fix “blocked by robots.txt” errors?

Find the blocking rule → adjust scope → re-test → confirm in Search Console → watch trends for 7–14 days.

Q6: Can robots.txt create a security hole?

Not by itself, but it can hint at URL patterns. Do not rely on it for secrecy—use authentication.

Q7: What should I put in the file?

Only crawler directives (User-agent, Disallow, Allow, Sitemap)—not HTML, marketing copy, or sensitive paths as “security”.

Reference links

Recommended reading

Need practical guidance?

Talk to me about your SEO / GEO bottlenecks

Reach me by email, WeChat, or LinkedIn. I can help you prioritize issues and suggest a practical first step.

Email: Send emailWeChat: 15765565449LinkedIn