How Google Interprets robots.txt (REP)

Google’s crawlers follow the Robots Exclusion Protocol (REP) to check which parts of a website they can crawl.

📌 What robots.txt Does #

It tells crawlers (like Googlebot) which parts of the site they can or cannot crawl.

Example:

User-agent: *

Disallow: /private/

User-agent: Googlebot

Allow: /includes/

Sitemap: https://example.com/sitemap.xml

📍 File Location Rules #

Must be placed in the root directory of the site:
- https://example.com/robots.txt ✅
- https://example.com/pages/robots.txt ❌
Applies only to that specific host, protocol, and port.
- https://example.com/robots.txt → applies to https://example.com/ only.
- https://m.example.com/robots.txt → separate rules for mobile subdomain.

📏 Google’s Key Interpretations #

Case-Sensitive URL
- /Private/ ≠ /private/ (URLs are treated as exact matches).
Allow vs Disallow Priority
- Google applies the most specific rule to a path.

Example:

Disallow: /includes/

Allow: /includes/css/

→ CSS inside /includes/css/ will still be crawled.

Comments (#) Are Ignored
- Use # for explanations in file (doesn’t affect crawling).
Google Requires Resources
- Don’t block important .css or .js if they are needed to render your pages properly.
Sitemap Tag is Optional
- Adding Sitemap: https://example.com/sitemap.xml helps Google discover pages faster.

⚠️ What robots.txt Does NOT Do

❌ Does not hide content from search results
- Blocked URLs can still appear if other sites link to them.
- Use noindex meta tag or password-protection to fully hide.
❌ Not security → It’s public. Anyone can see your robots.txt.

💡 FSIDM Quick Tip for Students & Site Owners:
Think of robots.txt as a traffic cop — it directs crawlers, but it doesn’t lock doors. If you need real privacy, use authentication or noindex.

📌 Valid robots.txt URL Rules (Google’s View) #

The robots.txt file only applies to the exact protocol, domain/subdomain, and port it’s hosted on.

Robots.txt Location	Valid For	Not Valid For
https://example.com/robots.txt	https://example.com/	https://other.example.com/, http://example.com/, https://example.com:8181/
https://www.example.com/robots.txt	https://www.example.com/	https://example.com/, https://shop.www.example.com/
https://example.com/folder/robots.txt	❌ Crawlers don’t check subdirectories	All
https://www.exämple.com/robots.txt	https://www.exämple.com/, https://xn--exmple-cua.com/	https://www.example.com/
ftp://example.com/robots.txt	ftp://example.com/	https://example.com/
https://212.96.82.21/robots.txt	https://212.96.82.21/	https://example.com/
https://example.com:443/robots.txt	https://example.com/, https://example.com:443/	https://example.com:444/
https://example.com:8181/robots.txt	https://example.com:8181/	https://example.com/

💡 FSIDM Tip: Each subdomain needs its own robots.txt if you want to control crawling separately.

⚡ Handling HTTP Status Codes for robots.txt #

Google treats robots.txt responses differently based on status code:

HTTP Code	Google’s Behavior
2xx (Success)	Reads and applies rules normally.
3xx (Redirects)	Follows up to 5 hops → then treats as 404.
4xx (Client Errors)	Treated as no robots.txt (full crawl allowed), except 429 (rate limit).
5xx (Server Errors)	– First 12h: Stops crawling.- Next 30 days: Uses cached version if available.- After 30 days: If site available → crawls as if no robots.txt.
DNS/Network Errors	Treated as 5xx.

⏱ Caching Rules #

Google caches robots.txt for up to 24 hours.
May cache longer if site is unreachable.
Can refresh manually via Search Console robots.txt tester.

📏 Format & Size Rules

Must be UTF-8 encoded plain text.
Max size: 500 KiB (content after that is ignored).
Invalid lines (e.g., HTML, bad encoding) are ignored.

💡 FSIDM Takeaway for Students & SEO Managers:
👉 Correct location, format, and encoding are just as important as the rules themselves.
👉 Always test robots.txt in Search Console after uploading to avoid indexing issues.

🛠 Robots.txt Syntax Basics #

Format:

<field>:<value> # optional comment

Spaces are optional but recommended for readability.
# starts a comment (ignored by crawlers).

📌 Supported Fields (Google) #

Field	Purpose	Example
user-agent	Specifies the crawler the rules apply to	User-agent: Googlebot
disallow	Path not allowed to crawl	Disallow: /private/
allow	Path allowed to crawl (overrides disallow)	Allow: /private/public-page.html
sitemap	Location of sitemap(s)	Sitemap: https://example.com/sitemap.xml

❌ Fields like crawl-delay are not supported by Google.

🔍 Path Rules #

Paths are case-sensitive. (/File.asp ≠ /file.asp)
Paths must start with / (root relative).
Wildcards supported:
- * matches any sequence of characters.
- $ matches end of URL.

🧠 User-Agent Selection Logic #

Google picks the most specific group for the crawler:

Example:

User-agent: googlebot-news # Group 1

User-agent: * # Group 2

User-agent: googlebot # Group 3

Googlebot-News → Group 1
Googlebot → Group 3
Other Google bots → Group 2

📌 Order in file doesn’t matter. Google groups all relevant rules for a user agent internally.

📂 Grouping Rules #

Multiple user-agents can share rules:

User-agent: e

User-agent: f

Disallow: /g

→ Both e and f follow /g restriction.

📜 Example of Correct Syntax #

# Block all bots from /private/

User-agent: *

Disallow: /private/

# Allow Googlebot access to /private/reports/

User-agent: Googlebot

Allow: /private/reports/

# Add sitemap location

Sitemap: https://example.com/sitemap.xml

💡 FSIDM Practical Tip for Students

Test in Search Console after every change.
Keep rules minimal → complex rules increase errors.
Always combine Disallow + Allow smartly for sections.

🚦 URL Matching Based on Path Values in robots.txt #

Google uses the path part of a URL (after domain name) to decide if a robots.txt rule applies. It compares this path to the allow and disallow rules.

🎯 Key Wildcards Supported: #

Wildcard	Meaning	Example Match
*	Matches 0 or more characters	/fish* matches /fish.html, /fishheads, etc.
$	Matches end of URL	/*.php$ matches /index.php but not /index.php?x=1

📌 Examples of Matching Rules #

Rule	Matches	Doesn’t Match
/	The root and everything below it (whole site)	—
/fish	/fish, /fish.html, /fish/salmon.html, /fish.php?id=anything	/Fish.asp (case-sensitive), /catfish, /desert/fish
/fish/	Anything inside /fish/ folder, e.g. /fish/salmon.htm, /fish/?id=anything	/fish (without slash), /fish.html
/*.php	Any URL containing .php, e.g. /index.php, /folder/filename.php?params	/windows.PHP (case sensitive)
/*.php$	URLs ending exactly with .php e.g. /file.php, /folder/file.php	/file.php5, /file.php?param
/fish*.php	URLs containing /fish followed by .php somewhere, e.g. /fish.php, /fishheads/catfish.php	/Fish.PHP (case sensitive)

⚖️ Order of Precedence — Which Rule Wins? #

Google uses the most specific (longest matching) rule for a URL.
If there are conflicting rules, Google applies the least restrictive rule (i.e., allows crawling if possible).

🔥 Real-World Examples #

URL	Rules	Which Rule Applies?	Why?
https://example.com/page	allow: /pdisallow: /	allow: /p	/p is more specific than /
https://example.com/folder/page	allow: /folderdisallow: /folder	allow: /folder	In conflict, Google picks least restrictive rule
https://example.com/page.htm	allow: /pagedisallow: /*.htm	disallow: /*.htm	Longer, more specific disallow rule applies
https://example.com/page.php5	allow: /pagedisallow: /*.ph	allow: /page	Least restrictive rule wins
https://example.com/	allow: /$disallow: /	allow: /$	$ means exact root, more specific
https://example.com/page.htm	allow: /$disallow: /	disallow: /	allow: /$ only matches root URL, not /page.htm

💡 FSIDM Pro Tip #

When writing rules:

Use /folder/ to block a directory’s content only.
Use /file.ext to block specific files.
Use wildcards * and $ wisely to cover patterns.
Always test your robots.txt with Google Search Console to verify rule behavior!