Googlebot, Read Aloud, APIs-Google & Feedfetcher

📌 Googlebot – Core Crawler for Google Search #

🛠 Types of Googlebot #

Googlebot Smartphone
- Simulates a mobile user
- Primary crawler for most sites (Google mainly indexes mobile version of content)
Googlebot Desktop
- Simulates a desktop user
- Still used, but less frequently than mobile crawler

👉 Note:
Both share the same User-agent: Googlebot in robots.txt, so you cannot block one without blocking the other.

🌐 How Googlebot Crawls Your Site #

Crawls most sites once every few seconds (on average)
Can crawl the first 15 MB of HTML or supported text-based files (uncompressed size)
- CSS, JS, images are fetched separately
Crawls primarily from US-based IPs (Pacific Time zone)
Discovers URLs via:
- Internal/external links
- Sitemaps
- Indexed pages

🚫 Blocking Googlebot (Crawl vs Index) #

Stop crawling: Use robots.txt
Stop indexing: Use noindex
Block access completely: Password protect or restrict server access
⚠️ Blocking Googlebot affects Google Search, Images, Video, News, Discover

✅ Verifying Googlebot (Avoid Fake Crawlers) #

Why? Some bots spoof Google’s user agent
How to verify:
1. Reverse DNS lookup → Confirm hostname ends with .googlebot.com or .google.com
2. Match IP against Googlebot IP ranges

💡 Quick FSIDM Tip for Students:
If your site shows crawling overload or errors in Search Console, it’s often not Googlebot being aggressive—it may be fake bots spoofing Googlebot UA. Always verify before blocking.

📖 What is Google Read Aloud? #

User Agent Name: Google-Read-Aloud
Purpose: Reads web pages aloud using Text-to-Speech (TTS)
When it Activates:
- Only when a user with TTS enabled visits a page
Where it’s Used:
- Google Go
- Google Read It
- Read Aloud in Google app
- Other Google TTS-enabled services

🔍 Crawl Frequency & Behavior #

Not a web crawler (no link following)
Triggered by user action (not automated crawling)
Caching:
- Google caches content to save bandwidth
- But multiple requests for the same page may still occur

🚫 How to Block or Control It #

Robots.txt → ❌ Does NOT work (because it’s user-triggered)

Block completely:

<meta name=”google” content=”nopagereadaloud”>

Paywalled or subscription content:
Use structured data to mark restricted content:

“isAccessibleForFree”: false

📜 Old vs New User Agent #

Current: Google-Read-Aloud
Old (deprecated): google-speakr

💡 FSIDM Pro Tip for Students:
If you run a membership site, premium articles, or gated content, always use nopagereadaloud or mark isAccessibleForFree:false in structured data — otherwise, Google’s TTS may read it aloud to users for free.

🌐 What is APIs-Google? #

User Agent Name: APIs-Google
Purpose:
- Delivers push notifications for Google APIs
- Lets apps avoid constant polling of Google servers
Why Ownership Verification?
- Google ensures the developer owns the domain before allowing push notifications
- Prevents abuse

📡 How APIs-Google Accesses Your Site #

Uses HTTPS POST requests to send push notifications
Retries failed requests:
- Exponential backoff retry schedule (up to several days)
Traffic pattern:
- Can be steady (if updates are regular)
- Or spiky (if resources change rapidly or retries happen often)

⚙️ How to Prepare Your Site #

SSL Certificate Required (must be valid)
- ❌ No self-signed certs
- ❌ No certs from untrusted authorities
- ❌ No revoked certificates
Respond quickly to notifications (within seconds) to avoid repeated retries

🚫 How to Block APIs-Google #

Unregister notifications:
- Contact whoever registered the push endpoint and disable it

Robots.txt:

User-agent: APIs-Google

Disallow: /

(Note: It ignores Googlebot rules, only responds to APIs-Google token)

✅ Verifying APIs-Google Requests #

Spoof check:
- Look up source IP of request claiming APIs-Google
- Reverse DNS lookup → Must resolve to googlebot.com or google.com

💡 FSIDM Tip for Developers:
If APIs-Google traffic is hitting your site too often, it’s usually due to:

A misconfigured app not acknowledging notifications
A push endpoint responding too slowly
👉 Fixing those saves bandwidth and keeps Google from hammering your server.

📌 What is Feedfetcher? #

User Agent Name: Feedfetcher-Google
Purpose:
- Retrieves RSS & Atom feeds for Google News and PubSubHubbub
- Only podcast feeds may appear in Google Search results
Trigger: Activated only when a user adds a feed to a service/app (not an automated crawl)

⚡ How Feedfetcher Works #

Requests feeds on behalf of the user
Ignores robots.txt (because it acts like a direct user request, not a crawler)
Stores & periodically refreshes feeds for efficiency
Often shares the same retrieved feed with multiple users to save bandwidth

📈 Frequency of Retrieval #

Usually ≤ 1 request per hour per feed
Popular/frequently updated feeds may refresh more often
Network delays can cause short bursts of requests

🚫 Blocking Feedfetcher #

Since robots.txt doesn’t work:

Return HTTP status codes for Feedfetcher-Google requests:
- 404 Not Found
- 410 Gone
If hosted on a platform (like Blogger, WordPress, etc.), configure restrictions through the platform settings

🔍 Why It Might Fetch “Odd” or “Secret” URLs #

A user manually typed or bookmarked the feed URL
A user shared the link
Feedfetcher does not guess URLs—it only fetches the one explicitly provided

📊 Technical Details #

Distributed across multiple machines → requests may come from different Google IPs
IP ranges: Listed in user-triggered-fetchers-google.json
Fetches exact URL only (does not follow links like Googlebot)

💡 FSIDM Tip for Students/Marketers:
If you run a podcast, news site, or blog:

Ensure your RSS/Atom feeds are valid & clean (follow spec)
Feedfetcher helps deliver your content faster to apps & news surfaces
Don’t block it unless you have a business reason (e.g., premium content)

SEO

Make - AI Workflow Automation Software & Tools

OpenClaw - The AI that actually does things

Google Antigravity

Googlebot & Related Crawlers Explained – Types, Behavior & SEO Control (2025 Guide)