Stop guessing which pages are missing from Google. Build a diagnostic pipeline in Python that flags soft 404s, blocked-by-robots pages, and duplicate-without-canonical signals, then resubmits corrected URLs via the Indexing API. No dashboard watching required.
If you have ever exported Google Search Console data for a 10,000-page site and tried to spot soft 404s by hand, you already know the pain. The core bottleneck is not the error itself, it is the signal-to-noise ratio. Google flags 'Crawled - currently not indexed' on 40% of URLs, but only a fraction are actionable. You need to separate real indexability issues from timing delays.
A common situation we see: an agency inherits a site with 1,200 pages returning HTTP 200 but zero content. The client wants indexing fixed. Running a Python script that checks requests.head() for status, then parses tags, then sends the clean URLs to the Google crawler overview is the only repeatable way to scale. Without automation, you burn days on spreadsheets.
1. Soft 404s. Pages that return a 200 status but serve a 'not found' message or an empty body. Google trusts the HTTP header, not the content. These pages sit in the index as dead weight. Detection: compare body length against a threshold (e.g., under 150 characters) or check for known 404 keywords like 'page not found'.
2. Blocked by robots / noindex. A page exists, but a robots.txt directive or a tag blocks the crawler. GSC will show 'Excluded by robots.txt'. The fix is not a resubmit, it is removing the directive. Python can scan all sitemap URLs and flag those with disallow or noindex.
3. Duplicate without canonical. Two URLs serving identical content and neither has a self-referencing canonical. Google picks one, drops the other. Often misdiagnosed as 'duplicate without user-selected canonical'. Python can compute a content hash (using hashlib on the body) and group duplicates.
| Error type | Detection method | Recommended fix | Failure mode |
|---|---|---|---|
| Soft 404 HTTP 200, zero content | Check body length < 150 chars or match regex 'not found' | Return 410 or 404 status or add canonical to a live page | False positives: landing pages with short copy (fix: raise threshold to 80) |
| Blocked by robots.txt URL disallowed | Parse robots.txt with robotparser and test each sitemap URL | Remove or update the disallow rule then resubmit via Indexing API | Wildcard disallows block entire directories. Always test with a single URL first. |
| Noindex meta tag Page has noindex directive | Extract meta name='robots' with BeautifulSoup | Remove the tag or change to 'index, follow' | Developers often leave noindex on staging pages that go live. Watch for CMS template inheritance. |
| Duplicate without canonical Identical body, different URL | Hash the page body with SHA256 group by hash value | Add a on each duplicate pointing to the preferred URL | Session IDs in URLs produce false duplicates. Strip query params before hashing. |
Use <code>requests.get</code> on sitemap index, parse XML, extract all URLs into a list. Expect 50k limit per sitemap.
Run <code>requests.head</code> on each URL. Filter out non-200 statuses immediately. Log 301s and 404s separately.
Download full body with <code>requests.get</code>. Measure length. Parse robots meta. Compute SHA256 hash for duplicate detection.
Compare against thresholds: soft 404 if length < 150, blocked if meta=noindex, duplicate if hash matches another URL.
Output a CSV with columns: URL, error type, suggested action. Only URLs with a clear fix proceed to step 6.
Authenticate with service account. Send POST to <code>https://indexing.googleapis.com/v3/urlNotifications:publish</code>. Limit 200 URLs per day per project.
We ran the pipeline on a client site with 5,200 URLs. First pass: 4,980 returned HTTP 200. Content analysis flagged 712 pages with body length under 150 characters. Manual spot-check confirmed 688 were actual soft 404s (empty product pages, deleted articles with no redirect).
We set the threshold to 100 characters to avoid false-flagging short landing pages. That dropped the count to 641. For each, we implemented a 410 response code (via .htaccess for the static ones, CMS plugin for the rest). After 48 hours, we resubmitted the corrected URLs using the Indexing API. 572 of 641 (89%) appeared as 'Submitted and indexed' within two days. The remaining 69 had secondary issues (blocked by robots, slow load) that required a second pass.
Blocked URLs that are not in robots.txt. We saw a site where the server returned a 403 for specific user-agent headers. Python's requests default header passed, but Googlebot got blocked. Solution: simulate the Googlebot user-agent in your script (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and compare results.
Duplicate lists with session IDs. One ecommerce site had 7,000 URLs for 1,200 products because of tracking parameters. Our hash grouping flagged all as duplicates. We added a URL normalization step: strip ?session= and ?utm_* before hashing. Result: 1,200 real pages, 5,800 false duplicates ignored.
Indexing API daily quota. Google limits to 200 URLs per day per project. For sites with 10k+ corrections, you need a queue system. We batch 200 per day, log the remaining, and retry the next day. Do not exceed quota or you get a 403 error and lose a day.
Use requests.get on each URL, check status code, then parse the body. If status=200 but body length is under 150 characters or contains keywords like 'page not found', classify as soft 404. For 50k+ URLs, use asyncio or a thread pool to avoid timeout.
Yes, but only partially. Python can detect which URLs are blocked by parsing robots.txt with robotparser. The fix requires actual access to the server or CMS to remove the disallow or noindex directive. After the directive is removed, the script can resubmit the URL via the Indexing API.
Use google-api-python-client with the SearchConsole service. You can query index coverage reports and filter by 'soft_404' or 'blocked_by_robots'. Combine with pandas for analysis. Be aware of the 25,000 row limit per request; paginate using the startRow parameter.
Download the full HTML of each URL, compute a SHA256 hash of the body (after stripping scripts and CSS), group URLs by hash. For groups with more than one URL, none with a self-referencing canonical, flag as 'duplicate without canonical'. The fix is to add a canonical tag pointing to the preferred URL.
200 URLs per Google Cloud project per day. If you need to submit more, use multiple service accounts (each with its own project) or stagger submissions over days. Never exceed the limit; you risk getting the project suspended. Track remaining quota via the response headers.
Use the robotparser module. Fetch the robots.txt content with requests.get, then call can_fetch('Googlebot', url). If it returns False, the URL is blocked. Note: some sites serve different robots.txt to different user-agents, so always specify Googlebot.
Short landing pages or pages with mostly JavaScript rendering. A page that loads 50 characters of HTML but 200KB of JS will appear as a soft 404 to a pure HTTP check. Mitigate by rendering pages in a headless browser (Playwright or Selenium) or by whitelisting known short but valid paths like /404-test.
Authenticate with a service account JSON key, create a google.auth credential object, then POST to https://indexing.googleapis.com/v3/urlNotifications:publish with body {'url': '...', 'type': 'URL_UPDATED'}. Handle 429 errors with exponential backoff. Log all responses for audit.
Yes. Run the same pipeline: fetch all URLs, check status and content, flag soft 404s and noindex. For guest posts, ensure the target site allows indexing. Use a <a href="https://medium.com/@alexa.sam2026/how-to-index-pbn-links-safely-the-2026-sandbox-escape-protocol-ee763a3171e9">sandbox escape protocol</a> for safe submission, and always verify with GSC. Be careful with automated submissions to avoid spam penalties.
Google crawled the URL but chose not to index it due to quality, duplication, or technical issues. Python helps by checking the page content length, canonical tags, and load speed. If the page is thin or a duplicate, you can flag it. However, some 'not indexed' URLs are low priority, not errors.
You cannot fix what you do not measure. The pipeline described here runs weekly on a cron job. It outputs a CSV with error classification and suggested action. The team reviews the 'fix list' column, implements changes, and the script resubmits. No manual GSC export, no guesswork.
For those new to the Indexing API, refer to the quick sitemap indexing guide for a complementary approach: submitting a sitemap after corrections ensures Google re-discovers the fixes faster. Combine both methods for best results.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.