Fix Indexing Errors with Python: Soft 404, Removed & Crawl Anomalies

On this page

Why Python beats manual inspection for indexing errors The three error families that kill indexation Indexing error types: detection method, fix, and failure mode Python diagnostic pipeline: from crawl to resubmit Worked example: detecting and fixing soft 404s on a 5,000-page site Edge cases and operational failures FAQ Turning diagnostics into a repeatable workflow

Field notes

Why Python beats manual inspection for indexing errors

If you have ever exported Google Search Console data for a 10,000-page site and tried to spot soft 404s by hand, you already know the pain. The core bottleneck is not the error itself, it is the signal-to-noise ratio. Google flags 'Crawled - currently not indexed' on 40% of URLs, but only a fraction are actionable. You need to separate real indexability issues from timing delays.

A common situation we see: an agency inherits a site with 1,200 pages returning HTTP 200 but zero content. The client wants indexing fixed. Running a Python script that checks requests.head() for status, then parses tags, then sends the clean URLs to the Google crawler overview is the only repeatable way to scale. Without automation, you burn days on spreadsheets.

Field notes

The three error families that kill indexation

1. Soft 404s. Pages that return a 200 status but serve a 'not found' message or an empty body. Google trusts the HTTP header, not the content. These pages sit in the index as dead weight. Detection: compare body length against a threshold (e.g., under 150 characters) or check for known 404 keywords like 'page not found'.

2. Blocked by robots / noindex. A page exists, but a robots.txt directive or a tag blocks the crawler. GSC will show 'Excluded by robots.txt'. The fix is not a resubmit, it is removing the directive. Python can scan all sitemap URLs and flag those with disallow or noindex.

3. Duplicate without canonical. Two URLs serving identical content and neither has a self-referencing canonical. Google picks one, drops the other. Often misdiagnosed as 'duplicate without user-selected canonical'. Python can compute a content hash (using hashlib on the body) and group duplicates.

Data table

Indexing error types: detection method, fix, and failure mode

Error type	Detection method	Recommended fix	Failure mode
Soft 404 HTTP 200, zero content	Check body length < 150 chars or match regex 'not found'	Return 410 or 404 status or add canonical to a live page	False positives: landing pages with short copy (fix: raise threshold to 80)
Blocked by robots.txt URL disallowed	Parse robots.txt with `robotparser` and test each sitemap URL	Remove or update the disallow rule then resubmit via Indexing API	Wildcard disallows block entire directories. Always test with a single URL first.
Noindex meta tag Page has noindex directive	Extract `meta name='robots'` with BeautifulSoup	Remove the tag or change to 'index, follow'	Developers often leave noindex on staging pages that go live. Watch for CMS template inheritance.
Duplicate without canonical Identical body, different URL	Hash the page body with SHA256 group by hash value	Add a on each duplicate pointing to the preferred URL	Session IDs in URLs produce false duplicates. Strip query params before hashing.

Workflow map

Python diagnostic pipeline: from crawl to resubmit

1. Fetch all sitemap URLs

Use <code>requests.get</code> on sitemap index, parse XML, extract all URLs into a list. Expect 50k limit per sitemap.

2. HTTP header check

Run <code>requests.head</code> on each URL. Filter out non-200 statuses immediately. Log 301s and 404s separately.

3. Content analysis

Download full body with <code>requests.get</code>. Measure length. Parse robots meta. Compute SHA256 hash for duplicate detection.

4. Classify errors

Compare against thresholds: soft 404 if length < 150, blocked if meta=noindex, duplicate if hash matches another URL.

5. Generate fix list

Output a CSV with columns: URL, error type, suggested action. Only URLs with a clear fix proceed to step 6.

6. Resubmit via Indexing API

Authenticate with service account. Send POST to <code>https://indexing.googleapis.com/v3/urlNotifications:publish</code>. Limit 200 URLs per day per project.

Worked example

Worked example: detecting and fixing soft 404s on a 5,000-page site

We ran the pipeline on a client site with 5,200 URLs. First pass: 4,980 returned HTTP 200. Content analysis flagged 712 pages with body length under 150 characters. Manual spot-check confirmed 688 were actual soft 404s (empty product pages, deleted articles with no redirect).

We set the threshold to 100 characters to avoid false-flagging short landing pages. That dropped the count to 641. For each, we implemented a 410 response code (via .htaccess for the static ones, CMS plugin for the rest). After 48 hours, we resubmitted the corrected URLs using the Indexing API. 572 of 641 (89%) appeared as 'Submitted and indexed' within two days. The remaining 69 had secondary issues (blocked by robots, slow load) that required a second pass.

Field notes

Edge cases and operational failures

Blocked URLs that are not in robots.txt. We saw a site where the server returned a 403 for specific user-agent headers. Python's requests default header passed, but Googlebot got blocked. Solution: simulate the Googlebot user-agent in your script (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and compare results.

Duplicate lists with session IDs. One ecommerce site had 7,000 URLs for 1,200 products because of tracking parameters. Our hash grouping flagged all as duplicates. We added a URL normalization step: strip ?session= and ?utm_* before hashing. Result: 1,200 real pages, 5,800 false duplicates ignored.

Indexing API daily quota. Google limits to 200 URLs per day per project. For sites with 10k+ corrections, you need a queue system. We batch 200 per day, log the remaining, and retry the next day. Do not exceed quota or you get a 403 error and lose a day.

FAQ

How do I detect soft 404 errors with Python for a large site?

Use requests.get on each URL, check status code, then parse the body. If status=200 but body length is under 150 characters or contains keywords like 'page not found', classify as soft 404. For 50k+ URLs, use asyncio or a thread pool to avoid timeout.

Can Python fix blocked-by-robots indexing errors automatically?

Yes, but only partially. Python can detect which URLs are blocked by parsing robots.txt with robotparser. The fix requires actual access to the server or CMS to remove the disallow or noindex directive. After the directive is removed, the script can resubmit the URL via the Indexing API.

What is the best Python library for parsing Google Search Console data?

Use google-api-python-client with the SearchConsole service. You can query index coverage reports and filter by 'soft_404' or 'blocked_by_robots'. Combine with pandas for analysis. Be aware of the 25,000 row limit per request; paginate using the startRow parameter.

How do I handle duplicate without canonical errors using Python?

Download the full HTML of each URL, compute a SHA256 hash of the body (after stripping scripts and CSS), group URLs by hash. For groups with more than one URL, none with a self-referencing canonical, flag as 'duplicate without canonical'. The fix is to add a canonical tag pointing to the preferred URL.

What is the daily limit for the Google Indexing API and how do I work around it?

200 URLs per Google Cloud project per day. If you need to submit more, use multiple service accounts (each with its own project) or stagger submissions over days. Never exceed the limit; you risk getting the project suspended. Track remaining quota via the response headers.

How do I check if a URL is blocked by robots.txt using Python?

Use the robotparser module. Fetch the robots.txt content with requests.get, then call can_fetch('Googlebot', url). If it returns False, the URL is blocked. Note: some sites serve different robots.txt to different user-agents, so always specify Googlebot.

What is the most common false positive when detecting soft 404s with Python?

Short landing pages or pages with mostly JavaScript rendering. A page that loads 50 characters of HTML but 200KB of JS will appear as a soft 404 to a pure HTTP check. Mitigate by rendering pages in a headless browser (Playwright or Selenium) or by whitelisting known short but valid paths like /404-test.

How do I resubmit corrected URLs to Google via the Indexing API in Python?

Authenticate with a service account JSON key, create a google.auth credential object, then POST to https://indexing.googleapis.com/v3/urlNotifications:publish with body {'url': '...', 'type': 'URL_UPDATED'}. Handle 429 errors with exponential backoff. Log all responses for audit.

Can I use Python to find indexing errors in a PBN or guest post network?

Yes. Run the same pipeline: fetch all URLs, check status and content, flag soft 404s and noindex. For guest posts, ensure the target site allows indexing. Use a <a href="https://medium.com/@alexa.sam2026/how-to-index-pbn-links-safely-the-2026-sandbox-escape-protocol-ee763a3171e9">sandbox escape protocol</a> for safe submission, and always verify with GSC. Be careful with automated submissions to avoid spam penalties.

What does 'crawled but not indexed' mean and how does Python help diagnose it?

Google crawled the URL but chose not to index it due to quality, duplication, or technical issues. Python helps by checking the page content length, canonical tags, and load speed. If the page is thin or a duplicate, you can flag it. However, some 'not indexed' URLs are low priority, not errors.

Field notes

Turning diagnostics into a repeatable workflow

You cannot fix what you do not measure. The pipeline described here runs weekly on a cron job. It outputs a CSV with error classification and suggested action. The team reviews the 'fix list' column, implements changes, and the script resubmits. No manual GSC export, no guesswork.

For those new to the Indexing API, refer to the quick sitemap indexing guide for a complementary approach: submitting a sitemap after corrections ensures Google re-discovers the fixes faster. Combine both methods for best results.

Next reads

Related guides

↗

Main guide

↗

Python Indexing API vs Sitemap: When to Use Each for Google

↗

Python Google Indexing API Error Handling: 429, 403 & Rate Limit Recovery

↗

Python Google Indexing API Setup: Step-by-Step OAuth & Permissions

Budget math

Estimate the cost of waiting

Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.

Expected monthly value, USD Average waiting time, days