Guide

Crawl budget fundamentals for JavaScript sites

What crawl budget is, why JavaScript sites waste it, and the three engineering controls that move the needle in the first sprint.

10 min readProcedure: 20 min auditBeginnerUpdated April 21, 2026

Apply the checklist with ostr.io Read related guides →

Crawl budget is never unlimited. On JavaScript sites it is more constrained than on static sites because every URL must pass through Google's Web Rendering Service before it can be indexed. That queue is the bottleneck; if you are new to the architecture behind that queue, start with how pre-rendering works.

This guide covers three engineering controls that recover crawl budget on JavaScript sites: pre-rendering the canonical URL set, tightening the canonical strategy, and tuning sitemap lastmod hints. If your site is already beyond 100k URLs, keep large sites (100k+ pages) open alongside this guide.

The catch is that teams often call every indexing problem a crawl-budget problem. Sometimes they are right. Often they are looking at canonicals, soft 404s, thin pages, or stale snapshots and calling the whole mess 'crawl budget.' This guide is for separating those cases before you spend engineering time on the wrong fix.

**Read next when:** the diagnosis on this page points to a specific failure mode. If you suspect 70-85% of your crawl is stuck in the WRS queue, branch to the 80% crawl-budget loss pattern. If the fundamentals are clear and you need expansion levers, go to expand your crawl budget — 7 levers. For the financial framing of the same problem, see crawl-budget ROI of prerendering and crawl-budget optimization for JS sites.

Step-by-step

How to: Crawl budget

1
Audit the current render-queue waste
Open Google Search Console → Settings → Crawl stats. Look for "JavaScript-rendered pages" vs "HTML-rendered pages." If more than 30% of successful crawls require rendering, the render queue is the bottleneck.
2
Identify the canonical URL set
List every URL pattern that should be indexed. Exclude filter permutations, paginated variants past page 3, and session-specific URLs. The canonical set is usually 10-20% of the total crawlable surface.

Pre-render the canonical set only

Route crawler traffic on canonical URLs to the pre-rendered cache. Everything outside the canonical set continues to hit origin until it proves indexation value (via Search Console impressions / clicks). On high-fanout sites this is also the cheapest way to avoid over-investing in low-value URL segments.

middleware.ts

typescript

1import { NextResponse } from "next/server";
2import type { NextRequest } from "next/server";
3
4const BOT_REGEX = /bot|crawler|spider|googlebot|bingbot/i;
5const CANONICAL_PATTERNS = [
6  /^\/$/,
7  /^\/p\/[^/]+$/,
8  /^\/c\/[^/]+$/,
9  /^\/(pricing|blog|docs)(\/.+)?$/,
10];
11
12export function middleware(req: NextRequest) {
13  const ua = req.headers.get("user-agent") ?? "";
14  if (!BOT_REGEX.test(ua)) return NextResponse.next();
15  if (!CANONICAL_PATTERNS.some((r) => r.test(req.nextUrl.pathname)))
16    return NextResponse.next();
17
18  return NextResponse.rewrite(
19    new URL(
20      `https://render.ostr.io/render?url=${encodeURIComponent(req.nextUrl.toString())}`,
21    ),
22  );
23}

Split the sitemap by priority

Create a sitemap-index and shard by priority: canonicals with recent lastmod (daily recrawl target), canonicals with stable lastmod (weekly), and the long tail (monthly). Google processes shards in parallel.

public/sitemap-index.xml

xml

1<?xml version="1.0" encoding="UTF-8"?>
2<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
3  <sitemap><loc>https://yourdomain.com/sitemaps/canonicals-fresh.xml</loc></sitemap>
4  <sitemap><loc>https://yourdomain.com/sitemaps/canonicals-stable.xml</loc></sitemap>
5  <sitemap><loc>https://yourdomain.com/sitemaps/long-tail.xml</loc></sitemap>
6</sitemapindex>

5
Eliminate false positives before blaming crawl budget
Check the top affected URLs for four simpler causes first: wrong canonical target, noindex or robots blocks, soft 404 behavior, and stale snapshots. If one of those is present, you do not have a pure crawl-budget problem yet. You have an indexability hygiene problem that happens to look like slow discovery.
6
Measure crawl-rate change after 14 days
Re-open Crawl stats. The render-queue percentage should drop to under 10%. If it does not, re-check the bot-detection middleware — most "no change" outcomes are traced to a UA regex that missed common bots.

Symptoms that the render queue is your actual bottleneck

The clearest symptoms are slow indexation on new URLs, unstable discovery on pages that clearly exist in the sitemap, and a gap between what a browser sees and what Google indexes first. On large JS sites these symptoms usually appear before teams have a precise crawl-budget vocabulary.

If those symptoms appear mostly on faceted, paginated, or inventory-heavy sections, compare this guide with the 80% crawl-budget loss pattern and the marketplaces use case.

One practical sign is asymmetry: HTML pages that are technically reachable but only a fraction of them get indexed after each sitemap or content push. Another is lag that clusters on JavaScript-heavy templates rather than on the whole site. When the pattern follows templates, not random URLs, the render queue is usually involved.

What crawl budget is not

Crawl budget is not a universal excuse for every indexation problem. If the page is blocked by robots, noindexed, canonicals elsewhere, or returning weak content, pre-rendering alone will not fix it.

Use this guide when the page should be indexable and the problem is delayed rendering or wasted fetches. Use pre-render cache headers when the issue is stale content rather than missed discovery.

False positives that look like crawl budget but are not

False positive one: canonicals point away from the URL you expect to rank. False positive two: the page returns `200 OK` but behaves like a soft 404 because the useful content never loads or the template is effectively empty. False positive three: the page is indexable, but the snapshot is stale enough that Google keeps seeing old state. False positive four: internal links barely surface the page, so Googlebot has no strong signal that the URL matters.

In practice, these are the cases that waste the most time. Teams try to solve them with rendering infrastructure alone, then conclude that crawl budget is mysterious. The problem was never mysterious. The wrong layer was being debugged.

What should change, and how fast

The first useful checkpoint is 7-14 days after shipping the canonical-set routing and sitemap cleanup. That window is long enough for Googlebot to react to the lower render burden on the URLs you touched. The second checkpoint is 30-45 days, when the site starts showing whether the gain is durable or whether waste simply moved elsewhere in the URL graph.

The metric to watch is not only raw crawl count. Watch the share of requests that require rendering, the number of newly indexed canonical URLs, the lag between publish and first indexation, and whether long-tail templates are still starved. If crawl count rises but useful discovery does not, you fixed throughput but not prioritization.

Operational mistakes that keep crawl budget stuck

Mistake one is pre-rendering too much, too early. Teams often route the entire crawlable surface into a render pool instead of defining a canonical set first. Mistake two is leaving sitemap shards undifferentiated, which tells Google that every URL matters equally. Mistake three is treating bot detection as finished after one regex pass, even though missed user agents can erase the whole gain.

Mistake four is not linking the cleaned URL set strongly enough. Crawl budget and internal linking are not separate topics at scale. If the URLs that deserve recrawl are four clicks deep and receive weak in-body links, Google still learns that they are lower priority than you think they are.

FAQ

Questions engineers ask about this guide

Not for sites under 1,000 URLs. Below that scale Googlebot crawls nearly everything on every cycle. Crawl budget becomes a real constraint above 10,000 URLs or when content changes frequently.

Only when it removes render-queue time from the crawl path. If your site was already server-rendered, pre-rendering does not help with crawl budget — it helps with origin cost instead.

Partially. Tightening canonical strategy and fixing Soft 404s recovers 30-50% of lost budget on JavaScript sites. The remaining 50-70% requires either SSR or pre-rendering.

Check whether the affected URLs are supposed to index, are canonically self-referencing, are not blocked, and return useful HTML. If those basics are broken, fix them first. Crawl budget becomes the right diagnosis when the URLs are already indexable but still get delayed or starved because rendering and low-value crawl paths consume too much of the crawler's capacity.

Define the canonical URL set, route only that set through pre-rendering for bots, and split the sitemap by priority. That combination usually creates the first visible gain faster than broad architectural changes.

Crawl budget adjusts continuously based on site health, response latency, and host load. Significant structural changes (sitemap expansion, render-queue reduction) take 2-4 weeks to show a sustained change.

Usually the first KPI is not rankings. It is faster discovery of new or updated URLs, a lower share of render-heavy crawl requests, and a shorter lag between publish time and indexation. Rankings often move later, after discovery and recrawl behavior stabilize.

Editorial trust

Written by ostr.io engineering team · Engineering Team. We build and run pre-rendering infrastructure for more than 200 engineering teams, which is where the numbers and code samples on this page come from.

Last updated April 21, 2026. Editorial scope and review policy: About prerender.info.

Provenance

Crawl budget fundamentals for JavaScript sites

Introduction

Audit the current render-queue waste

Identify the canonical URL set