The (not so) hidden cost of AI scrapers

Your hosting bill keeps growing. Your servers feel busier than they should. Real users sometimes wait too long for a page. Then you check the access logs and realize the traffic isn’t who you thought it was. Most of it is bots. Specifically, AI scrapers and aggressive crawlers, walking your application end-to-end, generating database queries, blowing through your cache, and quietly inflating your infrastructure cost. You’re not alone. Customers run into this regularly, and the problem is rarely visible until the bill arrives.

Bots will follow every link, no matter how deep

One pattern keeps showing up: pages with no natural endpoint. A calendar that links back month by month is the textbook example. A bot finds the “previous month” link and follows it. The next page has another “previous month” link, and the bot follows that one too. And on, and on. One customer’s calendar generated those links all the way back to the 1700s, and the bots dutifully followed. The shape is what matters here, not the calendar. Any “previous/next” navigation without a lower bound creates the same trap: archive paginations, year/month/day drill-downs, infinite tag pages. This isn’t malicious. The bot is well-behaved. The site never told it to stop.

Faceted listings turn into millions of URLs

The other pattern is just as expensive. Picture a product listing with facets: brand, screen size, memory, color, price. Every facet adds a query parameter. A crawler that tries every link on the page eventually tries every permutation of every facet combination. A dozen facets with a few values each means millions of URLs. All of them hit your application server. All of them render full product listings against your database. None of them get cached, because each URL looks unique. Each individual request is fine. The aggregate is what hurts your infrastructure budget.

Where the cost lands

Bot traffic uses the same application servers, the same database, and the same cache as your real users. When that capacity runs hot, you either scale up or your real users start waiting. Either way, you pay for it. The unpleasant part is that you rarely see the bot tax broken out. Your hosting bill says “compute time” or “database hours”. It doesn’t say “rendered the same product page 80,000 times for crawlers in a single afternoon”. You see the impact as a slowly growing infrastructure spend, the kind of trend that looks like natural growth. Auto-scaling makes this worse, not better. A platform that quietly adds capacity to absorb crawler traffic will keep you online and keep growing the bill. The site doesn’t go down. The graphs look fine. The invoice arrives. The bots don’t notice. Your users do.

The robots.txt fix

Well-behaved scrapers, which is to say almost all the ones causing this kind of pressure, do read your robots.txt. You can tell them which URL patterns to skip. For the faceted-listing case, a single line goes a long way:

  User-agent: *
  Disallow: /*?*=*

That tells crawlers to ignore any URL with query parameters, anywhere on the site. If only some pages have this problem, scope the rule down:

  User-agent: *
  Disallow: /products?

There’s a trade-off here. The same Disallow line that stops AI scrapers from drilling into every facet combination also stops them from indexing those URLs at all. That used to matter mostly for Google search results. Now it also matters for what ChatGPT, Claude, and Perplexity recommend when someone asks. Everyone wants to be what the LLMs recommend. The fix isn’t to give up on robots.txt, it’s to make sure the products stay reachable through paths the rules don’t cover. Link to every product from the main listing page. Add a sitemap.xml that points search engines and AI crawlers at the canonical product URLs. Bots that should index your catalog follow those. When the same bots see a Disallow line, they politely oblige and skip the URLs you asked them to skip. A robots.txt is a polite request, not a wall, and that’s exactly enough. Robots.txt changes don’t take effect immediately. Most crawlers re-fetch it every day or two. Expect bot traffic to drop within a few days, not within minutes. Sites that were struggling under crawler load tend to return to normal once the rules propagate.

Caching, with sorted keys

Even after robots.txt is in place, some traffic still comes through, and some of it hits pages with query parameters. Two URLs that differ only in parameter order are functionally identical, but a naive cache treats them as different keys. The fix is to normalize URLs before they reach the cache, wherever the cache lives: CDN, reverse proxy, or application layer. Sort query parameters alphabetically, and strip the ones you don’t care about (tracking IDs, session tokens). Hundreds of variants collapse into a single cache entry. If you happen to be using Varnish, Varnish 103: Cache Optimization with URL Normalization on Upsun walks through the VCL-level details.

Limits in the application itself

The deepest fix isn’t a configuration file. It’s the application asking itself: should this link exist? The 1700s calendar problem isn’t really a robots.txt problem. It’s a UX problem that happens to also be a crawler problem. Nobody wants to see events from the 18th century. The “previous month” link only needs to keep working as long as you have meaningful events to show. Past that point, return a 404, stop rendering the link, or redirect to the earliest month with content. The same logic applies to faceted product pages. If a facet combination produces zero results, you don’t need a unique URL for it. Render a “no products match” page that doesn’t link out to other empty combinations. These are small changes. They remove the infinite hallway that bots will otherwise wander down.

A new normal, not a villain story

The bots are doing what their owners asked them to do, on inputs their owners didn’t anticipate. The websites were built before AI scrapers existed at scale. Their authors assumed the only mechanical traffic they’d see was the occasional Googlebot pass. Both sides are catching up to a new normal. The good news: the fixes are mostly cheap and mostly under your control. A robots.txt that matches the shape of your site. HTML meta directives like <meta name="robots" content="noindex"> and rel="nofollow" on the links you don’t want followed. Application limits where infinite navigation doesn’t make sense. A cache that knows the difference between meaningful URL variation and noise. All of these work because the same well-behaved bots that overcrawl your site also read your instructions. Apply the ones that fit your application, and most of the bot pressure goes away on its own. Your hosting bill should follow.

Articles

Documentation Index

​Bots will follow every link, no matter how deep

​Faceted listings turn into millions of URLs

​Where the cost lands

​The robots.txt fix

​Caching, with sorted keys

​Limits in the application itself

​A new normal, not a villain story