Replicate your production outage on staging

Your production site went down. Too much traffic, something in your code couldn’t handle it, and your users got errors instead of pages. It happens. Maybe the traffic subsided on its own. Maybe you threw more resources at it. Maybe you did both and called it a night. Either way, the immediate fire is out. Now comes the harder question: how do you make sure it doesn’t happen again?

The fix is the easy part

You dig into your application and find the problem. Maybe you used Blackfire to find the bottleneck and fix it. Maybe you weren’t using your cache layer and every request was hitting the database. Maybe you had a loop running individual queries when a single SQL statement would have done the job. Maybe a specific page was doing something expensive on every render that should have been precomputed. You fix it. You deploy to staging. You feel good about it. But feeling good isn’t the same as knowing. You need to prove that if the same traffic hits your application again, it won’t fall over the same way.

Load testing isn’t the same as traffic replication

The instinct here is to run a load test. Pick a tool, throw a bunch of concurrent requests at your staging environment, see if it holds up. The problem is that a standard load test doesn’t reproduce what actually happened. Your outage wasn’t caused by “a lot of requests.” It was caused by a specific pattern of requests hitting specific pages with specific data behind them. Maybe 1 page out of 200 was responsible for 80% of your CPU usage because of a particularly expensive query. Maybe the combination of 3 pages being hit simultaneously caused lock contention in your database. A uniform load test across your homepage won’t surface any of that. What you need isn’t a generic load test. You need to replay the traffic that caused the outage, against an environment that matches production, and see what happens.

The ingredients for a real replay

To replicate a production outage with confidence, you need 3 things:

The traffic pattern. The distribution of your most-hit URLs and their relative request rates from the incident. You won’t replay every single request (the observability API gives you the top 10 URLs, not a full access log), but those top URLs are where the pressure was, and that’s what matters.
The same data. Your staging database needs to contain the same records as production. A page that’s slow because it renders 50,000 products won’t be slow if your staging database has 12.
The same resources. If production runs on 2 CPUs and 4 GB of RAM, your staging environment needs to match. Testing your fix on a beefier machine proves nothing about production.

Getting all 3 in the same place is harder than it sounds. Most setups give you maybe 1 of these. Upsun gives you all 3.

How to do it on Upsun

Clone your environment

Start by creating a staging environment that’s an exact copy of production. On Upsun, this is a single operation. When you clone an environment, you get a byte-for-byte copy of your production data in an isolated environment. Same database records, same files, same everything. Then give that environment the same resources as production. If production has dedicated CPU and memory allocations, match them on staging. You want the hardware to be identical so that the only variable is your code change.

Get your traffic data

Upsun’s observability API exposes the top 10 URLs by traffic volume and bandwidth for any given time window. You can pull this data through the Upsun CLI for the time range that covers your outage:

upsun p:curl 'environments/main/observability/http-metrics/overview?from=<timestamp>&to=<timestamp>'

Replace <timestamp> with the start and end of your outage window. If you don’t want to figure out the timestamps yourself, open the Upsun Console’s metrics page for your environment, select the time range you care about, and grab the timestamps from the URL in your browser. What you get back is a breakdown of the most-requested URLs and their relative traffic volume.

Generate a k6 load test with Claude

Here’s where it gets interesting. Take that observability data and hand it to Claude. Ask it to generate a k6 load test script that replicates the traffic distribution from your outage window. k6 is a JavaScript-based load testing framework that’s well-suited to this kind of thing. You define scenarios with different virtual users, request patterns, and timing. Claude can translate the observability data into a k6 script that matches the real traffic proportions, hitting the same URLs at the same relative rates. A prompt along these lines works:

Run this command and use the output to generate a k6 load test
script that replicates the traffic distribution:

upsun p:curl 'environments/main/observability/http-metrics/overview?from=<timestamp>&to=<timestamp>'

The target URL is https://staging-abcdefgh1234567.us.platformsh.site.
Scale the virtual users so that the total request rate matches
what we saw in production.

Claude generates the k6 script. You run it against your staging environment. That environment has the same data and the same resources as production, and the traffic pattern reflects what your application was dealing with during the outage.

Interpret the results

If your staging environment handles the replicated traffic without issues, you have high confidence that your fix works. Not certainty, because real production always has surprises, but a level of confidence that a generic load test can’t provide. If it still falls over, you know your fix wasn’t sufficient, and you know it before deploying to production. That’s worth a lot.

Why this matters

The value here isn’t in any single piece. Load testing tools exist. Staging environments exist. Observability dashboards exist. The value is in combining all 3 into a workflow that answers a specific question: “If the same thing happens again, will my fix hold?” Most teams can’t answer that question with confidence because their staging environment doesn’t match production closely enough, or because their load test doesn’t match the real traffic pattern. When your staging is an exact clone of production, running on the same resources, and your load test replays the traffic from the incident, the gap between “I think it’s fixed” and “I’ve verified it’s fixed” gets very small. That confidence changes how you operate. You deploy fixes faster because you’ve already validated them under realistic conditions. You sleep better because you’ve seen your application handle the exact scenario that woke you up last time. And when someone asks “are we sure this won’t happen again,” you have a better answer than “we think so.”

​The fix is the easy part

​Load testing isn’t the same as traffic replication

​The ingredients for a real replay

​How to do it on Upsun

​Clone your environment

​Get your traffic data

​Generate a k6 load test with Claude

​Interpret the results

​Why this matters