Your production site went down. Too much traffic, something in your code couldn’t handle it, and your users got errors instead of pages. It happens. Maybe the traffic subsided on its own. Maybe you threw more resources at it. Maybe you did both and called it a night. Either way, the immediate fire is out. Now comes the harder question: how do you make sure it doesn’t happen again?Documentation Index
Fetch the complete documentation index at: https://developer.upsun.com/llms.txt
Use this file to discover all available pages before exploring further.
The fix is the easy part
You dig into your application and find the problem. Maybe you used Blackfire to find the bottleneck and fix it. Maybe you weren’t using your cache layer and every request was hitting the database. Maybe you had a loop running individual queries when a single SQL statement would have done the job. Maybe a specific page was doing something expensive on every render that should have been precomputed. You fix it. You deploy to staging. You feel good about it. But feeling good isn’t the same as knowing. You need to prove that if the same traffic hits your application again, it won’t fall over the same way.Load testing isn’t the same as traffic replication
The instinct here is to run a load test. Pick a tool, throw a bunch of concurrent requests at your staging environment, see if it holds up. The problem is that a standard load test doesn’t reproduce what actually happened. Your outage wasn’t caused by “a lot of requests.” It was caused by a specific pattern of requests hitting specific pages with specific data behind them. Maybe 1 page out of 200 was responsible for 80% of your CPU usage because of a particularly expensive query. Maybe the combination of 3 pages being hit simultaneously caused lock contention in your database. A uniform load test across your homepage won’t surface any of that. What you need isn’t a generic load test. You need to replay the traffic that caused the outage, against an environment that matches production, and see what happens.The ingredients for a real replay
To replicate a production outage with confidence, you need 3 things:- The traffic pattern. The distribution of your most-hit URLs and their relative request rates from the incident. You won’t replay every single request (the observability API gives you the top 10 URLs, not a full access log), but those top URLs are where the pressure was, and that’s what matters.
- The same data. Your staging database needs to contain the same records as production. A page that’s slow because it renders 50,000 products won’t be slow if your staging database has 12.
- The same resources. If production runs on 2 CPUs and 4 GB of RAM, your staging environment needs to match. Testing your fix on a beefier machine proves nothing about production.
How to do it on Upsun
Clone your environment
Start by creating a staging environment that’s an exact copy of production. On Upsun, this is a single operation. When you clone an environment, you get a byte-for-byte copy of your production data in an isolated environment. Same database records, same files, same everything. Then give that environment the same resources as production. If production has dedicated CPU and memory allocations, match them on staging. You want the hardware to be identical so that the only variable is your code change.Get your traffic data
Upsun’s observability API exposes the top 10 URLs by traffic volume and bandwidth for any given time window. You can pull this data through the Upsun CLI for the time range that covers your outage:<timestamp> with the start and end of your outage window. If you don’t want to figure out the timestamps yourself, open the Upsun Console’s metrics page for your environment, select the time range you care about, and grab the timestamps from the URL in your browser. What you get back is a breakdown of the most-requested URLs and their relative traffic volume.