Your site is down. Fastly says 503. Your logs say 502. Your boss says “fix it.” But what’s actually happening? The error message in your nginx error log looks something like this:Documentation Index
Fetch the complete documentation index at: https://developer.upsun.com/llms.txt
Use this file to discover all available pages before exploring further.
How requests actually flow through your stack
When a request hits your Upsun project, it takes this path:- Upsun gateway (you don’t need to worry about this part)
- nginx (your web server)
- php-fpm (your PHP runtime using the FastCGI Process Manager)
The highway analogy
Think of php-fpm workers like highway lanes. If you have four workers, you have a four-lane highway. Each lane can handle one car (request) at a time, moving in the same direction. Unlike some other runtimes (Python with ASGI, Node.js), php-fpm workers handle one request at a time. When a worker is busy executing your PHP code, it can’t do anything else until that request finishes. The math is straightforward: if you have 4 workers and each request takes 0.5 seconds to process, you can handle 8 requests per second (4 workers ÷ 0.5 seconds per request = 8 requests/second). If you don’t have any workers available, new requests have to wait. nginx won’t immediately return a 502 error - it’ll wait for a worker to become available, up to its configured timeout. But if workers stay busy too long, nginx gives up and returns that 502 error you’re seeing.Why workers get exhausted
There are two main reasons workers get exhausted, and they love working together to take your site down.Slow requests
Here’s the death spiral: you get a lot of requests, so your database gets busy. When your database gets busy, each PHP request has to wait longer for query results. When requests take longer, fewer requests can be processed per second. When fewer requests can be processed, the queue builds up. The queue building up makes everything slower. If your average request time goes from 0.5 seconds to 5 seconds, your throughput drops by 10x. Those 4 workers that could handle 8 requests/second? They’re now handling less than 1 request/second. This is why you don’t want to run your workers at 100% utilization all the time. If you have 10 workers and they’re all busy constantly, any small increase in traffic or any small slowdown in request processing will tip the entire system over.External API calls without timeouts
Another common culprit: your code calls an external API usinglibcurl, and that API is slow or down.
Here’s the problem: libcurl in PHP doesn’t have a timeout by default. If you’re calling an external API and it never responds, your worker waits forever. That worker is now stuck, doing nothing, unable to process any other requests.
You can spot this in your php.access.log by looking at the CPU percentage. PHP requests are supposed to spend CPU time processing things. If the CPU % usage is low in the logs, PHP is waiting most of the time - probably on an external service.
Here’s what that looks like in practice:
The deadlock scenario: when your app calls itself
An even worse version of the missing timeout issue happens when your application calls its own API over HTTP. Here’s how the deadlock unfolds:- User A’s request comes in
- Worker 1 spins up to process request A (worker count: 1)
- User B’s request comes in
- Worker 2 spins up to process request B (worker count: 2)
- Worker 1 makes a
libcurlcall to your own website during processing - Request C comes in (originally from Worker 1), but can’t be processed because both workers are already busy
- Worker 1 waits for a worker to become available
- Worker 2 makes a
libcurlcall to your own website during processing - Request D comes in (originally from Worker 2), but can’t be processed because both workers are already busy
- Worker 2 waits for a worker to become available
Diagnosing self-requests
You can check if your workers are calling themselves with this snippet:- Connections to your database
- Connections to Redis
- Connections to external APIs (Stripe, AWS, etc.)
18.200.179.139, which is one of Upsun’s gateway IPs:
The fix
Don’t call your own API over HTTP when you’re already inside the same container. If you need data, query the database directly. If you absolutely must call your own API, you need to either:- Refactor to call the data layer directly instead of going through HTTP
- Increase your worker count significantly to reduce the chance of deadlock
- Set aggressive timeouts so the deadlock eventually breaks (not recommended)
Quick fixes when your site is down
If your site is currently down with 502 errors, here’s what to do: Look for self-requests: Run that diagnostic snippet above and see if workers are connecting back to your own gateways. This is often the smoking gun. Check for external API timeouts: Search your codebase forlibcurl calls and make sure every one has a timeout set. Look at your php.access.log for requests with low CPU % - that indicates waiting on external services.
Monitor your request times:
Check your PHP access logs to see if request times are spiking. If you’re consistently seeing requests that take 5+ seconds, you need to optimize those endpoints or offload work to background jobs.
Restart your application (last resort):
If you’ve run out of time to debug and need to get back online immediately: