When php-fpm runs out of workers: a 502 error field guide

Your site is down. Fastly says 503. Your logs say 502. Your boss says “fix it.” But what’s actually happening? The error message in your nginx error log looks something like this:

2025/10/28 22:56:52 [error] 207#0: *8050 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 81.122.114.94, server: , request: "GET /foo/bar HTTP/1.1", upstream: "fastcgi://unix:/run/app.sock:", host: "foo.bar", referrer: "https://foo.bar/baz-com"

This isn’t a PHP error. It’s nginx telling you it tried to talk to php-fpm and something went wrong. Let’s figure out what.

How requests actually flow through your stack

When a request hits your Upsun project, it takes this path:

Upsun gateway (you don’t need to worry about this part)
nginx (your web server)
php-fpm (your PHP runtime using the FastCGI Process Manager)

nginx handles static files on its own. But when you’re requesting a PHP script, nginx doesn’t know how to execute PHP code, so it hands the request to php-fpm. A php-fpm worker executes your code, gets the result, and sends it back to nginx, which sends it to the client. The key detail here: when you see a 502 error, it’s usually coming from nginx. It means nginx tried to send a request to php-fpm and couldn’t get a response in time. If you have a CDN in front of your site, the CDN might show this as a 503 gateway timeout instead, because from its perspective, the entire origin timed out.

The highway analogy

Think of php-fpm workers like highway lanes. If you have four workers, you have a four-lane highway. Each lane can handle one car (request) at a time, moving in the same direction. Unlike some other runtimes (Python with ASGI, Node.js), php-fpm workers handle one request at a time. When a worker is busy executing your PHP code, it can’t do anything else until that request finishes. The math is straightforward: if you have 4 workers and each request takes 0.5 seconds to process, you can handle 8 requests per second (4 workers ÷ 0.5 seconds per request = 8 requests/second). If you don’t have any workers available, new requests have to wait. nginx won’t immediately return a 502 error - it’ll wait for a worker to become available, up to its configured timeout. But if workers stay busy too long, nginx gives up and returns that 502 error you’re seeing.

Why workers get exhausted

There are two main reasons workers get exhausted, and they love working together to take your site down.

Slow requests

Here’s the death spiral: you get a lot of requests, so your database gets busy. When your database gets busy, each PHP request has to wait longer for query results. When requests take longer, fewer requests can be processed per second. When fewer requests can be processed, the queue builds up. The queue building up makes everything slower. If your average request time goes from 0.5 seconds to 5 seconds, your throughput drops by 10x. Those 4 workers that could handle 8 requests/second? They’re now handling less than 1 request/second. This is why you don’t want to run your workers at 100% utilization all the time. If you have 10 workers and they’re all busy constantly, any small increase in traffic or any small slowdown in request processing will tip the entire system over.

External API calls without timeouts

Another common culprit: your code calls an external API using libcurl, and that API is slow or down. Here’s the problem: libcurl in PHP doesn’t have a timeout by default. If you’re calling an external API and it never responds, your worker waits forever. That worker is now stuck, doing nothing, unable to process any other requests. You can spot this in your php.access.log by looking at the CPU percentage. PHP requests are supposed to spend CPU time processing things. If the CPU % usage is low in the logs, PHP is waiting most of the time - probably on an external service. Here’s what that looks like in practice:

web@app.0:~$ cat /var/log/php.access.log | grep -Fa "$(date +%Y-%m-%d)" | sort -n -k 4 | tail -n 25
2024-04-25T01:43:53Z GET 404 9069.887 ms 14336 kB 1.43% /foo/bar/baz
2024-04-25T00:01:46Z GET 200 9603.032 ms 36864 kB 3.02% /baz/bar/qux

These requests took 9+ seconds but used only 1.43% and 3.02% CPU. That’s not normal - sub-50% CPU usage generally indicates the request spent most of its time waiting, not processing. If you have 4 workers and an external service goes down, you might see your request times go from 0.5 seconds to 30 seconds or more. Suddenly your site that could handle 8 requests/second can barely handle 1 request every 10 seconds, and you’re getting 502 errors even though your traffic hasn’t increased at all. The fix: set sensible timeouts on every single API call. Think about what a reasonable response time is from the perspective of a user waiting on a page to load. If a third-party API takes more than 3 seconds to respond, that’s probably too long for a synchronous page load. Bonus tip: write code with the assumption that calls can timeout and react accordingly - retry the request, show an error message, or silently ignore the failure, depending on how critical that data is.

The deadlock scenario: when your app calls itself

An even worse version of the missing timeout issue happens when your application calls its own API over HTTP. Here’s how the deadlock unfolds:

User A’s request comes in
Worker 1 spins up to process request A (worker count: 1)
User B’s request comes in
Worker 2 spins up to process request B (worker count: 2)
Worker 1 makes a libcurl call to your own website during processing
Request C comes in (originally from Worker 1), but can’t be processed because both workers are already busy
Worker 1 waits for a worker to become available
Worker 2 makes a libcurl call to your own website during processing
Request D comes in (originally from Worker 2), but can’t be processed because both workers are already busy
Worker 2 waits for a worker to become available

Now both workers are waiting for workers to become available. Nobody can proceed. You have a deadlock. Every new request that comes in now just adds to the pile. nginx starts timing out and returning 502 errors. Your site is effectively down until you restart php-fpm. The telltale sign: you have a customer who says “I have to redeploy every few hours to bring the site back up.”

Diagnosing self-requests

You can check if your workers are calling themselves with this snippet:

ofs=$(lsof)
for pid in $(ps faux|grep 'php-fpm: pool'|grep -v grep|awk '{print \$2}'); do
  echo "===== $pid =====" &&
  ps faux|grep $pid|grep -v grep|awk '{print \$2 "\t" \$13 "\t" \$14}' &&
  fds=$(ls /proc/$pid/fd -l|sed '1d'|awk '{print \$9 " " \$10 " " \$11}'|sort -n) &&
  echo "$fds" &&
  for fd in $(echo "$fds"|grep socket|cut -d'[' -f2| sed 's/]$//'); do
    echo "$ofs"|grep $pid|grep $fd|awk '{print \$5 " " \$6 " " \$8 " " \$9}';
  done;
done | less

This shows you what connections each php-fpm worker has open. You’ll see normal stuff like:

Connections to your database
Connections to Redis
Connections to external APIs (Stripe, AWS, etc.)

But sometimes you’ll see a TCP connection to an IP address you recognize - the IP of Upsun’s gateways. That means your worker is making an HTTP request back to your own application. If you see this pattern across multiple workers, you’ve found your problem. For example, this output shows a connection to 18.200.179.139, which is one of Upsun’s gateway IPs:

===== 38288 =====
38288   pool    web
...
IPv4 2516326193 TCP app.0:59654->ec2-18-200-179-139.eu-west-1.compute.amazonaws.com:https

That worker is calling itself.

The fix

Don’t call your own API over HTTP when you’re already inside the same container. If you need data, query the database directly. If you absolutely must call your own API, you need to either:

Refactor to call the data layer directly instead of going through HTTP
Increase your worker count significantly to reduce the chance of deadlock
Set aggressive timeouts so the deadlock eventually breaks (not recommended)

Option 1 is always the right answer.

Quick fixes when your site is down

If your site is currently down with 502 errors, here’s what to do: Look for self-requests: Run that diagnostic snippet above and see if workers are connecting back to your own gateways. This is often the smoking gun. Check for external API timeouts: Search your codebase for libcurl calls and make sure every one has a timeout set. Look at your php.access.log for requests with low CPU % - that indicates waiting on external services. Monitor your request times: Check your PHP access logs to see if request times are spiking. If you’re consistently seeing requests that take 5+ seconds, you need to optimize those endpoints or offload work to background jobs. Restart your application (last resort): If you’ve run out of time to debug and need to get back online immediately:

systemctl --user restart app

This kills all php-fpm workers and starts fresh. It doesn’t fix the underlying problem - the issue will come back - but it buys you time to investigate properly.

Understanding the problem is half the battle

502 errors from php-fpm worker exhaustion happen for predictable reasons: slow requests, external APIs without timeouts, and self-requests creating deadlocks. Once you understand what’s happening, you can fix the immediate problem. But there’s another half to this: configuring your workers properly so you don’t run out in the first place. That involves understanding memory limits, sizing hints, and the relationship between workers and container resources. Stay tuned for an upcoming article diving into these details!

​How requests actually flow through your stack

​The highway analogy

​Why workers get exhausted

​Slow requests

​External API calls without timeouts

​The deadlock scenario: when your app calls itself

​Diagnosing self-requests

​The fix

​Quick fixes when your site is down

​Understanding the problem is half the battle