· 5 min read

When a deploy succeeds but your app is down

The deploy pipeline is green. Customers see errors. The gap is almost always in what the health check is actually checking.

By Marten

Last Thursday at 21:47 I pushed a small refactor to a client project. The deploy went green. The pipeline logged deployment.completed with duration_seconds: 42.1. I closed the laptop.

The first email came in at 22:14. “Login is broken.” I reopened the laptop and went to check. The site loaded. I logged in as myself. It worked.

Over the next ten minutes I got three more of those emails. I could not reproduce on any of my accounts.

This post is the full trace of what I found and what the actual bug was. It is not a case I have seen before, and I want to write it down so I remember the shape of it.

The symptom

Users reported that after entering their email and password, the app would redirect to the dashboard and then show a blank page. Reload fixed it. So this was not “login is broken” exactly, but “the first page after login fails to render.”

Their browsers were hitting a 500 on an API call right after login, but only on the first call. Subsequent calls worked.

On my machine, in my session, it worked first-try. No 500. No blank page.

The misleading clue

I checked the app logs. No 500s. The time the users reported the errors did not line up with any error in the structured log.

I checked the nginx access log on production. I saw the 500s. The request path was /api/user/me. The response time was 0.004 seconds — unusually fast for a 500. That is not an application error. An application error would take longer because the request has to go through auth, through the router, through whatever code errored, and back.

0.004 seconds meant nginx itself was returning 500 without ever hitting the app.

Where nginx can 500 without touching PHP

Two places, in my experience. A config error (but nginx reload would have failed and the config would not be live), or a missing upstream socket file. I checked.

nginx: configuration file /etc/nginx/nginx.conf test is successful

Socket file:

ls -l /var/run/php/php8.3-fpm.sock
srw-rw---- 1 www-data www-data 0 Apr 15 20:14 /var/run/php/php8.3-fpm.sock

That is interesting. The socket was modified at 20:14. I deployed at 21:47, so the socket was older than my deploy. That should not matter — the socket gets recreated when PHP-FPM restarts, and I had not restarted PHP-FPM as part of my deploy.

Except. The deploy script runs systemctl reload nginx and systemctl reload php8.3-fpm. The nginx reload happened. The PHP-FPM reload, I noticed, returned exit code 0, but journalctl -u php8.3-fpm told a different story:

Apr 15 21:47:22 server systemd[1]: Reloading The PHP FastCGI Process Manager...
Apr 15 21:47:22 server php8.3-fpm[2031]: ERROR: [pool www] Another FPM instance seems to already listen on /var/run/php/php8.3-fpm.sock
Apr 15 21:47:22 server systemd[1]: Reload failed for The PHP FastCGI Process Manager.

The reload had failed because the socket file was still in use by the old pool, which was supposed to have released it but had not. systemd had reported the reload failed, but my deploy script was checking systemctl reload exit code, which had come back zero. That part turned out to be a distribution-specific quirk of how systemctl reload reports non-fatal errors.

The old PHP-FPM pool was running the old code. Most requests went through it just fine. But something inside the PHP-FPM master was intermittently marking the old pool’s workers as unresponsive, and when a worker got marked unresponsive between its last request and its next, nginx got a “no upstream available” response and returned 500.

The reason users saw it and I did not: the 500 hit a small fraction of requests (maybe 5%), and the users who complained happened to land on that fraction for their first post-login call. I, logging in during the debugging, happened to not hit it because the pool had stabilised for a while.

The fix, and the deeper fix

Immediate fix, from the server:

systemctl restart php8.3-fpm

This killed all the old pool processes, released the socket, and spawned fresh ones bound to the new code. The 500s stopped within thirty seconds.

Deeper fix, in the deploy script:

# Before
systemctl reload php8.3-fpm
# After
if ! systemctl reload php8.3-fpm; then
  echo "reload failed, falling back to restart"
  systemctl restart php8.3-fpm
fi
# And check it really is alive
if ! ss -lx | grep -q /var/run/php/php8.3-fpm.sock; then
  echo "PHP-FPM socket not listening after reload/restart"
  exit 1
fi

Two separate improvements. The first tries a reload, falls back to a restart if that fails. The second actually checks that the socket is listening. Without the second, I am trusting systemctl to tell me the truth about a restart, and as I just learned, it does not always.

What this changed for me

I added a new item to my post-deploy check (the one that verifies the commit SHA, DB reachable, etc.): the check now also fires ten concurrent requests to the app, from the deploy machine, and requires all ten to succeed. If even one returns 500, the deploy is considered failed.

Ten requests is not a load test. It is a smoke test that catches “99% of requests succeed but a worker pool is flapping.” A single-request smoke test would have missed my Thursday bug, because most single requests hit a working worker.

The lesson is not “check more things.” The lesson is that “it looks fine when I check it” is a different claim from “it is fine for everyone.” The deploy process should be making the second claim, not the first. Which means the check has to do enough to actually discover the difference.

I still think about the 90 seconds between “deploy complete” and the first user email. Nothing I did in that window would have changed the outcome. But I did close the laptop with a wrong belief, and I do not like that I had no tool to catch the wrong belief before it became someone else’s problem.