· 4 min read
Why does my deploy take 4 minutes?
A slow deploy is almost always three fast steps stuck waiting on one quiet one. Here's how I found the quiet one in my own pipeline.
By Marten
Four minutes. That was the median deploy time on a tiny PHP app I’ve been running for a year. The app itself is 11MB. The server is 120ms away. A full rsync -av of the whole directory takes 8 seconds. So where do the other 232 seconds go?
I assumed it was the SSH handshake piling up across multiple commands, or the post-deploy health check being cautious. Neither turned out to be right.
Measuring
The first useful thing was to stop guessing. I wrapped every step of the deploy in date +%s.%N:
t() { printf "[%s] %s\n" "$(date +%H:%M:%S.%3N)" "$1" >&2; }
t "deploy start"
git_download
t "code downloaded"
build
t "build complete"
ssh_upload
t "files uploaded"
post_deploy_checks
t "checks complete"
The output, on a slow day:
[22:04:13.119] deploy start
[22:04:15.802] code downloaded
[22:04:19.411] build complete
[22:04:27.033] files uploaded
[22:08:41.204] checks complete
Three steps took 14 seconds combined. The last step took four minutes and fourteen seconds.
So the problem was in the post-deploy check. I had been staring at rsync output this whole time.
What the check was doing
The post-deploy check hits a health endpoint and expects a JSON response with the commit SHA and a status: ok field. If the SHA on the server matches what I just deployed, the deploy is considered successful.
Simple enough. But the check was also — and this is the part I had forgotten — validating that the app could reach its own database, by asking the health endpoint to do a SELECT 1 query.
The database was on the same VPS. The app connects to it via 127.0.0.1. I know this cold.
The app’s database config, however, did not say 127.0.0.1. It said localhost. And on this particular Ubuntu install, /etc/nsswitch.conf had been configured to prefer mDNS for hostname resolution, with DNS as a fallback. localhost resolved via mDNS, got a “no answer”, waited the timeout, then fell back to DNS, which answered instantly.
The timeout was 30 seconds. Per connection attempt. The app opened a fresh connection inside the health check instead of reusing the pool. The check retried three times before giving up and succeeding on the fourth.
30 seconds × 4 attempts = 2 minutes of waiting for a nameservice call that had no business being made at all.
Fixing the symptom vs fixing the cause
The obvious fix is to change localhost to 127.0.0.1 in the database config. That removed 2 minutes from the deploy immediately.
But that is a symptom fix. The deeper issue is that the health check was treating a database connection failure as a retryable transient error, when it was in fact a config problem that no amount of retrying would solve. I changed the retry logic to fail fast on “host not found” and similar non-transient errors.
And the even deeper issue is that the deploy script had no visibility into step durations. If I had been logging per-step timings from day one, I would have noticed this the first time it happened, not on the thirtieth deploy.
Now every deploy writes per-step timings to a log file. Once a week I look at the distribution. If any step’s 95th percentile drifts upward by more than 20%, I get an email.
The part nobody wants to hear
Between noticing the four-minute deploy and fixing it, I had been deploying three times a week. That’s twelve minutes a week of waiting, just because I never measured.
A year of that is ten hours of me staring at a terminal.
The lesson is not “deploys should be fast.” The lesson is that you do not notice slowness below the threshold where it actively blocks you. Four minutes does not block you. You go make coffee. You scroll Slack. You come back. The deploy is done. You do not stop to ask why it took four minutes, because four minutes is not painful enough to stop you.
The cheapest way to catch this kind of slow-creep is to measure each step, write the numbers somewhere durable, and look at them occasionally.
I now measure every step of every deploy. Not because I expect to find another 30-second hostname timeout. Because I know I will find something I was not expecting, and I do not want to wait a year to see it.