Post-deploy checks that actually catch things

The simplest post-deploy check in the world is curl -f https://example.com/health. It returns HTTP 200. You declare victory. You go home.

This check has caught exactly one kind of problem in my entire career: the server is completely down. It has missed every other kind, including the ones you deployed this version to fix.

A health check that returns 200 whenever the web server is alive is measuring the web server, not your app. The app can be broken in a hundred ways that are invisible to a 200 response.

What the check should verify

Three things, in order.

One: the new code is actually running. Not “there is a new file on disk” — actually executing. The easy way to test this is to embed the commit SHA in the response:

{
  "status": "ok",
  "commit": "a1b2c3d4e5",
  "deployed_at": "2026-04-15T09:33:01Z"
}

The deploy script already knows the SHA it deployed. It calls the health endpoint and asserts the returned commit matches. If the SHA is wrong, the old container is still serving, or the process didn’t reload, or the symlink swap didn’t take.

Two: the app can reach its dependencies. Not “can it open a TCP socket” — can it actually do its job. If your app needs a database, run a trivial read query inside the check. If it calls Simezu, ping the authenticated endpoint. The check does the same handshake the app does.

Three: a specific known-good path returns a known-good value. Pick a URL whose response is stable: a status page, a public item fetched by ID, an empty search. Verify the response body matches a regex. If the DB is up but returning a different shape than the app expects, this will catch it.

A concrete example

Here is the post-deploy check I run for a PHP app I maintain. It lives in ./scripts/health-check on my local machine:

#!/usr/bin/env bash
set -euo pipefail

URL=$1
EXPECTED_SHA=$2

response=$(curl -fsSL --max-time 10 "$URL/_health" || { echo "health endpoint unreachable"; exit 1; })

commit=$(echo "$response" | jq -r '.commit')
db_ok=$(echo "$response" | jq -r '.database.ok')
external_ok=$(echo "$response" | jq -r '.external.simezu.ok')

if [ "$commit" != "$EXPECTED_SHA" ]; then
  echo "SHA mismatch: expected $EXPECTED_SHA, got $commit"
  exit 1
fi
if [ "$db_ok" != "true" ]; then
  echo "database not ok: $(echo "$response" | jq -r '.database.error')"
  exit 1
fi
if [ "$external_ok" != "true" ]; then
  echo "simezu not ok: $(echo "$response" | jq -r '.external.simezu.error')"
  exit 1
fi

# Known-good path: the public status page
body=$(curl -fsSL --max-time 5 "$URL/status")
if ! echo "$body" | grep -q 'All systems operational'; then
  echo "status page body didn't match expected"
  exit 1
fi

echo "ok"

I call it as ./scripts/health-check https://example.com $(git rev-parse --short HEAD) as the last step of my deploy. If it fails, I swap the symlink back.

Each of the three checks has caught a real problem in the last year.

The SHA check caught a deploy where the PHP-FPM pool didn’t reload after the symlink swap. opcache was still holding the old bytecode, so the app was running old code against a new schema. The health endpoint was responding, the deploy script thought it was done, but production was actually on the old build.

The database check caught a deploy where the migration ran fine but the app’s connection pool was still pointing at a stale database host (from a prior env change). The endpoint’s database.ok was false with the error “connection refused”. The 200 response was misleading.

The status page body check caught the most embarrassing one: a content rewrite on the status page had accidentally shipped with a templating placeholder intact (“All {{status}} operational”). HTTP 200, valid HTML, completely broken content. A simple grep for the expected string caught it.

What the check should not do

Three things I used to put in health checks and no longer do.

I no longer check “everything the app can do.” If the check tries to exercise every feature, it will take 30 seconds and flake randomly because one feature or another is slow. The check has to be fast enough to run in the deploy critical path. Mine is under 2 seconds.

I no longer test with state-modifying requests. The check does reads only. Creating a test user during every deploy pollutes the database and creates interesting bugs when the test user happens to have a username that collides with a real one.

I no longer report separate checks as separate endpoints. Everything the deploy cares about is on one endpoint, _health, with a structured response. Having three different endpoints to call is three things that can become out of sync.

The bit that made the biggest difference

The single most useful line in my check is the SHA assertion. Before I had it, I was getting the class of incident where the site “looked fine” and the team genuinely believed the deploy had worked, but production was still on the previous release for some process-restart reason. Debugging these took hours, because the obvious thing (check if the code was deployed) was assumed to have happened.

An explicit SHA check turns “did the deploy work” from a thing you have to investigate into a thing the script has already verified.

One-line change to the app: embed $COMMIT_SHA in the /_health response at build time. One-line change to the deploy script: compare it to what I pushed. This catches a silent failure mode that I used to hit once every couple of months.

The rule

If the only thing your post-deploy check asserts is that the web server is alive, you are not checking the deploy. You are checking the web server. Those are different things, and only one of them is what you care about at the moment you pressed the button.