The staging branch that lied to me

So the staging server was fine. Production was not.

Staging had been green for three weeks of feature work. Every PR ran against it. Every deploy smoke test came back clean. We merged to main, deployed to production, and within four minutes the customer support channel lit up with “the app is blank after login.”

I rolled back. Production recovered. I logged in to staging to reproduce. Staging still worked.

The branch had not changed between the merge and the production deploy. Same commit, same build output, same everything. And yet staging said yes and production said no.

The first wrong answer

My first theory was caching. Staging had been serving this same build for a while, maybe some CDN or service worker was holding onto something that production wasn’t. I disabled caching at every layer I could find, hard-refreshed staging, did the whole shift-ctrl-r routine. Staging still worked.

Second theory: database state. Maybe some data on staging existed that production didn’t, and the app quietly depended on it being there. I ran the app’s diagnostic endpoint on both. Both returned sensible shapes.

Third theory, which I reached after about forty minutes: the server itself had diverged in some way I had not accounted for. Staging and production were built from the same base image six months ago. What had happened since then.

This was the right direction.

What `env` told me

Two SSH sessions, side by side. On staging:

$ ssh staging "env | sort | grep -iE 'api|key|url'"
APP_URL=https://staging.example.com
DATABASE_URL=postgres://...
STRIPE_KEY=sk_test_...
SIMEZU_API_URL=https://simuze.nl

On production:

$ ssh production "env | sort | grep -iE 'api|key|url'"
APP_URL=https://example.com
DATABASE_URL=postgres://...
STRIPE_KEY=sk_live_...
SIMEZU_API_URL=https://app.simezu.co

Different SIMEZU_API_URL. Staging was pointing at the current URL. Production was pointing at a URL that had been decommissioned six weeks earlier.

The API at app.simezu.co had been renamed to simuze.nl. Staging’s .env had been updated at some point by someone. Production’s .env had not. Requests to app.simezu.co from production were hitting an nginx that redirected with a 301 to the new domain, and the app’s HTTP client was not following redirects on POST requests, so the authentication handshake was silently failing with what looked like “you are logged in” followed by a blank page when the first API call failed.

The handshake failure was swallowed. The app rendered the login screen, saw a cookie, decided you were logged in, and then crashed on the first data fetch. Staging did not hit this because staging had the right URL.

How the divergence happened

I went through the git history. Six weeks ago, in a PR titled “update API endpoints”, someone had updated the .env.example file and staging’s .env. They had commented on the PR that production would need to be updated separately because the old URL was still active for compatibility.

They then never did the separate production update.

The commit had been reviewed, merged, and the “update production later” line was buried in a comment on line 14 of a long PR discussion. The comment was correct. The follow-up never happened. Six weeks later, here we were.

The things I changed afterward

The easy fix was to update production’s .env to match. That took thirty seconds. The hard fix was deciding what to change so this did not happen again.

I now pin .env.example to the same format as the real environments, and my deploy script compares the keys (not values) in .env.example to the keys in the target environment before deploying. If a key is missing, the deploy refuses to run:

missing=$(comm -23 <(grep -v '^#' .env.example | cut -d= -f1 | sort) <(grep -v '^#' /srv/www/example.com/shared/.env | cut -d= -f1 | sort))
if [ -n "$missing" ]; then
  echo "production .env is missing keys present in .env.example:"
  echo "$missing"
  exit 1
fi

This catches missing keys. It does not catch stale values — the bug I was hit by. For stale values, there is no generic check. The only defence is a health-check endpoint that actually exercises the paths you care about, against the values in the environment, and fails loudly if they are wrong.

So my health check now makes one actual call to each external dependency, verifying the configured URL, headers, and API key all work end-to-end. Before, it only checked “the app responds with 200”. Now it checks “the app can reach Simezu, Stripe, and the database using the configured credentials, and here are the response codes for each.”

The check takes 400ms longer. The deploy refuses to swap the symlink if any dependency fails the check. Production cannot go live against a stale API URL again, because the post-deploy step will catch it before the old release is replaced.

The lesson I keep relearning

Staging is only useful if the thing that breaks in production would also break in staging. This sounds obvious. It is not a property of staging being “like” production. It is a property of staging being exercised against the same external dependencies, with the same configuration, on code paths that match production usage.

Most staging environments are lies in small ways. Mine was a lie in a specific way: its environment variables had diverged. Your staging might be lying in a different way. But it is probably lying about something.

The only way to know is to find something that is true in production and false in staging, and ask why.

The first wrong answer

What env told me

How the divergence happened

The things I changed afterward

The lesson I keep relearning

What `env` told me