· 4 min read
Rolling back is not rolling forward
If your rollback plan is 'deploy the previous commit', you don't have a rollback plan. You have a second deployment that can also fail.
By Marten
Most rollback plans I read assume the deploy pipeline is healthy. That is the entire problem.
A rollback is usually needed because something went wrong. Sometimes that something is in the code you just shipped, and sometimes it is in the pipeline itself. If your rollback plan is “deploy the previous commit,” and the pipeline is the thing that broke, you have two problems and one solution that cannot fix either of them.
A proper rollback does not run the same pipeline. It uses a different, simpler mechanism, which was validated separately and does not share failure modes with the thing that just broke.
The test
There is one test for whether you have a rollback plan. Can you go back to the previous version in under 60 seconds, without running your CI pipeline, without building anything, and without anyone having to remember a specific command?
If the answer involves the word “just” — “just re-run the pipeline”, “just push the old commit”, “just revert and redeploy” — you are describing a second deployment, not a rollback.
I have been on calls where “just rolling back” took 35 minutes because the previous artifact had to be rebuilt from scratch, and the build broke on the way, and now we are debugging the build at 22:00 while the site is still serving the broken version.
What a real rollback looks like
On a server with atomic symlink deploys, rollback is one command:
ln -sfn /srv/www/example.com/releases/$(ls -t /srv/www/example.com/releases | sed -n '2p') /srv/www/example.com/current
nginx -s reload
That is two lines. No build. No SSH key rotation. No GitHub webhook. No CI. It does not matter if GitHub is down. It does not matter if the build server is on fire. The previous release’s files are on disk, the symlink moves, nginx reloads, the site is back.
This works because I kept the old files. Most pipelines throw them away after deploy.
For a containerized setup the equivalent is one docker run against the previous image tag, if you kept the image around. Which you should.
For a PaaS-style deploy the equivalent is whatever button the platform gives you, but — and this matters — the platform must have kept the previous artifact. Heroku does. Vercel does. Self-built “just rebuild from the old commit” does not.
What breaks in real incidents
Rollback plans fail in predictable ways. I have seen all of these in the wild.
The old artifact was garbage collected. Whoever built the pipeline set a seven-day retention on build artifacts. The issue you are rolling back from was introduced three weeks ago, nobody noticed, and now you need version N-5. It does not exist. You can rebuild it from git, which takes eleven minutes on a good day.
The rollback depends on the forward path. Your rollback command goes through the same SSH key, the same CI runner, the same npm registry proxy that is currently 500ing because of a separate incident. You cannot roll back because the tools required to roll back are also broken.
The database went forward. You shipped a migration that dropped a column. Rolling back the code does not put the column back. Now the old code is hitting a schema it does not know.
The deploy was actually three deploys. You ship to web, CDN, and a background worker separately. You “rolled back the web tier” but forgot the worker is still on new code. The worker is the one causing the incident.
The practical version
Here is what I actually do.
One, keep the last five releases on disk. Not one. Five. Disk is cheap, incidents are not.
Two, never gate rollback on CI. Rollback is a separate path that runs without touching the pipeline. If it runs, the pipeline could be on fire and you would not know or care.
Three, for schema changes, two deploys. The first deploy only adds the new thing. The second deploy removes the old thing. Between them, the rollback is safe — you can always step back one deploy without the database disagreeing with the code.
Four, write the rollback command on the wall. In a runbook. In the README. Pinned in the team channel. Not “here is how to roll back our deployment process” in fifty words. The literal command, copy-pasteable, three lines maximum.
The point of a rollback is that it runs when you are tired, scared, and not thinking clearly. It has to be short enough that a sleepy person can execute it without typos.
The one thing people do not want to hear
Rollback is not free. The simpler you make it, the more constraints you have to accept elsewhere. Atomic symlinks mean you cannot reshape your directory structure on a whim. Keeping old images means paying for storage. Two-deploy schema changes mean slower feature delivery.
Teams I have worked with pay these costs or they don’t, and when they don’t, they pay a much larger cost once a year in the form of a bad incident. The math comes out the same.
The only choice you really have is whether you pay in advance or all at once.