· 4 min read
Your deploy script is a liability
The bespoke deploy.sh sitting in your repo is the largest undocumented dependency the project has. Rewriting it once a year is cheaper than debugging it once.
By Marten
Open deploy.sh in any project older than a year. Read it.
Now tell me: what does line 47 do, and when was the last time the person who wrote that line worked on this project?
Most deploy scripts I inherit have three kinds of lines in them. The lines that are still doing the thing they were meant to do. The lines that used to matter and now just run out of habit. And the lines that are actively wrong but haven’t caused an outage yet, so nobody has touched them.
The script runs. Nobody reads it.
What makes it a liability
A production dependency that nobody understands is a liability. We accept this for database schemas and go to great lengths to version them, migrate them, review them. We do not accept it for deploy logic. The deploy script is treated like a shell utility that happened to end up in the repo, not like production code.
Which is strange, because it is the only piece of code in the project that runs with full write access to your servers.
The failure mode is specific. Someone upgrades the server’s bash version, or changes the Node install path, or migrates from CentOS to Ubuntu. The script still “works” in the sense that it exits zero. But one of the silent steps inside it now does nothing, or does the wrong thing. You find out a week later when a customer notices that the uploaded file they saw for five minutes is gone.
I have seen this exact sequence three times.
The rewrite rule
I have a rule that applies to my own projects: if I have not read the deploy script top to bottom in the last twelve months, I rewrite it. Not refactor. Rewrite.
The rewrite is not about making it better. It is about forcing myself to understand what it actually does right now, on the current version of the OS, against the current state of the application. Most of the time the rewrite is shorter than the original, because half of what the original did is no longer needed.
This sounds wasteful. It is the opposite of wasteful. The time cost is three hours once a year. The time cost of debugging a deploy script that nobody understands, at 22:00 on a Thursday, while the product is down, is much larger than three hours and it happens at the worst possible moment.
The parts worth keeping
When I rewrite, I keep three things from the old script: the environment variable names (because changing them is a breaking change for anyone with a runbook), the exit codes (for the same reason), and any step that is guarded by a “this exists because of incident #NNN” comment.
Everything else is negotiable. The order of steps. The tool used for file transfer. Whether to call it from cron or from a webhook. The logging format. All of those were decisions made in the past, and “it works” is not a strong enough reason to inherit them into the next year.
The scariest deploy scripts I’ve seen are the ones that grew organically over five years. Someone added a line to fix an issue. Someone else added a retry loop around that line because it flaked once. Someone else caught an edge case and wrapped the retry loop in an if that nobody remembers the meaning of. Each change was defensible in isolation. The sum is unreadable.
What I do instead
In the last two years of my own projects, I have stopped treating deploy logic as scripts at all. It lives in a named function, in the same language as the app, with the same test runner. Not because testing shell is impossible (it is not), but because shell scripts have no natural place to write a test against. They live in the corner of the repo where nothing else lives.
A function in a file, called from a CLI entry point, imported by a test — that is code. And code gets read.
The shell scripts I still have are small. They call the real deploy logic and do nothing else. When those break, I can read them in fifteen seconds and see why.
This is not a universal prescription. If your deploy is three rsync lines, a shell script is correct. But I have seen three-rsync-line deploy scripts grow to four hundred lines without anyone noticing, because nobody was ever forced to look.
The forcing function is the point.