Three years of pull-based deploys

I have been deploying pull-based for three years. Before that I did what everyone else did: CI built the artifact, CI had SSH credentials to production, CI pushed the artifact to the server, CI called a reload script. The “credentials in CI” part bothered me quietly for a long time and I eventually flipped the model.

Pull-based means the server has a credential to fetch from the artifact source (git repo, container registry, object storage). It fetches when told to. The CI system never touches the production server. The trust direction is reversed: production trusts a signed request from a webhook or a manual command, and production authenticates itself to the artifact source, not the other way around.

This sounded small when I read about it. In practice it has been one of the better operational decisions I’ve made.

What I actually changed

In the old setup, GitHub Actions had a secret called PROD_SSH_KEY. That key was an SSH private key that, on the production server, was authorised in authorized_keys. When CI finished a build, it rsync’d the artifact to production using that key. The key was scoped in theory (command="/usr/local/bin/receive-deploy"), but the fact of it being there — a key with write access to production sitting in a third-party system’s secret store — always felt like a design I was apologising for.

In the new setup, the production server has a deploy key (SSH, read-only) attached to the repo. There are no credentials anywhere else. A webhook fires on push to main. The server’s webhook handler verifies the webhook signature, fetches, builds, and deploys. CI is not involved.

The artifact is built on production instead of in CI. For small apps this is fine. For anything that needs cross-compilation or a big toolchain, I build in CI, upload to a registry, and have production pull from the registry (with a read-only token).

Either way: production pulls, production does not get pushed to.

What I thought would be hard and wasn’t

I thought losing CI-as-the-deploy-actor would mean losing the audit trail of “who deployed what when.” It did not. The webhook handler writes its own audit log: the SHA, the actor who pushed, the timestamp, the outcome. It is cleaner than CI logs were, because it lives next to the deployment logs and uses the same correlation IDs.

I thought I would miss CI’s orchestration features. I did not. What I had been using CI for was essentially “run some commands in order and fail if any of them fail,” which set -euo pipefail in a shell script gives me for free. CI was doing a lot of lifting that I did not need.

I thought the build-on-production model would be slow. On my small apps, the build takes 15 seconds and I do not notice. On apps that need a bigger build, I still do that in CI and pull the result — but the critical credential (write access to production) does not live in CI anymore.

What was actually hard

Rate limits. GitHub’s raw git pulls from production IPs got flagged once as abusive during a debug cycle where I was deploying every few minutes. I now pull from a cached CDN mirror when I am iterating quickly.

Webhook reliability. GitHub retries webhooks, but they can be delayed during incidents. Once, during a GitHub incident, a deploy I triggered took 40 minutes to actually arrive at production because the webhook was queued. I added a manual fallback: any developer can SSH in and run deploy-fetch to trigger the same flow manually, skipping the webhook.

Worker state. The webhook handler is a small HTTP service running on the server. If it crashes or hangs, deploys just silently don’t happen. I have it behind systemd with Restart=always, and a watchdog script that sends me a notification if no successful deploy has happened in 24 hours. The latter is the one that caught two outages I would have otherwise noticed much later.

The incident I had

One hour, two years ago. The webhook handler had a bug where, on receiving a push to a branch it did not recognise, it would start an infinite retry loop against the repo. This did not cause an outage of the app, but it caused GitHub to rate-limit my deploy key, which blocked the next legitimate deploy.

I found it by SSHing in and looking at the process list. Fixed the handler in fifteen minutes. The rate limit cleared in an hour. Total impact: one delayed deploy.

Compared to the kinds of incidents I used to have with push-based deploys — the leaked key incident in 2021, the CI secret rotation that took down deploys for half a day because nobody remembered where the key was stored, the rebuild that failed because the CI runner’s OS had been upgraded and broke the toolchain — this is a small problem in a managed category of problems.

The zero I am proud of

No credential leaks. Zero. Three years.

This is partly luck, but mostly the structural thing that the credentials with write access to production are not in any system I do not operate. They are on the production server itself, in a location I control, behind my own auth. The attack surface is smaller. There is no third-party breach scenario where “our CI provider was compromised” means “our production was compromised.”

I am not claiming pull-based is more secure in some abstract sense. I am saying it removes a specific class of “credential lives somewhere surprising” problem, and for a small team that class is actually the likely one.

What I would not do this way

If I had fifty developers deploying thirty times a day to dozens of services, I would probably need something more structured. A real deploy control plane, artifact provenance, permissioned deploy actions per-service, the works.

I do not have that problem. Most people reading this do not have that problem. For small teams and personal projects, the pull-based pattern is a cleaner default, and I would reach for it again.

The shift was a weekend of work. The benefit has been “this just does not come up as something to worry about.” That is a rare state for infrastructure decisions to reach.