2 Ways to Catch Software Deployment Problems Early

Have you ever introduced a subtle bug that lurked undetected until your next deployment? I have! Here, I’ll describe a couple of different types of these bugs and early-warning systems that can catch them.

Feedback Loops

First, though, let’s talk about feedback loops. Here are a handful of development processes a software team performs, with some example frequencies:

  1. Edit text (continuously).
  2. Compile/test code (every few minutes).
  3. Commit and push to source control (a few times a day).
  4. Create a pull request (a few times a week).
  5. Release code to users (every two weeks).

Various feedback loops apply at each level, providing ways to catch different sets of problems:

  • When you make a typo or violate a lint rule, that’s probably caught by your editor (#1).
  • If a logic error slips into the new feature you’re building, that may be caught by the unit tests you run locally (#2).
  • When your change inadvertently breaks something seemingly unrelated and way over on the other end of the codebase, that’s caught by the full test suite in CI, hopefully (#3).

What about if your change breaks part of the deployment process? When will that first get caught?

The answer can vary widely across different project circumstances, but on my current project, it’s “after the pull request lands” (call it step 4.5).

It’s not the end of the world, but it’s a very long feedback loop. When a deployment failure occurs, it may be due to code written many days or commits ago. And the delta to analyze — the diff between the last good deployment and the current broken one — may be large.

Read on for two different types of issues we’ve seen pop up at deployment time and measures we took to catch them earlier.

Problem: Production-mode Build Failure

Production builds often vary from development builds. That can include:

  • Compiler options, optimization levels, security configuration, etc.
  • Different dependencies installed/included (for a Node.js example, npm install --production excludes devDependencies)
  • Conditional presence of test/development features (test endpoints, mock integrations, etc.)
  • Producing a special artifact for deployment that’s different from what’s used in local development.

Each of these represents a path that might not get exercised very often.

For a practical example, my current project uses a Docker-based deployment. The deployment pipeline performs a docker build, then pushes the resulting docker image to a remote registry, where it can be accessed by the cloud host.

Though we deploy with Docker, we don’t use it day-in and day-out for development and testing (those run in native Node.js processes). So if you inadvertently break the Docker build (say, by forgetting to add an environment variable to the Dockerfile), you might not notice until the next time the application tries to deploy.

Mitigation: Frequent Production Builds

This may seem obvious, but it’s easy to miss: perform the production build earlier, more frequently than the production deployment!

We added a CI job to perform Docker builds on every commit (not just those destined to be deployed). Since then, we’ve caught several problems of this type.

Problem: Infrastructure-as-code Provisioning Failure

Infrastructure-as-code is fantastic. It takes what can be a manual, error-prone, immediately-forgotten process and moves that into the deterministic, versioned, repeatable domain of code.

But that code — the code that deploys your servers and configures your databases — might not run (or build) all that often. My team found at least a couple of ways to inadvertently break ours.

For one example, our TypeScript CDK project shares some code with the rest of our app. But it was only getting built at deployment time, so errors caused by changes to the shared code could sneak in and break the provisioning job.

A second example relates to a cool benefit of using a general-purpose language to define resources: you can add extra validation logic. For certain types of errors that manifest at runtime in the deployed app (e.g., a missing environment variable), you can detect them at deployment time with a bit of logic (or maybe even at build-time, in a typed language!).

We have some validation like this, which prevented an error from making it to the deployed app (yay!). But the failure happened at deploy time after the changes had already landed on the shared dev branch (boo!).

Mitigation: Provisioning Dry Run

Though you might not want to provision resources on every single commit, you may be able to execute part of the provisioning process frequently.

Going back to my project’s example, in the TypeScript CDK tool, the provisioning process goes something like this:

  1. Compile TypeScript sources into JavaScript.
  2. Execute compiled JavaScript, producing a CloudFormation template.
  3. Provision CloudFormation stack from template.

Invoking cdk synth covers steps 1 and 2, functioning as a dry run without actually sending anything to AWS. We added this to a CI job that executes on every commit, and it has caught several problems!

You might be able to do something similar with terraform plan, or presuambly with features provided by other infrastructure-as-code tools.

Limits and Tradeoffs

These new checks aren’t perfect. Our deployment process still runs infrequently, and there are still ways deployment can break that these mitigating techniques can’t catch.

Is it worth doing them? There are a bunch of factors to weigh:

  • Frequency of catchable deployment issues (have you been bitten? more than once?)
  • Cost of a failed deploy
  • Development cost to implement mitigation
  • Added time to CI pipeline

But perhaps the biggest factor is deployment frequency. Does your team regularly deploy feature branches while they’re in development (say, to Review Apps or to temporary sandbox environments)? Then you likely already have a good early warning system in place. If not, and deployment happens after the developer has moved on to another task, then some new checks may make sense.

How frequently does your team deploy? Do you have other infrequently-needed processes that you’ve worked to exercise more often?