Getting Out of Tricky Terraform Situations

Automated infrastructure configuration is an important ingredient when building a robust application. We’ve used Terraform for several recent projects, and my current project relies on it to maintain a significant and growing AWS deployment.

We use Terraform to manage all of our deployed environments, including around twenty feature environments that developers use to build, test, and coordinate business stakeholder review before merging back to our mainline development branch (our version of Heroku’s review apps for feature branches).

While it works great most of the time, there are times that Terraform gets twisted into digital knots — especially on those feature environments where we tend to play a bit rough with the infrastructure. Here are some situations I’ve encountered and some tools and tips to get through them.

ResourceAlreadyExistsException

We’ll start on the simpler end of the spectrum. While attempting to run a terraform apply for one of our environments, we saw an error message stating that a resource already exists, something like this:

Creating CloudWatch Log Group failed: ResourceAlreadyExistsException: The specified log group already exists
status code: 400, request id: 89b9f8a6-820e-836a2f65df1c: The CloudWatch Log Group ‘/aws/eks/lambda/tf-api-lambda’ already exists.

From the name, we could tell it was a CloudWatch log group for one of our Lambda functions. We don’t explicitly create those in our Terraform code. AWS does it automatically when you create a Lambda, so we didn’t see an obvious way to destroy it using Terraform. We usually try not to manually change the infrastructure that’s managed by Terraform, but it seemed worth a shot in this situation.

Deleting the log groups from the AWS web console was easy and resolved our problem. I’ve worked through a similar situation with API Gateway Stages that wasn’t as easy. There were other dependencies that had to be traced and deleted before I could delete the Stage itself, but in the end, deleting the conflicting resources let us run the Terraform commands we needed.

Error: Provider Configuration Not Present

We removed some infrastructure after changing directions with how we were managing CloudWatch notifications to our team. A few weeks later, we went to update a feature environment and ran into a problem. Terraform complained that the provider configuration was missing for some resources that needed to be destroyed — the same resources that weren’t in the configuration anymore.

It turns out that this can happen after removing a module that had a module-local provider configuration, which we had in our configuration. Per the Terraform docs on Providers Within Modules:

For backward compatibility with configurations targeting Terraform v0.10 and earlier, Terraform does not produce an error for a provider block in a shared module if the module block only uses features available in Terraform v0.10, but that is a legacy usage pattern that is no longer recommended.

To get past the problem, we checked out a version of our configuration that still had the module included and ran a targeted destroy against the offending resources. Running a terraform destroy normally tears down the entire configuration, but you can use the -target option to select specific resources when you need to resolve situations like this one. Read the docs on resource targeting for more information.

After removing the resources, we were able to return our git repository to the current state and apply our changes.

Timeout Waiting for a Resource to Be Created

Sometimes when creating or destroying resources in AWS using Terraform, timeouts happen. We’ve seen it happen most with resources like CloudFront distributions, Aurora clusters, and sometimes DNS records and certificates. They can take quite a while to create no matter how you do it.

Most of the time, just waiting a few minutes and then re-running the command is enough to resolve the problem. Or, if you want to watch the paint dry, pull up the resource in the AWS web console and watch until it completes. Then re-run your Terraform command.

Random Dependency Failure

Hmmm. This worked before… Why is it failing now with a dependency problem?

We’ve noticed a few times that our terraform apply would fail with strange dependency messages, despite having succeeded with the same configuration before. It seems like sometimes there’s a propagation delay in the AWS infrastructure. Terraform knows it finished creating, say, a DNS entry, but when it tries to create a certificate that references it, AWS fails the command saying it doesn’t know about the DNS entry — even though we can see it in the AWS web console.

Most of the time, just re-running the command is enough to get past the problem.

Upgrades, Upgrades, Upgrades — Oops!

There are updates to Terraform on a fairly regular basis. While that’s great — I appreciate progress — it needs consideration because mismanaged updates can make it difficult to get caught back up and apply infrastructure changes. Terraform’s documented recommendation for upgrading to v0.13 mentioned this workflow:

When upgrading between major releases, we always recommend ensuring that you can run terraform plan and see no proposed changes on the previous version first, because otherwise pending changes can add additional unknowns into the upgrade process.

For this upgrade, in particular, completing the upgrade will require running terraform apply with Terraform 0.13 after upgrading in order to apply some upgrades to the Terraform state, and we recommend doing that with no other changes pending.

Because we have a large collection of feature review environments (around twenty), it’s easy for one to get left behind after an upgrade. We had at least a few instances where that happened. Here are some recommendations to make it easier on yourself:

  • Make it easy to identify what version of code and infrastructure is deployed to each environment so that when you need to, it’s easy to go back and follow the proper upgrade steps without infrastructure changes getting in the way.
  • Use tfenv to manage installations of Terraform so that your whole team can stay on the same version and can more easily jump back to older versions when needed.
  • When you upgrade Terraform, coordinate applying that upgrade to as many environments as you can.

Use Terraform; Avoid the Traps

Terraform is a handy tool, and it helps to know what the rough edges look like — every tool has them. I hope this helps you work through them.