Handling an AWS NAT Gateway Implicit Dependency with Terraform

Terraform does a great job of automatically optimizing the creation and modification of obvious dependent resources (correctly ordering and chaining operations). However, it cannot always know about other dependencies implicit in the infrastructure.

One particularly common example occurs when AWS EC2 instances depend on internet access for provisioning, but are only available after the creation of a corresponding NAT Gateway. Normally, there are no direct dependencies between a NAT Gateway and an EC2 instance.

Fortunately, it is possible to explicitly define an EC2 instance’s dependency on a NAT Gateway. This allows Terraform to properly optimize the order and chaining of operations with the depends_on meta-parameter.

Background

If you’re not familiar with it, Terraform is a tool for managing infrastructure resources as code. Terraform configuration files define the intended configuration, or state, of a series of infrastructure resources.

Running Terraform against such an infrastructure will produce a change set. This defines the changes that need to occur to bring the existing infrastructure resources into harmony with the defined state.

Terraform providers exist for many public cloud infrastructures (such as AWS and Azure), cloud services (such as DNSimple and DNSMadeEasy, and other software products (such as VMWare vSphere and OpenStack). In particular, Atomic Object frequently uses Terraform to manage AWS deployments.

The Problem

In a standard AWS VPC configuration, a “public” subnet has direct internet access, while a “private” subnet does not. Devices in the public subnet access the internet directly via an Internet Gateway (IGW), while devices in the private subnet must access the internet using network address translation (NAT) via a NAT Gateway device that resides in the public subnet. The NAT Gateway has a publicly routable IP address, but it also has a private IP address which can be reached by the private subnet.

In Terraform, all of the above can be readily created and modified.


resource "aws_vpc" "default" {
  cidr_block = "10.1.0.0/16"
  [...]
}

resource "aws_internet_gateway" "default" {
  vpc_id = "${aws_vpc.default.id}"
  [...]
}

resource "aws_nat_gateway" "nat" {
  subnet_id     = "${aws_subnet.public.id}"
  [...]
}

resource "aws_subnet" "public" {
  vpc_id            = "${aws_vpc.default.id}"
  cidr_block        = "${var.public_subnet_cidr}"
  [...]
}

resource "aws_subnet" "private" {
  vpc_id            = "${aws_vpc.default.id}"
  cidr_block        = "${var.private_subnet_cidr}"
  [...]
}

resource "aws_instance" "public" {
  ami               = "${var.ami}"
  instance_type     = "t2.micro"
  user_data         = "ping 8.8.8.8"
  subnet_id         = "${aws_subnet.public.id}"
  [...]
}

resource "aws_instance" "private" {
  ami               = "${var.ami}"
  instance_type     = "t2.micro"
  user_data         = "ping 8.8.8.8"
  subnet_id         = "${aws_subnet.private.id}"
  [...]
}

[Internet Gateway and routing tables omitted for brevity.]

In the Terraform configuration code above, NAT Gateways depend explicitly on a subnet (which inherently depends on the VPC). However, the EC2 instances also only depend upon a subnet (and therefore the VPC)—not the NAT Gateway.

For EC2 instances in the private subnet, the problem becomes:

  • NAT Gateways provision rather slowly (~ 2 to 3 minutes).
  • EC2 instances can provision more quickly.
  • The EC2 instance may depend on internet access to boot properly.
  • The internet will not be fully available to the EC2 instance until after the NAT Gateway has finished provisioning.

In short, the EC2 instance in the private subnet has an implicit dependency on the NAT Gateway, even though there is no explicit relation between the two resources in the Terraform code. If any part of the EC2 instance’s provisioning (such as initial package installation from a repository) relies on the internet, it may hang because internet access will not be available.

The Solution

The solution is to ensure that the NAT Gateways have been fully provisioned before dependent EC2 instances are provisioned.

Fortunately, Terraform provides a means for us to define additional dependencies via the depends_on parameter to resources. The depends_on parameter guarantees that the given resource will be created after what it depends on.

For example, if the depends_on parameter is used with the EC2 instance in the private subnet to “depend on” the NAT Gateway, the EC2 instance will be created after the NAT Gateway:


resource "aws_instance" "private" {
  ami               = "${var.ami}"
  instance_type     = "t2.micro"
  user_data         = "ping 8.8.8.8"
  subnet_id         = "${aws_subnet.private.id}"

  depends_on        = ["aws_nat_gateway.nat"]
  [...]
}

Rather unfortunately though, depends_on can only take an exact resource name as its argument. It cannot take an interpolated resource name. This causes problems when the EC2 instance you need to depend upon a NAT Gateway is in a different Terraform module:


# Main module

module "network" {
  source = "modules/network"
  [...]
}
module "ec2_instances" {
  source = "modules/ec2_instances"

  aws_nat_gateway_id = "${module.network.aws_nat_gateway_id}"
  [...]
}

# Network module
[...]
resource "aws_nat_gateway" "nat" {
  subnet_id     = "${aws_subnet.public.id}"
  [...]
}

output "aws_nat_gateway_id" {
  value = "${aws_nat_gateway.nat.id}"
}

# EC2 module
[...]
resource "aws_instance" "private" {
  ami               = "${var.ami}"
  instance_type     = "t2.micro"
  user_data         = "ping 8.8.8.8"
  subnet_id         = "${aws_subnet.private.id}"

  ### THIS WILL THROW AN ERROR ###
  depends_on        = ["${var.aws_nat_gateway_id}"]
  [...]
}

input "aws_nat_gateway_id" {
	description = "NAT Gateway to depend on"
}

A Better Solution

This problem with depends_on not supporting interpolated resources can be solved by using a null_resource to add a layer of indirection.

Rather than depending on the NAT Gateway itself, the EC2 instance incorporates a reference to the null_resource, which itself depends on the NAT Gateway:


# Main module

module "network" {
  source = "modules/network"
  [...]
}
module "ec2_instances" {
  source = "modules/ec2_instances"

  aws_nat_gateway_id = "${module.network.aws_nat_gateway_id}"
  [...]
}

# Network module
[...]
resource "aws_nat_gateway" "nat" {
  subnet_id     = "${aws_subnet.public.id}"
  [...]
}

resource "null_resource" "dummy_dependency" {
  depends_on = ["aws_nat_gateway.nat"]
}

output "depends_id" {
  value = "${null_resource.dummy_dependency.id}"
}

# EC2 module
[...]
resource "aws_instance" "private" {
  ami               = "${var.ami}"
  instance_type     = "t2.micro"
  user_data         = "ping 8.8.8.8"
  subnet_id         = "${aws_subnet.private.id}"

  tags {
    Name          = "Private Instance"
    depends_id    = "${var.depends_id}"
  }
  [...]
}

input "depends_id" {
  description = "Resource to depend on"
}

In this code, the null_resource named “dummy_dependency” is explicitly depending on the NAT Gateway. This guarantees that the dummy_dependency won’t be created (and hence the output generated) until after the NAT Gateway has finished provisioning.

Furthermore, the EC2 instance depends on the dummy_dependency by referencing its id via the depends_id input variable in its tags section. This reference causes Terraform to automatically order the EC2 instance’s provisioning after the “creation” of the dummy_resource.

Once the dummy_dependency is created, its depends_id will be available, and the EC2 instance can be created!

One may ask, “Why couldn’t the NAT Gateway id simply be interpolated into the EC2 instance tags?” It could! But this would not provide the guarantee that the NAT Gateway would be fully provisioned first.

Using depends_on guarantees that the dummy_dependency will not be created (and hence no id provided) until after the NAT Gateway has been fully provisioned.