avatarMike Ensor

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3693

Abstract

} }</pre></div><div id="c661"><pre>module <span class="hljs-string">"gke"</span> { source <span class="hljs-operator">=</span> <span class="hljs-string">"../modules/gke"</span> project_id <span class="hljs-operator">=</span> <span class="hljs-string">"{google_project.project.project_id}"</span> region <span class="hljs-operator">=</span> <span class="hljs-string">"{var.region}"</span> location <span class="hljs-operator">=</span> <span class="hljs-string">"{var.location}"</span> region_zone <span class="hljs-operator">=</span> <span class="hljs-string">"{var.region_zone}"</span> credentials_file_path <span class="hljs-operator">=</span> <span class="hljs-string">"{var.credentials_file_path}"</span> cluster_prefix <span class="hljs-operator">=</span> <span class="hljs-string">"cluster-test"</span> max-node-count <span class="hljs-operator">=</span> <span class="hljs-number">5</span> }</pre></div><p id="e9f1">The above code would fail due to the <code>container.googleapis.com</code> not being enabled. The fix should be simple, just add code required to enable the API.</p><div id="b402"><pre><span class="hljs-meta prompt_">...</span> <span class="hljs-meta prompt_">...</span> <span class="language-python"><span class="hljs-comment"># still in main.tf</span></span></pre></div><div id="de30"><pre>resource <span class="hljs-string">"google_project_service"</span> <span class="hljs-string">"gke-api"</span> { project <span class="hljs-operator">=</span> <span class="hljs-string">"{google_project.testing-project.project_id}"</span> service <span class="hljs-operator">=</span> <span class="hljs-string">"container.googleapis.com"</span></pre></div><div id="c406"><pre> <span class="hljs-attribute">disable_dependent_services</span> = <span class="hljs-literal">true</span> disable_on_destroy = <span class="hljs-literal">false</span> }</pre></div><h1 id="dfa4">All good, right? No…FAIL</h1><p id="bcc7">The Terraform validates and runs, but when the <code>module</code> attempts to create a GKE cluster using <code>google_container_cluster</code> the build fails with a <code>403 Forbidden</code> error, despite the API enablement code succeeding. If I re-run the build, only seconds later, the code succeeds. This violates one of my strong beliefs in the principles of Continuous Delivery, every build should be predictable.</p><p id="6de4">The problem is with the API needing to be propagated across Google’s solution. The underlying Google API used by Terraform returns with “success”, but the API is not consistent across the platform. With normal resources, the problem would be solved using <code>depends_on</code> keywords that give Terraform information about how resources may be connected if the syntax does is not clear or the associations are not explicit. However, this is not a feature in v0.12 of Terraform for <code>module</code> resources (see <a href="https://github.com/hashicorp/terraform/issues/10462">https://github.com/hashicorp/terraform/issues/10462</a> to track the bug, er “feature”)</p><p id="5470">I tried several different solutions including <code>external</code> data providers and <code>null_resources</code> in order create some sort of waiting time allowing the propagation to take place. I ended up using a four-step plan that can be used for just about any Terraform script that ends up with eventual consistent resources despite the API result being “complete”.</p><h1 id="a5ab">My solution</h1><ol><li>Setup the API (or eventual consistent resource) inside of the <code>module</code> (this is key)</li><li>Create a <code>n

Options

ull_resource</code> instance that contains a simple script to <code>sleep</code> (NOTE: this solution only works for linux-based solutions).</li></ol><div id="26c7"><pre><span class="hljs-comment"># module/gke/main.tf</span> <span class="hljs-string">...</span> resource <span class="hljs-string">"null_resource"</span> <span class="hljs-string">"resource-to-wait-on"</span> { provisioner <span class="hljs-string">"local-exec"</span> { <span class="hljs-keyword">command</span> = <span class="hljs-string">"sleep {local.wait-time}"</span> } } <span class="hljs-string">...</span></pre></div><p id="9b2d">3. Have the <code>null_resource</code> code depend on the output of the <code>google_project_service</code> resource previously created. Note, I have created a variable <code>local.wait-time</code> and set the value to “60”, representing 60 seconds.</p><div id="f51e"><pre><span class="hljs-comment"># module/gke/main.tf</span> ... resource <span class="hljs-string">"null_resource"</span> <span class="hljs-string">"resource-to-wait-on"</span> { provisioner <span class="hljs-string">"local-exec"</span> { <span class="hljs-built_in">command</span> = <span class="hljs-string">"sleep <span class="hljs-variable">{local.wait-time}</span>"</span> } depends_on = [<span class="hljs-string">"google_project_service.gke-api"</span>] }</pre></div><p id="b836">4. Have the infrastructure needing the eventual consistent resource to be dependent on the <code>null_resource</code> (in my case, the <code>google_container_cluster</code> resource). Add a <code>depends_on</code> clause to the resource forcing Terraform to wait until the <code>null_resource.resource-to-wait-on</code> has completed.</p><div id="b954"><pre><span class="hljs-comment"># module/gke/main.tf</span> <span class="hljs-built_in">..</span>.<span class="hljs-built_in"> resource </span><span class="hljs-string">"google_container_cluster"</span> <span class="hljs-string">"primary"</span> { name = <span class="hljs-string">"<span class="hljs-variable">{var.cluster_prefix}</span>-<span class="hljs-variable">{local.cluster_suffix}</span>"</span> location = <span class="hljs-string">"<span class="hljs-variable">{var.location}</span>"</span> project = <span class="hljs-string">"<span class="hljs-variable">{google_project.testing-project.project_id}</span>"</span></pre></div><div id="d44f"><pre> <span class="hljs-string">..</span> <span class="hljs-string">..</span> details removed for brevity <span class="hljs-string">..</span></pre></div><div id="5075"><pre> depends_on <span class="hljs-operator">=</span> [<span class="hljs-string">"null_resource.resource-to-wait-on"</span>] }</pre></div><p id="707d">The above pseudo-code represents what I have used inside of my module. I pass the <code>project_id</code> into my module and a few other required fields. Now the Terraform waits for the calculated amount of time before trying to create the <code>google_container_cluster</code> instance.</p><p id="5e9d"><b>Conclusion</b></p><p id="8f6f">First, I really wish I did not have to create a #hack, but the rate at which cloud providers are adding functionality, and the rate at which Hashicorp (and community) are keeping up features to the new features, it is reasonable that we will need to have little hacks from time-to-time. With one final note, please make sure to keep these “sleep” hacks to a minimum, and only use them IF there is no alternative. Terraform has a declarative format that allows the underlying algorithms to achieve time and resource efficiency. Adding 60 seconds here or there can drastically change the build time for new stacks.</p></article></body>

Trouble with eventual consistency, Terraform and Google Cloud

Over the last few days I’ve been building out a multi-tiered mock enterprise reference solution based on my speaking engagements and several projects I have built over the last 3–4 years. In my pursuit of this ambitious project, I ran into a problem when building a Terraform module. All of the resources I have built are synchronous and blocking when using Terraform to provision them, with the exception of google_project_service and google_project_services.

Google requires users to enable specific APIs to manage resource lifecycle in addition to standard IAM. These APIs need to be enabled to create resources like google_container_clusterinstances over the Google Cloud API. Several of the APIs require extra steps including agreeing to terms and conditions or creating credential sets. These extra steps are not often covered by the Terraform API, and therefore require creative work arounds or adding manual steps to what should be a100% code solution.

I strive to make all of my (and my client’s) projects 100% code-based solutions. With the introduction of API-driven cloud-based or modern virtualized infrastructure, coupled with Infrastructure-as-Code abstraction tools like Terraform, Cloud Formation, Build Manager and Pulumi, architectures should strive to be 100% code based. The benefits are numerous and I do intend on going into the benefits, so here are a few blogs if you need to be convinced: https://readmedium.com/infrastructure-as-code-but-why-ab13951fb8d4 , http://techtowntraining.com/resources/blog/infrastructure-as-code-benefits (sorry, this has a big full-page ad) and https://youtu.be/eiHgx-pyg1U (a shameless plug on me talking about the ‘no-ops culture’ and designing architectures using Infrastructure-as-Code).

The problem I ran into is around a code side-effect or asynchronous actions or eventually consistency (likely the latter). Google API enablement needs time to propagate around their solution. My plan is to to dynamically build a google_project and then create a GKE (Kubernetes) instance inside the newly created project using a Terraform module I had previously created.

## main.tf
resource "google_project" "project" {
  name            = "testing-project"
  project_id      = "project-${local.project_suffix}"
  billing_account = "${data.google_billing_account.client_billing_acct.id}"
  org_id          = "${var.org_id}"
  labels          = { "example" = "true" }
}
module "gke" {
  source                = "../modules/gke"
  project_id            = "${google_project.project.project_id}"
  region                = "${var.region}"
  location              = "${var.location}"
  region_zone           = "${var.region_zone}"
  credentials_file_path = "${var.credentials_file_path}"
  cluster_prefix        = "cluster-test"
  max-node-count        = 5
}

The above code would fail due to the container.googleapis.com not being enabled. The fix should be simple, just add code required to enable the API.

...
... # still in main.tf
resource "google_project_service" "gke-api" {
  project = "${google_project.testing-project.project_id}"
  service = "container.googleapis.com"
  disable_dependent_services = true
  disable_on_destroy         = false
}

All good, right? No…FAIL

The Terraform validates and runs, but when the module attempts to create a GKE cluster using google_container_cluster the build fails with a 403 Forbidden error, despite the API enablement code succeeding. If I re-run the build, only seconds later, the code succeeds. This violates one of my strong beliefs in the principles of Continuous Delivery, every build should be predictable.

The problem is with the API needing to be propagated across Google’s solution. The underlying Google API used by Terraform returns with “success”, but the API is not consistent across the platform. With normal resources, the problem would be solved using depends_on keywords that give Terraform information about how resources may be connected if the syntax does is not clear or the associations are not explicit. However, this is not a feature in v0.12 of Terraform for module resources (see https://github.com/hashicorp/terraform/issues/10462 to track the bug, er “feature”)

I tried several different solutions including external data providers and null_resources in order create some sort of waiting time allowing the propagation to take place. I ended up using a four-step plan that can be used for just about any Terraform script that ends up with eventual consistent resources despite the API result being “complete”.

My solution

  1. Setup the API (or eventual consistent resource) inside of the module (this is key)
  2. Create a null_resource instance that contains a simple script to sleep (NOTE: this solution only works for linux-based solutions).
# module/gke/main.tf
...
resource "null_resource" "resource-to-wait-on" {
  provisioner "local-exec" {
    command = "sleep ${local.wait-time}"
  }
}
...

3. Have the null_resource code depend on the output of the google_project_service resource previously created. Note, I have created a variable local.wait-time and set the value to “60”, representing 60 seconds.

# module/gke/main.tf
...
resource "null_resource" "resource-to-wait-on" {
  provisioner "local-exec" {
    command = "sleep ${local.wait-time}"
  }
  depends_on = ["google_project_service.gke-api"]
}

4. Have the infrastructure needing the eventual consistent resource to be dependent on the null_resource (in my case, the google_container_cluster resource). Add a depends_on clause to the resource forcing Terraform to wait until the null_resource.resource-to-wait-on has completed.

# module/gke/main.tf
...
resource "google_container_cluster" "primary" {
  name     = "${var.cluster_prefix}-${local.cluster_suffix}"
  location = "${var.location}"
  project  = "${google_project.testing-project.project_id}"
  ..
  .. details removed for brevity
  ..
  depends_on = ["null_resource.resource-to-wait-on"]
}

The above pseudo-code represents what I have used inside of my module. I pass the project_id into my module and a few other required fields. Now the Terraform waits for the calculated amount of time before trying to create the google_container_cluster instance.

Conclusion

First, I really wish I did not have to create a #hack, but the rate at which cloud providers are adding functionality, and the rate at which Hashicorp (and community) are keeping up features to the new features, it is reasonable that we will need to have little hacks from time-to-time. With one final note, please make sure to keep these “sleep” hacks to a minimum, and only use them IF there is no alternative. Terraform has a declarative format that allows the underlying algorithms to achieve time and resource efficiency. Adding 60 seconds here or there can drastically change the build time for new stacks.

Terraform
Google Cloud Platform
Eventual Consistency
Hacks
Infrastructure As Code
Recommended from ReadMedium