Pains with Terraform (again)

June 23, 2020

Terraform - the popular infrastructure as code tool and language - looks very appealing at start and many people swear by it. However, I have had a great deal of frustration with it, which I want to share here as an input for future discussions of Terraform vs. AWS Cloud Development Kit (CDK) vs. other solutions. (I have written about this already 2 years ago and there is partial overlap between this and the older Pains with Terraform (perhaps use Sceptre next time?))

The promise of Terraform is attractive: You describe your infrastructure in a nice, declarative way and it detects and performs any changes. And it works pretty well, most of the time. And, having a commercial backing and large community, new AWS resources become reportedly quickly available, sometimes even faster than in Amazon’s own CloudFormation. (Terraform uses the AWS API, via Go.)

However I have many grievances with it, the main categories described below.

The Terraform "language"

My main issue with Terraform is that it has its own, half-baked, weird, frustratingly limiting programming language. The authors obviously realized that a static description is not enough and keep piling on more and more dynamism via variables, control structures, etc. - creating a patchwork of an unfamiliar, unfinished, rougly-edged, holey programming language. A few examples:

No simple way to include a resource conditionally, as in if cond? resource. What you can do is set its count to 0 or 1 - but then, whenever you use it, you get an array instead of a value and need something very non-intuitive like "${join("", aws_iam_role.this.*.arn)}" to extract the value, e.g. to set an output variable. In the upcoming JavaScript we would simply and naturally do aws_iam_role.this?.arn.
No way to include a whole module conditionally, as in if cond? my_module. So you have to manually define an input variable such as enabled and do the count = 0 or 1 trick for every single resource.
- This is fixed in Terraform 0.13
Proper programming languages make it easy to pass around structured data such as maps. Terraform also supports complex variables and outputs but there are (were?) various limitations (such as same type of map values; see e.g. #8153) and you cannot simply output a whole resource, you need to copy all of its outputs of interest.
Blocks aren’t first-class citizens: Some resources such as aws_codepipeline use many inline blocks. But there is no way to conditionally include a block or to define a block once and reuse or to define a "template," parametrized block and use it at multiple places.
If you want a "singleton" module - e.g. a shared infrastructure (a bucket, role, …) for all CodeBuild jobs - you must create it yourself at the top level and pass the relevant variables down to wherever it is needed, which is annoying and breaks encapsulation (which should anybody beyond my codebuild-job module know about what shared infrastructure it needs?!).
You can use interpolation - i.e. variables - at some places but not everywhere where you might expect/want it. F.ex. the default value of a variable cannot use another variable etc. Sometimes you just have to provide a literal value because Terraform does not handle that much "dynamism" (unfortunately, I do not have a concrete example here; it is a while since I suffered this).

I think that if Terraform was designed from start to support high dynamism to provide you with the flexibility and reuse you typically need, and had conditionals, variables, loops from start and modules, blocks, etc. as first-class citizens, it would be much less frustrating. That is why I am intrigued by AWS CDK, which uses a full-featured, familiar language - e.g. TypeScript - to define your infrastructure (which is then turned to the declarative CloudFormation description of it). It looks more imperative then declarative, which is a pity, but the flexibility and autocompletion this provides seem worth it. (I haven’t tried CDK yet, I am sure it has a number of limitations of its own.)

(Note: I hear there have been a number of improvements in Terraform 0.12 but I suspect that it couldn’t overcome this fundamental problem.)

Dependency hell

Terraform 0.12 introduces incompatible changes to the language. But I could not upgrade until all the modules we use were updated to that version (and now, while still on 0.11, I cannot use any module that has been updated). Contrast it to Rust 2018 that managed to introduce new keywords without breaking existing packages, allowing easily to mix old and new. Not only have you to juggle the versions of Terraform and modules but there is also the AWS provider version - that occasionally also introduces backwards-incompatible changes, requiring that you upgrade all your modules … if their authors have updated them. Many people have managed to upgrade to 0.12 so perhaps it isn’t as complicated as it was and I should stop fearing the pain and try it again…

Timing, resource dependencies, and other critters

Sometimes terraform apply fails - and the solution is to run it repeatedly until it doesn’t - because some resources take time to create or destroy and Terraform obviously does not understand these dependencies fully. And you run into annoying problems where you have to manually (and possibly repeatedly) delete resources - empty S3 buckets, delete autoscaling groups, remove policies from a role (that Terraform has attached to it) - even though you would expect Terraform to handle that itself. One that has bitten me recently is CodeBuild - Error: cache location is required when cache type is "S3" where Terraform fails to understand a number of common ways of passing in an S3 bucket. (I ended up being forced to run terraform twice, adding the cache only in the second run.) (See #4149 Partial/Progressive Configuration Changes (2015) that tries to address some of these.)

So there is a number of cases where Terraform does not properly understand and/or resolve dependencies between resources and variables.

Conclusion

There is a lot of good in Terraform but also many frustration. I wish to try something that has been designed from scratch to by dynamic and embraces conditional logic. I am sure that other solutions such as CDK have many issues and limitations of their own so I might be forced to circle back to Terraform as the least evil in the end…

Tags: aws DevOps tool