IaC framework selection guideline: Defining the problem
August 7, 2024 • 9 min read
Prior to August 2023, cloud resource configuration management automation was dominated by three frameworks: AWS CloudFormation for Amazon Web Services (AWS), Kubernetes API-based tools for Kubernetes, and HashiCorp Terraform. Terraform, in particular, held the position of the go-to solution for cloud configuration management automation. However, in August 2023, HashiCorp Terraform transitioned into a proprietary product backed by a single software company. This change has leveled the playing field in the cloud configuration management market, making the choice of configuration management tooling less obvious.
In this two-part blog series, we share a Grid Dynamics internal guideline for selecting an infrastructure as code (IaC) framework for greenfield projects. This first post explores the recommended approach to IaC framework selection. The second post delves into specific use cases and provides tailored recommendations for IaC framework selection. Our goal is to assist engineers in making informed decisions when selecting an IaC framework, ultimately facilitating effective cloud infrastructure management.
This guideline aims to assist in selecting the optimal frameworks for automating cloud infrastructure management. It is specifically targeted towards:
- Architects and tech leads responsible for making decisions regarding tooling for cloud-based solutions.
- Engineering managers concerned about skills and expertise development.
Scope
This guideline focuses on infrastructure configuration management for cloud-based systems, encompassing both application development and infrastructure operations.
Here, “cloud” refers to services delivering IT resources via APIs and pay-per-use models. This includes Infrastructure as a Service (IaaS) and Software as a Service (SaaS) offerings from major cloud providers, along with dedicated SaaS solutions and self-hosted infrastructure platforms.
Infrastructure as Code (IaC) is a software system development approach utilizing software-like concepts (code) to control the resources of the system environment. This approach is applicable regardless of the underlying infrastructure, whether it’s a major public cloud or an application platform managed by another team within the organization. As long as the infrastructure is external to the application system and controlled through a CRUD-like resource-based API, it’s in the scope of this guideline. “Code” in this context doesn’t mean a general-purpose programming language. Instead, it emphasizes applying Software Development Life Cycle (SDLC) practices for development and change management.
This guideline doesn’t cover configuration management for operating systems or hardware because these components aren’t managed using CRUD-like resource-based APIs.
This guideline is not for organizations providing infrastructure or platform services to external parties.
Use cases
To start, let’s review some common scenarios for cloud infrastructure configuration management:
Existing system in the cloud: An established system already has tools for automating infrastructure configuration management. While newer IaC tools may offer advantages like efficiency or cost savings, retooling the entire system for the owning organization can be expensive and time-consuming.
Multiple technology domains: As a system evolves, new technology domains may emerge. For example, an AWS-based organization with primarily Kubernetes workloads might benefit from adopting a separate toolchain for Kubernetes application development and operations. However, pure infrastructure management could remain with existing tools like AWS CloudFormation or HashiCorp Terraform Cloud. The split between application and system infrastructure should align with the organization’s structure.
Single-cloud system with limited SaaS: Primarily on AWS, a single-cloud system with minimal reliance on external SaaS services might leverage the provider’s native configuration management tools. AWS offers a comprehensive CloudFormation-based suite, including graphical web interface, IaC framework (AWS Cloud Development Kit, CDK), and application framework (AWS Serverless Application Model, SAM). Azure follows AWS and offers Azure Resource Manager (ARM) and the higher-level Bicep framework on top of ARM. Google Cloud stands out by relying on the Google Cloud command line interface and HashiCorp Terraform (as of early 2024) instead of a proprietary framework.
Cross-platform system: A single-cloud system might rely heavily on third-party SaaS services with extensive functionality, essentially acting as independent application platforms. Examples include Databricks, Snowflake, and Cloudflare. Kubernetes can also be considered a third-party platform as its development is independent of the cloud provider hosting the cluster. Extending existing cloud-specific IaC frameworks (like CDK) to these platforms might not be feasible. When application workloads reside solely on a third-party platform, it’s an example of “Multiple technology domains” as described above. However, it’s common to integrate components from various platforms to create unique value (as described in our blog: An application of microservice architecture to data pipelines).
Kubernetes-only system: When all workloads run on Kubernetes, developers might not require additional tooling for managing other infrastructure aspects. This design could be driven by cloud-agnostic goals or a focus on skills unification. The Kubernetes ecosystem itself might be treated in isolation, independent of the underlying cloud infrastructure. In these scenarios, most infrastructure configuration happens within the Kubernetes Resource Model (KRM). Managing the Kubernetes infrastructure itself falls to a smaller, specialized group that may utilize different tooling with distinct selection criteria.
Kubernetes-based system: Achieving a truly pure Kubernetes application system can be challenging. However, the KRM is extensible. Both cloud providers and third-party projects implement mapping cloud resources to Kubernetes resources with tools like GCP Config Connector for Kubernetes (and similar tools by the other clouds) or Crossplane. This enables building a system entirely on Kubernetes as the foundational abstraction layer. However, the KRM has limitations compared to specialized IaC frameworks or CDK and general-purpose programming languages. The KRM lacks features like resource dependency management and provisioning orchestration. The trend of workloads shifting towards higher-level serverless platforms might reduce reliance on Kubernetes over time. But new promising use cases, such as AI/ML and large language models (LLMOps blueprint for open source large language models), are built on the Kubernetes foundation.
Isolated basic infrastructure management: Some organizations have dedicated cloud infrastructure teams provisioning cloud accounts and managing foundational services like networking and access permissions. These teams are small but their work is highly privileged. These teams prioritize security and require out-of-the-box configuration management tools for typical cloud infrastructure tasks. Scalability for diverse use cases or interoperability with other frameworks is less important compared to robust functionality and ease of use.
IaC as application definition language: Infrastructure configuration management tools are often used for application deployment automation. However, in the cloud, applications themselves are usually composed of cloud-provided building blocks. IaC frameworks are routinely used to define these application compositions. Unlike the relatively static infrastructure of information systems, applications are typically numerous, experience rapid development initially, and then transition to low-maintenance mode.
Selection criteria
Tooling selection typically arises during new implementation projects. These projects usually involve two key parties: developers (building the solution) and system owners (receiving the solution). The selection process should consider the concerns of both groups.
Developer concerns:
- Implementation timeline: Lengthy procurement processes for new tools can be a roadblock, especially if the tool doesn’t offer significant time savings (e.g., reducing implementation time severalfold).
- Skills availability: A lack of experienced engineers or extensive training requirements can delay implementation if sufficient expertise isn’t readily available.
- Client acceptance: Ultimately, the solution must be delivered and accepted by the client. If the client has reservations about a particular tool for any reason, the tool isn’t a viable choice.
- Legal risks: Framework licensing terms of use and other legal considerations must not pose risks of violation for developers. Additionally, the client may have legal requirements restricting the use of certain tools.
Owner concerns:
- Direct operating costs: This includes readily apparent expenses like service and license fees. These costs are subject to close scrutiny by financial stakeholders. Indirect operating expenses, such as human resources for tool maintenance, are also important but less visible.
- Skill availability: After implementation, the owner needs the in-house expertise to operate and maintain the solution. Unfamiliar or exotic toolchains can create challenges in finding skilled personnel.
- Third-party support availability: Organizations don’t usually plan to maintain comprehensive expertise about each tool in use. Access to qualified third-party support in these cases is crucial to maintain solution reliability and cost. This might not be a major concern for non-essential toolchains or if the organization has the resources to support the tool internally.
- Tool sprawl minimization: A growing collection of tools increases both the cost and complexity of maintenance. Clients generally prefer to control the number of tools in their environment.
- Risk of unanticipated operations cost increase: System development, evolving tool terms of use, or pricing changes can lead to unexpected cost hikes. If a tooling company controls the ecosystem and can secure vendor lock-in, it may resort to aggressive tactics to squeeze more money from its customers.
- Risk of unplanned migration: If a business is forced to switch infrastructure management tools due to factors like cost increases, legal restrictions, or vendor disappearance (especially with SaaS), the impact can be significant and disruptive.
- Cost of unplanned migration: The cost of unplanned migration, particularly for widely used tools supporting multiple systems and teams, can be prohibitive. Even delaying migration can be expensive, as seen with the 2021 Docker Desktop license change.
- Legal risks (ongoing): Unlike developer legal concerns, owner risks extend beyond initial implementation. These may include license or term of use violations due to changes on either side, potential vendor litigation over costs, or data breaches through third-party managed services.
The list of concerns influencing framework selection should be focused on those directly impacting a developer’s ability to implement and the owner’s ability to manage the solution. Here are some commonly mentioned criteria that typically have minimal influence and can be safely excluded:
- Open source vs. proprietary: The licensing model itself is generally not a deciding factor. However, it can indirectly affect cost and risk considerations, which should be evaluated independently.
- Cloud-specific toolset fragmentation: The inherent complexity and peculiarity of different cloud platforms often necessitate distinct IaC toolchains. Adding another IaC framework likely won’t significantly worsen skill fragmentation.
- IaC framework syntax/language: Differences between languages like YAML, HCL, Python, or TypeScript are minor compared to more critical selection criteria.
Problem definition
To choose an IaC framework, the decision maker needs to answer a few questions:
- What is the application domain for IaC?
- What are the target infrastructure platforms?
Decomposition
The first step is to determine if the information system in question belongs to a single, homogeneous domain of processes, use cases, and technologies. An example of a homogeneous domain would be an end-user-facing application system with tightly coupled cloud infrastructure. These components share a similar SDLC, are created and managed by similar teams (often referred to as “application” or “product” teams), and engineers can easily move between these teams and products.
However, “everything on AWS” isn’t a homogeneous domain. This encompasses a broader scope, including relatively static foundational infrastructure like networking, IAM, and security. This application-agnostic infrastructure is likely owned by an isolated team and has a different lifecycle.
If the system spans multiple domains:
- Divide the system into its constituent homogeneous domains.
- Then, for each relevant domain, repeat the evaluation process from the beginning.
IaC application domain
In the cloud, everything is software. As a result, the entire spectrum of activities, from pure IT operations to user-facing software development, can be generalized to SDLC/Application Lifecycle Management (ALM). However, for IaC framework selection, some key distinctions are crucial. This guideline defines three IaC application domains:
- Application composition: Infrastructure configuration management tools have been used for application deployment automation for a while. However, in the cloud, applications are often composites of cloud-provided building blocks, potentially combined with custom business logic code. IaC frameworks are frequently used to define these application compositions. In this scenario, the IaC application manifest becomes the primary deployment artifact, subject to the application’s SDLC and an integral part of the application codebase.
An alternative approach involves leveraging comprehensive cloud application frameworks, such as the Serverless framework for AWS Lambda functions. Here, applications are numerous and experience rapid development initially, followed by a period of low-maintenance operation. This necessitates an application manifest framework that is dependable and risk-free at scale. Generally, it should be either a stable platform-specific technology, like AWS CloudFormation, or managed by a respected, independent governing body, like Kubernetes resource model YAML, backed by Cloud Native Computing Foundation (CNCF). - Deployment environment management: Application components are seldom deployed directly to bare cloud environments. They typically require some prior configuration of the cloud environment or may rely on sophisticated system services, sometimes referred to as application platforms. This portion of the information system might be seen as a regular service component used by other parts. However, there’s a crucial distinction. These system components, forming the deployment environment for application components, are less numerous. They often retain state and are frequently updated in place, whereas application components favor immutability. Operational concerns for deployment environments outweigh SDLC considerations.
- Foundational infrastructure: Does the problem domain encompass foundational infrastructure management within major clouds or other SaaS platforms? In this scenario, the focus might be on the organization’s cloud infrastructure team. This team controls the allocation of fundamental cloud resource units (e.g., Google Cloud projects and AWS accounts) and manages essential elements of the cloud infrastructure, including networking, IAM, and security. Due to the sensitive nature of their responsibilities, this team is typically small, isolated, and has limited interaction with large-scale application development. Consequently, the toolchain they employ has a minimal impact on the organization. In this case, a toolchain that offers out-of-the-box support for most basic infrastructure management use cases is ideal. Additionally, a cloud-agnostic toolchain may be preferred to align with an organizational multi-cloud policy.
Target platforms
This list is not exhaustive, nor is it universal. It’s purely for IaC tooling selection.
- AWS: Is the system AWS-only or primarily AWS-based, with minor, simple use of cloud and SaaS resources outside AWS?
- Azure: Is the system Azure-only?
- Google Cloud: Is the system primarily hosted on Google Cloud Platform (GCP)? Google Cloud does not have its own IaC tooling, so being GCP-only is less critical than in AWS or Azure cases.
- Kubernetes: Is the system in question based solely on Kubernetes or is Kubernetes the primary focus with the addition of cloud and SaaS resources? Or are container images natural artifacts of the system SDLC, and, while most of the system resides on proprietary clouds, the cloud usage is straightforward, involving only popular services from major public clouds? This question aims to identify situations where Kubernetes with API extensions for cloud resources can serve as the system’s control plane, even if the system is not solely based on Kubernetes. It implies that Kubernetes infrastructure management is beyond the scope of the system in question and is readily available.
- Multi-platform: The system directly utilizes several major SaaS platforms. While it can be a multi-cloud system, more frequently these systems reside within a specific cloud while heavily relying on another SaaS platform, such as Databricks, Snowflake, or Cloudflare. Occasionally, this scenario can be simplified to Kubernetes. However, this is often not feasible, necessitating the use of IaC frameworks compatible with all the infrastructure platforms involved.
Next steps
To summarize the decision-making process outlined in this guideline: first, identify the IaC application domain (application composition, deployment environment management, or foundational infrastructure management), and then consider the target platforms (AWS, Azure, Google Cloud, Kubernetes or multi-platform). These two factors will help you narrow down your choices.
In the second part of this guideline, we will go deeper into specific IaC frameworks and provide recommendations based on the criteria discussed here.