Adopting IaC tools like terraform with terragrunt and splitting infrastructure into reusable and configurable modules, allowed us at Mambu to automate an enormous amount of infrastructure. However, the actual process of performing infrastructure releases/terraform runs was still taking more time and effort than it was acceptable.
This was due to several reasons:
- The actual process of performing infrastructure release was highly manual - terraform runs had to be executed from site reliability engineers’ (SREs) laptops.
- Many SREs had to be involved in the release process in order to:
- parallelise manual labor of performing terraform executions on the environments
- troubleshoot issues along the way
- Having multiple people involved required lots of coordination between them.
- Infrastructure drifts / inconsistencies in the environments were not unusual, were particularly painful and time consuming to resolve.
- Slightly different local environment setup / scripts sometimes created butterfly-effects down the line.
- Terraform state manipulation/migration was required due to introduced breaking changes during module refactoring. State manipulation automation usually resulted in ad-hoc scripts that were executed in a sequential manner and often consumed a significant amount of time.
Because of these reasons, sometimes the infrastructure release would take up to three weeks. This situation was hardly scalable and less than ideal for a company growing at a fast pace.
Obviously, automation (probably in some sort of pipeline format) was required to be executed in a centralised manner. While looking for the available options, apart from the problems mentioned in the previous section above, there were two main requirements that were being focused on:
- A platform should scale efficiently to handle a rollout of an infrastructure release on 200+ environments.
- A platform should be highly configurable and extendable to integrate with the surrounding ecosystems.
For terraform run orchestration there are multiple options available in the wild. Some of the most popular platforms are:
- Terraform cloud / enterprise
- Generic CI/CD platforms like:
While evaluating the options above and the technologies currently used at Mambu, we realised that there is under-utilised potential of the Gitlab CI/CD offering. At the time, Gitlab was already a strategic version control system (VCS) in Mambu and we had a dedicated team for owning and evolving Gitlab offering for the rest of the company. The above and the following reasons made this option even more attractive:
- Highly configurable Gitlab CI/CD pipelines.
- SSO & RBAC are already standardised and already in place.
- Self-hosted Gitlab Runners with a dedicated team managing them.
- Most of the SRE already had some skills with Gitlab pipelines.
Another important point I’d like to emphasise is that Mambu (and probably any other modern company) has a relatively complex structure and plain terraform plan & apply might not be enough. Therefore, the ability of generic CI/CD platforms to easily extend the workflow with custom functionality was another key factor in the decision making process.
Based on the evaluations above, the decision was not to introduce another tool and explore the potential of existing Gitlab CI/CD offerings.
Context for infrastructure setup
While using terragrunt to orchestrate terraform execution, we have 2 types of repositories: infrastructure-modules and infrastructure-live.
- infrastructure-modules - in this repository the infrastructure templates are defined in a terraform format.
- infrastructure-live - this repository has terragrunt configuration that references terraform modules defined in the infrastructure-modules and provides input parameters for those modules. The main branch of this repository represents the source of truth for Mambu’s infrastructure and is used for infrastructure releases.
Examples of file/folder structure with more in-depth explanation of the concept can be found in this example infrastructure-live repository.
Let’s get familiar a bit more with the Mambu setup of infrastructure-live repository.
Each Mambu environment, shared or dedicated, consists of at least 3 stages (e.g.: sandbox, dr, production). Each stage is a completely separate infrastructure deployment and consists of multiple terraform modules that may or may not be dependent on each other. This means that particular order of module execution is required.
An example structure of infrastructure-live type of repository is demonstrated below:
Dynamically-generated parallel multi-level gitlab pipelines
Statically defined pipelines have been a common thing in any CI/CD platform for quite a while. However, in their 12.9 release Gitlab has introduced a new feature called “Dynamic child pipelines” which was a complete game changer. With this functionality you now no longer need to statically define each job configuration. Instead, it has a “pipeline generator” job that dynamically generates YAML file(s) which are then used to trigger downstream child pipeline(s).
By leveraging Gitlab’s multi-level dynamic child pipelines feature, this approach allows to compose multiple pipelines that:
- Leverage Gitlab runners scalability to provide a high number of parallelised terraform runs. As long as gitlab runners have sufficient capacity, top level pipeline can schedule terraform runs on all environments residing in the infrastructure-live repository in a parallel manner.
- Make the feedback loop shorter by having multi-level child pipeline jobs running at terraform module level. At the same time, providing high level release status (environment / stage) in the top level pipeline.
This pipeline architecture provides sufficient granularity to drill down if needed. For example, if a particular environment / stage / module terraform run fails, it is relatively easy to drill down into the specific failing component for more details.
In Mambu, the so-called “pipeline generator” has been named infraflow. Let’s go through the high level example of an infrastructure release executed via infraflow.
1. Release is started by raising a merge-request (MR) to infrastructure-live repository or by triggering pipeline via UI/API.
2. Separate workflow paths for generating downstream pipelines based on the pipeline trigger:
- MR trigger: infraflow performs git diff parsing from MR and depending on the environment / stage / module changeset, appropriate downstream child pipelines are generated (Example of dynamic pipeline generation with Jsonnet).
- UI/API trigger: downstream child pipelines generated based on the input parameters specified in the API payload.
The important bit is that downstream child pipeline yaml files are generated dynamically and released to all stages from git changeset or API payload is decomposed and enforced in particular in order to be sequentially executed for each stage, e.g. sandbox → dr → prod.
3. Each environment in a particular stage will get a dynamically generated child pipeline. Since there are no dependencies between environments, these pipelines are executed in parallel.
4. Each environment pipeline in a stage consists of multiple terraform module level child pipelines (5). Since we have dependencies between modules, they are executed sequentially.
5. Module level pipelines execute specified action(s) on the target terraform module. Usually this consists of terraform plan, plan validation and terraform apply.
6. Once all sandbox environments are deployed, the release will progress to the subsequent stage and the appropriate pipeline execution (3, 4, 5) will start accordingly.
Although infraflow has been designed and built with a primary focus to orchestrate terraform runs, it must be emphasised that it is a generic framework that empowers us to execute any automation (e.g. terraform plan validation, terraform state migration, EKS worker rollout, etc.) onto our infrastructure in a dynamic and parallelised manner by leveraging Gitlab CI/CD offering.
Scaling-out to the Multi-Cloud
Mambu operates in multiple clouds. Consequently, in order to have better control and configuration options for each cloud, multiple separate infrastructure-live repositories have been adopted that are dedicated for each cloud service provider (CSP).
As an example, multiple infrastructure-live repositories setup with the the infraflow deployed for AWS, GCP and Azure CSPs, can described in the picture below:
It goes without saying - consistency is the key in many situations - and this is especially true in a multi-cloud scenario. This setup allows a consistent workflow powered by infraflow across multiple clouds while also maintaining high configurability options for infraflow via separate config files in each repository.
Results & Conclusions
Although there were certain compromises we had to make while designing and building the framework based on dynamically-generated parallel pipelines, the introduction of infraflow and infraspin CLI has provided a number of benefits:
- There is no longer a need for an individual SRE to execute an infrastructure release from their laptops - infrastructure releases are executed from a centralised pipeline generated by infraflow. And the consistency of the workflows is easily maintained across multi-cloud environments.
- Drastically improved infrastructure release time by leveraging parallelised terraform deployments backed by Gitlab pipelines. Not only was the infrastructure release cycle reduced by more than 50%, but also the number of SREs involved in the infrastructure release was reduced from ~five to one or two persons.
- Dynamically-generated multi-level child pipelines provide a granular way to narrow the scope by targeting a specific environment / stage / module depending on the requirements of the situation.
- Consistent workflows powered by infraflow reduce the friction when adopting new cloud providers which consequently enable Mambu to scale-out to multiple cloud providers more easily.
This generic framework provides a solid foundation that enables Mambu to execute any automation onto managed infrastructure in a dynamic and parallelised manner. The processes and integrations around the infraflow are evolving and a vast amount of its potential is still to be explored for further expansion. I’d like to thank everyone for their support in this journey in introducing this framework, and especially my closest colleagues who gave their best to turn this design concept into reality.