The Hidden Link in Your Infrastructure

Ah, software infrastructure and service management… an inexhaustible source of spiritual wisdom and enrichment. It can teach you so much about life, about love… about humanity. One of the many parallels with infrastructure work is that it all seems simple at first; the monsters and challenges lurking beneath the surface only reveal themselves once you’ve grown old and weary (which, fortunately, happens rather quickly in the software world: if you work on the same thing for more than a year, you’re practically a fossil already).

Take a moment to consider a common career path in this field. You join a startup as a “DevOps specialist,” start tinkering with a home lab, or set up a website for the local chess club. At first, everything seems wonderfully simple: you plug an old laptop into a forgotten Ethernet port or rent a VPS for a few cents a day, and download the most promising Google result for “good free server operating system please help no Chinese scams no Microsoft.” You install a few packages, explain to your coworkers how to access the system, and wait to see what happens.

Eventually, something goes wrong: the server runs out of memory, a rodent chews through the motherboard, or your server has the misfortune of encountering a “regular user” (see Figure 1). At that moment, you realize that your server and its users might need monitoring after all. You spend a whole night browsing the internet, and the next day you’ve pieced together a few components into a beautiful observability stack. Maybe it’s a SystemD drop-in, maybe ElasticSearch, Grafana, and Metricbeat. Maybe you’ve installed Prometheus and a few exporters. Either way, everything seems fine… until one day…

Figure 1: A typical user who completely ignores your carefully configured VPS and brings it down.

A new department at work. Your daughter joins a soccer club whose website hasn’t been accessible for five years. The hardware for your home server needs to be upgraded. If you’re lucky, you’ll realize that even if you’ve taken the most logical steps and executed them correctly, you’ve actually already made a serious mistake. You have to set up a new server, and given the way you’ve handled it, that means doing all the work all over again.

You cry, you beg, you bargain with your god or the void, but there’s no escaping it. Either you redo your work from scratch a second time, or you undo everything and find a way to make the process repeatable.

But from a fundamental perspective: what exactly is the problem here? Is it the failure to document the steps taken? Is it the failure to use an automation framework like Ansible or SaltStack? Although these are, in a sense, practical solutions to the problem, by focusing solely on marketing and simply becoming a “user,” you run the risk of missing the real, deeper problem: during the setup of this infrastructure, the functional configuration and the deployment process have become inextricably intertwined (Figure 2).

The server configuration is the runtime environment, and the runtime environment is the configuration. A framework like Ansible makes it possible to partially break that link, which not only solves the problem of repeatability but also suddenly gives you the ability to deploy your services on various target platforms, such as Debian, Arch Linux, and so on.

Figure 2: Separation of configuration and runtime environments through the use of tools such as Ansible.

One of the key advantages of this type of setup—especially on a larger scale or when dealing with more complex requirements—is precisely the ability to operate independently of specific targets. This enables an organization to select the most cost-effective hardware options without incurring significant overhead when scaling across multiple data centers, or to benefit from additional compliance and reliability advantages by distributing the failure surface across multiple target environments.

For many, the move from a runtime environment that also serves as a configuration environment to separate configuration and runtime environments seems sufficient.

At SUE and other leading companies, however, we encounter challenges and infrastructure deployments on a truly massive scale. Consider, for example, a government or company building its own (semi-)public cloud. In such situations, the requirements of deployed services can change at any moment, while the broader landscape is also complex and constantly evolving. This calls for thinking outside the box and moving away from a paradigm based solely on standard tools and basic configuration management.

Instead, we let our choices be guided by sound theoretical principles and a higher level of abstract thinking. In this context, the need for an additional layer of stratification is rarely immediately apparent, but it can prove to be particularly valuable in practice.

Figure 3: Separate configuration, deployment, and runtime environments

Breaking free from a specific underlying configuration management system means you can suddenly introduce a whole new level of automation. Why spend time carefully writing Salt states, Terraform code, or Ansible code when these can be automatically generated for your platform based on a simpler representation of the input?

Version Control

When it comes to version control, looking at your infrastructure through this lens opens up a number of new options. In combination with Ansible, it’s quite common to version your configuration using Git and to use GitOps (for example, with GitLab runners) to deploy your systems based on that configuration.

Because you've separated the configuration from the underlying engine, you can version only the underlying system code—which actually needs to be managed by engineers—in Git, while the system state itself is stored elsewhere.

To some, this may seem pointless or unnecessarily quirky. And yet, even Kubernetes uses an etcd database to store and manage the configuration of the services it is responsible for. The system status and objects are stored in a database, allowing the system to utilize replication and scale far beyond what a set of GitLab runners can provide, while at the same time allowing the actual Kubernetes source code to be modified and versioned separately.

If your goal is to set up a few services, the “usual way” is perfectly fine. But if you’re working on enabling your company or customers to set up systems independently—similar to Kubernetes or Azure—then the extra flexibility of storing configuration “your own way” can be well worth the extra effort.

Pre-flight Configuration Validation

As for pre-flight configuration validation: when you look at infrastructure from this perspective, another important area for design decisions emerges.

In traditional Ansible or Terraform workflows, validation is often performed implicitly during execution itself: you run a plan, execute a playbook, and discover during the deployment phase whether something is correct or not. This works fine on a smaller scale, but becomes fragile as soon as the environment becomes more complex and changes are implemented more quickly and by more teams.

By explicitly separating configuration, deployment, and runtime, you can move validation earlier in the chain. Instead of waiting until a system is actually deployed, you can perform a “pre-flight check” on the projected configuration: a controlled evaluation to determine whether the desired state is consistent, complete, and executable in the first place.

This can range from simple schema validation to semantic checks that verify that resources do not conflict, dependencies are correct, and policies are followed before even a single change is deployed to production. On a larger scale, this becomes essential: errors that only become apparent during deployment are not only costly but can also propagate horizontally through interconnected systems.

In modern platforms, this principle is reflected in various forms. Kubernetes, for example, performs declarative reconciliation, in which the desired state is first validated and then continuously compared with the actual state in etcd. Other systems build additional layers on top of this, such as policy engines or admission controllers, which block or modify configurations before they are even applied.

The result of such a pre-flight layer is not only reliability, but also a shift in responsibility: from “we hope the deployment works” to “we know in advance that this change is consistent with the system model.”

When you connect your infrastructure to a public API, an internal developer portal, or a team responsible for service delivery, the likelihood of errors finding their way into your backend increases significantly. Furthermore, once you’re locked into a single specific way of storing, managing, and representing your configuration, it becomes difficult to validate that configuration for compliance, security, and correctness.

However, if you allow yourself to decide how configuration works for your platform, you can use a wide range of tools—such as OpenAPI specifications—or even your own custom code to validate, modify, or reject configurations based on what you actually want to support within your platform.

Organizational agility

One of the great things about thinking in terms of these abstract building blocks is that if the technology you’ve chosen for deployment ever goes out of style (and if history teaches us anything, that “if” is really a “when”), you can essentially replace it without losing any information needed to rebuild or modify your machines.

For example, if you’re using Terraform and want to switch to something that HCL doesn’t support… that’s going to be tricky. But when your infrastructure is defined in a more vendor-neutral format, and you’re the one who determines how that basic input is transformed into a fully functional system, you have much more flexibility and freedom of choice.

As with any problem, the same principle applies here: the tools you choose should be suited to the problem you’re trying to solve. GitOps and “plain” Ansible are great and work well for certain setups, but when scale or accessibility become a bottleneck, a different approach may make much more sense.

A broader, systematic perspective on internal processes and configuration management requires some initial effort. But in the right context, and guided by experience and critical thinking, difficult design choices can actually yield significant long-term benefits.

Stay informed
By subscribing to our newsletter, you declare that you agree with our privacy statement.

* required

Enter the LEGO® giveaway
Get your 3-month FREE Multistax trial
By submitting this form, you indicate that you have read and understood ourprivacy statement.