Estimation of causal structure for eliminating rogue resources in Kubernetes clusters

Causal structure estimation for eliminating rogue resources in Kubernetes clusters

Software engineers working on larger software systems often split them into microservices (MSA). As these systems become larger or more complex, or the requirements become more dynamic, the orchestration of these microservices becomes increasingly complicated. For this reason, various frameworks for deploying, managing, and monitoring microservice or cloud infrastructure systems in so-called "clusters" have gained popularity. The most prominent of these is Kubernetes (K8s), typically used in combination with a configuration system such as Helm or Terraform, often referred to as "Infrastructure as Code" (IaC).

The problem of rogue resources

In practice, we see that the intended state of the cluster, as tracked by developers, documentation, and these configuration frameworks, may differ from the actual situation. One problem that can arise in such cases is that infrastructure (such as containers, microservices, network namespaces, storage units, etc.) remains active in the live cluster environment, even though it is no longer critical to the system. These resources are referred to as "rogue resources."

Risks and challenges of rogue resources

Rogue resources cause various problems, including an increased attack surface, waste of energy and/or financial resources, and a greater chance of deviant behavior. When the cluster and associated documentation are clear and can easily fit into the working memory of one or a few people, rogue resources can often be removed manually. However, in poorly documented or highly complex systems, it is considerably more difficult to identify and eliminate these rogue resources.

Detection of rogue resources

Essentially, detecting rogue resources boils down to performing a thorough root cause analysis (RCA) of the expected functioning of the system and identifying deployed infrastructure that is not part of the causal structure. Automated tools that can help prioritize the search for these resources have the potential to offer significant value in this regard.

Research design and objectives

This project takes the first steps toward developing such tools. To keep the scope of the research manageable, we have chosen to use Kubernetes as the exclusive basis and to limit ourselves to investigating methods for identifying rogue resources. We therefore formulate the following research questions:

  1. How do we define rogue resources?
  2. How can we detect rogue resources in a Kubernetes deployment?
  3. What factors indicate whether resources are rogue?
  4. How does the effectiveness of rogue resource detection vary depending on the availability of these factors?
Download
Privacy overview
This website uses cookies. We use cookies to ensure that our website and services function properly, to gain insight into the use of our website, and to improve our products and marketing. For more information, please read our privacy and cookie policy.