Causal structure estimation for rogue resource elimination in Kubernetes clusters

An assessment of Zero-Shot Open Book Question Answering using Large Language Models

Software engineers working on larger software systems often split up the system into microservices (MSA). When these systems get sufficiently large or complicated, or the demands placed on these systems are sufficiently dynamic, the orchestration of these microservices can get rather complicated. For this reason, several frameworks for deploying, managing and monitoring microservice or cloud infrastructure systems into “clusters” have gained traction. Foremost among these is Kubernetes (K8s), typically used in conjunction with some sort of configuration system (e.g. Helm, Terraform), which are often termed “infrastructure as code” (IaC). In practice, we know that the intended state of the cluster tracked by developers, documentation and these configuration frameworks can get out of sync with reality. One issue that can emerge in such cases is that of infrastructure (e.g. containers, microservices, network namespaces, storage units, etc.) hanging around in the live state of the cluster, despite no longer being critical for the system. We term these resources “rogue resources”.

Rogue resources present several problems, such as an enlarged attack surface, wasted energy and/or financial resources, and the increased possibility of anomalous behavior. When the cluster and documentation are easy to understand and keep in working memory for a single person or a few people, rogue resources can often be easily eliminated by hand. However, in particularly badly-documented or complex systems, it can be rather difficult to find and eliminate these rogue resources. The search for rogue resources essentially comes down to running a thorough root cause analysis (RCA) on the expected functioning of the system, and identifying deployed infrastructure that is not a part of the causal structure graph. Automated tools that can help prioritize the search for these resources have the potential to be of great help.

This project is intended to take the first steps towards developing such tools. In the interest of keeping the scope of the project realistic, we have decided to focus solely on Kubernetes as the basis for our research, and restrict ourselves to researching ways of identifying rogue resources. Hence, we present the following research questions:

How do we define rogue resources?
How can we detect rogue resources in a K8s deployment?
What factors can tell us whether resources are rogue?
How does the efficacy of rogue resource detection vary with access to these factors?

Causal structure estimation for rogue resource elimination in Kubernetes clusters

Download research

An assessment of Zero-Shot Open Book Question Answering using Large Language Models

Related resources

A platform-agnostic middleware for seamless data portability in multi-cloud application deployment

Remediating Rogue Resources in an Infrastructure as Code Multi-Cloud Environment

Generative AI for Infrastructure-as-Code Configurations in Response to Drift Detection