Panoptes: Remediating Rogue Resources in an Infrastructure as Code Multi-Cloud Environment
Administrating system architectures has historically been a time-consuming and error-prone process. DevOps, a set of practices designed to improve system deployment speed and quality, addresses these challenges by promoting automation and consistency. A central practice in DevOps is Infrastructure as Code (IaC), which involves describing infrastructure through code and automatically configuring systems based on these definitions. This approach contrasts with traditional system deployment methods, where administrators manually configure systems interactively.
IaC is particularly well-suited for cloud computing, as cloud services enable on-demand allocation of system resources that can be provisioned automatically using IaC. When adopting cloud computing, organizations can choose to rely on a single provider or multiple vendors simultaneously, the latter being referred to as a multi-cloud environment. Multi-cloud adoption helps avoid vendor lock-in and reduces dependency on a single provider. However, managing multiple vendors is challenging due to differences in products and APIs, which complicate interoperability.
Despite the advantages of IaC, it introduces certain challenges, one of which is configuration drift, often described as “undocumented configuration changes made to a running system”. This research focuses on detecting and remediating rogue resources, which are resources outside the IaC-managed “state” that affect resources within it. Here, “state” refers to the collection of resources managed by IaC. Rogue resources can interfere with IaC functionality, for example, by hindering deployments if they depend on IaC infrastructure. Additionally, they can compromise security, as their existence outside IaC documentation makes them difficult to monitor and control. Addressing these rogue resources is essential to restore the system to a known state and maintain its integrity.
This research investigated methods for detecting and remediating rogue resources in multi-cloud environments. To implement the findings, a tool was designed and developed to address these challenges. The scope of the research was limited to ensure feasibility within the given time frame. AWS and Hetzner Cloud are chosen as the focus platforms due to their accessibility and the inclusion of a smaller platform alongside a major provider. Terraform is selected as the IaC tool for this study due to its popularity and extensive adoption. Additionally, the research focuses specifically on compute resources, allowing for a more detailed examination of rogue resource management.
By exploring these challenges and implementing solutions, this research aimed to enhance the management of multi-cloud environments and improve the reliability of systems using IaC. The findings contributed to a deeper understanding of how rogue resources can be addressed effectively, ensuring systems remain secure and functional.