Adaptive Vertical Scaling with Granular Degradation Prediction & Contextualized Multi-Armed Bandits

In collaboration with:

Back to overview

Download research

* required

In collaboration with Utrecht University

This research explores the optimization of resource allocation for containers in cloud-native environments. It examines how the combination of fine-grained telemetry and machine learning-driven decision-making can significantly reduce resource waste while maintaining application performance. Now that containerized applications have become the industry standard and organizations often overprovision resources to prevent failures, this study shows how advanced monitoring and adaptive scaling mechanisms can bridge the gap between efficiency and reliability.

Research question and methodology

The central research question is: "How can we minimize over-allocation of compute resources in cloud-native orchestration platforms without degrading performance?" This question is particularly relevant, given that more than 65% of containers deployed with Kubernetes use less than half of their allocated CPU and memory resources. The research uses a quantitative methodology and analyzes metrics such as CPU utilization, memory utilization, disk I/O, network I/O, CPU throttling, out-of-memory errors, and end-to-end latency. It explicitly takes into account multiple dimensions of resource contention, beyond just CPU and memory.

Research design and techniques

The study introduces a two-phase predictive vertical scaling mechanism that combines kernel-level telemetry with online learning algorithms. In the first phase, holistic metrics are collected—ranging from kernel-level run-queue latency and block I/O stalls to container-level CPU and memory usage—and fed into a calibrated Random Forest classifier. This classifier generates a performance degradation likelihood score. In the second phase, a contextual multi-armed bandit algorithm uses this degradation estimate, along with current utilization metrics, to learn over successive iterations how CPU and memory allocations should be adjusted. This involves striking a balance between resource savings and performance risk.

The framework has been implemented on Kubernetes and evaluated against industry standards, including the Vertical Pod Autoscaler and SHOWAR.

Results: resource savings versus performance and stability considerations

The results paint a nuanced picture of optimization considerations. The proposed mechanism eliminated all out-of-memory errors in both benchmark applications tested (Google Cloud Online Boutique and Train Ticket), while existing state-of-the-art solutions exhibited multiple OOM failures. In addition, up to 3x less CPU throttling was achieved compared to these solutions, while maintaining a similar level of CPU resource allocation.

However, these gains come with trade-offs. The approach exhibits higher end-to-end latency than simpler existing solutions, likely due to the overhead introduced by frequent resource adjustments and kernel-level instrumentation.

Implications and future research

The research highlights the close correlation between resource efficiency and application stability in containerized environments. Although the proposed mechanism excels at preventing performance degradation thanks to its predictive capabilities, it entails significant overhead in terms of CPU and memory usage for monitoring and decision-making components. The findings show that no single autoscaling configuration is universally optimal. Operators must consciously choose where their systems lie on the curve between aggressive resource optimization and conservative performance protection.

Future research could focus on reducing the overhead of kernel-level instrumentation, optimizing the granularity of scaling actions, and integrating this approach with horizontal autoscaling to achieve more holistic resource management strategies.

Download

Adaptive Vertical Scaling with Granular Degradation Prediction & Contextualized Multi-Armed Bandits

Download research

In collaboration with Utrecht University

Research question and methodology

Research design and techniques

Results: resource savings versus performance and stability considerations

Implications and future research

Related research

An assessment of Zero-Shot Open Book Question Answering using Large Language Models

Remediating Rogue Resources in an Infrastructure as Code Multi-Cloud Environment

Collaborative Edge-Cloud Computing for Efficient Resource Utilization