Optimizing HPC performance: a NUMA-aware scheduling solution for AMD EPYC systems

Small inefficiencies — major delays

In high-performance computing (HPC), even minor inefficiencies can cause significant delays. I recently encountered such a challenge while supporting a customer in the semiconductor industry that had integrated AMD EPYC-based systems into their global compute clusters. These systems, known for their high core counts and energy efficiency, should actually accelerate simulation workloads. However, unexpected performance issues arose, particularly with multithreaded jobs.

The bottleneck

The customer noticed that certain simulation jobs, particularly those using 4 to 8 cores, had inconsistent runtimes. Some tasks were completed quickly, while others lagged significantly behind, causing delays in the overall workflow. This inconsistency was striking, especially since all clusters had the same hardware and software configuration.

After investigation, I discovered that the cause lay in the interaction between the job scheduler and the Non-Uniform Memory Access (NUMA) architecture of the AMD EPYC processors. Specifically, the scheduler assigned threads without taking into account the L3 cache boundaries inherent in the EPYC design. As a result, threads that shared data were placed on cores with separate L3 caches, leading to increased latency due to communication between caches.

Understanding the architecture

AMD EPYC processors use a chiplet-based architecture, in which each chiplet—also known as a Core Complex Die (CCD)—contains multiple cores that share an L3 cache. This design offers advantages in terms of scalability and performance, but also introduces complexity in memory access patterns. When threads that communicate intensively with each other are assigned to cores on different CCDs, they must access data via separate L3 caches. This leads to increased latency and reduced performance.

Implementing a solution

To address this, I developed a strategy to make the scheduler NUMA-aware with respect to L3 cache boundaries. By modifying the topology information that the system presents to the scheduler, I ensured that threads that share data are assigned to cores within the same L3 cache domain. This approach consisted of the following steps:

  • Topology adjustment: Adjusting the system's hardware topology information so that the L3 cache boundaries are displayed correctly. This caused the scheduler to group threads within the same cache domain.
  • Scheduler configuration: updating the scheduler configuration to prioritize assigning threads to cores within the same L3 cache group, minimizing communication between caches.
  • Validation and testing: performing benchmark tests to compare performance before and after the adjustments. The results showed a clear decrease in job runtime variation and improved overall efficiency.

Results and impact

  • Consistent job runtimes: the variation in job completion times decreased significantly, resulting in more predictable workflows.
  • Improved resource utilization: by reducing communication between caches, CPU and memory usage became more efficient.
  • Increased throughput: the total throughput of the compute clusters increased, allowing more simulations to be processed in less time.

This experience underscores the importance of aligning job scheduling strategies with the underlying hardware architecture in HPC environments. By making the scheduler aware of the NUMA topology, and in particular the L3 cache domains, we can optimize performance and leverage the full potential of modern processors such as AMD EPYC.

Conclusion

In high-performance computing, it is crucial to understand the complexity of processor architectures and adapt to them. This case study shows how thoughtful adjustments in job scheduling, based on hardware topology, can lead to significant performance improvements. As HPC systems continue to evolve, these types of optimizations will be essential to achieving efficient and reliable computational workflows.

Stay informed
By subscribing to our newsletter, you declare that you agree with our privacy statement.

Let's talk High Performance Computing

stefan.behlen 1
Stefan Behlen

Let's talk!


Let's talk High Performance Computing

* required

By submitting this form, you confirm that you have read and understood our privacy statement.
Privacy overview
This website uses cookies. We use cookies to ensure that our website and services function properly, to gain insight into the use of our website, and to improve our products and marketing. For more information, please read our privacy and cookie policy.