Minor inefficiencies - major slowdowns
In high-performance computing (HPC), even minor inefficiencies can cause major slowdowns. Recently, I encountered such a challenge while assisting a semiconductor client who had integrated AMD EPYC-based systems into their global compute clusters. These systems, known for their high core counts and energy efficiency, were expected to improve simulation workloads. However, unexpected performance issues arose, particularly with multi-threaded jobs.
Identifying the Bottleneck
The client observed that certain simulation jobs, especially those utilizing 4 to 8 cores, were experiencing inconsistent runtimes. Some tasks completed swiftly, while others lagged, leading to delays in the overall workflow. This inconsistency was puzzling, especially since all clusters had the same hardware and software setup.
Upon investigation, I discovered that the root cause lay in the interaction between the job scheduler and the Non-Uniform Memory Access (NUMA) architecture of the AMD EPYC processors. Specifically, the scheduler was assigning threads without considering the L3 cache boundaries inherent in the EPYC design. This oversight led to threads sharing data being placed on cores with separate L3 caches, resulting in increased latency due to cross-cache communication.
Understanding the Architecture
AMD EPYC processors utilize a chiplet-based architecture, where each chiplet, or Core Complex Die (CCD), contains multiple cores sharing an L3 cache. This design offers scalability and performance benefits but introduces complexity in memory access patterns. When threads that frequently communicate are assigned to cores on different CCDs, they must access data across separate L3 caches, leading to increased latency and reduced performance.
Implementing a Solution
To address this, I developed a strategy to make the scheduler NUMA-aware concerning the L3 cache boundaries. By modifying the system’s topology information presented to the scheduler, I ensured that threads sharing data were assigned to cores within the same L3 cache domain. This approach involved:
- Topology Adjustment: Modifying the system’s hardware topology information to reflect the L3 cache boundaries accurately. This change guided the scheduler to group threads within the same cache domain.
- Scheduler Configuration: Updating the scheduler’s configuration to prioritize assigning threads to cores within the same L3 cache group, reducing cross-cache communication.
- Validation and Testing: Running benchmark tests to compare performance before and after the changes. The results showed a significant reduction in job runtime variability and improved overall efficiency.
Results and Impact
- Consistent Job Runtimes: The variability in job completion times decreased markedly, leading to more predictable workflows.
- Improved Resource Utilization: By reducing cross-cache communication, the system achieved better CPU and memory utilization.
- Enhanced Throughput: The overall throughput of the compute clusters increased, allowing more simulations to be processed in less time.
This experience underscores the importance of aligning job scheduling strategies with the underlying hardware architecture in HPC environments. By making the scheduler aware of the NUMA topology, particularly the L3 cache domains, we can optimize performance and fully leverage the capabilities of modern processors like AMD EPYC.
Conclusion
In high-performance computing, understanding and adapting to the intricacies of processor architectures is crucial. This case highlights how thoughtful adjustments to job scheduling, informed by hardware topology, can lead to substantial performance gains. As HPC systems continue to evolve, such optimizations will be vital in achieving efficient and reliable computational workflows.