Timing Method and Evaluation Metrics for CPU Performance Variation Detections
Performance variation is characterized by inconsistencies in run times on the same hardware or by periods of unaligned performance on identical hardwares.The CPU performance variation is one of the most harmful and insidious causes of performance degradation.Even a tiny variation can negatively affect the overall performance of a supercomputer.CPU performance variation detections currently face two challenges.First,identifying tiny processor performance variation is difficult with existing profiling tools like PAPI.The magnitude of processor performance variations can be as low as the nanosecond level.To accurately detect such variations,timing methods need to have high precision and sensitivity.However,researchers have found that tools like PAPI and LIKWID,when used for timing measurements in real applications,have large overheads and fluctuations that can reach tens of thousands of cycles,making it difficult to capture nanosecond-level run time changes.Second,existing methods struggle to objectively evaluate the performance variation detection capabilities of different tools.A single performance variation detection consists of thousands of timing results.The distribution characteristics of these measurements include variations in the runtime of the tested code,fluctuations in the overhead of the timing method itself,and the impact of the timing operation on the tested code.However,current methods cannot determine whether the timing results truly reflect the distribution of the code's runtime.To address the first problem,this study first focused on PAPI,the most commonly used performance measurement tool at present.By simulating the cache environment of real applications,we measured and analyzed PAPI's timing fluctuations under different cache states for the first time.The experimental results showed that when measuring the run time of a computation process that does not change,PAPI's measurements exhibited significant long-tail deviations.Combining performance counter analysis,the main causes of PAPI's timing fluctuations included timing overhead,operating system noise,out-of-order execution,and cache misses.Subsequently,this study designed a serialized barrier timing method based on the memory barrier and serialized instructions of the x86 and Armv8 instruction sets,which suppressed timing fluctuations.In comparative experiments,the amplitude of timing fluctuations of the serialized barrier timing method was significantly lower than that of PAPI.To address the second problem,this study combined experiments and modeling to perform qualitative and quantitative analyses of the sources of instability in timing fluctuations and their impact on measurement values.For the first time,this paper proposed cross-platform precision and sensitivity indicators for timing methods,along with evaluation methods aimed at detecting processor performance variation.This paper suggests that in performance variation detection,the shorter the time that can be accurately measured,the higher the precision;the smaller the amplitude of performance variation that can be accurately distinguished,the higher the sensitivity.The precision and sensitivity indicators quantitatively evaluated the timing methods′ ability to measure minute time fluctuations,thereby providing a basis for the detection and determination of performance variation.According to our evaluations,on the Intel Xeon 6248 and Huawei Kunpeng 920-6426 processors,compared to PAPI,the serialized barrier timing method was 2.2~30.2 times more precise and 1.9~44.8 times more sensitive,and is able to detect nanosecond-level performance variation.
high performance computingmicroarchitectureperformance variationperformance analysisperformance evaluation