Review on the Development and Application of Checkpointing Technology in High-performance Computing
As high-performance computers grow in size and complexity,the fault tolerance of applications becomes one of the key challenges facing exascale computing.Checkpointing technology is one of the main means used to achieve fault-tolerance of appli-cations,enabling fault recovery by periodically saving the execution state of applications.This paper conducts a review study on the development and application of checkpointing techniques for high performance computing.First,the development of check-pointing technology in the field of high performance computing is compiled.Then,the system-level checkpointing and application-level checkpointing work are described according to the different operation levels,including the mainstream tool software,availa-ble checkpointing techniques,and the application scenarios used.The application of checkpoint technology in four aspects:fault tolerance and resilience in parallel computing,scheduling and migration of HPC,FPGA debugging,and fault tole-rance and faith-ful replay in deep learning,is discussed.Finally,further research directions of checkpointing technology in the field of high-per-formance computing are proposed.