容错一直是高性能计算领域的热点和难点问题.检查点是解决容错问题的一种常用技术手段,它能够将运行进程的状态转储成文件并恢复.容器具有较强的资源隔离能力,可以为检查点技术提供更理想的运行环境与载体,避免迁移后任务在节点变更的情况下由于环境与资源变化而出现异常.因此,容器和检查点相结合能够更好地支撑任务迁移的研究与实现.文中围绕基于CRIU(Checkpoint/Restore In Userspace)的Singularity容器检查点方案的设计和优化展开,根据检查点技术在高性能计算容器应用中的特点,在CRIU安全使用、迁移性能优化、保持网络状态方面给出了有效的解决方案,基于这些方案拓展了 Singularity容器检查点功能,并且实现了原型工具Migrator来验证容器迁移性能.期望本工作能为后续实现高性能计算任务迁移提供有效的支撑.
Study on High Performance Computing Container Checkpoint Technology Based on CRIU
Fault tolerance has always been a hot and difficult problem in the field of high performance computing.Checkpointing is a common technical means to solve the fault tolerance problem,which can dump the state of running processes into files and re-cover them.Containers have strong resource isolation capability,which can provide a more ideal running environment and carrier for checkpointing technology and avoid the abnormality caused by the change of environment and resources in the case of node change after migration.Therefore,the combination of container and checkpointing can better support the research and implemen-tation of task migration.This paper focuses on the design and optimization of Singularity checkpointing scheme based on CRIU(Checkpoint/Restore In Userspace).Based on the characteristics of checkpointing technology in HPC container applications,ef-fective solutions are given in terms of safe use of CRIU,migration performance optimization,and maintaining network status.The paper extends the checkpointing function to Singularity and implements the prototype tool Migrator to verify the container migra-tion performance.It can provide support for the subsequent implementation of HPC task migration.