基于K8s的天基云平台可靠性方案设计
Reliability scheme design of K8s based space-based cloud platform
何羽 1吴琦 2安军社1
作者信息
- 1. 中国科学院国家空间科学中心复杂航天系统电子信息技术重点实验室,北京 100190;中国科学院大学计算机科学与技术学院,北京 100049
- 2. 中国科学院国家空间科学中心复杂航天系统电子信息技术重点实验室,北京 100190
- 折叠
摘要
针对地面云平台移植到太空中会由于单粒子效应频繁导致其可靠性严重下降的问题,结合具体型号任务,针对K8s云平台研究基于三模冗余的任务容错方案.针对星载计算机功耗受限的问题,在综合考虑功耗和任务实时性的基础上,设计实现2种分别基于传统三模冗余和时间三模冗余的冗余方案;针对核心级别的故障恢复需求和单个卫星搭载节点较少的特点,通过修改K8s源码,实现核心分配功能.相关实验结果表明,该容错机制能有效容忍单粒子翻转导致的错误,支持核心级别的错误恢复和利用核级冗余支持任务容错,同时,具有较小的性能开销.
Abstract
To address the problem that the reliability of ground-based cloud platform is severely degraded due to frequent single event effects when the platform is transplanted to space,a fault-tolerant scheme based on triple modular redundancy was studied for K8s-based cloud platform in conjunction with a specific model mission.To address the problem of limited power consumption of the on-board computer,two redundant schemes based on traditional triple modular redundancy and time triple modular redun-dancy were designed and implemented based on the comprehensive consideration of power consumption and real-time constraints of tasks.To meet the core-level fault recovery requirements and to consider the characteristics of fewer nodes carried by a single satellite,the core allocation function was implemented by modifying the K8s source code.Related experimental results show that the fault-tolerant mechanism can effectively tolerate errors caused by single event upset,support core-level error recovery and utilize core-level redundancy to support fault tolerance of tasks,with a small performance overhead.
关键词
任务冗余/三模冗余/容错/云计算/Docker/Kubernetes/天基云Key words
task redundancy/triple modular redundancy/fault tolerance/cloud computing/Docker/Kubernetes/space-based cloud引用本文复制引用
基金项目
国家重点研发计划基金项目(2022YFF0503900)
出版年
2024