LRCRaft:支持节点数据快速恢复的共识协议

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：在支持纠删码的分布式存储系统中,最常用的编码是RS(Reed-Solomon)码.对于一个RS(k,m)编码条带,常见的配置是一个节点仅存储条带中的一个分片,这导致在节点出现故障的情况下,对其存储分片的恢复需要跨多个节点读取分片并重新编码生成恢复分片,容易造成系统网络拥塞.在需要恢复大量数据的场合,系统在恢复期间会处于较长时间的脆弱期,容错能力和吞吐量下降、读写时延升高时有发生.LRCRaft是一个基于LRC(local reconstruction code)的改进Raft共识协议,通过在Raft中引入LRC码、动态日志增补、状态机删减和分片版本一致性等机制,降低了Raft的读写时延,缩短了节点故障恢复时间.实验结果表明,相较于Raft,LRCRaft在不同恢复模式中恢复一个单节点故障数据时,恢复用时有着49.25%-74.97%的减少.

外文标题：LRCRaft:Consensus Protocol with Rapid Node Data Recovery Support

外文摘要：RS(Reed-Solomon)code is most widely adopted in distributed storage systems that support erasure coding.For an RS(k,m)coding stripe,a common approach to store it is to distribute one fragment to one node.Such an approach could cause network congestion when a node fails since the system needs to read fragments across multiple nodes before it can decode and rebuild the lost data.The system would be in a fragile period for a long time when a great amount of data recovery is taking place.During this period,the system would suffer from lower failure tolerance capability,lower throughput,and higher read/write latency constantly.LRCRaft is an optimized version of Raft based on local reconstruction code(LRC).By introducing LRC,dynamic log replenishment,state machine purge,and fragment version consistency to Raft,LRCRaft can reduce read/write latency and the time consumed for node failure recovery.The results of our experiments indicate that compared to Raft,LRCRaft can reduce the time for a single node recovery by up to 49.25%-74.97%in different recovery modes.

外文关键词：

distributed storageRaft consensus protocolerasure codinglocal reconstruction code(LRC)node data recovery

作者：

袁佳正、胡晓鹏

展开 >

作者单位：

西南交通大学计算机与人工智能学院,成都 611756

关键词：

分布式存储 Raft共识协议纠删码局部重构码(LRC) 节点数据恢复

基金：

河北省自然科学基金

项目编号：

F2022105033

出版年：

2024

DOI：

10.15888/j.cnki.csa.009581

计算机系统应用

中国科学院软件研究所

计算机系统应用

CSTPCD

影响因子：0.449

ISSN：1003-3254

年,卷(期)：2024.33(7)