首页|LRCRaft:支持节点数据快速恢复的共识协议

LRCRaft:支持节点数据快速恢复的共识协议

扫码查看
在支持纠删码的分布式存储系统中,最常用的编码是RS(Reed-Solomon)码.对于一个RS(k,m)编码条带,常见的配置是一个节点仅存储条带中的一个分片,这导致在节点出现故障的情况下,对其存储分片的恢复需要跨多个节点读取分片并重新编码生成恢复分片,容易造成系统网络拥塞.在需要恢复大量数据的场合,系统在恢复期间会处于较长时间的脆弱期,容错能力和吞吐量下降、读写时延升高时有发生.LRCRaft是一个基于LRC(local reconstruction code)的改进Raft共识协议,通过在Raft中引入LRC码、动态日志增补、状态机删减和分片版本一致性等机制,降低了Raft的读写时延,缩短了节点故障恢复时间.实验结果表明,相较于Raft,LRCRaft在不同恢复模式中恢复一个单节点故障数据时,恢复用时有着49.25%-74.97%的减少.
LRCRaft:Consensus Protocol with Rapid Node Data Recovery Support
RS(Reed-Solomon)code is most widely adopted in distributed storage systems that support erasure coding.For an RS(k,m)coding stripe,a common approach to store it is to distribute one fragment to one node.Such an approach could cause network congestion when a node fails since the system needs to read fragments across multiple nodes before it can decode and rebuild the lost data.The system would be in a fragile period for a long time when a great amount of data recovery is taking place.During this period,the system would suffer from lower failure tolerance capability,lower throughput,and higher read/write latency constantly.LRCRaft is an optimized version of Raft based on local reconstruction code(LRC).By introducing LRC,dynamic log replenishment,state machine purge,and fragment version consistency to Raft,LRCRaft can reduce read/write latency and the time consumed for node failure recovery.The results of our experiments indicate that compared to Raft,LRCRaft can reduce the time for a single node recovery by up to 49.25%-74.97%in different recovery modes.

distributed storageRaft consensus protocolerasure codinglocal reconstruction code(LRC)node data recovery

袁佳正、胡晓鹏

展开 >

西南交通大学计算机与人工智能学院,成都 611756

分布式存储 Raft共识协议 纠删码 局部重构码(LRC) 节点数据恢复

河北省自然科学基金

F2022105033

2024

计算机系统应用
中国科学院软件研究所

计算机系统应用

CSTPCD
影响因子:0.449
ISSN:1003-3254
年,卷(期):2024.33(7)