首页|基于可编程交换机的网内灰色故障检测技术研究进展

基于可编程交换机的网内灰色故障检测技术研究进展

扫码查看
灰色故障是指对生产网络产生细微影响的交换机故障.然而,当这些轻微故障相互叠加或与新增故障叠加时,可能会导致整个生产网络的瘫痪.因此,检测灰色故障对生产网络的稳定性至关重要.传统解决方案关注的是在控制平面收集数据平面交换机中的流记录,并对其进行处理以检测灰色故障.然而,此类解决方案存在着不足:(1)缓存和处理大量的流记录会引入庞大的资源开销;(2)较高的检测时延无法保证灰色故障检测的时效性.近年来,可编程交换机的出现为灰色故障检测技术带来了新机遇:网络管理员可以将灰色故障检测算法部署运行至可编程交换机的线速ASIC流水线上,从而支持低开销、低时延、高精度的网内灰色故障检测技术.本文针对基于可编程交换机的网内灰色故障检测技术进行综述,在对灰色故障的概念、普遍性及对生产网络的危害进行描述的基础上,分析与讨论了现有基于可编程交换机的网内灰色故障检测技术的研究现状与进展,详细介绍每项技术的工作原理及流程,搭建真实的实验平台评估各项技术的检测指标,在文末指出了现有技术所面临的问题与挑战.
Empowering In-Network Gray Failure Detection with Programmable Switches
Gray failures are micro switch malfunctions that have a subtle impact on production networks.However,when these micro malfunctions are superimposed on each other or on a new malfunction,they can lead to paralysis of pro-duction networks.Thus,the detection of gray failures is essential to the stability of production networks.Prior methods fo-cus on using the control plane to collect flow records from data plane switches and process them to detect packet loss.How-ever,they fall short due to(1)their high resource overhead of handling with massive flow records and(2)non-trivial delays that result in out-of-date failure detection.Recently,the emergence of programmable switches provides a promising alterna-tive solution:the detection of gray failures can be offloaded to line-rate switch ASIC pipelines,enabling low-cost,low-laten-cy,and high-accuracy in-network gray failure detection.This paper presents an illustrative survey of programmable switch-assisted techniques in in-network gray failure detection.First,we describe the concept of gray failures,their prevalence,and their impact to production networks.Second,we analyze and discuss the characteristics of state-of-the-art gray failures de-tection techniques built on programmable switches.Third,we illustrate the principle and workflow of each detection tech-nique.Fourth,we conduct a real-world testbed to evaluate the metrics of each detection technique.Finally,we highlight the problems and challenges faced by existing techniques.

gray failure detectionprogrammable switchesin-network computingnetwork measurementpacket lossdatacenter networks

刘宏岩、张栋、吴春明

展开 >

浙江大学计算机科学与技术学院,浙江 杭州 310063

福州大学计算机与大数据学院,福建 福州 350108

灰色故障检测 可编程交换机 网内计算 网络测量 数据报丢失 数据中心网络

浙江省"尖兵""领雁"研发攻关计划项目

2024C01066

2024

电子学报
中国电子学会

电子学报

CSTPCD北大核心
影响因子:1.237
ISSN:0372-2112
年,卷(期):2024.52(10)