Empowering In-Network Gray Failure Detection with Programmable Switches
Gray failures are micro switch malfunctions that have a subtle impact on production networks.However,when these micro malfunctions are superimposed on each other or on a new malfunction,they can lead to paralysis of pro-duction networks.Thus,the detection of gray failures is essential to the stability of production networks.Prior methods fo-cus on using the control plane to collect flow records from data plane switches and process them to detect packet loss.How-ever,they fall short due to(1)their high resource overhead of handling with massive flow records and(2)non-trivial delays that result in out-of-date failure detection.Recently,the emergence of programmable switches provides a promising alterna-tive solution:the detection of gray failures can be offloaded to line-rate switch ASIC pipelines,enabling low-cost,low-laten-cy,and high-accuracy in-network gray failure detection.This paper presents an illustrative survey of programmable switch-assisted techniques in in-network gray failure detection.First,we describe the concept of gray failures,their prevalence,and their impact to production networks.Second,we analyze and discuss the characteristics of state-of-the-art gray failures de-tection techniques built on programmable switches.Third,we illustrate the principle and workflow of each detection tech-nique.Fourth,we conduct a real-world testbed to evaluate the metrics of each detection technique.Finally,we highlight the problems and challenges faced by existing techniques.