首页|Evaluating Spectrum-Based Fault Localization on Deep Learning Libraries
Evaluating Spectrum-Based Fault Localization on Deep Learning Libraries
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
IEEE
Deep learning (DL) libraries have become increasingly popular and their quality assurance is also gaining significant attention. Although many fault detection techniques have been proposed, effective fault localization techniques tailored to DL libraries are scarce. Due to the unique characteristics of DL libraries (e.g., complicated code architecture supporting DL model training and inference with extensive multidimensional tensor calculations), the effectiveness of existing fault localization techniques for traditional software is also unknown on DL library faults. To bridge this gap, we conducted the first empirical study to investigate the effectiveness of fault localization on DL libraries. Specifically, we evaluated spectrum-based fault localization (SBFL) due to its high generalizability and affordable overhead on such complicated libraries. Based on the key aspects in SBFL, our study investigated the effectiveness of SBFL with different sources of passing test cases (including human-written, fuzzer-generated, and mutation-based test cases) and various suspicious value calculation methods. In particular, mutation-based test cases are produced by our designed rule-based mutation technique and LLM-based mutation technique tailored to DL library faults. To enable our extensive study, we built the first benchmark (Defects4DLL), which contains 120 real-world faults in PyTorch and TensorFlow with easy-to-use experimental environments. Our study delivered a series of useful findings. For example, the rule-based approach is effective in localizing crash faults in DL libraries, successfully localizing 44.44% of crash faults within Top-10 functions and 74.07% of crash faults within Top-10 files, while the passing test cases from DL library fuzzers perform poorly on this task. Furthermore, based on our findings on the complementarity of different sources, we designed a hybrid technique by effectively integrating human-written, LLM-mutated, rule-based mutated test cases, which further achieves 31.48%$\boldsymbol{\sim}$61.36% improvements over each single source in terms of the number of detected faults within Top-5 files.