首页期刊导航|IEEE transactions on software engineering
期刊信息/Journal information
IEEE transactions on software engineering
Institute of Electrical and Electronics Engineers
IEEE transactions on software engineering

Institute of Electrical and Electronics Engineers

0098-5589

IEEE transactions on software engineering/Journal IEEE transactions on software engineeringSCIISTP
正式出版
收录年代

    An Automated Approach to Discovering Software Refactorings by Comparing Successive Versions

    Bo LiuHui LiuNan NiuYuxia Zhang...
    1358-1380页
    查看更多>>摘要:Software developers and maintainers frequently conduct software refactorings to improve software quality. Identifying the conducted software refactorings may significantly facilitate the comprehension of software evolution, and thus facilitate software maintenance and evolution. Besides that, the identified refactorings are also valuable for data-driven approaches in software refactoring. To this end, researchers have proposed a few approaches to identifying software refactorings automatically. However, the performance (especially precision) of such approaches deserves substantial improvement. To this end, in this paper, we propose a novel refactoring detection approach, called ReExtractor+. At the heart of ReExtractor+ is a reference-based entity matching algorithm that matches coarse-grained code entities (e.g., classes and methods) between two successive versions, and a context-aware statement matching algorithm that matches statements within a pair of matched methods. We evaluated ReExtractor+ on a benchmark consisting of 400 commits from 20 real-world projects. The evaluation results suggested that ReExtractor+ significantly outperformed the state of the art in refactoring detection, reducing the number of false positives by 57.4% and improving recall by 18.4%. We also evaluated the performance of the proposed matching algorithms that serve as the cornerstone of refactoring detection. The evaluation results suggested that the proposed algorithms excel in matching code entities, substantially reducing the number of mistakes (false positives plus false negatives) by 67% compared to the state-of-the-art approaches.

    When Crypto Fails: Demystifying Cryptographic Defects in Ethereum Smart Contracts

    Jiashuo ZhangJiachi ChenYiming ShenTao Zhang...
    1381-1398页
    查看更多>>摘要:Ethereum has officially provided a set of system-level cryptographic APIs to enhance smart contracts with cryptographic capabilities. These APIs have been utilized in over 13.8% of Ethereum transactions, motivating developers to implement various on-chain cryptographic tasks, such as digital signatures. However, since developers may not always be cryptographic experts, their ad-hoc and potentially defective implementations could compromise the theoretical guarantees of cryptography, leading to real-world security issues. To mitigate this threat, we conducted a comprehensive study aimed at demystifying and detecting cryptographic defects in smart contracts. Through the analysis of 3,762 real-world security reports, we defined 12 types of cryptographic defects in smart contracts with detailed descriptions and practical detection patterns. Based on this categorization, we proposed CryptoScan, the first static analyzer to automate the pre-deployment detection of cryptographic defects in smart contracts. CryptoScan utilizes cross-contract and inter-procedure static analysis to identify crypto-related execution paths and employs taint analysis to extract fine-grained crypto-specific semantics for defect detection. Furthermore, we collected a large-scale dataset containing 79,598 real-world crypto-related smart contracts and evaluated CryptoScan's effectiveness on it. The results demonstrated that CryptoScan achieves an overall precision of 96.1% and a recall of 93.3%. Notably, CryptoScan revealed that 19,707 (24.8%) out of 79,598 smart contracts contain at least one cryptographic defect. Although not all defects directly cause financial losses, they indicate prevalent non-standard cryptographic implementations that should be addressed in real-world practices.

    Evaluating Spectrum-Based Fault Localization on Deep Learning Libraries

    Ming YanJunjie ChenTianjie JiangJiajun Jiang...
    1399-1414页
    查看更多>>摘要:Deep learning (DL) libraries have become increasingly popular and their quality assurance is also gaining significant attention. Although many fault detection techniques have been proposed, effective fault localization techniques tailored to DL libraries are scarce. Due to the unique characteristics of DL libraries (e.g., complicated code architecture supporting DL model training and inference with extensive multidimensional tensor calculations), the effectiveness of existing fault localization techniques for traditional software is also unknown on DL library faults. To bridge this gap, we conducted the first empirical study to investigate the effectiveness of fault localization on DL libraries. Specifically, we evaluated spectrum-based fault localization (SBFL) due to its high generalizability and affordable overhead on such complicated libraries. Based on the key aspects in SBFL, our study investigated the effectiveness of SBFL with different sources of passing test cases (including human-written, fuzzer-generated, and mutation-based test cases) and various suspicious value calculation methods. In particular, mutation-based test cases are produced by our designed rule-based mutation technique and LLM-based mutation technique tailored to DL library faults. To enable our extensive study, we built the first benchmark (Defects4DLL), which contains 120 real-world faults in PyTorch and TensorFlow with easy-to-use experimental environments. Our study delivered a series of useful findings. For example, the rule-based approach is effective in localizing crash faults in DL libraries, successfully localizing 44.44% of crash faults within Top-10 functions and 74.07% of crash faults within Top-10 files, while the passing test cases from DL library fuzzers perform poorly on this task. Furthermore, based on our findings on the complementarity of different sources, we designed a hybrid technique by effectively integrating human-written, LLM-mutated, rule-based mutated test cases, which further achieves 31.48%$\boldsymbol{\sim}$61.36% improvements over each single source in terms of the number of detected faults within Top-5 files.

    Understanding and Identifying Technical Debt in the Co-Evolution of Production and Test Code

    Yimeng GuoZhifei ChenLu XiaoLin Chen...
    1415-1436页
    查看更多>>摘要:The co-evolution of production and test code (PT co-evolution) has received increasing attention in recent years. However, we found that existing work did not comprehensively study various PT co-evolution scenarios, such as the qualification and persistence of their effects on software. Inspired by technical debt (TD), we refer to TD generated during the co-evolution between production and test code as PT co-evolution technical debt (PTCoTD). To better understand PT co-evolution, we first conducted an exploratory study on its characteristics on 15 open-source projects, finding unbalanced PT co-evolution prevalent and summarizing five potential PT flaws. Then we proposed an approach to identify and quantify PTCoTDs of these flaw patterns, considering evolutionary and structural relationships. We also built prediction models to describe cost trajectories and rank all PTCoTDs to prioritize expensive ones. The evaluation on the 15 projects shows that our approach can identify PTCoTDs that deserve attention. The identified PTCoTDs account for about half of the project's total maintenance costs, and the cost proportion of the expensive Top-5 is 1.8x more than the file proportion they contain. Almost all covered maintenance costs persist as PTCoTD in the future, with an average increase of 6.8% between the last two releases. Our approach also accurately predicts the costs of PTCoTD with an average prediction deviation of only 8.3%. Our study provides valuable insights into PT co-evolution scenarios and their effects, which can guide practices and inspire future work on software testing and maintenance.

    An Empirical Study on Meta Virtual Reality Applications: Security and Privacy Perspectives

    Hanyang GuoHong-Ning DaiXiapu LuoGengyang Xu...
    1437-1454页
    查看更多>>摘要:Virtual Reality (VR) has accelerated its prevalent adoption in emerging metaverse applications, but it is not a fundamentally new technology. On the one hand, most VR operating systems (OS) are based on off-the-shelf mobile OS (e.g., Android OS). As a result, VR apps also inevitably inherit privacy and security deficiencies from conventional mobile apps. On the other hand, in contrast to traditional mobile apps, VR apps can achieve an immersive experience via diverse VR devices, such as head-mounted displays, body sensors, and controllers. However, achieving this requires the extensive collection of privacy-sensitive human biometrics (e.g., hand-tracking and face-tracking data). Moreover, VR apps have been typically implemented by 3D gaming engines (e.g., Unity), which also contain intrinsic security vulnerabilities. Inappropriate use of these technologies may incur privacy leaks and security vulnerabilities although these issues have not received significant attention compared to the proliferation of diverse VR apps. In this paper, we develop a security and privacy assessment tool, namely the VR-SP detector for VR apps. The VR-SP detector has integrated program static analysis tools and privacy-policy analysis methods. Using the VR-SP detector, we conduct a comprehensive empirical study on 900 popular VR apps. We obtain the original apps from the popular SideQuest app store and extract Android PacKage (APK) files via the Meta Quest 2 device. We evaluate the security vulnerabilities and privacy data leaks of these VR apps through VR app analysis, taint analysis, privacy policy analysis, and user review analysis. We find that a number of security vulnerabilities and privacy leaks widely exist in VR apps. Moreover, our results also reveal conflicting representations in the privacy policies of these apps and inconsistencies of the actual data collection with the privacy-policy statements of the apps. Further, user reviews also indicate their privacy concerns about relevant biometric data. Based on these findings, we make suggestions for the future development of VR apps.

    FlexFL: Flexible and Effective Fault Localization With Open-Source Large Language Models

    Chuyang XuZhongxin LiuXiaoxue RenGehao Zhang...
    1455-1471页
    查看更多>>摘要:Fault localization (FL) targets identifying bug locations within a software system, which can enhance debugging efficiency and improve software quality. Due to the impressive code comprehension ability of Large Language Models (LLMs), a few studies have proposed to leverage LLMs to locate bugs, i.e., LLM-based FL, and demonstrated promising performance. However, first, these methods are limited in flexibility. They rely on bug-triggering test cases to perform FL and cannot make use of other available bug-related information, e.g., bug reports. Second, they are built upon proprietary LLMs, which are, although powerful, confronted with risks in data privacy. To address these limitations, we propose a novel LLM-based FL framework named FlexFL, which can flexibly leverage different types of bug-related information and effectively work with open-source LLMs. FlexFL is composed of two stages. In the first stage, FlexFL reduces the search space of buggy code using state-of-the-art FL techniques of different families and provides a candidate list of bug-related methods. In the second stage, FlexFL leverages LLMs to delve deeper to double-check the code snippets of methods suggested by the first stage and refine fault localization results. In each stage, FlexFL constructs agents based on open-source LLMs, which share the same pipeline that does not postulate any type of bug-related information and can interact with function calls without the out-of-the-box capability. Extensive experimental results on Defects4J demonstrate that FlexFL outperforms the baselines and can work with different open-source LLMs. Specifically, FlexFL with a lightweight open-source LLM Llama3-8B can locate 42 and 63 more bugs than two state-of-the-art LLM-based FL approaches AutoFL and AgentFL that both use GPT-3.5. In addition, FlexFL can localize 93 bugs that cannot be localized by non-LLM-based FL techniques at the top 1. Furthermore, to mitigate potential data contamination, we conduct experiments on a dataset which Llama3-8B has not seen before, and the evaluation results show that FlexFL can also achieve good performance.

    Do Experts Agree About Smelly Infrastructure?

    Sogol MasoumzadehNuno SaavedraRungroj MaipraditLili Wei...
    1472-1486页
    查看更多>>摘要:Code smells are anti-patterns that violate code understandability, re-usability, changeability, and maintainability. It is important to identify code smells and locate them in the code. For this purpose, automated detection of code smells is a sought-after feature for development tools; however, the design and evaluation of such tools depends on the quality of oracle datasets. The typical approach for creating an oracle dataset involves multiple developers independently inspecting and annotating code examples for their existing code smells. Since multiple inspectors cast votes about each code example, it is possible for the inspectors to disagree about the presence of smells. Such disagreements introduce ambiguity into how smells should be interpreted. Prior work has studied developer perceptions of code smells in traditional source code; however, smells in Infrastructure-as-Code (IaC) have not been investigated. To understand the real-world impact of disagreements among developers and their perceptions of IaC code smells, we conduct an empirical study on the oracle dataset of GLITCH—a state-of-the-art detection tool for security code smells in IaC. We analyze GLITCH's oracle dataset for code smell issues, their types, and individual annotations of the inspectors. Furthermore, we investigate possible confounding factors associated with the incidences of developer misaligned perceptions of IaC code smells. Finally, we triangulate developer perceptions of code smells in traditional source code with our results on IaC. Our study reveals that unlike developer perceptions of smells in traditional source code, their perceptions of smells in IaC are more substantially impacted by subjective interpretation of smell types and their co-occurrence relationships. For instance, the interpretation of admins by default, empty passwords, and hard-coded secrets varies considerably among raters and are more susceptible to misidentification than other IaC code smells. Consequently, the manual identification of IaC code smells involves annotation disagreements among developers—46.3% of studied IaC code smell incidences have at least one dissenting vote among three inspectors. Meanwhile, only 1.6% of code smell incidences in traditional source code are affected by inspector bias stemming from these disagreements. Hence, relying solely on the majority voting, would not fully represent the breadth of interpretation of the IaC under scrutiny.

    BabelRTS: Polyglot Regression Test Selection

    Gabriele MaurinaWalter CazzolaSudipto Ghosh
    1487-1499页
    查看更多>>摘要:Regression test selection (RTS) approaches reduce the number of regression tests. Current RTS approaches are typically monoglot, i.e., their implementations target a specific language. However, many subjects under test (SUT) are polyglot, i.e., they use multiple languages. Running multiple monoglot RTS approaches separately on a polyglot SUT is unsafe because tests that involve inter-language dependencies can be missed. Moreover, a new language may require completely reimplementing an RTS approach, especially if the original implementation relies on language and runtime features that are not available in the new language. We propose a new static approach called BabelRTS, which is multilingual (supports multiple languages out of the box), polyglot (analyzes SUTs written in multiple languages), and extensible (allows adding support for new languages). A key contribution is the idea of encapsulating the language-specific aspects of RTS by using patterns and actions. A pattern specifies programming language constructs used in each file that indicate dependencies to other files written in the same or a different language. An action specifies how to identify these files in the codebase. Patterns and actions can be customized to support new languages without modifying the test selection algorithm. BabelRTS is not tied to a specific language run-time system or paradigm. BabelRTS currently supports 12 languages and 5 language combinations. We evaluated BabelRTS on 142 open-source monoglot and polyglot SUTs, analyzing a total of more than two billion LOC. The performance of BabelRTS was similar to the state-of-the-art monoglot approaches on monoglot SUTs. On polyglot SUTs, BabelRTS was safer in polyglot mode and selected more tests for 60% of the commits than in monoglot mode, which missed inter-language dependencies.

    LLM-Based Automation of COSMIC Functional Size Measurement From Use Cases

    Gabriele De VitoSergio Di MartinoFilomena FerrucciCarmine Gravino...
    1500-1523页
    查看更多>>摘要:COmmon Software Measurement International Consortium (COSMIC) Functional Size Measurement is a method widely used in the software industry to quantify user functionality and measure software size, which is crucial for estimating development effort, cost, and resource allocation. COSMIC measurement is a manual task that requires qualified professionals and effort. To support professionals in COSMIC measurement, we propose an automatic approach, CosMet, that leverages Large Language Models to measure software size starting from use cases specified in natural language. To evaluate the proposed approach, we developed a web tool that implements CosMet using GPT-4 and conducted two studies to assess the approach quantitatively and qualitatively. Initially, we experimented with CosMet on seven software systems, encompassing 123 use cases, and compared the generated results with the ground truth created by two certified professionals. Then, seven professional measurers evaluated the analysis achieved by CosMet and the extent to which the approach reduces the measurement time. The first study's results revealed that CosMet is highly effective in analyzing and measuring use cases. The second study highlighted that CosMet offers a transparent and interpretable analysis, allowing practitioners to understand how the measurement is derived and make necessary adjustments. Additionally, it reduces the manual measurement time by 60-80%.

    Programmer Visual Attention During Context-Aware Code Summarization

    Robert WallaceAakash BansalZachary KarasNingzhi Tang...
    1524-1537页
    查看更多>>摘要:Programmer attention represents the visual focus of programmers on parts of the source code in pursuit of programming tasks. The focus of current research in modeling this programmer attention has been on using mouse cursors, keystrokes, or eye tracking equipment to map areas in a snippet of code. These approaches have traditionally only mapped attention for a single method. However, there is a knowledge gap in the literature because programming tasks such as source code summarization require programmers to use contextual knowledge that can only be found in other parts of the project, not only in a single method. To address this knowledge gap, we conducted an in-depth human study with 10 Java programmers, where each programmer generated summaries for 40 methods from five large Java projects over five one-hour sessions. We used eye tracking equipment to map the visual attention of programmers while they wrote the summaries. We also rate the quality of each summary. We found eye-gaze patterns and metrics that define common behaviors between programmer attention during context-aware code summarization. Specifically, we found that programmers need to read up to 35% fewer words (p $\boldsymbol{ \lt }$ 0.01) over the whole session, and revisit 13% fewer words (p $ \lt $ 0.03) as they summarize each method during a session, while maintaining the quality of summaries. We also found that the amount of source code a participant looks at correlates with a higher quality summary, but this trend follows a bell-shaped curve, such that after a threshold reading more source code leads to a significant decrease (p $\boldsymbol{ \lt }$ 0.01) in the quality of summaries. We also gathered insight into the type of methods in the project that provide the most contextual information for code summarization based on programmer attention. Specifically, we observed that programmers spent a majority of their time looking at methods inside the same class as the target method to be summarized. Surprisingly, we found that programmers spent significantly less time looking at methods in the call graph of the target method. We discuss how our empirical observations may aid future studies towards modeling programmer attention and improving context-aware automatic source code summarization.