Fault Localization Using TrustRank Algorithm

Fault Localization Using TrustRank Algorithm: Comparison

Please note this is a comparison between Version 2 by Wendy Huang and Version 1 by Kaisheng Wu.

Fault localization refers to the process of identifying the specific locations or program entities within a software system that are responsible for causing observed failures or errors. It involves analyzing various sources of information, such as execution traces, test cases, and program dependencies, to pinpoint the root causes of failures. The goal of fault localization is to narrow down the search space and provide developers with actionable insights to efficiently and effectively fix the identified faults. By accurately localizing faults, developers can save time and effort in debugging and troubleshooting software systems.

fault localization
spectrum
program execution network
TrustRank algorithm
software

1. Introduction

Software testing is a critical step in ensuring software quality and reliability and can effectively reduce the enormous economic losses and personnel safety risks caused by software faults ^[1]. In the software development process, traditional manual debugging requires huge human resources and time costs, making automated testing particularly important [2,3]^[2][3]. In this process, software fault localization is a key step. Fault localization refers to the process of identifying the specific locations or program entities within a software system that are responsible for causing observed failures or errors. It involves analyzing various sources of information, such as execution traces, test cases, and program dependencies, to pinpoint the root causes of failures. The goal of fault localization is to narrow down the search space and provide developers with actionable insights to efficiently and effectively fix the identified faults. By accurately localizing faults, developers can save time and effort in debugging and troubleshooting software systems. The emergence of program fault localization technology has improved software quality, reduced development resources, and reduced economic losses ^[4]. Therefore, researching and applying software fault localization technology can further improve the accuracy and efficiency of software fault localization technology, which is of great significance to software development and maintenance.

In the past few decades of academic exploration, program fault localization has been widely studied. A large number of fault localization techniques has been proposed, including program spectrum-based fault localization techniques [5,6^[5][6][7],7], mutation-based program fault localization techniques [2^[2][8][9],8,9], slicing-based program localization techniques [10^[10][11],11], information retrieval-based program localization techniques ^[12], and machine learning-based fault localization techniques [3,13]^[3][13]. These methods have undoubtedly contributed to the research of program fault localization, but there is still a large amount of information and features hidden within the program itself and during the program running process that have not been explored by current research. Therefore, it is necessary to find more efficient methods to improve fault localization technology. Among the above research methods, spectrum-based fault localization technology is called the most widely used program fault localization technology due to its lightweight, scalability, and applicability [2,14]^[2][14]. However, traditional fault localization methods only consider the coverage information of test cases, which directly affects the effectiveness of program fault localization. Redundant coverage information can also reduce the accuracy of fault location and consume testing personnel’s energy. In addition, due to the limitations of the suspiciousness value calculation formula (the calculation method is based only on program coverage information), if elements are ranked solely based on the results of the suspiciousness value calculation, it will lead to problems with elements being equally ranked in the list of sorts where identical suspiciousness values are placed in the same position ^[15].

Some researchers focus their studies on the creation and modification of fault localization formulas ^[16], while others attempt to improve the accuracy of fault localization formulas by modifying or enhancing the program spectrum ^[17]. However, these research perspectives overlook the interaction between programs and the data flow within programs. Data dependency can serve as a new research perspective as it can reveal the correlation and influence between variables in a program, thereby aiding in understanding and locating faults. Data dependency can not only reveal the execution flow of a program but also narrow down the scope of fault localization. In complex software systems, faults may be hidden in a large amount of code. By analyzing data dependencies, attention can be focused on code segments directly related to faults, thereby improving fault localization efficiency.

2. Spectrum-Based Fault Localization Method

In software fault localization technology research, SBFL is one of the more mature methods currently applied ^[2]. SBFL is a dynamic program analysis technique ^[18] that integrates coverage information and execution results (pass or fail) of a test case set into a program spectrum form. Then, based on a statistical suspiciousness calculation formula, it calculates a suspiciousness value as a fault probability for program entities (such as statements, predicates, branches, basic blocks, and methods). According to this probability, it sorts program entities from high to low and finally obtains a ranking list to assist workers in locating program entities. The concept of program spectrum was proposed in 1987 ^[19], which is easy to obtain and can intuitively reflect program running information. In 2000, to solve the problem of date data format and storage caused by century-spanning changes, program spectrum was first applied to software fault localization. In order to more intuitively reflect the information in the program spectrum on the location of fault program entities, Jones et al. ^[19] proposed a suspiciousness value calculation formula called Tarantula. The principle of Tarantula is that program entities executed by more failed test cases are more likely to be erroneous than other entities. After Tarantula appeared, many other suspiciousness value calculation formulas were introduced, such as the Jaccard formula based on the clustering analysis method ^[20]; Ochiai evolved from biology-inspired ^[21], Ochiai 2, which considers the influence of unexecuted or passed test cases ^[22], Op2, which performs best in single fault localization ^[22], and Dstar, which has been proven to have the best and most effective comprehensive ranking effect ^[23]. However, there is no suspiciousness value calculation formula that can perfectly apply to all tested programs ^[24]. In order to improve the accuracy and efficiency of software fault localization technology, researchers have explored and improved it from two points of view: feature extraction; and sorting technology. Xie et al. ^[25] proposed a universal data augmentation method to improve the effectiveness of the program spectrum, which balances the number of failed and successful test cases in the original data, providing new ideas for fault localization method research. Some people have also proposed a fault localization formula based on call frequency information, creating a call count structure that depends on the number of times a dependent test case is executed to eliminate the influence of a large number of repeated calls generated by loops on the fault localization formula. In addition, Zhang et al. ^[26] proposed to analyze the relationship between test cases and tested programs through the PageRank algorithm, assign weights to different test cases, and combine fault localization technology with complex network centrality analysis methods for the first time. Overall, researchers are currently trying to reduce the limitations of the program spectrum on SBFL technology. As for the research described in this research, it attempts not only to use program spectrum but also to improve traditional SBFL methods by capturing the link relationship between test cases and program entities using a bidirectional TrustRank algorithm.

3. Program Execution Network

Program execution network can simulate complex data flow during program execution, establish impact propagation and interaction relationships between program entities, and understand program dynamic behavior. The idea of a program execution network comes from a complex network model, which has been widely used by researchers due to its robustness and applicability. Studies have shown that complex networks are powerful tools for researchers to understand complex software programs ^[27], as they can analyze the importance of program entities and understand program context relationships. It is worth mentioning that with the support of the complex network model, a program execution network can comprehensively consider the entire relationship of large programs. Qu et al. ^[28] proposed a dynamic changing network that combines dynamic call relationships during program execution with complex networks to construct a call network with methods as nodes and call relationships as edges. Zakari et al. ^[29] also used software network technology and analyzed fault localization by measuring network metrics. In order to analyze the impact relationship of faults, Vancsics et al. ^[30] constructed a new method based on counting different function call contexts to analyze the propagation of program faults. In order to reduce the impact range of fault nodes in software networks, Zhao et al. ^[31] constructed a directional call network, proposed their own influence propagation relationship calculation method, and analyzed the local centrality of fault nodes in software networks. This research uses a software execution network to represent data flow behavior during program execution and identifies and locates fault program statements’ positions by leveraging complex network analysis capabilities.

4. TrustRank Algorithm

TrustRank is a link-ranking algorithm proposed by Zoltan Gyongyi and Hector Garcia-Molina from Stanford University and Jan Pedersen from Yahoo (Hong Kong, China), which is used to optimize query speed and quality ^[32]. It is a variant of the PageRank algorithm used in Google search engine. In the traditional PageRank algorithm, the ranking of a page is based on the link relationship between pages, and pages with more links are more likely to be considered important. However, this method is vulnerable to link manipulation and spam pages, which can lead to a decrease in the quality of search results. TrustRank mitigates these issues by introducing the concept of trusted sources. TrustRank aims to combat spam by filtering out unreliable networks. This method requires selecting a small group of seed pages that are evaluated by experts. Once reputable seed pages are manually identified, crawling that extends outward from the seed set will look for similar reliable and trustworthy pages. The reliability of TrustRank decreases as the distance between documents and seed sets increases. The intuition behind the TrustRank algorithm is that for every node in a network, if it is an important node, then nodes linked to it should also be equally important due to its influence, and this importance relationship will weaken with increasing distance between nodes. In other words, nodes with closer links to important nodes are relatively more important than those linked to less influential nodes in the same network. Zhang et al. ^[33] used the TrustRank algorithm to differentiate citation values of different-level journals by leaving journals as seeds for evaluating journal rankings. Li et al. ^[34] proposed an auditor suspicious ranking algorithm based on community discovery and TrustRank to solve the problem of lack of supervised learning data for detecting spam senders. The experimental results show that this algorithm performs better than baseline methods in ranking spam senders and can effectively improve their ranking performance in unsupervised environments. Inspired by complex network link ranking algorithms, TrustRank simulates the propagation of program faults by considering the link relationship between nodes. In network analysis, this simulation can help us better understand the structure and behavior of networks. Moreover, TrustRank propagates the degree of influence of a node to its adjacent nodes through influence propagation algorithms. This propagation mechanism can help us evaluate the entire network and better identify important nodes in the network. In addition, different versions of programs have different link relationships, which makes the complex network to be analyzed more diversely. By using the TrustRank algorithm, the influence propagation relationship between these networks can be better identified and their structure better understood. This research starts with the degree of impact of software faults and analyzes the impact of program entities’ importance on software suspiciousness using the TrustRank algorithm.

References

Tassey, G. The Economic Impacts of Inadequate Infrastructure for Software Testing. Natl. Inst. Stand. Technol. 2002, 3, P125.
Pearson, S.; Campos, J.; Just, R.; Fraser, G.; Abreu, R.; Ernst, M.D.; Pang, D.; Keller, B. Evaluating and improving fault localization. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), IEEE 2017, Buenos Aires, Argentina, 20–28 May 2017; pp. 609–620.
Ishimoto, Y.; Kondo, M.; Ubayashi, N.; Kamei, Y. PAFL: Probabilistic Automaton-based Fault Localization for Recurrent Neural Networks. Inf. Softw. Technol. 2023, 155, 107117.
Zou, D.; Liang, J.; Xiong, Y.; Ernst, M.D.; Zhang, L. An empirical study of fault localization families and their combinations. IEEE Trans. Softw. Eng. 2019, 47, 332–347.
Jones, J.A.; Harrold, M.J.; Stasko, J. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering, Orlando, FL, USA, 19–25 May 2002; pp. 467–477.
Xie, X.; Chen, T.Y.; Kuo, F.-C.; Xu, B. A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2013, 22, 31.
Xie, X.; Kuo, F.C.; Chen, T.Y.; Yoo, S.; Harman, M. Provably optimal and human-competitive results in sbse for spectrum based fault localization. In Proceedings of the International Symposium on Search Based Software Engineering, St. Petersburg, Russia, 24–26 August 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 224–238.
Papadakis, M.; Le Traon, Y. Metallaxis-FL: Mutation-based fault localization. Softw. Test. Verif. Reliab. 2015, 25, 605–628.
Li, Z.; Wang, H.; Liu, Y. Hmer: A hybrid mutation execution reduction approach for mutation-based fault localization. J. Syst. Softw. 2020, 168, 110661.
Reis, S.; Abreu, R.; d’Amorim, M. Demystifying the Combination of Dynamic Slicing and Spectrum-based Fault Localization. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, 10–16 August 2019; pp. 4760–4766.
Cao, H.; Wang, F.; Deng, M.; Li, L. The improved dynamic slicing for spectrum-based fault localization. PeerJ Comput. Sci. 2022, 8, e1071.
Le, T.D.B.; Lo, D.; Le Goues, C.; Grunske, L. A learning-to-rank based fault localization approach using likely invariants. In Proceedings of the 25th International Symposium on Software Testing and Analysis, Saarbrucken, Germany, 18–20 July 2016; pp. 177–188.
Widyasari, R.; Prana, G.A.A.; Haryono, S.A.; Tian, Y.; Zachiary, H.N.; Lo, D. XAI4FL: Enhancing spectrum-based fault localization with explainable artificial intelligence. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, Pittsburgh, PA, USA, 16–17 May 2022; pp. 499–510.
Lemos Ribeiro, H.; Amario de Souza, H.; de Andrioli Araujo, R.P.; Lordello Chaim, M.; Kon, F. Evaluating data-flow coverage in spectrum-based fault localization. In Proceedings of the 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), IEEE, Porto de Galinhas, Brazil, 19–20 September 2019; pp. 1–11.
Sarhan, Q.I.; Beszédes, Á. A survey of challenges in spectrum-based software fault localization. IEEE Access 2022, 10, 10618–10639.
Laghari, G.; Demeyer, S. On the Use of Sequence Mining within Spectrum Based Fault Localisation. In Proceedings of the Acm/Sigapp Symposium on Applied Computing, ACM, Pau, France, 9–13 April 2018; pp. 1916–1924.
Renieres, M.; Reiss, S.P. Fault localization with nearest neighbor queries. In Proceedings of the 18th IEEE International Conference on Automated Software Engineering, Montreal, QC, Canada, 6–10 October 2003.
Laghari, G.; Murgia, A.; Demeyer, S. Fine-tuning spectrum based fault localisation with frequent method item sets. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, Singapore, 3–6 September 2016; pp. 274–285.
Reps, T.; Ball, T.; Das, M.; Larus, J. The use of program profiling for software maintenance with applications to the year 2000 problem. In Software Engineering—ESEC/FSE’97. ESEC SIGSOFT FSE 1997; Lecture Notes in Computer Science; Jazayeri, M., Schauer, H., Eds.; Springer: Berlin/Heidelberg, Germany, 1997; Volume 1301.
Chen, M.Y.; Kiciman, E.; Fratkin, E.; Fox, A.; Brewer, E. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the International Conference on Dependable Systems and Networks, IEEE, Washington, DC, USA, 23–26 June 2002; pp. 595–604.
Abreu, R.; Zoeteweij, P.; Golsteijn, R.; van Gemund, A.J. A practical evaluation of spectrum-based fault localization. J. Syst. Softw. 2009, 82, 1780–1792.
Naish, L.; Lee, H.J.; Ramamohanarao, K. A model for spectra-based software diagnosis. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2011, 20, 11.
Wong, W.E.; Debroy, V.; Gao, R.; Li, Y. The DStar method for effective software fault localization. IEEE Trans. Reliab. 2013, 63, 290–308.
Yoo, S.; Xie, X.; Kuo, F.C.; Chen, T.Y.; Harman, M. No pot of gold at the end of program spectrum rainbow: Greatest risk evaluation formula does not exist. RN 2014, 14, 14.
Xie, H.; Lei, Y.; Yan, M.; Yu, Y.; Xia, X.; Mao, X. A universal data augmentation approach for fault localization. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 22–24 May 2022; pp. 48–60.
Zhang, M.; Li, Y.; Li, X.; Chen, L.; Zhang, Y.; Zhang, L.; Khurshid, S. An empirical study of boosting spectrum-based fault localization via pagerank. IEEE Trans. Softw. Eng. 2019, 47, 1089–1113.
Matas, N. Comparing Network Centrality Measures as Tools for Identifying Key Concepts in Complex Networks: A Case of Wikipedia. J. Digit. Inf. Manag. 2017, 15.
Qu, Y.; Guan, X.; Zheng, Q.; Liu, T.; Zhou, J.; Li, J. Calling network: A new method for modeling software runtime behaviors. ACM SIGSOFT Softw. Eng. Notes 2015, 40, 1–8.
Zakari, A.; Lee, S.P.; Chong, C.Y. Simultaneous localization of software faults based on complex network theory. IEEE Access 2018, 6, 23990–24002.
Vancsics, B.; Horváth, F.; Szatmári, A.; Beszédes, Á. Fault localization using function call frequencies. J. Syst. Softw. 2022, 193, 111429.
Zhao, G.; He, H.; Huang, Y. Fault centrality: Boosting spectrum-based fault localization via local influence calculation. Appl. Intell. 2022, 52, 7113–7135.
Gyöngyi, Z.; Garcia-Molina, H.; Pedersen, J. Combating web spam with trustrank. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, BC, Canada, 31 August–3 September 2004; Volume 30, pp. 576–587.
Zhang, X.; Lyu, D.; Ruan, X.; Cheng, Y.; Ju, X. Journal Evaluation Method Based on the Integration of Peer-Review Consensus and Bibliometric Indicators. J. China Soc. Sci. Tech. Inf. 2022, 41, 486–496.
Li, Y.; Zhang, S. A community discovery and TrustRank based approach for spammer ranking. In Proceedings of the 2020 International Conference on Culture-oriented Science & Technology (ICCST), IEEE, Beijing, China, 28–31 October 2020; pp. 279–283.