PREDICTING FAULTS IN SOFTWARE SYSTEMS USING  CROSS PROJECT DATA

Bal, Pravas Ranjan

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/19623

Title:	PREDICTING FAULTS IN SOFTWARE SYSTEMS USING CROSS PROJECT DATA
Authors:	Bal, Pravas Ranjan
Issue Date:	Aug-2021
Publisher:	IIT Roorkee
Abstract:	Software Defect Prediction (SDP) with limited historical data and within project defect pre diction (WPDP) has attracted significant attention for software practitioners to introduce cross project defect prediction (CPDP). So, CPDP using learning algorithms has been the staple research area for all software practitioners in the SDP domain. Generally, researchers proposed two types of CPDP approach in the SDP research domain, i.e., CPDP with homo geneous metrics (CPDP) and CPDP with heterogeneous metrics (HDP). In typical CPDP approach, the SDP model is trained and tested on different project datasets with homoge neous metric sets. Similarly, in HDP approach, the SDP model is trained and tested on different project datasets with heterogeneous metric sets. Imbalanced data is a significant is sue in the CPDP scenario. It is very challenging for software engineers to handle imbalanced software defect data for the early prediction of software defects. In the last two decades, many researchers have used synthetic minority oversampling technique (SMOTE), SMOTE for re gression, and other such methods to preprocess the imbalanced software defect data. However, these preprocessing techniques do not produce consistently good accuracy, especially in cross project defect prediction. One more important assumption in learning algorithms is that both train and test data must follow similar data distribution for better prediction accuracy. These assumptions may hold in the WPDP scenario, but not in the CPDP scenario. So, researchers looked at CPDP with matched metrics. However, there may be a chance of small-sized source data matched with the target data in CPDP with matched metrics scenario, and it is chal lenging to train the learning models on small-sized source data in this scenario. Sometimes, researchers may not get sufficient number of homogeneous metrics to achieve the better perfor mance of learning models in the CPDP scenario. So, imbalanced data, small-sized data, and homogeneous metrics are significant issues in the CPDP scenario. Many learning models, data preprocessing methods have been developed in the last decade for the CPDP scenario; still, CPDP is a very challenging issue in the SDP domain to provide better accuracy. To address the issues of imbalanced data, we have proposed a novel variant of extreme learning machine (ELM), namely weighted regularization ELM (WR-ELM) for imbalance learning to generalize the imbalanced data to balanced data. The experimental results showed that the proposed WR-ELM algorithm led to improved performance. To address the issues of small-sized data, we have proposed a novel cross data preprocessing method, namely Knowledge Transfer from Target dataset to Source dataset using Correlation (KT-TSC) to improve the CPDP accuracy of learning models. The experimental results demonstrate that deep neural network, k near est neighbor, decision tree, logistic regression, and Naive Bayes classifiers using the proposed KT-TSC CPDP method show an improvement of 20%, 16%, 23%, 16%, and 8%, respectively in terms of average AUC score as compared to the traditional CPDP methods. To address the issues of homogeneous metrics, we have proposed a novel heterogeneous data preprocess ing method, namely Transfer of Data from Target dataset to Source dataset selected using Relevant metrics (TDTSR), for heterogeneous defect prediction. The experimental results demonstrate that random forest classifier using the proposed TDTSR heterogeneous data preprocessing method achieves 90% average AUC score in the HDP scenario. Apart from this, we have proposed an ensemble model, namely non-linear heterogeneous ensemble us ing ELM (NH-ELM), regularized ELM (RELM Plus) using Matched Metrics (RELMP-MM), cross data preprocessing method, namely unique selection of matched metrics (USMM) to improve the accuracy of learning algorithms on cross project data. In this thesis, we have mainly focused on cross project defect prediction and heterogeneous defect prediction. The idea of cross project defect prediction and heterogeneous defect prediction is not only used in software defect prediction but also can be generalized in other application areas.
URI:	http://localhost:8081/jspui/handle/123456789/19623
Research Supervisor/ Guide:	Kumar, Sandeep
metadata.dc.type:	Thesis
Appears in Collections:	DOCTORAL THESES (CSE)

Files in This Item:

File	Description	Size	Format
PRAVAS RANJAN.pdf		7.05 MB	Adobe PDF	View/Open

Show full item record