discover universiti teknologi brunei library

Classification with imbalanced data: Ensemble method using split balancing and instance hardness / Chongomweru Halimu

By: Chongomweru Halimu [author.]Contributor(s): Dr. Asem Kasem [supervisor.] | Universiti Teknologi Brunei School of Computing and InformaticsMaterial type: TextTextPublication details: Bandar Seri Begawan : Universiti Teknologi Brunei, ©2021. Description: 158 pages : tables ; 30 cmSubject(s): -- Universiti Teknologi Brunei Project Report | Thesis Writing | Project Report, Academic | Project Report Universiti Teknologi Brunei | Machine learning -- Classification | Ensemble learning (Machine learning) | Imbalanced data problem (Statistics) | Data mining -- Statistical methodsOther classification: RTDS 361 | UTB 120 REPORT THESIS & DISSERTATION, RTDS 361
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
Holdings
Item type Current library Call number Copy number Status Notes Date due Barcode
Reports, Thesis & Dissertation Students Reports, Thesis & Dissertation Students Universiti Teknologi Brunei Library
- at level 2
UTB 120 REPORT THESIS & DISSERTATION, RTDS 361 (Browse shelf(Opens below)) 1 Not for loan Reg. No._UTB [RTDS 361] 850401

Submitted in fulfillment of the requirements for the degree of Doctor of Philosophy.

Abstract

In classification tasks, class imbalance and noise are among the top data quality challenges faced by the machine learning community when dealing with real-world data. A class imbalance problem happens when at least one class (majority class) has relatively,

more instances than the other classes (minority classes). On the other hand, even though various definitions of noise have been presented in machine learning and statistics literature, most of them agree that noise presents inaccuracies that obscure the relationship between features of an object and its class. The presence of class imbalance and noise in a training set, which is almost unavoidable in real-world domains such as healthcare, finance, etc. generally degrades the classification performance of learning algorithms.

There is an increasing demand for solving the problem of class imbalance while paying attention to the drawbacks of the existing methods that address them, and the underlying complexities existing within datasets. This research is motivated by the growing trends in the use of ensemble techniques for imbalanced classification. This is because of their ability to improve classification performance by leveraging the classification power of multiple base learners trained on different bootstraps of training data as compared to single learner classification algorithms. However, ensemble methods are still prone to the negative effects of class imbalance and noise in the training sets. Even though some of them attempt to address the problem of class imbalance by altering the original class distribution to create a sort of balance in the datasets through over-sampling or under-sampling, this can lead to overfitting or discarding useful data respectively, and thus may still hamper performance. Despite a relatively improved performance registered by some existing methods that utilize sampling-based techniques, much more work needs to be done to avoid discarding potentially useful data as well as minimizing the likelihood of overfitting.

In recognition of the aforementioned challenges, this thesis focuses on binary classification problems, by first investigating the crucial issue of establishing a reliable and suitable performance evaluation metric for evaluating learner's performance with class imbalance problems. To achieve this, the study utilizes the earlier proposed criteria for comparing metrics based on their degree of Consistency and degree of Discriminancy, and empirically compares the performance of the most commonly used metrics for evaluation with class imbalance problems, namely; Area under the Receiver Operating Characteristic Curve (AUC), Mathew Correlation Coefficient (MCC), and Accuracy. The findings from the investigations demonstrate that both AUC and MCC are statistically consistent with each other, however, AUC is more discriminating than MCC. Hence, suggesting that AUC is a more reliable and suitable measure in evaluating binary class imbalance problems than MCC.

Secondly, this thesis proposes an ensemble method for binary classification in imbalanced datasets using split balancing technique and instance hardness estimation. The proposed method creates an arbitrary number of balanced splits (sBal) of data generated based on Instance Hardness (IH) as a weighting mechanism for creating balanced bags. IH is referred to as a measure that specifies the degree of complexity in classifying a given instance in a dataset. This implies that each instance in a respective dataset has a property that suggests its probability of being classified incorrectly regardless of the choice of the classifier. i.e. instance with an estimated hardness value higher than a predefined threshold to is categorized as Hard, and with a hardness value below the threshold te is categorized as Easy, or else as Normal.

Each of the generated bags contains all the minority instances, and a mixture of majority instances with varying degrees of hardness (easy, normal, and hard). This eliminates the possibility of discarding potentially useful data and enables base learners to train on different balanced bags comprising varied characteristics of the training data which might minimize overfitting.

Comprehensive experiments were carried out to assess the effectiveness of the proposed method in dealing with the problem of noise and class imbalance. Performance of the proposed method is evaluated on a total of 100 datasets that include 30 synthetic datasets with controlled levels of noise, 29 balanced and 41 imbalanced real-world multi-domain datasets, and compared its performance with both traditional ensemble methods (Bagging, Wagging Random Forest, and AdaBoost), and those specialized for class imbalanced problems (Balanced Bagging, Balanced Random forest, RUSBoost, and Easy Ensemble).

The results reveal that the proposed method brings a substantial improvement in classification performance relevant to the compared methods. Furthermore, the Equalized Loss of Accuracy (ELA) was calculated to assess the robustness of the proposed methods under different levels of noise. The results indicate that the proposed method is more robust (not affected by noise as much) as compared to the specialized ensemble method designed for handling class imbalance problems i.e. Balanced Bagging.

To ascertain and validate the findings; throughout this research, a detailed and comprehensive statistical significance analysis of all the findings is carried out. The study first reviews the current practices in the machine learning community and theoretically examines various statistical tests to establish and recommend a set of nonparametric statistical tests suitable for validating the current experiments i.e. comparison of two or more classifiers. The statistical analysis shows that the proposed method performs significantly better than the compared traditional and specialized existing ensemble methods for imbalanced problems across many datasets.

There are no comments on this title.

to post a comment.

library opening hours

24/7 study area

Friday Open 24 hours (Closed during Friday Prayers from 11.30am to 2.30pm)