Masters Degrees (Statistics and Actuarial Science)
Permanent URI for this collection
Browse
Browsing Masters Degrees (Statistics and Actuarial Science) by Author "Bezuidenhout, Jean-Pierre"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemExploring the class imbalance problem in text classification(Stellenbosch : Stellenbosch University, 2023-03) Bezuidenhout, Jean-Pierre; Lamont, Morne; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY: Natural Language Processing (NLP) is a subfield in computer science which is focused on leveraging computers to learn from human language. Over the years, the field has been used to perform a wide variety of tasks which have resulted in many interesting real-world applications. One of these tasks is text classification, where the focus is on the development of models which are able to successfully predict the class label for textual inputs from a set of pre-defined category labels. Text classification has previously been applied in the development of automatic spam detection systems and in the analysis of consumer sentiment. Unfortunately, many real-world text data have an imbalanced class label distribution. This is often the case for spam data sets, where the majority of observations are labelled as non-spam. In the development of an automatic spam detection system, we want the system to correctly identify spam instances. However, traditional Machine Learning (ML) models are usually overwhelmed by instances in the majority class, which hinders the ability of these models to correctly identify instances in the minority class. The field of imbalanced learning is focused on the manipulation of data and algorithms to address the problem that was just described. However, these methods have not been thoroughly explored in the literature. Thus, our objective in this thesis is to contribute new knowledge to the problem of imbalanced class label distributions in the context of text classification. The problem is approached by reviewing the literature to identify ML models which were previously applied to text classification tasks. Furthermore, methods are identified from the literature which manipulate data and algorithms which are well suited to the task of imbalanced learning. The performance of these techniques is investigated by means of an empirical study which focused on real-world movie review data. Simulated scenarios with varying degrees of class imbalance are investigated in order to study the robustness of classifiers on imbalanced data problems, and to analyse the performance of imbalanced learning techniques. For the data set that was analysed, the results from our findings suggest that some classifiers are more robust to class imbalance than others, and that performance gains are possible when imbalanced learning techniques are included in the learning process.