Exploring the class imbalance problem in text classification

Date
2023-03
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH SUMMARY: Natural Language Processing (NLP) is a subfield in computer science which is focused on leveraging computers to learn from human language. Over the years, the field has been used to perform a wide variety of tasks which have resulted in many interesting real-world applications. One of these tasks is text classification, where the focus is on the development of models which are able to successfully predict the class label for textual inputs from a set of pre-defined category labels. Text classification has previously been applied in the development of automatic spam detection systems and in the analysis of consumer sentiment. Unfortunately, many real-world text data have an imbalanced class label distribution. This is often the case for spam data sets, where the majority of observations are labelled as non-spam. In the development of an automatic spam detection system, we want the system to correctly identify spam instances. However, traditional Machine Learning (ML) models are usually overwhelmed by instances in the majority class, which hinders the ability of these models to correctly identify instances in the minority class. The field of imbalanced learning is focused on the manipulation of data and algorithms to address the problem that was just described. However, these methods have not been thoroughly explored in the literature. Thus, our objective in this thesis is to contribute new knowledge to the problem of imbalanced class label distributions in the context of text classification. The problem is approached by reviewing the literature to identify ML models which were previously applied to text classification tasks. Furthermore, methods are identified from the literature which manipulate data and algorithms which are well suited to the task of imbalanced learning. The performance of these techniques is investigated by means of an empirical study which focused on real-world movie review data. Simulated scenarios with varying degrees of class imbalance are investigated in order to study the robustness of classifiers on imbalanced data problems, and to analyse the performance of imbalanced learning techniques. For the data set that was analysed, the results from our findings suggest that some classifiers are more robust to class imbalance than others, and that performance gains are possible when imbalanced learning techniques are included in the learning process.
AFRIKAANSE OPSOMMING: Natuurlike taalverwerking is ’n subveld in rekenaarwetenskap wat daarop gefokus is om rekenaars te gebruik om uit menslike taal te leer. Oor die jare is die veld gebruik om ’n wye verskeidenheid uit te voer wat gelei het tot baie interessante regte-wereld toepassings. Een van hierdie take is teksklassifikasie, waar die fokus is op die ontwikkeling van modelle wat in staat is om die klastoekenning suksesvol te voorspel vir teksinsette vanaf ’n stel vooraf gedefinieerde klasse. Teksklassifikasie is voorheen toegepas in die ontwikkeling van outomatiese gemorsposopsporingstelsels en in die ontleding van verbruikersentiment. Ongelukkig het baie regte-wereld teksdata ’n ongebalanseerde klasverspreiding. Dit is dikwels die geval vir gemorsposdatastelle, waar die meeste waarnemings as nie-gemorspos bestempel word. In die ontwikkeling van ’n outomatiese gemorsposopsporingstelsel wil ons he dat die stelsel gemorsposgevalle korrek identifiseer. Tradisionele Masjienleer (ML) modelle word egter gewoonlik oorweldig deur gevalle in die meerderheidsklas, wat die vermoe van hierdie modelle belemmer om gevalle in die minderheidsklas korrek te identifiseer. Die veld van ongebalanseerde leer is gefokus op die manipulering van data en algoritmes om die probleem wat pas beskryf is, aan te spreek. Hierdie metodes is egter nie deeglik in die literatuur ondersoek nie. Ons doelwit in hierdie tesis is dus om nuwe kennis by te dra tot die probleem van ongebalanseerde klasverspreidings in die konteks van teksklassifikasie. Die probleem word benader deur die literatuur te hersien om ML-modelle te identifiseer wat voorheen op teksklassifikasietake toegepas is. Verder word metodes uit die literatuur geidentifiseer wat data en algoritmes manipuleer wat goed geskik is vir die taak van ongebalanseerde leer. Die prestasie van hierdie tegnieke word ondersoek deur middel van ’n empiriese studie wat gefokus het op werklike filmresensiedata. Gesimuleerde scenario’s met verskillende grade van klaswanbalans word ondersoek om die robuustheid van klassifiseerders op ongebalanseerde dataprobleme te bestudeer, en om die prestasie van ongebalanseerde leertegnieke te ontleed. Vir die datastel wat ontleed is, dui die resultate van ons bevindinge daarop dat sommige klassifiseerders meer robuust is vir klaswanbalans as antler, en dat prestasietoenames moontlik is wanneer ongebalanseerde leertegnieke by die leerproses ingesluit word.
Description
Thesis (MCom)--Stellenbosch University, 2023.
Keywords
Citation