Preconditioning for feature selection in classification

Pretorius, Jani (2019-04)

Thesis (MCom)--Stellenbosch University, 2019.

Thesis

ENGLISH SUMMARY : Increased dimensionality of data is a clear trend that has been observed over the past few decades. However, analysing high-dimensional data in order to predict an outcome can be problematic. In certain cases, such as when analysing genomic data, a predictive model that is both interpretable and accurate is required. Many techniques focus on solving these two components simultaneously; however, when the data are high-dimensional and noisy, such an approach may perform poorly. Preconditioning is a two-stage technique that aims to reduce the noise inherent in the training data before making final predictions. In doing so, it addresses the issues of interpretability and accuracy separately. The literature on this technique focuses on the regression case, but in this thesis, the technique is applied in a classification setting. An overview of the theory surrounding this method is provided, as well as an empirical analysis of the method. A simulation study evaluates the performance of the technique under various scenarios and compare the results to those obtained by standard (non-preconditioned) models. Thereafter, the models are applied to real-world datasets and their performances compared. Based on the results of the empirical work, it appears that, at their best, preconditioned classifiers can only reach a performance that is on par with standard classifiers. This is in contrast to the regression case, where the literature has shown that preconditioning can outperform standard regression models in high-dimensional settings.

AFRIKAANSE OPSOMMING : ’n Toename in die dimensionaliteit van datasetelle is ’n duidelike tendens wat oor die afgelope paar dekades na voorskyn gekom het. Om hoër-dimensionele data te analiseer sodat ’n uitkoms voorspel kan word, kan problematies wees. In sekere gevalle, soos wanneer genetiese data geanaliseer word, word ’n voorspellende model wat beide interpreteerbaar, sowel as akkuraat is, verlang. Baie tegnieke fokus daarop om hierdie twee aspekte gelyktydig op te los, maar wanneer die data van ’n hoë dimensie is en geruis bevat, kan hierdie benadering swak resultate oplewer. Prekondisionering is ’n twee-fase prosess wat daarop gemik is om die geruis in die afrigdatastel te verminder voordat ’n finale voorspelling gemaak word. Sodoende spreek dit die kwessies van interpreteerbaarheid en akkuraatheid afsonderlik aan. In die literatuur word daar klem gelê op die regressie geval. In hierdie tesis word die tegniek egter toegepas in ’n klassifikasie konteks. ’n Oorsig van die teorie aangaande hierdie metode word verskaf, sowel as empiriese studies. Simulasie studies evalueer die prestasie van die tegniek onder verskeie omstandighede en vergelyk die uitkomste met dié wat deur standaard (nie-geprekondisioneerde) modelle behaal was. Daarna word die modelle toegepas op regte-wêreld datastelle en hul resultate vergelyk. Gebaseer op die resultate van die empiriese werk wil dit blyk asof geprekondisioneerde klassifikasiemodelle, op hul beste, slegs so goed as standaard klassifikasiemodelle kan presteer. Hierdie bevindinge staan in kontras met die regressie geval, waar die literatuur wys dat prekondisionering standaard regressiemodelle kan uitpresteer in hoë dimensionele gevalle.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/106057
This item appears in the following collections: