Statistical classification in high-dimensional scenarios with a focus on microarray data sets

Rodseth, Tessa Louise (2017-12)

Thesis (MCom)--Stellenbosch University, 2017.

Thesis

ENGLISH SUMMARY : High-dimensional data analysis characterises many contemporary problems in statistics and arise in many application areas. This thesis focuses on very high-dimensional problems in which the input predictor variables are gene expression measurements in microarray studies. Accurate analysis of microarray data sets can provide new insight into cancer diagnosis using gene expression profiles and can result in breakthroughs in medical research. K-nearest neighbours (KNN), fastKNN, linear discriminant analysis (and variants thereof), nearest shrunken centroids (NSC) and support vector machines (SVMs) are investigated in this thesis as binary (and multi-class) classification procedures on microarray data sets. The important problem of eliminating redundant input variables before implementing classification procedures in high-dimensional data sets is addressed in this thesis. Several variable selection and dimension reduction procedures suitable for microarray data sets are discussed, with the focus on implementing sure independence techniques, NSC and fastKNN feature engineering in the empirical study. Principal component analysis and supervised principal component analysis are implemented as the two main dimension reduction techniques in this thesis. The performance of the classification procedures is evaluated on three real and three synthetic high-dimensional microarray data sets. The comparison of the different classification methods in the empirical study led to the conclusion that SVMs prove to be the most accurate procedure on the binary data sets considered, whilst NSC is the most accurate procedure on the multi-class data set.

AFRIKAANSE OPSOMMING : Hoë-dimensionele data ontledings is in die huidige tydperk kenmerkend van baie praktiese statistiek probleme. In hierdie tesis is die fokus op hoë-dimensionele data met die onafhanklike veranderlikes wat genetiese metings verteenwoordig, tipies van mikro-skyfie studies. Noukeurige ontleding van mikro-skyfie data kan lei tot nuwe insig in byvoorbeeld die diagnose van kanker waar daar van genetiese profieldata gebruik gemaak word. Dit kan uiteraard tot deurbrake in mediese navorsing lei. Die KNN tegniek, die sogenaamde “fastKNN” tegniek, lineêre diskriminantanalise (en variasies daarvan), naaste gekrimpte sentroïedes (NSC) en ondersteuningspunt algoritmes (SVMs) word in hierdie tesis ondersoek as klassifikasie prosedures vir binêre en multi-klas mikro-skyfie probleme. Die belangrike probleem om oortollige en irrelevante veranderlikes uit ’n hoë-dimensionele datastel te elimineer alvorens ’n klassifikasie prosedure daarop toegepas word, word in hierdie tesis aangespreek. Verskeie veranderlike seleksie en dimensie-reduksie prosedures wat geskik is vir toepassing op mikro-skyfie datastelle word bespreek, met die fokus wat geplaas word op “sure independence screening”, NSC en “fastKNN”. Dit verkry veral aandag in die empiriese gedeelte van die studie. Hoofkomponent analise en gerigte hoofkomponent analise word verder as twee van die vernaamste dimensie-reduksie tegnieke in hierdie tesis geïmplementeer. Die gehalte van die klassifikasie prosedures word op drie werklike en drie sintetiese hoë-dimensionele datastelle ge-evalueer. Onderlinge vergelyking van die prosedures in die empiriese studie lei tot die gevolgtrekking dat SVMs die akkuraatste prosedure vir binêre datastelle is, terwyl NSC die akkuraatste prosedure vir die multi-klas datastel was.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/102771
This item appears in the following collections: