Advances in random forests with application to classification

Pretorius, Arnu (2016-12)

Thesis (MCom)--Stellenbosch University, 2016.

Thesis

ENGLISH SUMMARY : Since their introduction, random forests have successfully been employed in a vast array of application areas. Fairly recently, a number of algorithms that adhere to Leo Breiman’s definition of a random forest have been proposed in the literature. Breiman’s popular random forest algorithm (Forest-RI), and related ensemble classification algorithms which followed, form the focus of this study. A review of random forest algorithms that were developed since the introduction of Forest-RI is given. This includes a novel taxonomy of random forest classification algorithms, which is based on their sources of randomization, and on deterministic modifications. Also, a visual conceptualization of contributions to random forest algorithms in the literature is provided by means of multidimensional scaling. Towards an analysis of advances in random forest algorithms, decomposition of the expected prediction error into bias and variance components is considered. In classification, such decompositions are not as straightforward as in the case of using squared-error loss for regression. Hence various definitions of bias and variance for classification can be found in the literature. Using a particular bias-variance decomposition, an empirical study of ensemble learners, including bagging, boosting and Forest-RI, is presented. From the empirical results and insights into the way in which certain mechanisms of random forests affect bias and variance, a novel random forest framework, viz. oblique random rotation forests, is proposed. Although not entirely satisfactory, the framework serves as an example of a heuristic approach towards novel proposals based on bias-variance analyses, instead of an ad hoc approach, as is often found in the literature. The analysis of comparative studies regarding advances in random forest algorithms is also considered. It is of interest to critically evaluate the conclusions that can be drawn from these studies, and to infer whether novel random forest algorithms are found to significantly outperform Forest-RI. For this purpose, a meta-analysis is conducted in which an evaluation is given of the state of research on random forests based on all (34) papers that could be found in which a novel random forest algorithm was proposed and compared to already existing random forest algorithms. Using the reported performances in each paper, a novel two-step procedure is proposed, which allows for multiple algorithms to be compared over multiple data sets, and across different papers. The meta analysis results indicate weighted voting strategies and variable weighting in high-dimensional settings to provide significantly improved performances over the performance of Breiman’s popular Forest-RI algorithm.

AFRIKAANSE OPSOMMING : Sedert hulle bekendstelling is random forests met groot sukses in ’n wye verskeidenheid toepassings geimplementeer. ’n Aantal algoritmes wat aan Leo Breiman se definisie van ’n random forest voldoen, is redelik onlangs in die literatuur voorgestel. Breiman se gewilde random forest (Forest-RI) algoritme en verwante ensemble klassifikasie algoritmes wat daaruit ontwikkel is, vorm die fokus van die studie. ’n Oorsig van nuut ontwikkelde random forest algoritmes wat sedert die bekendstellig van Forest-RI voorgestel is, word gegee. Dit sluit ’n nuwe kategoriseringsraamwerk van random forest algoritmes in, wat gebaseer is op hulle bron van ewekansigheid, asook op hulle tipe deterministiese wysigings. Met behulp van meerdimensionele skalering word ’n visuele voorstelling van bydraes in die literatuur ten opsigte van random forest algoritmes ook gegee. Met die oog op ’n analise van ontwikkelings rondom random forest algoritmes, word die opdeling van die verwagte vooruitskattingsfout in ’n sydigheiden variansie komponent beskou. In vergelyking met regressie wanneer die gekwadreerde-fout verliesfunksie gebruik word, is hierdie opdeling in klassifikasie minder voor-die-hand-liggend. Derhalwe kom verskeie definisies van sydigheid en variansie vir klassifikasie in die literatuur voor. Deur gebruik te maak van ’n spesifieke sydigheid-variansie opdeling word ’n empiriese studie van ensemble algoritmes, ingesluit bagging, boosting en Forest-RI, uitgevoer. Uit die empiriese resultate en insigte rakende die manier waarop sekere meganismes van random forests sydigheid en variansie beinvloed, word ’n nuwe random forest raamwerk voorgestel, nl. oblique random rotation forests. Hoewel nie in geheel bevredigend nie, dien die raamwerk as ’n voorbeeld van ’n heuristiese benadering tot nuwe voorstelle gebaseer op sydigheid-variansie analises in plaas van ’n ad hoc benadering, soos wat dikwels gevind word in die literatuur. Verder word vergelykende studies met betrekking tot random forests geanaliseer. Hier is dit van belang om gevolgtrekkings wat uit vergelykende studies gemaak is, krities te evalueer, en om te verifieer of nuwe random forest algoritmes betekenisvol verbeter op Forest-RI. Met bogaande doelwitte in gedagte is ’n meta-analise uitgevoer waarin die stand van random forest navorsing geevalueer is. Die analise is gebaseer op al (34) artikels waarin ’n nuwe random forest algoritme voorgestel is en vergelyk word met reeds bestaande random forest algoritmes. Deur gebruik te maak van die gerapporteerde prestasie-maatstawwe in elke artikel, is ’n nuwe prosedure voorgestel waarvolgens ’n aantal algoritmes oor ’n aantal datastelle en oor verskillende artikels vergelyk kan word. Die resultate van die meta-analise toon aan dat geweegde stem-strategiee en die weging van veranderlikes in hoe-dimensionele data ’n betekenisvolle verbetering lewer op die akkuraatheid van Breiman se gewilde Forest-RI algoritme.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/100096
This item appears in the following collections: