Strategies for combining tree-based learners

Date
2020-04
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University.
Abstract
ENGLISH ABSTRACT: In supervised statistical learning, an ensemble is a predictive model that is the conglomeration of several other predictive models. Ensembles are applicable to both classification and regression problems and have demonstrated theoretical and practical appeal. Furthermore, due to the recent advances in computing, the application of ensemble methods has become widespread. Structurally, ensembles can be characterised according to two distinct aspects. The first is by the method employed to train the individual base learning models that constitute the conglomeration. The second is by the technique used to combine the predictions of the individual base learners for the purpose of obtaining a single prediction for an observation. This thesis considers the second issue. Insofar, the focus is on weighting strategies for combining tree models that are trained in parallel on bootstrap resampled versions of the training sample. The contribution of this thesis is the development of a regularised weighted model. The purpose is two-fold. First, the technique provides flexibility in controlling the bias-variance trade-off when fitting the model. Second, the proposed strategy mitigates issues that plague similar weighting strategies through the application of ℓ2 regularisation. The aforesaid includes an ill-condition optimisation problem for finding the weights and overfitting in low signal to noise scenarios. In this thesis a derivation is provided, which outlines the mathematical details to solve for the weights of the individual models. Crucially, the solution relies on methods from convex optimisation which is discussed. In addition, the technique will be assessed against established ensemble techniques on both simulated and real-world data sets. The results show that the proposal performs well relative to the established averaging techniques such as bagging and random forest. It is argued that the proposed approach offers a generalisation to the bagging regression ensemble. In this regard, the bagging regressor is a highly regularised weighted ensemble leveraging ℓ2 regularisation; not merely an equally weighted ensemble. This deduction relies on the imposition of two constraints to the weights, namely: a positivity constraint and a normalisation constraint. Key words: Ensemble, Regression, Regularisation, Convex Optimisation, Bagging, Random Forest
AFRIKAANSE OPSOMMING: In statistiese leer teorie is ʼn ensemble ʼn voorspellingsmodel wat uit verskeie ander voorspellingsmodelle bestaan. Ensembles kan vir beide regressie en klassifikasie probleme gebruik word, en toon teoretiese en praktiese toepasbaarheid. Ook word ensembles al hoe meer gebruik as gevolg van die onlangse ontwikkeling van rekenaars. Die tipiese struktuur van ensembles behels twee aspekte. Die eerste is om individuele modelle, wat deel van die ensemble is, te leer. Die tweede is om die voorspellings van individuele modelle te kombineer. Die resultaat is ʼn enkele voorspelling vir ʼn spesifieke waarneming. Hierdie tesis beskou die tweede aspek. Sover moontlik is die fokus op geweegde strategieë vir gekombineerde regressie bome, waar die bome parallel op skoenlussteekproewe geleer word. Die bydrae van die tesis is die ontwikkeling van ʼn geregulariseerde geweegde model. Die doel is tweeledig. Eerstens bied die tegniek buigsaamheid om die sydigheid-variansie kompromis, wanneer die model geleer word, te beheer. Tweedens vermy die voorgestelde strategie probleme wat by soortgelyke tegnieke, waar ℓ2 regulering gebruik word, ontstaan. Bogenoemde sluit optimeringsprobleme, waar dit moeilik is om gewigte te bepaal, in; en ook probleme waar oorpassing figureer as gevolg van ʼn lae sein-tot-ruis verhouding. In die tesis word ʼn wiskundige raamwerk voorgestel om gewigte van individuele modelle te bepaal. Die oplossing maak gebruik van konvekse optimering, wat ook in die tesis bespreek word. Verder word die voorgestelde tegniek met bestaande ensemble tegnieke, in gesimuleerde en werklike datastelle, vergelyk. Die resultate toon dat die voorgestelde tegniek goed vaar, as dit met bagging en random forest vergelyk word. Daar word geargumenteer dat die voorgestelde tegniek ʼn veralgemening van die bagging regressie ensemble is. In hierdie verband kan bagging regressie as ʼn hoogs 2 l geregulariseerde ensemble tegniek beskou word; en nie net bloot as ʼn geweegde ensemble tegniek nie. Hierdie afleiding is gebaseer op twee beperkings wat op die gewigte geplaas word, naamlik: dat die gewigte positief is, en dat die gewigte genormaliseer word. Sleutelwoorde: Ensemble, Regressie, Regulering, Konvekse optimering, Bagging, Random Forest
Description
Thesis (MCom)--Stellenbosch University, 2020.
Keywords
Ensembles (Mathematics), Regression analysis, Regularization, Convex sets, Trees (Graph theory), Random data (Statistics), Set theory, Learning models (Stochastic processes), Base learner, Prediction (Logic), UCTD
Citation