Aspects of model development using regression quantiles and elemental regressions

Date
2007-03
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: It is well known that ordinary least squares (OLS) procedures are sensitive to deviations from the classical Gaussian assumptions (outliers) as well as data aberrations in the design space. The two major data aberrations in the design space are collinearity and high leverage. Leverage points can also induce or hide collinearity in the design space. Such leverage points are referred to as collinearity influential points. As a consequence, over the years, many diagnostic tools to detect these anomalies as well as alternative procedures to counter them were developed. To counter deviations from the classical Gaussian assumptions many robust procedures have been proposed. One such class of procedures is the Koenker and Bassett (1978) Regressions Quantiles (RQs), which are natural extensions of order statistics, to the linear model. RQs can be found as solutions to linear programming problems (LPs). The basic optimal solutions to these LPs (which are RQs) correspond to elemental subset (ES) regressions, which consist of subsets of minimum size to estimate the necessary parameters of the model. On the one hand, some ESs correspond to RQs. On the other hand, in the literature it is shown that many OLS statistics (estimators) are related to ES regression statistics (estimators). Therefore there is an inherent relationship amongst the three sets of procedures. The relationship between the ES procedure and the RQ one, has been noted almost “casually” in the literature while the latter has been fairly widely explored. Using these existing relationships between the ES procedure and the OLS one as well as new ones, collinearity, leverage and outlier problems in the RQ scenario were investigated. Also, a lasso procedure was proposed as variable selection technique in the RQ scenario and some tentative results were given for it. These results are promising. Single case diagnostics were considered as well as their relationships to multiple case ones. In particular, multiple cases of the minimum size to estimate the necessary parameters of the model, were considered, corresponding to a RQ (ES). In this way regression diagnostics were developed for both ESs and RQs. The main problems that affect RQs adversely are collinearity and leverage due to the nature of the computational procedures and the fact that RQs’ influence functions are unbounded in the design space but bounded in the response variable. As a consequence of this, RQs have a high affinity for leverage points and a high exclusion rate of outliers. The influential picture exhibited in the presence of both leverage points and outliers is the net result of these two antagonistic forces. Although RQs are bounded in the response variable (and therefore fairly robust to outliers), outlier diagnostics were also considered in order to have a more holistic picture. The investigations used comprised analytic means as well as simulation. Furthermore, applications were made to artificial computer generated data sets as well as standard data sets from the literature. These revealed that the ES based statistics can be used to address problems arising in the RQ scenario to some degree of success. However, due to the interdependence between the different aspects, viz. the one between leverage and collinearity and the one between leverage and outliers, “solutions” are often dependent on the particular situation. In spite of this complexity, the research did produce some fairly general guidelines that can be fruitfully used in practice.
AFRIKAANSE OPSOMMING: Dit is bekend dat die gewone kleinste kwadraat (KK) prosedures sensitief is vir afwykings vanaf die klassieke Gaussiese aannames (uitskieters) asook vir data afwykings in die ontwerpruimte. Twee tipes afwykings van belang in laasgenoemde geval, is kollinearitiet en punte met hoë hefboom waarde. Laasgenoemde punte kan ook kollineariteit induseer of versteek in die ontwerp. Na sodanige punte word verwys as kollinêre hefboom punte. Oor die jare is baie diagnostiese hulpmiddels ontwikkel om hierdie afwykings te identifiseer en om alternatiewe prosedures daarteen te ontwikkel. Om afwykings vanaf die Gaussiese aanname teen te werk, is heelwat robuuste prosedures ontwikkel. Een sodanige klas van prosedures is die Koenker en Bassett (1978) Regressie Kwantiele (RKe), wat natuurlike uitbreidings is van rangorde statistieke na die lineêre model. RKe kan bepaal word as oplossings van lineêre programmeringsprobleme (LPs). Die basiese optimale oplossings van hierdie LPs (wat RKe is) kom ooreen met die elementale deelversameling (ED) regressies, wat bestaan uit deelversamelings van minimum grootte waarmee die parameters van die model beraam kan word. Enersyds geld dat sekere EDs ooreenkom met RKe. Andersyds, uit die literatuur is dit bekend dat baie KK statistieke (beramers) verwant is aan ED regressie statistieke (beramers). Dit impliseer dat daar dus ‘n inherente verwantskap is tussen die drie klasse van prosedures. Die verwantskap tussen die ED en die ooreenkomstige RK prosedures is redelik “terloops” van melding gemaak in die literatuur, terwyl laasgenoemde prosedures redelik breedvoerig ondersoek is. Deur gebruik te maak van bestaande verwantskappe tussen ED en KK prosedures, sowel as nuwes wat ontwikkel is, is kollineariteit, punte met hoë hefboom waardes en uitskieter probleme in die RK omgewing ondersoek. Voorts is ‘n lasso prosedure as veranderlike seleksie tegniek voorgestel in die RK situasie en is enkele tentatiewe resultate daarvoor gegee. Hierdie resultate blyk belowend te wees, veral ook vir verdere navorsing. Enkel geval diagnostiese tegnieke is beskou sowel as hul verwantskap met meervoudige geval tegnieke. In die besonder is veral meervoudige gevalle beskou wat van minimum grootte is om die parameters van die model te kan beraam, en wat ooreenkom met ‘n RK (ED). Met sodanige benadering is regressie diagnostiese tegnieke ontwikkel vir beide EDs en RKe. Die belangrikste probleme wat RKe negatief beinvloed, is kollineariteit en punte met hoë hefboom waardes agv die aard van die berekeningsprosedures en die feit dat RKe se invloedfunksies begrensd is in die ruimte van die afhanklike veranderlike, maar onbegrensd is in die ontwerpruimte. Gevolglik het RKe ‘n hoë affiniteit vir punte met hoë hefboom waardes en poog gewoonlik om uitskieters uit te sluit. Die finale uitset wat verkry word wanneer beide punte met hoë hefboom waardes en uitskieters voorkom, is dan die netto resultaat van hierdie twee teenstrydige pogings. Alhoewel RKe begrensd is in die onafhanklike veranderlike (en dus redelik robuust is tov uitskieters), is uitskieter diagnostiese tegnieke ook beskou om ‘n meer holistiese beeld te verkry. Die ondersoek het analitiese sowel as simulasie tegnieke gebruik. Voorts is ook gebruik gemaak van kunsmatige datastelle en standard datastelle uit die literatuur. Hierdie ondersoeke het getoon dat die ED gebaseerde statistieke met ‘n redelike mate van sukses gebruik kan word om probleme in die RK omgewing aan te spreek. Dit is egter belangrik om daarop te let dat as gevolg van die interafhanklikheid tussen kollineariteit en punte met hoë hefboom waardes asook dié tussen punte met hoë hefboom waardes en uitskieters, “oplossings” dikwels afhanklik is van die bepaalde situasie. Ten spyte van hierdie kompleksiteit, is op grond van die navorsing wat gedoen is, tog redelike algemene riglyne verkry wat nuttig in die praktyk gebruik kan word.
Description
Dissertation (PhD)--University of Stellenbosch, 2007.
Keywords
Regression analysis, Least squares, Estimation theory, Theses -- Statistics and actuarial science, Dissertations -- Statistics and actuarial science
Citation