Extensions of biplot methodology to discriminant analysis with applications of non-parametric principal components

Gardner, Sugnet (2001)

Dissertation (PhD)--Stellenbosch University, 2001.

Thesis

ENGLISH ABSTRACT: Gower and Hand offer a new perspective on the traditional biplot. This perspective provides a unified approach to principal component analysis (PCA) biplots based on Pythagorean distance; canonical variate analysis (CVA) biplots based on Mahalanobis distance; non-linear biplots based on Euclidean embeddable distances as well as generalised biplots for use with both continuous and categorical variables. The biplot methodology of Gower and Hand is extended and applied in statistical discrimination and classification. This leads to discriminant analysis by means of PCA biplots, CVA biplots, non-linear biplots as well as generalised biplots. Properties of these techniques are derived in detail. Classification regions defined for linear discriminant analysis (LDA) are applied in the CVA biplot leading to discriminant analysis using biplot methodology. Situations where the assumptions of LDA are not met are considered and various existing alternative discriminant analysis procedures are formulated in terms of biplots and apart from PCA biplots, QDA, FDA and DSM biplots are defined, constructed and their usage illustrated. It is demonstrated that biplot methodology naturally provides for managing categorical and continuous variables simultaneously. It is shown through a simulation study that the techniques based on biplot methodology can be applied successfully to the reversal problem with categorical variables in discriminant analysis. Situations occurring in practice where existing discriminant analysis procedures based on distances from means fail are considered. After discussing self-consistency and principal curves (a form of non-parametric principal components), discriminant analysis based on distances from principal curves (a form of a conditional mean) are proposed. This biplot classification procedure based upon principal curves, yields much better results. Bootstrapping is considered as a means of describing variability in biplots. Variability in samples as well as of axes in biplot displays receives attention. Bootstrap a-regions are defined and the ability of these regions to describe biplot variability and to detect outliers is demonstrated. Robust PCA and CVA biplots restricting the role of influential observations on biplot displays are also considered. An extensive library of S-PLUS computer programmes is provided for implementing the various discriminant analysis techniques that were developed using biplot methodology. The application of the above theoretical developments and computer software is illustrated by analysing real-life data sets. Biplots are used to investigate the degree of capital intensity of companies and to serve as an aid in risk management of a financial institution. A particular application of the PCA biplot is the TQI biplot used in industry to determine the degree to which manufactured items comply with multidimensional specifications. A further interesting application is to determine whether an Old-Cape furniture item is manufactured of stinkwood or embuia. A data set provided by the Western Cape Nature Conservation Board consisting of measurements of tortoises from the species Homopus areolatus is analysed by means of biplot methodology to determine if morphological differences exist among tortoises from different geographical regions. Allometric considerations need to be taken into account and the resulting small sample sizes in some subgroups severely limit the use of conventional statistical procedures. Biplot methodology is also applied to classification in a diabetes data set illustrating the combined advantage of using classification with principal curves in a robust biplot or biplot classification where covariance matrices are unequal. A discriminant analysis problem where foraging behaviour of deer might eventually result in a change in the dominant plant species is used to illustrate biplot classification of data sets containing both continuous and categorical variables. As an example of the use of biplots with large data sets a data set consisting of 16828 lemons is analysed using biplot methodology to investigate differences in fruit from various areas of production, cultivars and rootstocks. The proposed a-bags also provide a measure of quantifying the graphical overlap among classes. This method is successfully applied in a multidimensional socio-economical data set to quantify the degree of overlap among different race groups. The application of the proposed biplot methodology in practice has an important byproduct: It provides the impetus for many a new idea, e.g. applying a peA biplot in industry led to the development of quality regions; a-bags were constructed to represent thousands of observations in the lemons data set, in tum leading to means for quantifying the degree of overlap. This illustrates the enormous flexibility of biplots - biplot methodology provides an infrastructure for many novelties when applied in practice.

AFRIKAANSE OPSOMMING: Gower en Hand bied 'n nuwe perspektief op die tradisionele bistipping. Hierdie perspektief verskaf 'n uniforme benadering tot hoofkomponent analise (HKA) bistippings gebaseer op Pythagoras-afstand; kanoniese veranderlike analise (KVA) bistippings gebaseer op Mahalanobis-afstand; nie-lineere bistippings gebaseer op Euclidies inbedbare afstande sowel as veralgemeende bistippings vir gebruik wanneer beide kontinue en kategoriese veranderlikes voorkom. Die bistippingsmetodologie van Gower en Hand word uitgebrei en toegepas in statistiese diskriminasie en klassifikasie. Dit lei tot diskriminantanalise met behulp van HKA bistippings, KVA bistippings, nie-lineere bistippings sowel as veralgemeende bistippings. Die eienskappe van hierdie tegnieke word in besonderhede afgelei. Die toepassing van die konsep van 'n klassifikasiegebied in die KVA bistipping baan die weg vir lineere diskriminantanalise (LDA) met behulp van bistippingsmetodologie. Situasies waar daar nie aan die aannames van LDA voldoen word nie kry aandag en verskeie bestaande altematiewe diskriminantanalise prosedures word in terme van bistippings geformuleer en naas HKA bistippings, word QDA, FDA en DSM bistippings gedefinieer, gekonstrueer en hul gebruike gedemonstreer. Dit word aangetoon dat bistippingsmetodologie op 'n natuurlik wyse voorsiening maak om kategoriese veranderlikes en kontinue veranderlikes gelyktydig te hanteer. Daar word met behulp van 'n simulasie-studie aangetoon dat tegnieke gebaseer op die bistippingsmetodologie wat ontwikkel IS, suksesvol by die sogenaamde ornkeringsprobleem by diskriminantanalise met kategoriese veranderlikes gebruik kan word. Verder word aangevoer dat daar baie praktiese situasies voorkom waar bestaande prosedures van diskriminantanalise faal omdat dit op afstande vanaf gemiddeldes gebaseer IS. Na 'n bespreking van self-konsekwentheid en hoofkrommes ('n vorm van nieparametriese hoofkomponente) word voorgestel om diskriminantanalise op afstand vanaf hoofkrommes ('n vonn van 'n voorwaardelike gemiddelde) te baseer. Sodoende is 'n bistippingklassifikasie prosedure wat op afstand vanaf hoofkrommes gebaseer is en wat baie beter resultate lewer, ontwikkel. Die variasie in die posisies van datapunte in die bistipping sowel as van die bistippingsasse word bestudeer met behulp van skoenlusmetodes. 'n Skoenlus a-gebied word gedefinieer en dit word gedemonstreer hoe so 'n a-gebied aangewend kan word om variasie in bistippings te beskryf en wegleers te identifiseer. Robuuste HKA en KV A bistippings wat die rol van invloedryke waamemings op die bistipping beperk, word bespreek. 'n Omvangryke biblioteek van S-PLUS rekenaarprogramme is geskryf VIr die implementering van die verskillende diskriminantanalise tegnieke wat met behulp van bistippingsmetodologie ontwikkel is. Die toepassing van die voorafgaande teoretiese ontwikkelinge en rekenaarprogramme word geillustreer aan die hand van werklike datastelle vanuit die praktyk. So word bistippings gebruik om die mate van kapitaalintensiteit van ondememings te ondersoek en om as hulpmiddel by risikobestuur van 'n finansiele instelling te dien. 'n Besondere toepassing van die HKA bistipping is die TQI bistipping wat in die industriele omgewing gebruik word ten einde te bepaal tot watter mate vervaardigde artikels aan neergelegde meerdimensionele spesifikasies voldoen. 'n Verdere interessante toepassing is om te bepaal of 'n Ou-Kaapse meubelstuk van stinkhout of embuia gemaak is. 'n Datastel verskaf deur Wes-Kaap Natuurbewaring in verband met die bekende padloper skilpad, Homopus areolatus, is met behulp van bistippings geanaliseer om te bepaal of daar morfometriese verskille tussen die padlopers afkomstig van bepaalde geografiese gebiede is. Allometriese beginsels moes ook in ag gene em word en die min waamemings in sommige van die subgroepe het tot gevolg dat konvensionele statistiese tegnieke nie sonder meer gebruik kan word nie. Die bistippingsmetodologie is ook toegepas op klassifikasie by 'n diabetes datastel om die gekombineerde gebruik van. hoofkrommes in 'n robuuste bistipping te illustreer en bistippingklassifikasie waar daar sprake van ongelyke kovariansiematrikse is. 'n Diskriminantanalise probleem waar die weidingsvoorkeure van wildsbokke 'n verandering in die dominante plantegroei tot gevolg kan he, word gebruik om bistippingklassifikasie met data waar kontinue sowel as kategoriese veranderlikes verskaf word, te illustreer. As voorbeeld van die gebruik van bistippings by 'n groot datastel is 'n datastel bestaande uit waamemings van 16828 suurlemoene met behulp van bistippingsmetodologie geanaliseer ten einde verskille in vrugte afkomstig van verskillende produsente-streke, kultivars en onderstamme te ondersoek. Die a-sakkies wat hier ontwikkel is, lei tot kwantifisering van die grafiese oorvleueling van groepe. Hierdie beginsel word suksesvol toegepas in 'n meerdimensionele sosio-ekonomiese datastel om die mate van oorvleueling van verskillende bevolkingsgroepe te kwantifiseer. Die toepassing van die voorgestelde bistippingsmetodologie in die praktyk lei tot 'n belangrike newe-produk: Dit verskaf die stimulus tot die ontstaan van nuwe idees, byvoorbeeld, die toepassing van 'n HKA bistipping in 'n industriele omgewing het tot die ontwikkeling van die konsep van 'n kwaliteitsgebied aanleiding gegee; a-sakkies is gekonstrueer om duisende waamemings in die suurlemoendatastel te verteenwoordig wat weer gelei het tot 'n metode om die graad van oorvleueling te kwantifiseer. Hierdeur is die geweldige veelsydigheid van bistippings geillustreer - bistippingsmetodologie verskaf die infrastruktuur vir baie vindingryke toepassings in die praktyk.

