# Robust principal component analysis biplots

Thesis (MSc (Mathematical Statistics))--University of Stellenbosch, 2008.

Thesis

In this study several procedures for finding robust principal components (RPCs) for low and high dimensional data sets are investigated in parallel with robust principal component analysis (RPCA) biplots. These RPCA biplots will be used for the simultaneous visualisation of the observations and variables in the subspace spanned by the RPCs. Chapter 1 contains: a brief overview of the difficulties that are encountered when graphically investigating patterns and relationships in multidimensional data and why PCA can be used to circumvent these difficulties; the objectives of this study; a summary of the work done in order to meet these objectives; certain results in matrix algebra that are needed throughout this study. In Chapter 2 the derivation of the classic sample principal components (SPCs) is first discussed in detail since they are the „building blocks‟ of classic principal component analysis (CPCA) biplots. Secondly, the traditional CPCA biplot of Gabriel (1971) is reviewed. Thirdly, modifications to this biplot using the new philosophy of Gower & Hand (1996) are given attention. Reasons why this modified biplot has several advantages over the traditional biplot – some of which are aesthetical in nature – are given. Lastly, changes that can be made to the Gower & Hand (1996) PCA biplot to optimally visualise the correlations between the variables is discussed. Because the SPCs determine the position of the observations as well as the orientation of the arrows (traditional biplot) or axes (Gower and Hand biplot) in the PCA biplot subspace, it is useful to give estimates of the standard errors of the SPCs together with the biplot display as an indication of the stability of the biplot. A computer-intensive statistical technique called the Bootstrap is firstly discussed that is used to calculate the standard errors of the SPCs without making underlying distributional assumptions. Secondly, the influence of outliers on Bootstrap results is investigated. Lastly, a robust form of the Bootstrap is briefly discussed for calculating standard error estimates that remain stable with or without the presence of outliers in the sample. All the preceding topics are the subject matter of Chapter 3. In Chapter 4, reasons why a PC analysis should be made robust in the presence of outliers are firstly discussed. Secondly, different types of outliers are discussed. Thirdly, a method for identifying influential observations and a method for identifying outlying observations are investigated. Lastly, different methods for constructing robust estimates of location and dispersion for the observations receive attention. These robust estimates are used in numerical procedures that calculate RPCs. In Chapter 5, an overview of some of the procedures that are used to calculate RPCs for lower and higher dimensional data sets is firstly discussed. Secondly, two numerical procedures that can be used to calculate RPCs for lower dimensional data sets are discussed and compared in detail. Details and examples of robust versions of the Gower & Hand (1996) PCA biplot that can be constructed using these RPCs are also provided. In Chapter 6, five numerical procedures for calculating RPCs for higher dimensional data sets are discussed in detail. Once RPCs have been obtained by using these methods, they are used to construct robust versions of the PCA biplot of Gower & Hand (1996). Details and examples of these robust PCA biplots are also provided. An extensive software library has been developed so that the biplot methodology discussed in this study can be used in practice. The functions in this library are given in an appendix at the end of this study. This software library is used on data sets from various fields so that the merit of the theory developed in this study can be visually appraised.