Doctoral Degrees (Statistics and Actuarial Science)
Permanent URI for this collection
Browse
Browsing Doctoral Degrees (Statistics and Actuarial Science) by browse.metadata.advisor "Le Roux, N. J."
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- ItemBiplot methodology for analysing and evaluating missing multivariate nominal scaled data(Stellenbosch : Stellenbosch University, 2019-12) Nienkemper-Swanepoel, Johane; Le Roux, N. J.; Lubbe, Sugnet; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: This research aims at developing exploratory techniques that are specifically suitable for missing data applications. Categorical data analysis, missing data analysis and biplot visualisation are the three core methodologies that are combined to develop novel techniques. Variants of multiple correspondence analysis (MCA) biplots are used for all visualisations. The first study objective addresses exploratory analysis after multiple imputation (MI). Multiple plausible values are imputed for each missing observation to construct multiple completed data sets for standard analyses. Biplot visualisations are constructed for each completed data set after MI which require individual exploration to obtain final inference. The number of MIs will greatly affect the accuracy and consistency of the interpretations obtained from several plots. This predicament led to the development of GPAbin, to optimally combine configurations from MIs to obtain a single configuration for final inference. The GPAbin approach advances from two statistical techniques: generalised orthogonal Procrustes analysis (GPA) and the combining rules used to combine estimates obtained from MIs, Rubin’s rules. Albeit a superior missing data handling approach, MI could be daunting for the non‐technical practitioner. Therefore, an adequate alternative approach could be appealing and contribute to the variety of available methods for the handling of incomplete multivariate categorical data. The second objective aims at confirming whether visualisations obtained from nonimputed data sets are a suitable alternative to visualisations obtained from MIs. Subset MCA (sMCA) distinguishes between observed and missing subsets of a multivariate categorical data set by creating an additional response category level (CL) for missing responses in the indicator matrix. Missing and observed responses can be visualised separately by only considering the subset of interest in the recoded indicator matrix. The visualisation of the observed responses utilises all available information which would have been forfeited by deletion methods. The third study objective explores the possibility of predicting a complete multivariate categorical data set from MI visualisations obtained from the first study objective. The distances between the coordinates of a biplot in the full space are used to predict plausible responses. Since the aim of this research is to advance missing data visualisations, the visualisations obtained from predicted completed data sets are compared to visualisations of simulated complete data sets. The emphasis is on preserving inference and not recreating the original data. Missing data techniques are typically developed to address a specific missing data problem. It is therefore crucial to understand the cause of missingness in order to apply suitable missing data techniques. The fourth study objective investigates the sMCA biplot of the missing subset of the recoded indicator matrix. Configurations of the incomplete subsets enable the recognition of non‐response patterns which could provide insight into the particular missing data mechanism (MDM). The missing at random (MAR) MDM refers to missing responses that are dependent on the observed information and is expected to be identified by patterns and groupings occurring in the incomplete sMCA biplot. The missing completely at random (MCAR) MDM states that all observations have the same probability of not being captured which could be identified by a random cloud of points in the incomplete sMCA biplot. Cluster analysis is applied to confirm distinguishable groupings in the incomplete sMCA biplot which could be used as a guideline to identify the MDM. The proposed methodologies to address the different study objectives are evaluated by means of an extensive simulation study comprising of various sample sizes, variables and varying number of CLs which are simulated from three different distributions. The findings of the simulation study are applied to a real data set to aid as a guide for the analysis. Functions have been developed for R statistical software to perform all methodology presented in this research. It is included as a tool pack provided as an appendix to assist in the correct handling and unbiased visualisation of multivariate categorical data with missing observations. Keywords: biplots; categorical data; missing data; multiple correspondence analysis; multiple imputation; Procrustes analysis.
- ItemExtensions of biplot methodology to discriminant analysis with applications of non-parametric principal components(Stellenbosch : Stellenbosch University, 2001) Gardner, Sugnet; Le Roux, N. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistical and Actuarial Science.ENGLISH ABSTRACT: Gower and Hand offer a new perspective on the traditional biplot. This perspective provides a unified approach to principal component analysis (PCA) biplots based on Pythagorean distance; canonical variate analysis (CVA) biplots based on Mahalanobis distance; non-linear biplots based on Euclidean embeddable distances as well as generalised biplots for use with both continuous and categorical variables. The biplot methodology of Gower and Hand is extended and applied in statistical discrimination and classification. This leads to discriminant analysis by means of PCA biplots, CVA biplots, non-linear biplots as well as generalised biplots. Properties of these techniques are derived in detail. Classification regions defined for linear discriminant analysis (LDA) are applied in the CVA biplot leading to discriminant analysis using biplot methodology. Situations where the assumptions of LDA are not met are considered and various existing alternative discriminant analysis procedures are formulated in terms of biplots and apart from PCA biplots, QDA, FDA and DSM biplots are defined, constructed and their usage illustrated. It is demonstrated that biplot methodology naturally provides for managing categorical and continuous variables simultaneously. It is shown through a simulation study that the techniques based on biplot methodology can be applied successfully to the reversal problem with categorical variables in discriminant analysis. Situations occurring in practice where existing discriminant analysis procedures based on distances from means fail are considered. After discussing self-consistency and principal curves (a form of non-parametric principal components), discriminant analysis based on distances from principal curves (a form of a conditional mean) are proposed. This biplot classification procedure based upon principal curves, yields much better results. Bootstrapping is considered as a means of describing variability in biplots. Variability in samples as well as of axes in biplot displays receives attention. Bootstrap a-regions are defined and the ability of these regions to describe biplot variability and to detect outliers is demonstrated. Robust PCA and CVA biplots restricting the role of influential observations on biplot displays are also considered. An extensive library of S-PLUS computer programmes is provided for implementing the various discriminant analysis techniques that were developed using biplot methodology. The application of the above theoretical developments and computer software is illustrated by analysing real-life data sets. Biplots are used to investigate the degree of capital intensity of companies and to serve as an aid in risk management of a financial institution. A particular application of the PCA biplot is the TQI biplot used in industry to determine the degree to which manufactured items comply with multidimensional specifications. A further interesting application is to determine whether an Old-Cape furniture item is manufactured of stinkwood or embuia. A data set provided by the Western Cape Nature Conservation Board consisting of measurements of tortoises from the species Homopus areolatus is analysed by means of biplot methodology to determine if morphological differences exist among tortoises from different geographical regions. Allometric considerations need to be taken into account and the resulting small sample sizes in some subgroups severely limit the use of conventional statistical procedures. Biplot methodology is also applied to classification in a diabetes data set illustrating the combined advantage of using classification with principal curves in a robust biplot or biplot classification where covariance matrices are unequal. A discriminant analysis problem where foraging behaviour of deer might eventually result in a change in the dominant plant species is used to illustrate biplot classification of data sets containing both continuous and categorical variables. As an example of the use of biplots with large data sets a data set consisting of 16828 lemons is analysed using biplot methodology to investigate differences in fruit from various areas of production, cultivars and rootstocks. The proposed a-bags also provide a measure of quantifying the graphical overlap among classes. This method is successfully applied in a multidimensional socio-economical data set to quantify the degree of overlap among different race groups. The application of the proposed biplot methodology in practice has an important byproduct: It provides the impetus for many a new idea, e.g. applying a peA biplot in industry led to the development of quality regions; a-bags were constructed to represent thousands of observations in the lemons data set, in tum leading to means for quantifying the degree of overlap. This illustrates the enormous flexibility of biplots - biplot methodology provides an infrastructure for many novelties when applied in practice.
- ItemMultivariate statistical process evaluation and monitoring for complex chemical processes(Stellenbosch : Stellenbosch University, 2015-12) Rossouw, Ruan Francois; Le Roux, N. J.; Coetzer, R. L. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: In this study, the development of an innovative fully integrated process monitoring methodology is presented for a complex chemical facility, originating at the coal feed from different mines up to the processing of the coal to produce raw gas at the gasification plant. The methodology developed is real-time, visual, detect deviations from expected performance across the whole value chain, and also provide for the integration and standardisation of data from a number of different data sources and formats. Real time coal quality analyses from an XRF analyser are summarised and integrated with various data sources from the Coal Supply Facility to provide information on the coal quality of each mine. In addition, simulation models are developed to generate information on the coal quality of each heap and the quality of the reclaimed coal sent to gasification. A real-time multivariate process monitoring approach for the Coal Gasification Facility is presented. This includes a novel approach utilising Generalised Orthogonal Procrustes Analysis to find the optimal units and time period to employ as a reference set. Principal Component Analysis (PCA) and Canonical Variate Analysis (CVA) theory and biplots are evaluated and extended for the real-time monitoring of the plant. A new approach to process deviation monitoring on many variables is presented based on the confidence ( ) value at a specified T2-value. This methodology is proposed as a general data driven performance index as it is objective, and very little prior knowledge of the system is required. A new multivariate gasifier performance index (GPI) is developed, which integrates subject matter knowledge with a data driven approach for real time performance monitoring. Various software modules are developed which were required for the implementation of the real time multivariate process monitoring methodology, which is made operational and distributed to the clients on an interactive web interface. The methodology has been trademarked by Sasol as the MSPEM™ Technology Package. Following the success of the developed methodology, the MSPEM™ package has been rolled out to many more business units within the Sasol Group. In conclusion, this study presents the development and implementation of the MSPEM™ application for a real-time, integrated and standardised approach to multivariate process monitoring of the Sasol Synfuels Coal Value Chain and Gasification Facility. In summary, the following novel developments were introduced: • The application of distance measures other than Euclidean measures are introduced for space filling designs for computer experiments in mixture variables. • An approach utilising Generalised Orthogonal Procrustes Analysis to specify the optimal units and time period to employ as a reference set is developed. • An approach to process deviation monitoring on many variables is presented based on the confidence ( ) value at a specified T2-value. • An integrated approach to a reactor performance index is developed and illustrated. • A comprehensive software infrastructure is developed and implemented