Doctoral Degrees (Statistics and Actuarial Science)
Permanent URI for this collection
Browse
Browsing Doctoral Degrees (Statistics and Actuarial Science) by Subject "Correspondence analysis (Statistics)"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- ItemBiplot methodology for analysing and evaluating missing multivariate nominal scaled data(Stellenbosch : Stellenbosch University, 2019-12) Nienkemper-Swanepoel, Johane; Le Roux, N. J.; Lubbe, Sugnet; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: This research aims at developing exploratory techniques that are specifically suitable for missing data applications. Categorical data analysis, missing data analysis and biplot visualisation are the three core methodologies that are combined to develop novel techniques. Variants of multiple correspondence analysis (MCA) biplots are used for all visualisations. The first study objective addresses exploratory analysis after multiple imputation (MI). Multiple plausible values are imputed for each missing observation to construct multiple completed data sets for standard analyses. Biplot visualisations are constructed for each completed data set after MI which require individual exploration to obtain final inference. The number of MIs will greatly affect the accuracy and consistency of the interpretations obtained from several plots. This predicament led to the development of GPAbin, to optimally combine configurations from MIs to obtain a single configuration for final inference. The GPAbin approach advances from two statistical techniques: generalised orthogonal Procrustes analysis (GPA) and the combining rules used to combine estimates obtained from MIs, Rubin’s rules. Albeit a superior missing data handling approach, MI could be daunting for the non‐technical practitioner. Therefore, an adequate alternative approach could be appealing and contribute to the variety of available methods for the handling of incomplete multivariate categorical data. The second objective aims at confirming whether visualisations obtained from nonimputed data sets are a suitable alternative to visualisations obtained from MIs. Subset MCA (sMCA) distinguishes between observed and missing subsets of a multivariate categorical data set by creating an additional response category level (CL) for missing responses in the indicator matrix. Missing and observed responses can be visualised separately by only considering the subset of interest in the recoded indicator matrix. The visualisation of the observed responses utilises all available information which would have been forfeited by deletion methods. The third study objective explores the possibility of predicting a complete multivariate categorical data set from MI visualisations obtained from the first study objective. The distances between the coordinates of a biplot in the full space are used to predict plausible responses. Since the aim of this research is to advance missing data visualisations, the visualisations obtained from predicted completed data sets are compared to visualisations of simulated complete data sets. The emphasis is on preserving inference and not recreating the original data. Missing data techniques are typically developed to address a specific missing data problem. It is therefore crucial to understand the cause of missingness in order to apply suitable missing data techniques. The fourth study objective investigates the sMCA biplot of the missing subset of the recoded indicator matrix. Configurations of the incomplete subsets enable the recognition of non‐response patterns which could provide insight into the particular missing data mechanism (MDM). The missing at random (MAR) MDM refers to missing responses that are dependent on the observed information and is expected to be identified by patterns and groupings occurring in the incomplete sMCA biplot. The missing completely at random (MCAR) MDM states that all observations have the same probability of not being captured which could be identified by a random cloud of points in the incomplete sMCA biplot. Cluster analysis is applied to confirm distinguishable groupings in the incomplete sMCA biplot which could be used as a guideline to identify the MDM. The proposed methodologies to address the different study objectives are evaluated by means of an extensive simulation study comprising of various sample sizes, variables and varying number of CLs which are simulated from three different distributions. The findings of the simulation study are applied to a real data set to aid as a guide for the analysis. Functions have been developed for R statistical software to perform all methodology presented in this research. It is included as a tool pack provided as an appendix to assist in the correct handling and unbiased visualisation of multivariate categorical data with missing observations. Keywords: biplots; categorical data; missing data; multiple correspondence analysis; multiple imputation; Procrustes analysis.
- ItemFeature selection for multi-label classification(Stellenbosch : Stellenbosch University, 2020-12) Contardo-Berning, Ivona E.; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Economics.ENGLISH ABSTRACT : The field of multi-label learning is a popular new research focus. In the multi-label setting, a data instance can be associated simultaneously with a set of labels instead of only a single label. This dissertation reviews the subject of multi-label classification, emphasising some of the notable developments in the field. The nature of multi-label datasets typically means that these datasets are complex and dimensionality reduction might aid in the analysis of these datasets. The notion of feature selection is therefore introduced and discussed briefly in this dissertation. A new procedure for multi-label feature selection is proposed. This new procedure, relevance pattern feature selection (RPFS), utilises the methodology of the graphical technique of Multiple Correspondence Analysis (MCA) biplots to perform feature selection. An empirical evaluation of the proposed technique is performed using a benchmark multi-label dataset and synthetic multi-label datasets. For the benchmark dataset it is shown that the proposed procedure achieves results similar to the full model, while using significantly fewer features. The empirical evaluation of the procedure on the synthetic datasets shows that the results achieved by the reduced sets of features are better than those achieved with a full set of features for the majority of the methods. The proposed procedure is then compared to two established multi-label feature selection techniques using the synthetic datasets. The results again show that the proposed procedure is effective.
- ItemA statistical analysis of student performance for the 2000-2013 period at the Copperbelt University in Zambia(Stellenbosch : Stellenbosch University, 2017-12) Ngoy, Mwanabute; Le Roux, Niel Johannes; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : Education in general, and tertiary education in particular are the engines for sustained development of a nation. In this line, the Copperbelt University (CBU) plays a vital role in delivering the necessary knowledge and skills requirements for the development of Zambia and the neighbouring Southern Africa Region. It is thus important to investigate relationships between school and university results at the CBU. The first year and the graduate datasets comprising the CBU data for the 2000-2013 period were analysed using a geometric data analysis approach. The population data of all school results for the whole Zambia from 2000 to 2003 and from 2006 to 2012 were also used. The findings of this study show that the changes in the cut-off values for university entrance resulted in the CBU admitting school leavers with better school results, i.e. most recent intakes of first year students had higher school results than the older intakes. But the adjustment on the cut-off values did not have a major effect on the university performance. There was a general tendency for students to achieve higher scores at school level which could not translate necessarily into higher academic achievement at university. Additionally, certain school subjects (i.e. school Mathematics, Science, Physics, Chemistry, Additional Mathematics, Geography, and Principles of Accounts) and the school average for all school subjects were identified as good indicators of university performance. These variables were also found to be responsible for the group separation/discrimination among the four groups of the first year students. For graduate students, the school average was the major determinant of the degree classification. However, most school variables had limited discrimination power to differentiate between successful and unsuccessful students. Furthermore, it was found that policies of making school results available as grades rather than actual percentages can have a marked influence on expected university achievements. One of the major contributions of this thesis is the use of optimal scores as an alternative imputation method applicable to interval-valued and categorical data. This study also identified years of study which needed more focus in order to enhance the performance of students: the first two years of study for business related programmes, the third year of study for engineering programmes, and the third and fifth year of study for other programmes. Additionally, the study also identified certain school variables which were good indicators of university performance and which could be used by the university to admit potential successful students. It was also found that the first year Mathematics had the worst performance at the first year level despite the students achieving outstanding results in school Mathematics. It was also found that a clear demarcation exists between the “clear pass” (CP) students, i.e. those who successfully passed the first year of study and other first year groups. Also the “distinction” (DIS) group, i.e. those who completed their undergraduate studies with distinction, was apart from the other groups. These two groups (CP and DIS groups) mostly achieved outstanding results at school level as compared to other groups.