Doctoral Degrees (Statistics and Actuarial Science)
Permanent URI for this collection
Browse
Browsing Doctoral Degrees (Statistics and Actuarial Science) by Title
Now showing 1 - 20 of 23
Results Per Page
Sort Options
- ItemAspects of model development using regression quantiles and elemental regressions(Stellenbosch : Stellenbosch University, 2007-03) Ranganai, Edmore; De Wet, Tertius; Van Vuuren, J.O.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: It is well known that ordinary least squares (OLS) procedures are sensitive to deviations from the classical Gaussian assumptions (outliers) as well as data aberrations in the design space. The two major data aberrations in the design space are collinearity and high leverage. Leverage points can also induce or hide collinearity in the design space. Such leverage points are referred to as collinearity influential points. As a consequence, over the years, many diagnostic tools to detect these anomalies as well as alternative procedures to counter them were developed. To counter deviations from the classical Gaussian assumptions many robust procedures have been proposed. One such class of procedures is the Koenker and Bassett (1978) Regressions Quantiles (RQs), which are natural extensions of order statistics, to the linear model. RQs can be found as solutions to linear programming problems (LPs). The basic optimal solutions to these LPs (which are RQs) correspond to elemental subset (ES) regressions, which consist of subsets of minimum size to estimate the necessary parameters of the model. On the one hand, some ESs correspond to RQs. On the other hand, in the literature it is shown that many OLS statistics (estimators) are related to ES regression statistics (estimators). Therefore there is an inherent relationship amongst the three sets of procedures. The relationship between the ES procedure and the RQ one, has been noted almost “casually” in the literature while the latter has been fairly widely explored. Using these existing relationships between the ES procedure and the OLS one as well as new ones, collinearity, leverage and outlier problems in the RQ scenario were investigated. Also, a lasso procedure was proposed as variable selection technique in the RQ scenario and some tentative results were given for it. These results are promising. Single case diagnostics were considered as well as their relationships to multiple case ones. In particular, multiple cases of the minimum size to estimate the necessary parameters of the model, were considered, corresponding to a RQ (ES). In this way regression diagnostics were developed for both ESs and RQs. The main problems that affect RQs adversely are collinearity and leverage due to the nature of the computational procedures and the fact that RQs’ influence functions are unbounded in the design space but bounded in the response variable. As a consequence of this, RQs have a high affinity for leverage points and a high exclusion rate of outliers. The influential picture exhibited in the presence of both leverage points and outliers is the net result of these two antagonistic forces. Although RQs are bounded in the response variable (and therefore fairly robust to outliers), outlier diagnostics were also considered in order to have a more holistic picture. The investigations used comprised analytic means as well as simulation. Furthermore, applications were made to artificial computer generated data sets as well as standard data sets from the literature. These revealed that the ES based statistics can be used to address problems arising in the RQ scenario to some degree of success. However, due to the interdependence between the different aspects, viz. the one between leverage and collinearity and the one between leverage and outliers, “solutions” are often dependent on the particular situation. In spite of this complexity, the research did produce some fairly general guidelines that can be fruitfully used in practice.
- ItemAssessing the influence of observations on the generalization performance of the kernel Fisher discriminant classifier(Stellenbosch : Stellenbosch University, 2008-12) Lamont, Morné Michael Connell; Louw, Nelmarie; Steel, Sarel; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.Kernel Fisher discriminant analysis (KFDA) is a kernel-based technique that can be used to classify observations of unknown origin into predefined groups. Basically, KFDA can be viewed as a non-linear extension of Fisher’s linear discriminant analysis (FLDA). In this thesis we give a detailed explanation how FLDA is generalized to obtain KFDA. We also discuss two methods that are related to KFDA. Our focus is on binary classification. The influence of atypical cases in discriminant analysis has been investigated by many researchers. In this thesis we investigate the influence of atypical cases on certain aspects of KFDA. One important aspect of interest is the generalization performance of the KFD classifier. Several other aspects are also investigated with the aim of developing criteria that can be used to identify cases that are detrimental to the KFD generalization performance. The investigation is done via a Monte Carlo simulation study. The output of KFDA can also be used to obtain the posterior probabilities of belonging to the two classes. In this thesis we discuss two approaches to estimate posterior probabilities in KFDA. Two new KFD classifiers are also derived which use these probabilities to classify observations, and their performance is compared to that of the original KFD classifier. The main objective of this thesis is to develop criteria which can be used to identify cases that are detrimental to the KFD generalization performance. Nine such criteria are proposed and their merit investigated in a Monte Carlo simulation study as well as on real-world data sets. Evaluating the criteria on a leave-one-out basis poses a computational challenge, especially for large data sets. In this thesis we also propose using the smallest enclosing hypersphere as a filter, to reduce the amount of computations. The effectiveness of the filter is tested in a Monte Carlo simulation study as well as on real-world data sets.
- ItemBayesian approaches of Markov models embedded in unbalanced panel data(Stellenbosch : Stellenbosch University, 2012-12) Muller, Christoffel Joseph Brand; Mostert, Paul J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: Multi-state models are used in this dissertation to model panel data, also known as longitudinal or cross-sectional time-series data. These are data sets which include units that are observed across two or more points in time. These models have been used extensively in medical studies where the disease states of patients are recorded over time. A theoretical overview of the current multi-state Markov models when applied to panel data is presented and based on this theory, a simulation procedure is developed to generate panel data sets for given Markov models. Through the use of this procedure a simulation study is undertaken to investigate the properties of the standard likelihood approach when fitting Markov models and then to assess its shortcomings. One of the main shortcomings highlighted by the simulation study, is the unstable estimates obtained by the standard likelihood models, especially when fitted to small data sets. A Bayesian approach is introduced to develop multi-state models that can overcome these unstable estimates by incorporating prior knowledge into the modelling process. Two Bayesian techniques are developed and presented, and their properties are assessed through the use of extensive simulation studies. Firstly, Bayesian multi-state models are developed by specifying prior distributions for the transition rates, constructing a likelihood using standard Markov theory and then obtaining the posterior distributions of the transition rates. A selected few priors are used in these models. Secondly, Bayesian multi-state imputation techniques are presented that make use of suitable prior information to impute missing observations in the panel data sets. Once imputed, standard likelihood-based Markov models are fitted to the imputed data sets to estimate the transition rates. Two different Bayesian imputation techniques are presented. The first approach makes use of the Dirichlet distribution and imputes the unknown states at all time points with missing observations. The second approach uses a Dirichlet process to estimate the time at which a transition occurred between two known observations and then a state is imputed at that estimated transition time. The simulation studies show that these Bayesian methods resulted in more stable results, even when small samples are available.
- ItemBiplot methodology for analysing and evaluating missing multivariate nominal scaled data(Stellenbosch : Stellenbosch University, 2019-12) Nienkemper-Swanepoel, Johane; Le Roux, N. J.; Lubbe, Sugnet; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: This research aims at developing exploratory techniques that are specifically suitable for missing data applications. Categorical data analysis, missing data analysis and biplot visualisation are the three core methodologies that are combined to develop novel techniques. Variants of multiple correspondence analysis (MCA) biplots are used for all visualisations. The first study objective addresses exploratory analysis after multiple imputation (MI). Multiple plausible values are imputed for each missing observation to construct multiple completed data sets for standard analyses. Biplot visualisations are constructed for each completed data set after MI which require individual exploration to obtain final inference. The number of MIs will greatly affect the accuracy and consistency of the interpretations obtained from several plots. This predicament led to the development of GPAbin, to optimally combine configurations from MIs to obtain a single configuration for final inference. The GPAbin approach advances from two statistical techniques: generalised orthogonal Procrustes analysis (GPA) and the combining rules used to combine estimates obtained from MIs, Rubin’s rules. Albeit a superior missing data handling approach, MI could be daunting for the non‐technical practitioner. Therefore, an adequate alternative approach could be appealing and contribute to the variety of available methods for the handling of incomplete multivariate categorical data. The second objective aims at confirming whether visualisations obtained from nonimputed data sets are a suitable alternative to visualisations obtained from MIs. Subset MCA (sMCA) distinguishes between observed and missing subsets of a multivariate categorical data set by creating an additional response category level (CL) for missing responses in the indicator matrix. Missing and observed responses can be visualised separately by only considering the subset of interest in the recoded indicator matrix. The visualisation of the observed responses utilises all available information which would have been forfeited by deletion methods. The third study objective explores the possibility of predicting a complete multivariate categorical data set from MI visualisations obtained from the first study objective. The distances between the coordinates of a biplot in the full space are used to predict plausible responses. Since the aim of this research is to advance missing data visualisations, the visualisations obtained from predicted completed data sets are compared to visualisations of simulated complete data sets. The emphasis is on preserving inference and not recreating the original data. Missing data techniques are typically developed to address a specific missing data problem. It is therefore crucial to understand the cause of missingness in order to apply suitable missing data techniques. The fourth study objective investigates the sMCA biplot of the missing subset of the recoded indicator matrix. Configurations of the incomplete subsets enable the recognition of non‐response patterns which could provide insight into the particular missing data mechanism (MDM). The missing at random (MAR) MDM refers to missing responses that are dependent on the observed information and is expected to be identified by patterns and groupings occurring in the incomplete sMCA biplot. The missing completely at random (MCAR) MDM states that all observations have the same probability of not being captured which could be identified by a random cloud of points in the incomplete sMCA biplot. Cluster analysis is applied to confirm distinguishable groupings in the incomplete sMCA biplot which could be used as a guideline to identify the MDM. The proposed methodologies to address the different study objectives are evaluated by means of an extensive simulation study comprising of various sample sizes, variables and varying number of CLs which are simulated from three different distributions. The findings of the simulation study are applied to a real data set to aid as a guide for the analysis. Functions have been developed for R statistical software to perform all methodology presented in this research. It is included as a tool pack provided as an appendix to assist in the correct handling and unbiased visualisation of multivariate categorical data with missing observations. Keywords: biplots; categorical data; missing data; multiple correspondence analysis; multiple imputation; Procrustes analysis.
- ItemClassifying yield spread movements in sparse data through triplots(Stellenbosch : Stellenbosch University, 2020-03) Van der Merwe, Carel Johannes; De Wet, Tertius; Inghelbrecht, Koen; Vanmaele, Michele; Conradie, W. J. (Willem Johannes); Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : In many developing countries, including South Africa, all data that are required to calculate the fair values of financial instruments are not always readily available. Additionally, in some instances, companies who do not have the necessary quantitative skills are reluctant to incorporate the correct fair valuation by failing to employ the appropriate techniques. This problem is most notable with regards to unlisted debt instruments. There are two main inputs with regards to the valuation of unlisted debt instruments, namely the the risk-free curve and the the yield spread. Investigation into these two components forms the basis of this thesis. Firstly, an analysis is carried out to derive approximations of risk-free curves in areas where data is sparse. Thereafter it is investigated whether there is sufficient evidence of a significant change in yield spreads of unlisted debt instruments. In order to determine these changes, however, a new method that allows for simultaneous visualisation and classification of data was developed - termed triplot classification with polybags. This new classification technique also has the ability to limit misclassification rates. In the first paper, a proxy for the extended zero curve, calculated from other observable inputs, is found through a simulation approach by incorporating two new techniques, namely permuted integer multiple linear regression and aggregate standardised model scoring. It was found that a Nelson Siegel fit, with a mixture of one year forward rates as proxies for the long term zero point, and some discarding of initial data points, performs relatively well in the training and testing data sets. This new method allows for the approximation of risk-free curves where no long term points are available, and further allows for the determinants of the yield curve shape by considering other available data. The changes in these shape determining parameters are used in the final paper as determinants for changes in yield spreads. For the second paper, a new classification technique is developed that was used in the final paper. Classification techniques do not easily allow for visual interpretation, nor do they usually allow for the limitation of the false negative and positive error rates. For some areas of research and practical applications these shortcomings are important to address. In this paper, classification techniques are combined with biplots, allowing for simultaneous visual representation and classification of the data, resulting in the so-called triplot. By further incorporating polybags, the ability of limiting misclassification type errors is also introduced. A simulation study as well as an application is provided showing that the method provides similar results compared to existing methods, but with added visualisation benefits. The paper focuses purely on developing a statistical technique that can be applied to any field. The application that is provided, for example, is on a medical data set. In the final paper the technique is applied to changes in yield spreads. The third paper considered changes in yield spreads which were analysed through various covariates to determine whether significant decreases or increases would have been observed for unlisted debt instruments. The methodology does not specifically determine the new spread, but gives evidence on whether the initial implied spread could be left the same, or whether a new spread should be determined. These yield spread movements are classified using various share, interest rate, financial ratio, and economic type covariates in a visually interpretive manner. This also allows for a better understanding of how various factors drive the changes in yield spreads. Finally, as supplement to each paper, a web-based application was built allowing the reader to interact with all the data and properties of the methodologies discussed. The following links can be used to access these three applications: - Paper 1: https://carelvdmerwe.shinyapps.io/ProxyCurve/ - Paper 2: https://carelvdmerwe.shinyapps.io/TriplotSimulation/ - Paper 3: https://carelvdmerwe.shinyapps.io/SpreadsTriplot/
- ItemEdgeworth-corrected small-sample confidence intervals for ratio parameters in linear regression(Stellenbosch : Stellenbosch University, 2002-03) Binyavanga, Kamanzi-wa; Maritz, J. S.; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistical and Actuarial Science.ENGLISH ABSTRACT: In this thesis we construct a central confidence interval for a smooth scalar non-linear function of parameter vector f3 in a single general linear regression model Y = X f3 + c. We do this by first developing an Edgeworth expansion for the distribution function of a standardised point estimator. The confidence interval is then constructed in the manner discussed. Simulation studies reported at the end of the thesis show the interval to perform well in many small-sample situations. Central to the development of the Edgeworth expansion is our use of the index notation which, in statistics, has been popularised by McCullagh (1984, 1987). The contributions made in this thesis are of two kinds. We revisit the complex McCullagh Index Notation, modify and extend it in certain respects as well as repackage it in the manner that is more accessible to other researchers. On the new contributions, in addition to the introduction of a new small-sample confidence interval, we extend the theory of stochastic polynomials (SP) in three respects. A method, which we believe to be the simplest and most transparent to date, is proposed for deriving cumulants for these. Secondly, the theory of the cumulants of the SP is developed both in the context of Edgeworth expansion as well as in the regression setting. Thirdly, our new method enables us to propose a natural alternative to the method of Hall (1992a, 1992b) regarding skewness-reduction in Edgeworth expansions.
- ItemExtensions of biplot methodology to discriminant analysis with applications of non-parametric principal components(Stellenbosch : Stellenbosch University, 2001) Gardner, Sugnet; Le Roux, N. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistical and Actuarial Science.ENGLISH ABSTRACT: Gower and Hand offer a new perspective on the traditional biplot. This perspective provides a unified approach to principal component analysis (PCA) biplots based on Pythagorean distance; canonical variate analysis (CVA) biplots based on Mahalanobis distance; non-linear biplots based on Euclidean embeddable distances as well as generalised biplots for use with both continuous and categorical variables. The biplot methodology of Gower and Hand is extended and applied in statistical discrimination and classification. This leads to discriminant analysis by means of PCA biplots, CVA biplots, non-linear biplots as well as generalised biplots. Properties of these techniques are derived in detail. Classification regions defined for linear discriminant analysis (LDA) are applied in the CVA biplot leading to discriminant analysis using biplot methodology. Situations where the assumptions of LDA are not met are considered and various existing alternative discriminant analysis procedures are formulated in terms of biplots and apart from PCA biplots, QDA, FDA and DSM biplots are defined, constructed and their usage illustrated. It is demonstrated that biplot methodology naturally provides for managing categorical and continuous variables simultaneously. It is shown through a simulation study that the techniques based on biplot methodology can be applied successfully to the reversal problem with categorical variables in discriminant analysis. Situations occurring in practice where existing discriminant analysis procedures based on distances from means fail are considered. After discussing self-consistency and principal curves (a form of non-parametric principal components), discriminant analysis based on distances from principal curves (a form of a conditional mean) are proposed. This biplot classification procedure based upon principal curves, yields much better results. Bootstrapping is considered as a means of describing variability in biplots. Variability in samples as well as of axes in biplot displays receives attention. Bootstrap a-regions are defined and the ability of these regions to describe biplot variability and to detect outliers is demonstrated. Robust PCA and CVA biplots restricting the role of influential observations on biplot displays are also considered. An extensive library of S-PLUS computer programmes is provided for implementing the various discriminant analysis techniques that were developed using biplot methodology. The application of the above theoretical developments and computer software is illustrated by analysing real-life data sets. Biplots are used to investigate the degree of capital intensity of companies and to serve as an aid in risk management of a financial institution. A particular application of the PCA biplot is the TQI biplot used in industry to determine the degree to which manufactured items comply with multidimensional specifications. A further interesting application is to determine whether an Old-Cape furniture item is manufactured of stinkwood or embuia. A data set provided by the Western Cape Nature Conservation Board consisting of measurements of tortoises from the species Homopus areolatus is analysed by means of biplot methodology to determine if morphological differences exist among tortoises from different geographical regions. Allometric considerations need to be taken into account and the resulting small sample sizes in some subgroups severely limit the use of conventional statistical procedures. Biplot methodology is also applied to classification in a diabetes data set illustrating the combined advantage of using classification with principal curves in a robust biplot or biplot classification where covariance matrices are unequal. A discriminant analysis problem where foraging behaviour of deer might eventually result in a change in the dominant plant species is used to illustrate biplot classification of data sets containing both continuous and categorical variables. As an example of the use of biplots with large data sets a data set consisting of 16828 lemons is analysed using biplot methodology to investigate differences in fruit from various areas of production, cultivars and rootstocks. The proposed a-bags also provide a measure of quantifying the graphical overlap among classes. This method is successfully applied in a multidimensional socio-economical data set to quantify the degree of overlap among different race groups. The application of the proposed biplot methodology in practice has an important byproduct: It provides the impetus for many a new idea, e.g. applying a peA biplot in industry led to the development of quality regions; a-bags were constructed to represent thousands of observations in the lemons data set, in tum leading to means for quantifying the degree of overlap. This illustrates the enormous flexibility of biplots - biplot methodology provides an infrastructure for many novelties when applied in practice.
- ItemExtreme quantile inference(Stellenbosch : Stellenbosch University, 2020-03) Buitendag, Sven; De Wet, Tertius; Beirlant, Jan; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : A novel approach to performing extreme quantile inference is proposed by applying ridge regression and the saddlepoint approximation to results in extreme value theory. To this end, ridge regression is applied to the log differences of the largest sample quantiles to obtain a bias-reduced estimator of the extreme value index, which is a parameter in extreme value theory that plays a central role in the estimation of extreme quantiles. The utility of the ridge regression estimators for the extreme value index is illustrated by means of simulations results and applications to daily wind speeds. A new pivotal quantity is then proposed with which a set of novel asymptotic confidence intervals for extreme quantiles are obtained. The ridge regression estimator for the extreme value index is combined with the proposed pivotal quantity together with the saddlepoint approximation to yield a set of confidence intervals that are accurate and narrow. The utility of these confidence intervals are illustrated by means of simulation results and applications to Belgian reinsurance data. Multivariate generalizations of sample quantiles are considered with the aim of developing multivariate risk measures, including maximum correlation risk measures and an estimator for the extreme value index. These multivariate sample quantiles are called center-outward quantiles, and are defined as an optimal transportation of the uniformly distributed points in the unit ball Sd to the observed sample points in Rd. A continuous extension of the centeroutward quantile is proposed, which yields quantile contours that are nested. Furthermore, maximum correlation risk measures for multivariate samples are presented, as well as an estimator for the extreme value index for multivariate regularly varying samples. These results are applied to Danish fire insurance data and the stock returns of Google and Apple share prices to illustrate their utility.
- ItemFeature selection for multi-label classification(Stellenbosch : Stellenbosch University, 2020-12) Contardo-Berning, Ivona E.; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Economics.ENGLISH ABSTRACT : The field of multi-label learning is a popular new research focus. In the multi-label setting, a data instance can be associated simultaneously with a set of labels instead of only a single label. This dissertation reviews the subject of multi-label classification, emphasising some of the notable developments in the field. The nature of multi-label datasets typically means that these datasets are complex and dimensionality reduction might aid in the analysis of these datasets. The notion of feature selection is therefore introduced and discussed briefly in this dissertation. A new procedure for multi-label feature selection is proposed. This new procedure, relevance pattern feature selection (RPFS), utilises the methodology of the graphical technique of Multiple Correspondence Analysis (MCA) biplots to perform feature selection. An empirical evaluation of the proposed technique is performed using a benchmark multi-label dataset and synthetic multi-label datasets. For the benchmark dataset it is shown that the proposed procedure achieves results similar to the full model, while using significantly fewer features. The empirical evaluation of the procedure on the synthetic datasets shows that the results achieved by the reduced sets of features are better than those achieved with a full set of features for the majority of the methods. The proposed procedure is then compared to two established multi-label feature selection techniques using the synthetic datasets. The results again show that the proposed procedure is effective.
- ItemA framework for estimating risk(Stellenbosch : Stellenbosch University, 2008-03) Kroon, Rodney Stephen; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.We consider the problem of model assessment by risk estimation. Various approaches to risk estimation are considered in a uni ed framework. This a discussion of various complexity dimensions and approaches to obtaining bounds on covering numbers is also presented. The second type of training sample interval estimator discussed in the thesis is Rademacher bounds. These bounds use advanced concentration inequalities, so a chapter discussing such inequalities is provided. Our discussion of Rademacher bounds leads to the presentation of an alternative, slightly stronger, form of the core result used for deriving local Rademacher bounds, by avoiding a few unnecessary relaxations. Next, we turn to a discussion of PAC-Bayesian bounds. Using an approach developed by Olivier Catoni, we develop new PAC-Bayesian bounds based on results underlying Hoe ding's inequality. By utilizing Catoni's concept of \exchangeable priors", these results allowed the extension of a covering number-based result to averaging classi ers, as well as its corresponding algorithm- and data-dependent result. The last contribution of the thesis is the development of a more exible shell decomposition bound: by using Hoe ding's tail inequality rather than Hoe ding's relative entropy inequality, we extended the bound to general loss functions, allowed the use of an arbitrary number of bins, and introduced between-bin and within-bin \priors". Finally, to illustrate the calculation of these bounds, we applied some of them to the UCI spam classi cation problem, using decision trees and boosted stumps. framework is an extension of a decision-theoretic framework proposed by David Haussler. Point and interval estimation based on test samples and training samples is discussed, with interval estimators being classi ed based on the measure of deviation they attempt to bound. The main contribution of this thesis is in the realm of training sample interval estimators, particularly covering number-based and PAC-Bayesian interval estimators. The thesis discusses a number of approaches to obtaining such estimators. The rst type of training sample interval estimator to receive attention is estimators based on classical covering number arguments. A number of these estimators were generalized in various directions. Typical generalizations included: extension of results from misclassi cation loss to other loss functions; extending results to allow arbitrary ghost sample size; extending results to allow arbitrary scale in the relevant covering numbers; and extending results to allow arbitrary choice of in the use of symmetrization lemmas. These extensions were applied to covering number-based estimators for various measures of deviation, as well as for the special cases of misclassi - cation loss estimators, realizable case estimators, and margin bounds. Extended results were also provided for strati cation by (algorithm- and datadependent) complexity of the decision class. In order to facilitate application of these covering number-based bounds,
- ItemThe identification and application of common principal components(Stellenbosch : Stellenbosch University, 2014-12) Pepler, Pieter Theo; Uys, Daniel W.; Nel, D. G.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: When estimating the covariance matrices of two or more populations, the covariance matrices are often assumed to be either equal or completely unrelated. The common principal components (CPC) model provides an alternative which is situated between these two extreme assumptions: The assumption is made that the population covariance matrices share the same set of eigenvectors, but have di erent sets of eigenvalues. An important question in the application of the CPC model is to determine whether it is appropriate for the data under consideration. Flury (1988) proposed two methods, based on likelihood estimation, to address this question. However, the assumption of multivariate normality is untenable for many real data sets, making the application of these parametric methods questionable. A number of non-parametric methods, based on bootstrap replications of eigenvectors, is proposed to select an appropriate common eigenvector model for two population covariance matrices. Using simulation experiments, it is shown that the proposed selection methods outperform the existing parametric selection methods. If appropriate, the CPC model can provide covariance matrix estimators that are less biased than when assuming equality of the covariance matrices, and of which the elements have smaller standard errors than the elements of the ordinary unbiased covariance matrix estimators. A regularised covariance matrix estimator under the CPC model is proposed, and Monte Carlo simulation results show that it provides more accurate estimates of the population covariance matrices than the competing covariance matrix estimators. Covariance matrix estimation forms an integral part of many multivariate statistical methods. Applications of the CPC model in discriminant analysis, biplots and regression analysis are investigated. It is shown that, in cases where the CPC model is appropriate, CPC discriminant analysis provides signi cantly smaller misclassi cation error rates than both ordinary quadratic discriminant analysis and linear discriminant analysis. A framework for the comparison of di erent types of biplots for data with distinct groups is developed, and CPC biplots constructed from common eigenvectors are compared to other types of principal component biplots using this framework. A subset of data from the Vermont Oxford Network (VON), of infants admitted to participating neonatal intensive care units in South Africa and Namibia during 2009, is analysed using the CPC model. It is shown that the proposed non-parametric methodology o ers an improvement over the known parametric methods in the analysis of this data set which originated from a non-normally distributed multivariate population. CPC regression is compared to principal component regression and partial least squares regression in the tting of models to predict neonatal mortality and length of stay for infants in the VON data set. The tted regression models, using readily available day-of-admission data, can be used by medical sta and hospital administrators to counsel parents and improve the allocation of medical care resources. Predicted values from these models can also be used in benchmarking exercises to assess the performance of neonatal intensive care units in the Southern African context, as part of larger quality improvement programmes.
- ItemImproved estimation procedures for a positive extreme value index(Stellenbosch : University of Stellenbosch, 2010-12) Berning, Thomas Louw; De Wet, Tertius; University of Stellenbosch. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: In extreme value theory (EVT) the emphasis is on extreme (very small or very large) observations. The crucial parameter when making inferences about extreme quantiles, is called the extreme value index (EVI). This thesis concentrates on only the right tail of the underlying distribution (extremely large observations), and specifically situations where the EVI is assumed to be positive. A positive EVI indicates that the underlying distribution of the data has a heavy right tail, as is the case with, for example, insurance claims data. There are numerous areas of application of EVT, since there are a vast number of situations in which one would be interested in predicting extreme events accurately. Accurate prediction requires accurate estimation of the EVI, which has received ample attention in the literature from a theoretical as well as practical point of view. Countless estimators of the EVI exist in the literature, but the practitioner has little information on how these estimators compare. An extensive simulation study was designed and conducted to compare the performance of a wide range of estimators, over a wide range of sample sizes and distributions. A new procedure for the estimation of a positive EVI was developed, based on fitting the perturbed Pareto distribution (PPD) to observations above a threshold, using Bayesian methodology. Attention was also given to the development of a threshold selection technique. One of the major contributions of this thesis is a measure which quantifies the stability (or rather instability) of estimates across a range of thresholds. This measure can be used to objectively obtain the range of thresholds over which the estimates are most stable. It is this measure which is used for the purpose of threshold selection for the proposed PPD estimator. A case study of five insurance claims data sets illustrates how data sets can be analyzed in practice. It is shown to what extent discretion can/should be applied, as well as how different estimators can be used in a complementary fashion to give more insight into the nature of the data and the extreme tail of the underlying distribution. The analysis is carried out from the point of raw data, to the construction of tables which can be used directly to gauge the risk of the insurance portfolio over a given time frame.
- ItemInfluential data cases when the C-p criterion is used for variable selection in multiple linear regression(Stellenbosch : Stellenbosch University, 2003) Uys, Daniel Wilhelm; Steel, S. J.; Van Vuuren, J. O.; Stellenbosch University. Faculty of Economic and Management Sciences . Dept. of Statistical and Actuarial Science.ENGLISH ABSTRACT: In this dissertation we study the influence of data cases when the Cp criterion of Mallows (1973) is used for variable selection in multiple linear regression. The influence is investigated in terms of the predictive power and the predictor variables included in the resulting model when variable selection is applied. In particular, we focus on the importance of identifying and dealing with these so called selection influential data cases before model selection and fitting are performed. For this purpose we develop two new selection influence measures, both based on the Cp criterion. The first measure is specifically developed to identify individual selection influential data cases, whereas the second identifies subsets of selection influential data cases. The success with which these influence measures identify selection influential data cases, is evaluated in example data sets and in simulation. All results are derived in the coordinate free context, with special application in multiple linear regression.
- ItemMulti-label feature selection with application to musical instrument recognition(Stellenbosch : Stellenbosch University, 2013-12) Sandrock, Trudie; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: An area of data mining and statistics that is currently receiving considerable attention is the field of multi-label learning. Problems in this field are concerned with scenarios where each data case can be associated with a set of labels instead of only one. In this thesis, we review the field of multi-label learning and discuss the lack of suitable benchmark data available for evaluating multi-label algorithms. We propose a technique for simulating multi-label data, which allows good control over different data characteristics and which could be useful for conducting comparative studies in the multi-label field. We also discuss the explosion in data in recent years, and highlight the need for some form of dimension reduction in order to alleviate some of the challenges presented by working with large datasets. Feature (or variable) selection is one way of achieving dimension reduction, and after a brief discussion of different feature selection techniques, we propose a new technique for feature selection in a multi-label context, based on the concept of independent probes. This technique is empirically evaluated by using simulated multi-label data and it is shown to achieve classification accuracy with a reduced set of features similar to that achieved with a full set of features. The proposed technique for feature selection is then also applied to the field of music information retrieval (MIR), specifically the problem of musical instrument recognition. An overview of the field of MIR is given, with particular emphasis on the instrument recognition problem. The particular goal of (polyphonic) musical instrument recognition is to automatically identify the instruments playing simultaneously in an audio clip, which is not a simple task. We specifically consider the case of duets – in other words, where two instruments are playing simultaneously – and approach the problem as a multi-label classification one. In our empirical study, we illustrate the complexity of musical instrument data and again show that our proposed feature selection technique is effective in identifying relevant features and thereby reducing the complexity of the dataset without negatively impacting on performance.
- ItemMultivariate statistical process evaluation and monitoring for complex chemical processes(Stellenbosch : Stellenbosch University, 2015-12) Rossouw, Ruan Francois; Le Roux, N. J.; Coetzer, R. L. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: In this study, the development of an innovative fully integrated process monitoring methodology is presented for a complex chemical facility, originating at the coal feed from different mines up to the processing of the coal to produce raw gas at the gasification plant. The methodology developed is real-time, visual, detect deviations from expected performance across the whole value chain, and also provide for the integration and standardisation of data from a number of different data sources and formats. Real time coal quality analyses from an XRF analyser are summarised and integrated with various data sources from the Coal Supply Facility to provide information on the coal quality of each mine. In addition, simulation models are developed to generate information on the coal quality of each heap and the quality of the reclaimed coal sent to gasification. A real-time multivariate process monitoring approach for the Coal Gasification Facility is presented. This includes a novel approach utilising Generalised Orthogonal Procrustes Analysis to find the optimal units and time period to employ as a reference set. Principal Component Analysis (PCA) and Canonical Variate Analysis (CVA) theory and biplots are evaluated and extended for the real-time monitoring of the plant. A new approach to process deviation monitoring on many variables is presented based on the confidence ( ) value at a specified T2-value. This methodology is proposed as a general data driven performance index as it is objective, and very little prior knowledge of the system is required. A new multivariate gasifier performance index (GPI) is developed, which integrates subject matter knowledge with a data driven approach for real time performance monitoring. Various software modules are developed which were required for the implementation of the real time multivariate process monitoring methodology, which is made operational and distributed to the clients on an interactive web interface. The methodology has been trademarked by Sasol as the MSPEM™ Technology Package. Following the success of the developed methodology, the MSPEM™ package has been rolled out to many more business units within the Sasol Group. In conclusion, this study presents the development and implementation of the MSPEM™ application for a real-time, integrated and standardised approach to multivariate process monitoring of the Sasol Synfuels Coal Value Chain and Gasification Facility. In summary, the following novel developments were introduced: • The application of distance measures other than Euclidean measures are introduced for space filling designs for computer experiments in mixture variables. • An approach utilising Generalised Orthogonal Procrustes Analysis to specify the optimal units and time period to employ as a reference set is developed. • An approach to process deviation monitoring on many variables is presented based on the confidence ( ) value at a specified T2-value. • An integrated approach to a reactor performance index is developed and illustrated. • A comprehensive software infrastructure is developed and implemented
- ItemA quantitative analysis of investor over-reaction and under-reaction in the South African Equity Market : a mathematical statistical approach(Stellenbosch : Stellenbosch University, 2022-04) Mbonda Tiekwe, Aude Ines; Conradie, Willie; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY: One of the basic foundations of traditional finance is the theory underlying the efficient market hypothesis (EMH). The EMH states that stocks are fairly and accurately priced, making it impossible for investors to use stock selection, technical analysis, or market timing to out-perform the market by earning abnormal returns. Several schools of thought have challenged the EMH by presenting empirical evidence of market anomalies, which seems to contradict the EMH. One such school of thought is behavioural finance, which holds that investors over-react and/or under-react over time, driven by their behavioural biases. The Barberis et al. (1998) theory of conservatism and representativeness heuristics is used to explain investor over-reaction and under-reaction. Investors who exhibit conservatism are slow to update their beliefs in response to recent evidence, and thus under-react to information. Under the influence of the representativeness heuristics, investors tend to produce extreme predictions, and over-react, implying that stocks that under-performed in the past tend to out-perform in the future, and vice-versa (Aguiar et al., 2006). In this study, it is investigated whether South African investors tend to overreact and/or under-react over time, driven by their behavioural biases. The 100 shares with the largest market capitalisation at the end of every calendar year from 2006 to 2016 were considered for the study. These shares had sufficient liquidity and depth of coverage by analysts and investors to be considered for a study on behavioural finance. In total, a sample of 163 shares had sufficient financial statement data on the Iress and Bloomberg databases to be included in the study. Analyses were done using two mathematical statistical techniques i.e. the more mathematical Fuzzy C-Means model and the Bayesian model, together with formal statistical tests. The Fuzzy C-Means model is based on the technique of pattern recognition, and uses the well-known fuzzy c-means clustering algorithm. The Bayesian model is based on the classical Bayes’ theorem, which describes a relationship between the probability of an event conditional upon another event. The stocks in the financials-, industrial- and resources sectors were analysed separately. Over-reaction and under-reaction were both detected, and differed across the three sectors. No clear patterns of the two biases investigated were visible over time. The results of the Fuzzy C-Means model analysis revealed that the resources sector shows the most under-reaction. In the Bayesian model, underreaction was observed more than over-reaction in the resources and industrial sectors. In the financial sector, over-reaction was observed more often. The results of this study imply that a momentum and a contrarian investment strategy can lead to over-performance in the South African equity market, but can also generate under-performance in a poorly performing market. Therefore, no trading strategies can be advised based on the results of this study.
- ItemRegularised Gaussian belief propagation(Stellenbosch : Stellenbosch University, 2018-12) Kamper, Francois; Steel, S. J.; Du Preez, J. A.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : Belief propagation (BP) has been applied as an approximation tool in a variety of inference problems. BP does not necessarily converge in loopy graphs and, even if it does, is not guaranteed to provide exact inference. Even so, BP is useful in many applications due to its computational tractability. On a high level this dissertation is concerned with addressing the issues of BP when applied to loopy graphs. To address these issues we formulate the principle of node regularisation on Markov graphs (MGs) within the context of BP. The main contribution of this dissertation is to provide mathematical and empirical evidence that the principle of node regularisation can achieve convergence and good inference quality when applied to a MG constructed from a Gaussian distribution in canonical parameterisation. There is a rich literature surrounding BP on Gaussian MGs (labelled Gaussian belief propagation or GaBP), and this is known to suffer from the same problems as general BP on graphs. GaBP is known to provide the correct marginal means if it converges (this is not guaranteed), but it does not provide the exact marginal precisions. We show that our regularised BP will, with sufficient tuning, always converge while maintaining the exact marginal means. This is true for a graph where nodes are allowed to have any number of variables. The selection of the degree of regularisation is addressed through the use of heuristics. Our variant of GaBP is tested empirically in a variety of settings. We show that our method outperforms other variants of GaBP available in the literature, both in terms of convergence speed and quality of inference. These improvements suggest that the principle of node regularisation in BP should be investigated in other inference problems. A by-product of GaBP is that it can be used to solve linear systems of equations; the same is true for our variant and in this context we make an empirical comparison with the conjugate gradient (CG) method.
- ItemSome statistical aspects of LULU smoothers(Stellenbosch : University of Stellenbosch, 2007-12) Jankowitz, Maria Dorothea; Conradie, W. J.; De Wet, Tertius; University of Stellenbosch. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.The smoothing of time series plays a very important role in various practical applications. Estimating the signal and removing the noise is the main goal of smoothing. Traditionally linear smoothers were used, but nonlinear smoothers became more popular through the years. From the family of nonlinear smoothers, the class of median smoothers, based on order statistics, is the most popular. A new class of nonlinear smoothers, called LULU smoothers, was developed by using the minimum and maximum selectors. These smoothers have very attractive mathematical properties. In this thesis their statistical properties are investigated and compared to that of the class of median smoothers. Smoothing, together with related concepts, are discussed in general. Thereafter, the class of median smoothers, from the literature is discussed. The class of LULU smoothers is defined, their properties are explained and new contributions are made. The compound LULU smoother is introduced and its property of variation decomposition is discussed. The probability distributions of some LULUsmoothers with independent data are derived. LULU smoothers and median smoothers are compared according to the properties of monotonicity, idempotency, co-idempotency, stability, edge preservation, output distributions and variation decomposition. A comparison is made of their respective abilities for signal recovery by means of simulations. The success of the smoothers in recovering the signal is measured by the integrated mean square error and the regression coefficient calculated from the least squares regression of the smoothed sequence on the signal. Finally, LULU smoothers are practically applied.
- ItemA statistical analysis of student performance for the 2000-2013 period at the Copperbelt University in Zambia(Stellenbosch : Stellenbosch University, 2017-12) Ngoy, Mwanabute; Le Roux, Niel Johannes; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : Education in general, and tertiary education in particular are the engines for sustained development of a nation. In this line, the Copperbelt University (CBU) plays a vital role in delivering the necessary knowledge and skills requirements for the development of Zambia and the neighbouring Southern Africa Region. It is thus important to investigate relationships between school and university results at the CBU. The first year and the graduate datasets comprising the CBU data for the 2000-2013 period were analysed using a geometric data analysis approach. The population data of all school results for the whole Zambia from 2000 to 2003 and from 2006 to 2012 were also used. The findings of this study show that the changes in the cut-off values for university entrance resulted in the CBU admitting school leavers with better school results, i.e. most recent intakes of first year students had higher school results than the older intakes. But the adjustment on the cut-off values did not have a major effect on the university performance. There was a general tendency for students to achieve higher scores at school level which could not translate necessarily into higher academic achievement at university. Additionally, certain school subjects (i.e. school Mathematics, Science, Physics, Chemistry, Additional Mathematics, Geography, and Principles of Accounts) and the school average for all school subjects were identified as good indicators of university performance. These variables were also found to be responsible for the group separation/discrimination among the four groups of the first year students. For graduate students, the school average was the major determinant of the degree classification. However, most school variables had limited discrimination power to differentiate between successful and unsuccessful students. Furthermore, it was found that policies of making school results available as grades rather than actual percentages can have a marked influence on expected university achievements. One of the major contributions of this thesis is the use of optimal scores as an alternative imputation method applicable to interval-valued and categorical data. This study also identified years of study which needed more focus in order to enhance the performance of students: the first two years of study for business related programmes, the third year of study for engineering programmes, and the third and fifth year of study for other programmes. Additionally, the study also identified certain school variables which were good indicators of university performance and which could be used by the university to admit potential successful students. It was also found that the first year Mathematics had the worst performance at the first year level despite the students achieving outstanding results in school Mathematics. It was also found that a clear demarcation exists between the “clear pass” (CP) students, i.e. those who successfully passed the first year of study and other first year groups. Also the “distinction” (DIS) group, i.e. those who completed their undergraduate studies with distinction, was apart from the other groups. These two groups (CP and DIS groups) mostly achieved outstanding results at school level as compared to other groups.
- ItemStatistical inference for inequality measures based on semi-parametric estimators(Stellenbosch : Stellenbosch University, 2011-12) Kpanzou, Tchilabalo Abozou; De Wet, Tertius; Neethling, Ariane; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: Measures of inequality, also used as measures of concentration or diversity, are very popular in economics and especially in measuring the inequality in income or wealth within a population and between populations. However, they have applications in many other fields, e.g. in ecology, linguistics, sociology, demography, epidemiology and information science. A large number of measures have been proposed to measure inequality. Examples include the Gini index, the generalized entropy, the Atkinson and the quintile share ratio measures. Inequality measures are inherently dependent on the tails of the population (underlying distribution) and therefore their estimators are typically sensitive to data from these tails (nonrobust). For example, income distributions often exhibit a long tail to the right, leading to the frequent occurrence of large values in samples. Since the usual estimators are based on the empirical distribution function, they are usually nonrobust to such large values. Furthermore, heavy-tailed distributions often occur in real life data sets, remedial action therefore needs to be taken in such cases. The remedial action can be either a trimming of the extreme data or a modification of the (traditional) estimator to make it more robust to extreme observations. In this thesis we follow the second option, modifying the traditional empirical distribution function as estimator to make it more robust. Using results from extreme value theory, we develop more reliable distribution estimators in a semi-parametric setting. These new estimators of the distribution then form the basis for more robust estimators of the measures of inequality. These estimators are developed for the four most popular classes of measures, viz. Gini, generalized entropy, Atkinson and quintile share ratio. Properties of such estimators are studied especially via simulation. Using limiting distribution theory and the bootstrap methodology, approximate confidence intervals were derived. Through the various simulation studies, the proposed estimators are compared to the standard ones in terms of mean squared error, relative impact of contamination, confidence interval length and coverage probability. In these studies the semi-parametric methods show a clear improvement over the standard ones. The theoretical properties of the quintile share ratio have not been studied much. Consequently, we also derive its influence function as well as the limiting normal distribution of its nonparametric estimator. These results have not previously been published. In order to illustrate the methods developed, we apply them to a number of real life data sets. Using such data sets, we show how the methods can be used in practice for inference. In order to choose between the candidate parametric distributions, use is made of a measure of sample representativeness from the literature. These illustrations show that the proposed methods can be used to reach satisfactory conclusions in real life problems.