- ItemAspects of model development using regression quantiles and elemental regressions(Stellenbosch : Stellenbosch University, 2007-03) Ranganai, Edmore; De Wet, Tertius; Van Vuuren, J.O.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
ENGLISH ABSTRACT: Multi-state models are used in this dissertation to model panel data, also known as longitudinal or cross-sectional time-series data. These are data sets which include units that are observed across two or more points in time. These models have been used extensively in medical studies where the disease states of patients are recorded over time. A theoretical overview of the current multi-state Markov models when applied to panel data is presented and based on this theory, a simulation procedure is developed to generate panel data sets for given Markov models. Through the use of this procedure a simulation study is undertaken to investigate the properties of the standard likelihood approach when fitting Markov models and then to assess its shortcomings. One of the main shortcomings highlighted by the simulation study, is the unstable estimates obtained by the standard likelihood models, especially when fitted to small data sets. A Bayesian approach is introduced to develop multi-state models that can overcome these unstable estimates by incorporating prior knowledge into the modelling process. Two Bayesian techniques are developed and presented, and their properties are assessed through the use of extensive simulation studies. Firstly, Bayesian multi-state models are developed by specifying prior distributions for the transition rates, constructing a likelihood using standard Markov theory and then obtaining the posterior distributions of the transition rates. A selected few priors are used in these models. Secondly, Bayesian multi-state imputation techniques are presented that make use of suitable prior information to impute missing observations in the panel data sets. Once imputed, standard likelihood-based Markov models are fitted to the imputed data sets to estimate the transition rates. Two different Bayesian imputation techniques are presented. The first approach makes use of the Dirichlet distribution and imputes the unknown states at all time points with missing observations. The second approach uses a Dirichlet process to estimate the time at which a transition occurred between two known observations and then a state is imputed at that estimated transition time. The simulation studies show that these Bayesian methods resulted in more stable results, even when small samples are available. - ItemEdgeworth-corrected small-sample confidence intervals for ratio parameters in linear regression(Stellenbosch : Stellenbosch University, 2002-03) Binyavanga, Kamanzi-wa; Maritz, J. S.; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistical and Actuarial Science.
ENGLISH ABSTRACT: In this thesis we construct a central confidence interval for a smooth scalar non-linear function of parameter vector f3 in a single general linear regression model Y = X f3 + c. We do this by first developing an Edgeworth expansion for the distribution function of a standardised point estimator. The confidence interval is then constructed in the manner discussed. Simulation studies reported at the end of the thesis show the interval to perform well in many small-sample situations. Central to the development of the Edgeworth expansion is our use of the index notation which, in statistics, has been popularised by McCullagh (1984, 1987). The contributions made in this thesis are of two kinds. We revisit the complex McCullagh Index Notation, modify and extend it in certain respects as well as repackage it in the manner that is more accessible to other researchers. On the new contributions, in addition to the introduction of a new small-sample confidence interval, we extend the theory of stochastic polynomials (SP) in three respects. A method, which we believe to be the simplest and most transparent to date, is proposed for deriving cumulants for these. Secondly, the theory of the cumulants of the SP is developed both in the context of Edgeworth expansion as well as in the regression setting. Thirdly, our new method enables us to propose a natural alternative to the method of Hall (1992a, 1992b) regarding skewness-reduction in Edgeworth expansions. - ItemA framework for estimating risk(Stellenbosch : Stellenbosch University, 2008-03) Kroon, Rodney Stephen; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
Show more ENGLISH ABSTRACT: In this thesis we construct a central confidence interval for a smooth scalar non-linear function of parameter vector f3 in a single general linear regression model Y = X f3 + c. We do this by first developing an Edgeworth expansion for the distribution function of a standardised point estimator. The confidence interval is then constructed in the manner discussed. Simulation studies reported at the end of the thesis show the interval to perform well in many small-sample situations. Central to the development of the Edgeworth expansion is our use of the index notation which, in statistics, has been popularised by McCullagh (1984, 1987). The contributions made in this thesis are of two kinds. We revisit the complex McCullagh Index Notation, modify and extend it in certain respects as well as repackage it in the manner that is more accessible to other researchers. On the new contributions, in addition to the introduction of a new small-sample confidence interval, we extend the theory of stochastic polynomials (SP) in three respects. A method, which we believe to be the simplest and most transparent to date, is proposed for deriving cumulants for these. Secondly, the theory of the cumulants of the SP is developed both in the context of Edgeworth expansion as well as in the regression setting. Thirdly, our new method enables us to propose a natural alternative to the method of Hall (1992a, 1992b) regarding skewness-reduction in Edgeworth expansions.Show more - ItemA framework for estimating risk(Stellenbosch : Stellenbosch University, 2008-03) Kroon, Rodney Stephen; Steel, S. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
ENGLISH ABSTRACT: When estimating the covariance matrices of two or more populations, the covariance matrices are often assumed to be either equal or completely unrelated. The common principal components (CPC) model provides an alternative which is situated between these two extreme assumptions: The assumption is made that the population covariance matrices share the same set of eigenvectors, but have di erent sets of eigenvalues. An important question in the application of the CPC model is to determine whether it is appropriate for the data under consideration. Flury (1988) proposed two methods, based on likelihood estimation, to address this question. However, the assumption of multivariate normality is untenable for many real data sets, making the application of these parametric methods questionable. A number of non-parametric methods, based on bootstrap replications of eigenvectors, is proposed to select an appropriate common eigenvector model for two population covariance matrices. Using simulation experiments, it is shown that the proposed selection methods outperform the existing parametric selection methods. If appropriate, the CPC model can provide covariance matrix estimators that are less biased than when assuming equality of the covariance matrices, and of which the elements have smaller standard errors than the elements of the ordinary unbiased covariance matrix estimators. A regularised covariance matrix estimator under the CPC model is proposed, and Monte Carlo simulation results show that it provides more accurate estimates of the population covariance matrices than the competing covariance matrix estimators. Covariance matrix estimation forms an integral part of many multivariate statistical methods. Applications of the CPC model in discriminant analysis, biplots and regression analysis are investigated. It is shown that, in cases where the CPC model is appropriate, CPC discriminant analysis provides signi cantly smaller misclassi cation error rates than both ordinary quadratic discriminant analysis and linear discriminant analysis. A framework for the comparison of di erent types of biplots for data with distinct groups is developed, and CPC biplots constructed from common eigenvectors are compared to other types of principal component biplots using this framework. A subset of data from the Vermont Oxford Network (VON), of infants admitted to participating neonatal intensive care units in South Africa and Namibia during 2009, is analysed using the CPC model. It is shown that the proposed non-parametric methodology o ers an improvement over the known parametric methods in the analysis of this data set which originated from a non-normally distributed multivariate population. CPC regression is compared to principal component regression and partial least squares regression in the tting of models to predict neonatal mortality and length of stay for infants in the VON data set. The tted regression models, using readily available day-of-admission data, can be used by medical sta and hospital administrators to counsel parents and improve the allocation of medical care resources. Predicted values from these models can also be used in benchmarking exercises to assess the performance of neonatal intensive care units in the Southern African context, as part of larger quality improvement programmes. - ItemImproved estimation procedures for a positive extreme value index(Stellenbosch : University of Stellenbosch, 2010-12) Berning, Thomas Louw; De Wet, Tertius; University of Stellenbosch. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
ENGLISH ABSTRACT: In extreme value theory (EVT) the emphasis is on extreme (very small or very large) observations. The crucial parameter when making inferences about extreme quantiles, is called the extreme value index (EVI). This thesis concentrates on only the right tail of the underlying distribution (extremely large observations), and specifically situations where the EVI is assumed to be positive. A positive EVI indicates that the underlying distribution of the data has a heavy right tail, as is the case with, for example, insurance claims data. There are numerous areas of application of EVT, since there are a vast number of situations in which one would be interested in predicting extreme events accurately. Accurate prediction requires accurate estimation of the EVI, which has received ample attention in the literature from a theoretical as well as practical point of view. Countless estimators of the EVI exist in the literature, but the practitioner has little information on how these estimators compare. An extensive simulation study was designed and conducted to compare the performance of a wide range of estimators, over a wide range of sample sizes and distributions. A new procedure for the estimation of a positive EVI was developed, based on fitting the perturbed Pareto distribution (PPD) to observations above a threshold, using Bayesian methodology. Attention was also given to the development of a threshold selection technique. One of the major contributions of this thesis is a measure which quantifies the stability (or rather instability) of estimates across a range of thresholds. This measure can be used to objectively obtain the range of thresholds over which the estimates are most stable. It is this measure which is used for the purpose of threshold selection for the proposed PPD estimator. A case study of five insurance claims data sets illustrates how data sets can be analyzed in practice. It is shown to what extent discretion can/should be applied, as well as how different estimators can be used in a complementary fashion to give more insight into the nature of the data and the extreme tail of the underlying distribution. The analysis is carried out from the point of raw data, to the construction of tables which can be used directly to gauge the risk of the insurance portfolio over a given time frame. - ItemInfluential data cases when the C-p criterion is used for variable selection in multiple linear regression(Stellenbosch : Stellenbosch University, 2003) Uys, Daniel Wilhelm; Steel, S. J.; Van Vuuren, J. O.; Stellenbosch University. Faculty of Economic and
Show more ENGLISH ABSTRACT: In extreme value theory (EVT) the emphasis is on extreme (very small or very large) observations. The crucial parameter when making inferences about extreme quantiles, is called the extreme value index (EVI). This thesis concentrates on only the right tail of the underlying distribution (extremely large observations), and specifically situations where the EVI is assumed to be positive. A positive EVI indicates that the underlying distribution of the data has a heavy right tail, as is the case with, for example, insurance claims data. There are numerous areas of application of EVT, since there are a vast number of situations in which one would be interested in predicting extreme events accurately. Accurate prediction requires accurate estimation of the EVI, which has received ample attention in the literature from a theoretical as well as practical point of view. Countless estimators of the EVI exist in the literature, but the practitioner has little information on how these estimators compare. An extensive simulation study was designed and conducted to compare the performance of a wide range of estimators, over a wide range of sample sizes and distributions. A new procedure for the estimation of a positive EVI was developed, based on fitting the perturbed Pareto distribution (PPD) to observations above a threshold, using Bayesian methodology. Attention was also given to the development of a threshold selection technique. One of the major contributions of this thesis is a measure which quantifies the stability (or rather instability) of estimates across a range of thresholds. This measure can be used to objectively obtain the range of thresholds over which the estimates are most stable. It is this measure which is used for the purpose of threshold selection for the proposed PPD estimator. A case study of five insurance claims data sets illustrates how data sets can be analyzed in practice. It is shown to what extent discretion can/should be applied, as well as how different estimators can be used in a complementary fashion to give more insight into the nature of the data and the extreme tail of the underlying distribution. The analysis is carried out from the point of raw data, to the construction of tables which can be used directly to gauge the risk of the insurance portfolio over a given time frame.Show more - ItemInfluential data cases when the C-p criterion is used for variable selection in multiple linear regression(Stellenbosch : Stellenbosch University, 2003) Uys, Daniel Wilhelm; Steel, S. J.; Van Vuuren, J. O.; Stellenbosch University. Faculty of Economic and Management Sciences . Dept. of Statistical and Actuarial Science.
Show more ENGLISH ABSTRACT: In this dissertation we study the influence of data cases when the Cp criterion of Mallows (1973) is used for variable selection in multiple linear regression. The influence is investigated in terms of the predictive power and the predictor variables included in the resulting model when variable selection is applied. In particular, we focus on the importance of identifying and dealing with these so called selection influential data cases before model selection and fitting are performed. For this purpose we develop two new selection influence measures, both based on the Cp criterion. The first measure is specifically developed to identify individual selection influential data cases, whereas the second identifies subsets of selection influential data cases. The success with which these influence measures identify selection influential data cases, is evaluated in example data sets and in simulation. All results are derived in the coordinate free context, with special application in multiple linear regression.Show more - ItemSome statistical aspects of LULU smoothers(Stellenbosch : University of Stellenbosch, 2007-12) Jankowitz, Maria Dorothea; Conradie, W. J.; De Wet, Tertius; University of Stellenbosch. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
Show more The smoothing of time series plays a very important role in various practical applications. Estimating the signal and removing the noise is the main goal of smoothing. Traditionally linear smoothers were used, but nonlinear smoothers became more popular through the years. From the family of nonlinear smoothers, the class of median smoothers, based on order statistics, is the most popular. A new class of nonlinear smoothers, called LULU smoothers, was developed by using the minimum and maximum selectors. These smoothers have very attractive mathematical properties. In this thesis their statistical properties are investigated and compared to that of the class of median smoothers. Smoothing, together with related concepts, are discussed in general. Thereafter, the class of median smoothers, from the literature is discussed. The class of LULU smoothers is defined, their properties are explained and new contributions are made. The compound LULU smoother is introduced and its property of variation decomposition is discussed. The probability distributions of some LULUsmoothers with independent data are derived. LULU smoothers and median smoothers are compared according to the properties of monotonicity, idempotency, co-idempotency, stability, edge preservation, output distributions and variation decomposition. A comparison is made of their respective abilities for signal recovery by means of simulations. The success of the smoothers in recovering the signal is measured by the integrated mean square error and the regression coefficient calculated from the least squares regression of the smoothed sequence on the signal. Finally, LULU smoothers are practically applied.Show more - ItemStatistical inference for inequality measures based on semi-parametric estimators(Stellenbosch : Stellenbosch University, 2011-12) Kpanzou, Tchilabalo Abozou; De Wet, Tertius; Neethling, Ariane; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
Show more ENGLISH ABSTRACT: Measures of inequality, also used as measures of concentration or diversity, are very popular in economics and especially in measuring the inequality in income or wealth within a population and between populations. However, they have applications in many other fields, e.g. in ecology, linguistics, sociology, demography, epidemiology and information science. A large number of measures have been proposed to measure inequality. Examples include the Gini index, the generalized entropy, the Atkinson and the quintile share ratio measures. Inequality measures are inherently dependent on the tails of the population (underlying distribution) and therefore their estimators are typically sensitive to data from these tails (nonrobust). For example, income distributions often exhibit a long tail to the right, leading to the frequent occurrence of large values in samples. Since the usual estimators are based on the empirical distribution function, they are usually nonrobust to such large values. Furthermore, heavy-tailed distributions often occur in real life data sets, remedial action therefore needs to be taken in such cases. The remedial action can be either a trimming of the extreme data or a modification of the (traditional) estimator to make it more robust to extreme observations. In this thesis we follow the second option, modifying the traditional empirical distribution function as estimator to make it more robust. Using results from extreme value theory, we develop more reliable distribution estimators in a semi-parametric setting. These new estimators of the distribution then form the basis for more robust estimators of the measures of inequality. These estimators are developed for the four most popular classes of measures, viz. Gini, generalized entropy, Atkinson and quintile share ratio. Properties of such estimators are studied especially via simulation. Using limiting distribution theory and the bootstrap methodology, approximate confidence intervals were derived. Through the various simulation studies, the proposed estimators are compared to the standard ones in terms of mean squared error, relative impact of contamination, confidence interval length and coverage probability. In these studies the semi-parametric methods show a clear improvement over the standard ones. The theoretical properties of the quintile share ratio have not been studied much. Consequently, we also derive its influence function as well as the limiting normal distribution of its nonparametric estimator. These results have not previously been published. In order to illustrate the methods developed, we apply them to a number of real life data sets. Using such data sets, we show how the methods can be used in practice for inference. In order to choose between the candidate parametric distributions, use is made of a measure of sample representativeness from the literature. These illustrations show that the proposed methods can be used to reach satisfactory conclusions in real life problems.Show more - ItemVariable selection for kernel methods with application to binary classification(Stellenbosch : University of Stellenbosch, 2008-03) Oosthuizen, Surette; Steel, S. J.; University of Stellenbosch. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
Show more The problem of variable selection in binary kernel classification is addressed in this thesis. Kernel methods are fairly recent additions to the statistical toolbox, having originated approximately two decades ago in machine learning and artificial intelligence. These methods are growing in popularity and are already frequently applied in regression and classification problems. Variable selection is an important step in many statistical applications. Thereby a better understanding of the problem being investigated is achieved, and subsequent analyses of the data frequently yield more accurate results if irrelevant variables have been eliminated. It is therefore obviously important to investigate aspects of variable selection for kernel methods. Chapter 2 of the thesis is an introduction to the main part presented in Chapters 3 to 6. In Chapter 2 some general background material on kernel methods is firstly provided, along with an introduction to variable selection. Empirical evidence is presented substantiating the claim that variable selection is a worthwhile enterprise in kernel classification problems. Several aspects which complicate variable selection in kernel methods are discussed. An important property of kernel methods is that the original data are effectively transformed before a classification algorithm is applied to it. The space in which the original data reside is called input space, while the transformed data occupy part of a feature space. In Chapter 3 we investigate whether variable selection should be performed in input space or rather in feature space. A new approach to selection, so-called feature-toinput space selection, is also proposed. This approach has the attractive property of combining information generated in feature space with easy interpretation in input space. An empirical study reveals that effective variable selection requires utilisation of at least some information from feature space. Having confirmed in Chapter 3 that variable selection should preferably be done in feature space, the focus in Chapter 4 is on two classes of selecion criteria operating in feature space: criteria which are independent of the specific kernel classification algorithm and criteria which depend on this algorithm. In this regard we concentrate on two kernel classifiers, viz. support vector machines and kernel Fisher discriminant analysis, both of which are described in some detail in Chapter 4. The chapter closes with a simulation study showing that two of the algorithm-independent criteria are very competitive with the more sophisticated algorithm-dependent ones. In Chapter 5 we incorporate a specific strategy for searching through the space of variable subsets into our investigation. Evidence in the literature strongly suggests that backward elimination is preferable to forward selection in this regard, and we therefore focus on recursive feature elimination. Zero- and first-order forms of the new selection criteria proposed earlier in the thesis are presented for use in recursive feature elimination and their properties are investigated in a numerical study. It is found that some of the simpler zeroorder criteria perform better than the more complicated first-order ones. Up to the end of Chapter 5 it is assumed that the number of variables to select is known. We do away with this restriction in Chapter 6 and propose a simple criterion which uses the data to identify this number when a support vector machine is used. The proposed criterion is investigated in a simulation study and compared to cross-validation, which can also be used for this purpose. We find that the proposed criterion performs well. The thesis concludes in Chapter 7 with a summary and several discussions for further research.Show more