Masters Degrees (Statistics and Actuarial Science)

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 5 of 100
  • Item
    bipl5 : an R package for reactive calibrated axes PCA biplots
    (Stellenbosch : Stellenbosch University, 2024-03) Buys, Ruan; van der Merwe, C. J.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
    ENGLISH SUMMARY: Principal component analysis biplots with calibrated axes are popular and effective multivariate data visualisation tools. Biplots are however often complex to navigate due to cluttered plotting in the central data area, as well as die limitations that accompany static rendering. The bipl5 package proposes three contributions to the biplot display: i) automated orthogonal parallel translation of the axes to the boundary of the plot and declutter the plot center; ii) superimpose interclass kernel densities on each axis to investigate class distributions in the data; iii) render the final plot on a portable and standalone HTML file with embedded reactivity. This article considers the mathematical and computational implementation of bipl5, and showcases its functionality through an illustrative example.
  • Item
    Hidden Markov Models for Natural Language Processing
    (Stellenbosch : Stellenbosch University, 2024-03) Stapelberg, Keara-Linn; Muller, Chris; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
    ENGLISH SUMMARY: Foundational to many Natural Language Processing (NLP) applications is the task of partof-speech (POS) tagging, which attempts to disambiguate words by labelling them with their correct grammatical part-of-speech. Supervised approaches to this problem train taggers on expert-annotated corpora to learn the patterns that result in the correct labelling. As word and tag context often informs correct labelling, the problem is expressed as a sequence modelling paradigm with a sequence of words W = {w1,w2, ...,wn} and their corresponding grammatical tags T = {t1, t2, ...tn}. Since the 1990s, probabilistic approaches have become more widely adopted than deterministic “rule-based” approaches. Hidden Markov models (HMMs) remain a popular formalism, where the task of inference is employing a trained tagger to find the most likely hidden sequence of tags given an observed sequence of words. HMMs, however, are limited in their ability to express complex dependencies due to strict Markov assumptions and their performance degrades when they encounter words not seen in training. Maximum entropy Markov models (MEMMs) and conditional random fields (CRFs) relax these assumptions and allow for the inclusion of a rich set of features that capture contextual and morphological information. Despite advances, improvements to tagging accuracy are still a relevant pursuit in this field as even small errors may propagate down the NLP pipeline. Parallel to the importance of accuracy are considerations around factors that affect accuracy. These pivotal factors include the quality and abundance of training data, the size and complexity of the tag set, and the presence of unknown words. Simulation procedures are proposed to create these various conditions in the data. A Monte Carlo study compares the taggers and demonstrate the overall robustness of CRFs to these conditions, highlighting extreme cases where CRFs tend to overfit. Finally, future research avenues are discussed.
  • Item
    Advancing ESG objectives with ESG-linked derivatives
    (Stellenbosch : Stellenbosch University, 2024-03) Jansen van Rensburg, Pieter Willem; Mesias, Alfeus; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
    ENGLISH SUMMARY: Environmental, social, and governance (ESG) concerns require the active involvement of the finan- cial sector in advancing ESG compliance. The global derivative market holds significant importance. Leveraging derivatives, exchanges, and clearinghouses offers a pathway for the financial sector to promote ESG objectives. This research assignment examines the particular challenges associated with utilizing derivatives to promote ESG compliance. It explores ESG-linked derivatives, utilizing Monte Carlo simulation, associated with Key Performance Indicators (KPIs) and priced based on the attainment of specific Sustainability Performance Targets (SPTs). These specific ESG-linked derivatives are tailored for this purpose - advancing ESG objectives. ESG-linked derivatives repre- sent an emerging field, demanding further in-depth exploration.
  • Item
    Modern gradient boosting
    (Stellenbosch : Stellenbosch University, 2024-03) Zackey, Matthew David; Uys, Daniel Wilhelm; Steel, Sarel Johannes; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
    ENGLISH SUMMARY: Boosting is a supervised learning procedure that has gained considerable interest in statistical and machine learning owing to its powerful predictive performance. The idea of boosting is to obtain a model ensemble by sequentially fitting base learners to modified versions of the training data. The first complete boosting procedure was Adaptive boosting (AdaBoost), designed for binary classification. Gradient boosting followed AdaBoost, which allowed boosting to be applied to any differentiable and continuous loss function. The most frequently used version of gradient boosting is Multiple Additive Regression Trees (MART), where trees are specified as the base learners. In the last several years, there have been numerous extensions to MART, aiming to improve its predictive performance and scalability. Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM) and Categorical Boosting (CatBoost) are three of these extensions, which in this thesis are termed the modern gradient boosting methods. The thesis introduces boosting by reviewing the details of AdaBoost, forward stagewise additive modelling (FSAM) and gradient boosting. Notably, the equivalence of AdaBoost and FSAM with the exponential loss is proven, FSAM for regression with trees is considered and the need for an efficient procedure like gradient boosting is emphasised. Additionally, two derivations of gradient boosting are provided. The first considers gradient boosting as an approximation to steepest descent of the empirical risk, while the second views gradient boosting as taking a quadratic approximation of FSAM. Since trees are a popular choice of base learner in gradient boosting, details will be given on MART. The remainder of the thesis studies the modern methods, focusing on the mathematical details of their novelties. Examples, illustrations, and simulations are given for some of these novelties to provide further clarity. Additionally, empirical studies investigating the generalisation performance of certain novelties are presented. More specifically, these empirical studies consider the performance of XGBoost’s regularisation parameters in tree-building, GOSS from LightGBM, the Plain and Ordered modes in CatBoost, and the cosine similarity to construct the trees in CatBoost. In these experiments, several binary classification datasets are considered with varying characteristics: size, class imbalance, sparsity and the inclusion of categorical features.
  • Item
    Distribution theory and inference for bivariate extremes
    (Stellenbosch : Stellenbosch University, 2024-03) van Tonder, Jana; Steyn, Matthys Lucas; de Wet, Tertius; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.
    ENGLISH SUMMARY: Various scenarios exist where the interest is in the modelling and prediction of rare or extreme events. Extreme value theory is an important branch of statistics, where limit theory is used to analyse extremes and to estimate the tail of the underlying distribution. Extreme value theory is the most developed for the univariate case, i.e. modelling the extremes of only a single variable. In many scenarios, however, more than one variable has an effect on the probability of occurrence of extreme events. In such cases, multivariate extreme value theory will play a valuable role in the modelling procedure by taking into account the joint effect of multivariate extremes. In this thesis, the focus will be on bivariate extreme value theory, i.e., multivariate extreme value theory restricted to two dimensions. Two approaches will be considered: (1) componentwise maxima and (2) a pair of random variables above a large threshold vector. A mathematical derivation of the limiting distribution of normalised componentwise maxima, called the bivariate extreme value distribution, will be given. For the threshold exceedance approach, it will be shown how the underlying distribution can be approximated by the bivariate extreme value distribution at transformed points. Unfortunately, no parametric form exists for the bivariate extreme value distribution. However, the distribution can be expressed in terms of the two marginal distributions and a dependence function. The latter is important in characterising the dependence structure of the distribution. Various characterisations are proposed in the literature. A few popular dependence functions will be discussed. It will also be shown how they are related through appropriate transformations. Since dependence plays an important role in bivariate extreme value theory, different measures of extremal dependence will be examined. For an independent and identically distributed random bivariate sample with asymptotic dependence between the two variables, it will be shown how the limit theory, based on the bivariate extreme value distribution, can be applied and how inference can be performed. Different ways of estimating the dependence structure of the bivariate extreme value distribution will be described, which include parametric and non-parametric techniques. When data exhibit asymptotic independence, the bivariate extreme value distribution is not suitable to use in the modelling procedure. Therefore, other models will be explored which better describe the tail of an asymptotically independent distribution. For illustration, the above-mentioned methods will be applied to two South African bivariate environmental datasets. For further interpretation and visualisation, graphs of the estimated distributions and quantile curves will also be given. Finally, it will be demonstrated that an asymptotic dependent model can lead to an overestimation of the joint exceedance probability when working in the tail of an asymptotic independent distribution, which agrees with the findings in the literature.