Categorical CVA biplots

Date
2020-12
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: In the modern era a great amount of emphasis is placed on data visualisation, especially in cases where a large amount of data is present. Usually, in these instances, the data is of a high-dimensional nature which cannot be visualised using conventional means. Fortunately, there has been a recent surge in using biplots to visualise multivariate data, where biplots can be described as a generalisation of a scatterplot. Moreover, these biplots use dimension reduction techniques to construct a two-dimensional representation of the data with non-orthogonal axes. However, at present, an effective biplot construction technique which adequately separates classes, in cases where categorical data is present does not exist. Hence, this research builds upon an existing biplot construction technique by using elements from Canonical Variate Analysis (CVA) and non-linear Principal Component Analysis (PCA) to develop a technique that can perform class separation in cases where numerical and categorical data is present. This novel biplot construction methodology forms the crux of this research assignment. Subsequently, the feasibility of this method was explored by considering the well-known Iris data set where two variables are binned to form categorical variables. It is shown that this novel method improves upon existing biplot construction in terms of classification accuracy and class separation. However, it is noted this method can be extended by incorporating CVA in the iterative algorithm which solves the optimal categorical level scores. A web-based Shiny application was built as supplement to this paper, and can be found at https://davidrodwell:shinyapps:io/CategoricalCVABiplotApp/. Here the user can interact with the data sets, proposed methodology, and functionalities presented in this research.
AFRIKAANSE OPSOMMING: In die moderne era word daar baie klem gelê op die visualisering van data, veral in waar groot datastelle betrokke is. In hierdie gevalle is die data gewoonlik hoë-dimensioneel van aard, wat veroorsaak dat dit nie deur konvensionele maniere visueel voorgestel kan word nie. Onlangse verwikkelinge het gelei tot ’n toename in die gebruik van bi-stippings om multi-veranderlike data voor te stel, waar bi-stippings as ’n veralgemening van ’n spreidingsdiagram beskryf kan word. Hierdie bi-stippings gebruik dimensie verminderingstegnieke om ’n twee-dimensionele voorstelling van die data op ’n nie-ortogonale assestelsel te konstrueer. Huidiglik bestaan daar nie ’n effektiewe bi-stipping konstruksietegniek wat klasse kan verdeel wanneer kategoriese data teenwoordig is nie. Hierdie navorsing bou op ’n bestaande bi-stipping konstruksietegniek wat elemente van Kanoniese Veranderlike Analise (KVA) en nie-lineêre Hoof Komponent Analise (HKA) gebruik om ’n tegniek te ontwikkel wat klasse kan verdeel in gevalle waar numeriese sowel as kategoriese data teenwoordig is. Hierdie nuwe bi-stipping konstruksie metodologie vorm die kruks van hierdie navorsingstaak. Die lewensvatbaarheid van hierdie metode was ook ondersoek deur die welbekende Iris datastel te oorweeg, waar twee veranderlikes ingedeel word om kategoriese veranderlikes te word. Dit is gewys dat hierdie nuwe metode die bestaande biplot konstruksietegnieke verbeter in terme van klassifikasie akkuraatheid en klas verdeling. Daar was wel opgemerk dat hierdie metode uitgebrei kan word deur KVA in die iteratiewe algoritme te inkorporeer, wat die optimale kategoriese vlak tellings oplos. ’n Web-gebaseerde Shiny toepassing was gebou as supplimentêr tot hierdie artikel, en kan gevind word by https://davidrodwell:shinyapps:io/CategoricalCVABiplotApp/. Hier kan die gebruiker self interaksie hê met die datastelle, voorgestelde metadologie, en funksionaliteite wat voorgelê is in hierdie navorsing.
Description
Thesis (MCom)--Stellenbosch University, 2020.
Keywords
Biplots, Canonical Variate Analysis (CVA), Categorical data, Canonical correlation (Statistics), Information visualization, Multivariate analysis, UCTD
Citation