Networks and multivariate statistics as applied to biological datasets and wine-related omics

Jacobson, Daniel A. (2013-12)

Thesis (PhD)--Stellenbosch University, 2013.

Thesis

ENGLISH ABSTRACT: Introduction: Wine production is a complex biotechnological process aiming at productively coordinating the interactions and outputs of several biological systems, including grapevine and many microorganisms such as wine yeast and wine bacteria. High-throughput data generating tools in the elds of genomics, transcriptomics, proteomics, metabolomics and microbiomics are being applied both locally and globally in order to better understand complex biological systems. As such, the datasets available for analysis and mining include de novo datasets created by collaborators as well as publicly available datasets which one can use to get further insight into the systems under study. In order to model the complexity inherent in and across these datasets it is necessary to develop methods and approaches based on network theory and multivariate data analysis as well as to explore the intersections between these two approaches to data modelling, mining and interpretation. Networks: The traditional reductionist paradigm of analysing single components of a biological system has not provided tools with which to adequately analyse data sets that are attempting to capture systems-level information. Network theory has recently emerged as a new discipline with which to model and analyse complex systems and has arisen from the study of real and often quite large networks derived empirically from the large volumes of data that have collected from communications, internet, nancial and biological systems. This is in stark contrast to previous theoretical approaches to understanding complex systems such as complexity theory, synergetics, chaos theory, self-organised criticality, and fractals which were all sweeping theoretical constructs based on small toy models which proved unable to address the complexity of real world systems. Multivariate Data Analysis: Principle components analysis (PCA) and Partial Least Squares (PLS) regression are commonly used to reduce the dimensionality of a matrix (and amongst matrices in the case of PLS) in which there are a considerable number of potentially related variables. PCA and PLS are variance focused approaches where components are ranked by the amount of variance they each explain. Components are, by de nition, orthogonal to one another and as such, uncorrelated. Aims: This thesis explores the development of Computational Biology tools that are essential to fully exploit the large data sets that are being generated by systems-based approaches in order to gain a better understanding of winerelated organisms such as grapevine (and tobacco as a laboratory-based plant model), plant pathogens, microbes and their interactions. The broad aim of this thesis is therefore to develop computational methods that can be used in an integrated systems-based approach to model and describe di erent aspects of the wine making process from a biological perspective. To achieve this aim, computational methods have been developed and applied in the areas of transcriptomics, phylogenomics, chemiomics and microbiomics. Summary: The primary approaches taken in this thesis have been the use of networks and multivariate data analysis methods to analyse highly dimensional data sets. Furthermore, several of the approaches have started to explore the intersection between networks and multivariate data analysis. This would seem to be a logical progression as both networks and multivariate data analysis are focused on matrix-based data modelling and therefore have many of their roots in linear algebra.

AFRIKAANSE OPSOMMING: Inleiding: Wynproduksie is 'n komplekse biotegnologiese proses wat mik op die produktiewe koördinering van verskeie interaksies en uitsette van verskeie biologiese sisteme. Hierdie sisteme sluit in die wingerd, wat van besondere belang is, asook die wyn gis en wyn bakterieë. Hoë-deurset data generasie word huidiglik beide globaal en plaaslik toegepas in die velde van genomika, transkriptomika, proteomika, metabolomika en mikrobiomika. As sulks is hierdie tipe datastelle beskikbaar vir ontleding, bemyning en verkening. Die datastelle kan de novo gegenereer word, met behulp van medewerkers, of dit kan vanuit die publieke databasisse gewerf word waar sulke datastelle dikwels beskikbaar gemaak word sodat verdere insig verkry kan word met betrekking tot die sisteem onder studie. Die hoë-deurset datastelle onder bespreking bevat 'n hoë mate van inherente kompleksiteit, beide ten opsigte van ditself asook tussen verskeie datastelle. Om ten einde hierdie datastelle en hul inherente kompleksiteit te modelleer is dit nodig om metodes en benaderings te ontwikkel wat gesetel is in netwerk teorie en meerveranderlike statistiek. Verdermeer is dit ook nodig om die kruisings tussen netwerk teorie en meerveranderlike statistiek te verken om sodoende die modellering, bemyning, verkening en interpretasie van data te verbeter. Netwerke: Die tradisionele reduksionistiese paradigma, waarby enkele komponente van 'n biologiese sisteem geontleed word, het tot dusver nie voldoende metodes en gereedskap gelewer waarmee datastelle, wat streef om sisteemvlak informasie te bekom, geontleed kan word nie. Netwerk teorie het na vore gekom as 'n nuwe dissipline wat toegepas kan word vir die model-skepping en ontleding van komplekse sisteme. Dit stem uit die studie van egte, dikwels groot netwerke wat empiries afgelei word uit die groot volumes data wat tans na vore kom vanuit kommunikasie-, internet-, nansiële- en biologiese sisteme. Dit is in skrille kontras met vorige teoretiese benaderings wat gestreef het om komplekse sisteme te verstaan met konsepte soos kompleksiteits teorie, synergetics , chaos teorie, self-georganiseerde kritikaliteit en fraktale. Al die bogeneomde is breë teoretiese konstrukte, gebasseer op relatief kleinskaal modelle, wat nie instaat was om oplossings vir die kompleksiteit van egte-wêreld sisteme te bied nie. Meerveranderlike Data-analise: Hoofkomponente-ontleding (PCA) en Partial Least Squares (PLS) regressie word dikwels gebruik om die dimensionaliteit van 'n matriks (en tussen matrikse in die geval van PLS) te verminder. Hierdie matrikse bevat dikwels 'n aansienlike groot hoeveelheid moontlikverwante veranderlikes. PCA en PLS is variansie gedrewe metodes en behels dat komponente gerang word deur die hoeveelheid variansie wat elke component verduidelik. Komponente is by de nisie ortogonaal ten opsigte van mekaar en as sulks ongekorreleerd. Doelwitte: Hierdie tesis verken die ontwikkeling van verskeie Computational Biology metodes wat noodsaaklik is om ten volle die groot skaal datastelle te benut wat tans deur sisteem-gebasseerde benaderings gegenereer word. Die doel is om beter begrip en kennis van wyn verwante organismes te kry, hierdie organismes sluit in die wingerd (met tabak as laboratorium-gebasseerde plant model), plant patogene en microbes sowel as hulle interaksies. Die breë mikpunt van hierdie tesis is dus om gerekenaardiseerde metodes te ontwikkel wat gebruik kan word in 'n geintergreerde sisteem-gebaseerde benadering tot die modellering en beskrywing van verskillende aspekte van die wynmaak proses vanuit 'n biologiese standpunt. Om die mikpunt te bereik is gerekenaardiseerde metodes ontwikkel en toegepas in die velde van transkriptomika, logenomika, chemiomika en mikrobiomika. Opsomming: Die primêre benadering geneem in hierdie tesis is die gebruik van netwerke en meerveranderlike data-ontleding metodes om hoë-dimensie datastelle te ontleed. Verdermeer, verskeie van die metodes begin om die gemeenskaplike grond tussen netwerke en meerveranderlike data-ontleding te verken. Dit blyk om 'n logiese progressie te wees, aangesien beide netwerke en meerveranderlike data-ontleding gefokus is op matriks-gebaseerde data modellering en dus gewortel is in liniêre algebra.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/85630
This item appears in the following collections: