Graph-based semi-supervised learning for the detection of potential disease causing genes

Van Zyl, G. (2020-12)

Thesis (PhD)--Stellenbosch University, 2020.

Thesis

ENGLISH ABSTRACT: AbstractIt is widely believed that almost all diseases are, to some extent, influenced by individuals’ geneticmake-up. Ample insight into this relationship may usher in a new age where preventative,precision medicine is the norm. The identification of human genes associated with diseases(disease genesin short) is a central step in the realisation of this ambition. The developmentof computational approaches aimed at identifying putative disease genes among a large pool ofcandidates — so as to reduce the number of alternatives to be explored in further validationexperiments and functional studies — has, therefore, become one of the fundamental problemsin bioinformatics.In the realm of bioinformatics, disease gene classification is primarily based on the principlethat “the network neighbour of a disease gene is likely to cause the same or a similar disease.”In this dissertation, a novel computational approach to the disease gene identification problemis proposed. This methodological framework utilises the aforementioned principle and exploitsboth the modular nature of biological networks and the abundance of available data related tothe similarities between genes within a semi-supervised machine learning paradigm.The proposed disease gene identification methodology is demonstrated practically and found toexhibit significant classification abilities. In addition, the framework is successfully applied toobtain ranked sets of putative disease gene predictions — a number of which are verified byretrieving evidence of their involvement in the origins of diseases from the literature.

AFRIKAANSE OPSOMMING: Daar word algemeen geglo dat bykans alle siektes tot ’n sekere mate deur die genetiese samestel-ling van individue be ̈ınvloed word. Goeie insig in hierdie verwantskap kan na ’n nuwe tydvak leiwaar voorkomende, presiesie-medisyne die norm is. Die identifikasie van menslike gene wat metsiektes (siekte-genein kort) verbind word, is ’n sentrale stap in die verwesenliking van hierdieideaal. Die ontwikkeling van berekeningsbenaderings wat daarop gemik is om vermeende siekte-gene tussen ’n groot aantal kandidate te identifiseer — om sodoende die aantal alternatiewe watin verdere valideringseksperimente en funksionele studies ondersoek moet word, te verminder —is dus een van die fundamentele probleme in bioinformatika.Op die gebied van bioinformatika is die klassifikasie van siekte-gene hoofsaaklik gebaseer opdie beginsel dat “die netwerkbuurgeen van ’n siekte-geen waarskynlik dieselfde of ’n soortge-lyke siekte sal veroorsaak.” In hierdie proefskrif word ’n nuwe berekeningsbenadering tot dieidentifikasieprobleem van die siekte-gene daargestel. Hierdie metodologiese raamwerk maak ge-bruik van die bogenoemde beginsel en benut beide die modulˆere aard van biologiese netwerkeen die oorvloed beskikbare data wat verband hou met die ooreenkomste tussen gene binne ’nsemi-toesighoudende masjienleerparadigma.Die voorgestelde metodologie vir die identifikasie van siekte-gene word prakties gedemonstreeren daar word bevind dat die metodologie oor betekenisvolle klassifikasievermo ̈e beskik. Daar-benewens word die raamwerk suksesvol toegepas om rangordes van vermeende siekte-gene daarte stel, waarvan ’n aantal geverifieer word deur bewyse van hul deelname aan die oorsprong vansiektes uit die literatuur te staaf.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/109105
This item appears in the following collections: