Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design

dc.contributor.advisorTromp, Gerarden_ZA
dc.contributor.authorZimire, Darrynen_ZA
dc.contributor.otherStellenbosch University. Faculty of Medicine and Health Sciences. Dept. of Biomedical Sciences: Molecular Biology and Human Genetics.en_ZA
dc.date.accessioned2021-11-24T06:46:57Z
dc.date.accessioned2021-12-22T14:23:25Z
dc.date.available2021-11-24T06:46:57Z
dc.date.available2021-12-22T14:23:25Z
dc.date.issued2021-12
dc.descriptionThesis (MSc)--Stellenbosch University, 2021.en_ZA
dc.description.abstractENGLISH ABSTRACT: RNA-sequencing (RNA-seq) is a quantitative high-throughput sequencing biotechnology developed to analyse and provide insights into the molecular biology of the transcriptome. An appropriate experimental design and analysis strategy for RNA-seq experiments is essential and requires statistical methods suited to model the characteristics of sequencing data which take the form of a matrix with the number of reads per genomic feature as a digital estimate of relative expression. Sequencing depth, read length and data quality are of particular importance for planning and analysing RNA-seq experiments as these factors can be decided before conducting the experiment. The number of reads generated for a particular experiment affects the statistical power to make biological conclusions. Read length coupled with its associated quality influences the mappability of the sequencing data and in turn has an impact on information loss. Shorter reads tend to map to multiple locations when aligned to the reference genome or transcriptome. The quality of the data also affects the downstream analysis and can result in the discarding of data, diminishing the ability to establish biological insights with confidence from the experimental data. To assist in the design of RNA-seq experiments, I present an RNA-seq data simulator (RSDS), which is a proof-of-concept computer simulator written in the Python programming language for raw RNA- seq data simulations. RSDS allows for simulation of both single-end and paired-end RNA-seq data with sequencing depth, read length, and base-call quality as tuneable settings. A two-group differential expression experiment can be simulated using RSDS. I describe, validate and implement the RSDS simulator and demonstrate its use for generation of raw synthetic RNA-seq data by varying the parameter values of sequencing depth, read length, and base-call quality. I demonstrate the ability of RSDS to reproduce a transcript expression profile from an input matrix of read counts derived from a real RNA-seq experiment and produce a two-group differential experiment with varying fold-changes and expression levels.en_ZA
dc.description.abstractAFRIKAANSE OPSOMMING: Die ontwikkeling van kwantitatief sequencing tegnologie, soos RNA-sequencing (RNA-seq) het n’ groot insig in molekulere biologie vasgestel. Behoorlike ontwerp and analise van die eksperimente benodig statistiese modelle en tegnieke wat die aard van sequencing data in ag neem, wat gewoonlik bestaan uit n’ matriks van lees-tellings per funksie. n’ Kwessie van besondere belang vir die ontwikkeling van hierdie metodes en ontwerp van die eksperimente is die rol van volgorde diepte, leeslengte en datakwaliteit. Die diepte van n’ eksperiment beinvloed die vermoe om biologiese gevolgtrekkings te maak, wat beteken dat n’ eksperimentontwerp die afweging tussen koste, statistiese krag en die aantal monsters wat ondersoek word, moet in ag neem. Leeslengte tesame met die gepaargaande kwaliteit daarvan is n’ belangrike oorweging vir elke eksperiment opeenvolgorde, want dit beinvloed die lot van n’ sequence wat gelees word na die kartering van n’ verwysingsgenoom. Korter reads is geneig om op meer as een plek te karteer as dit in lyn is met die verwysingsgenoom en word dikwels weggegooi, wat lei to verlies aan biologiese inligting. In hierdie proefskrif ondersoek ek die effekte van sequencing diepte, read lengte en datakwaliteit op die ontwerp en analise van RNA-seq eksperimente. Om te help met die ontwerp van RNA-seq eksperimente, bied ek RNA-seq Data Simulator (RSDS) aan, wat n’ bewys van konsep rekenaarsimulator is wat in Python programmeertaal geskryf is vir rou RNA-seq data simulasies. RSDS maak voorsiening vir simulasies van beide enkel en gekoppelde RNA-seq data met volgorde diepte leeslengte en basisoproep kwaliteit as instelbare instellings. DIt bied ook die vermoeie aan om n’ twee-groep differential geen uitdrukking te simuleer. Ek beskryf, bekragtig en implementeer die RSDS-simulator en demonstreer die gebruik daarvan om rou RNA-seq data te produseer deur die parameterwaardes van volgorde diepte, leeslengte en basisoproepkwaliteit te varieer. Ek demontreer ook die vermoe van RSDS om n’ transkripsie-uitdrukkings profiel weer te gee vanaf n’ invoermatriks van lees-tellings afgelei van n’ werklike RNA-seq eksperiment.af_ZA
dc.description.versionMastersen_ZA
dc.format.extent146 pagesen_ZA
dc.identifier.urihttp://hdl.handle.net/10019.1/123822
dc.language.isoen_ZAen_ZA
dc.publisherStellenbosch : Stellenbosch Universityen_ZA
dc.rights.holderStellenbosch Universityen_ZA
dc.subjectRNA-sequencingen_ZA
dc.subjectBiotechnologyen_ZA
dc.subjectMolecular biologyen_ZA
dc.subjectUCTDen_ZA
dc.titleSimulating read length, sequencing depth and base-call quality for RNAsequencing experimental designen_ZA
dc.typeThesisen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
zimire_read_2021.pdf
Size:
4.06 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: