Implementing a pipeline for analysing single-cell RNA sequencing data

Date
2023-03
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: Single-cell RNA sequencing (scRNA-seq) has permitted the dissection of gene expression at single-cell resolution and provides novel insights into the composition of apparently homogeneous cell types and transitions between cell states — thereby deepening our understanding of the cell as a functional unit. The data generated by scRNA-seq are characterised by sparsity, heterogeneity, and high-dimensionality as well as large scale. As a result of biological and technical limitations, scRNA-seq data are “noisier” and more complex than their bulk RNA-seq counterparts. Thus, analysing scRNA-seq data demands new statistical and computational methods. Analytical algorithms employed in scRNA-seq pipelines are prone to producing different results depending on the state at the start of the analysis and the number of iterations of computation, complicating reproducibility. I developed a highly robust, scalable, and reproducible analysis pipeline for scRNA-seq data, implemented in Nextflow — a workflow management system that complies with current best practices in bioinformatics. The pipeline implements pre-processing and comprehensive downstream analyses for scRNA-seq data. With the publicly available datasets used for testing, the pipeline identified cell types and differentially expressed genes that enabled the identification of cell subtypes. Trajectory inference also showed the differentiation trajectory of cells, identifying subclusters within cells. In addition, the pipeline documents all steps and transformations, records software packages and versions, and incorporates ontological metadata annotation. Containerisation of pipeline processes ensures that software dependencies are satisfied — contributing to consistent, robust, and reproducible science.
AFRIKAANS OPSOMMING: Enkelsel-RNA volgordebepaling (esRNAv) maak dit moontlik om geenuitdrukking te bestudeer teen enkelsel-resolusie en gee nuwe insig in die samestelling van skynbaar homogene seltipes en die oorgang tussen selfases — daarmee verdiep ons verstaan van die sel as funksionele eenheid. Die data wat esRNAv skep, word gekenmerk deur ‘n yl verspreiding van waardes, groot variasie, hoë dimensionaliteit en wye skaal. esRNAv data is, as gevolg van biologiese en tegniese beperkinge, meer geneig tot agtergrond geraas, as grootmaat RNA volgordebepaling. Daarom het esRNAv data nuwe statistiese en berekeningmetodes nodig. Herproduseerbaarheid is uitdagend omdat esRNAv analitiese algoritmes in pyplyne geneig is om verskillende resultate te gee afhangende van die beginpunt en die hoeveelheid herhalings in berekeninge. Ek het ‘n stewige, herproduseerbare pyplyn wat op enige skaal toegepas kan word, ontwikkel, om esRNAv data te analiseer en het dit implementeer met Nextflow — ‘n werkvloeibestuurstelsel wat huidige beste praktyk in bioinformatika is. Die pyplyn is die eerste om beide voorverwerking en uitgebreide stroom-af analise vir esRNAv uit te voer. Die pyplyn is met datastelle wat vrylik beskikbaar is, getoets en het seltipes uitgeken. Ontwikkelingsbaanafleiding het ook die onderskeiding van selle en onderafdelings gewys. Verder hou die pyplyn rekord van alle stappe, verwerkings, sagteware pakette en weergawes, en sluit ontologiese metadata in. Die pyplyn prosesse is in virtuele houers afgesonder sodat sagteware afhanklikheid bestuur kan word. Dit dra by tot volhoubare en herproduseerbare wetenskap
Description
Thesis (MSc)--Stellenbosch University, 2023.
Keywords
Pipeline development, Single-cell RNA-seq, Reproducible science, Workflow management, Containerization
Citation