On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics

dc.contributor.authorMachado, Karla C. T.en_ZA
dc.contributor.authorFortuin, Sueretaen_ZA
dc.contributor.authorTomazella, Gisele Guicardien_ZA
dc.contributor.authorFonseca, Andre F.en_ZA
dc.contributor.authorWarren, Robin Marken_ZA
dc.contributor.authorWiker, Harald G.en_ZA
dc.contributor.authorDe Souza, Sandro Joseen_ZA
dc.contributor.authorDe Souza, Gustavo Antonioen_ZA
dc.date.accessioned2021-11-03T13:47:52Z
dc.date.available2021-11-03T13:47:52Z
dc.date.issued2019
dc.descriptionCITATION: Machado, K. C. T., et al. 2019. On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics. Frontiers in Microbiology, 10:1410, doi:10.3389/fmicb.2019.01410.
dc.descriptionThe original publication is available at https://www.frontiersin.org
dc.description.abstractENGLISH ABSTRACT: In proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bordetella pertussis generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from M. tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases.en_ZA
dc.description.urihttps://www.frontiersin.org/articles/10.3389/fmicb.2019.01410/full
dc.description.versionPublisher's version
dc.format.extent13 pagesen_ZA
dc.identifier.citationMachado, K. C. T., et al. 2019. On the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomics. Frontiers in Microbiology, 10:1410, doi:10.3389/fmicb.2019.01410
dc.identifier.issn1664-302X (online)
dc.identifier.otherdoi:10.3389/fmicb.2019.01410
dc.identifier.urihttp://hdl.handle.net/10019.1/123351
dc.language.isoen_ZAen_ZA
dc.publisherFrontiers Mediaen_ZA
dc.rights.holderAuthors retain copyrighten_ZA
dc.subjectGenomicsen_ZA
dc.subjectProtein sequenceen_ZA
dc.subjectProteomicsen_ZA
dc.subjectMass spectrometryen_ZA
dc.titleOn the impact of the pangenome and annotation discrepancies while building protein sequence databases for bacteria proteogenomicsen_ZA
dc.typeArticleen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
machado_impact_2019.pdf
Size:
1.47 MB
Format:
Adobe Portable Document Format
Description:
Download article
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: