Heterologous expression and partial characterisation of enzymes predicted in silico by deep feed-forward neural networks

Date
2024-03
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: Inverse protein folding (IPF) involves the prediction of amino acid sequences that will fold into a specified three-dimensional (3D) structure. The implementation of advanced algorithms has expedited progress in structural bioinformatics, with increasing application of machine learning (ML) approaches. This has directly contributed to the unparalleled success of modern protein structure prediction methods that are capable of predicting the 3D atomic coordinates of complex structures from only their protein sequences with near-experimental accuracy. Comparable progress has been seen for reverse folding predictors, which are now largely reliant on ever-evolving neural network architectures. IPF tools that are guided by the physical principles of protein folding offer insight and potential for rational protein design and enzyme engineering that have long been unattainable. However, the outputs of many of these tools have only been assessed in silico by means of ML folding algorithms—consequently, a gap exists between their potential and realised application. SeqPredNN is an in-house feed-forward IPF neural network trained using features extracted from the subset of Protein Data Bank entries that have less than 90% identity. Similarly, ProteinMPNN is a deep learning-based protein sequence design method. Unlike ProteinMPNN, SeqPredNN is yet to be applied in vitro, and therefore lacks experimental validation. This study used both SeqPredNN and ProteinMPNN to predict novel protein sequences for Bacillus subtilis lipase A (LipA) and Streptomyces griseus trypsin (SGT). The SeqPredNN sequence recovery rates were <40%, while the ProteinMPNN predictions were >60% identical to the native sequences. Following the re-introduction of all residues deemed necessary for catalysis, the curated sequences were folded using AlphaFold2. The resulting conformations for both IPF tools were remarkably similar to the native X-ray crystal structures. Further molecular dynamics simulations and ligand docking showed that, despite vastly different amino acid sequences, the IPF enzymes were expected to possess physicochemical properties largely comparable to those of the native counterparts. To validate this experimentally, the novel proteins were produced in recombinant Escherichia coli and Pichia pastoris strains. Noticeably lower levels of heterologous protein expression were observed for the IPF variants, particularly for SeqPredNN LipA, compared to the native proteins. Furthermore, catalytic activity was significantly reduced or completely lost in the predicted enzymes. This is likely due to modifications of the electrostatic surface potential and active site topology that no longer facilitate correct substrate interactions, which are deemed to be repercussions of the highly unique protein sequences. Preliminary characterisation via circular dichroism yielded empirical secondary structure compositions that differed from the expected values, adding to the lack of congruence between the computational and experimental results. Importantly, this study provides the first experimental implementation of SeqPredNN, and emphasises critical considerations for difficult protein design targets. However, the ultimate utility of a reverse folding tool lies in its ability to predict protein sequences that will achieve structures capable of performing intended functions. Accordingly, while these tools offer an unprecedented starting point for rational protein modification, a more holistic approach with continued experimental validation may be required for greater success.
AFRIKAANSE OPSOMMING: Omgekeerde proteïenvouing (OPV) behels dat aminosuurvolgordes voorspel word wat in ‘n spesifieke driedimensionele (3D) struktuur sal vou. Die implementering van gevorderde algoritmes het vordering in strukturele bioinformatika bespoedig, met toenemende toepassing van masjienleerende (ML) benaderings. Dit het direk bygedra tot die ongesiende sukses van moderne proteïenstruktuurvoorspellingsmetodes wat in staat is om die 3D-atoomkoördinate van komplekse strukture vanuit slegs hul proteïenvolgordes, met byna eksperimentele akkuraatheid, te voorspel. Vergelykbare vordering is gesien vir omgekeerde vou-voorspellers, wat nou grootliks afhanklik is van ontwikkelende neurale netwerk argitekture. OPV-instrumente, wat deur die fisiese beginsels van proteïenvouing gelei word, bied insig en potensiaal vir rasionele proteïenontwerp en ensiemingenieuring wat tot dusvêr onbereikbaar is. Die uitsette van baie van hierdie instrumente is egter slegs in silico beoordeel deur ML-voualgoritmes te gebruik—gevolglik bestaan ‘n gaping tussen hul potensiële en gerealiseerde toepassing. SeqPredNN is ‘n interne voorvoerende-OPV neurale netwerk wat opgelei is deur kenmerke uit die subklas van Protein Data Bank-inskrywings wat minder as 90% identiteit het, te onttrek. Net so is ProteinMPNN ‘n diep leergebaseerde proteïenvolgorde-ontwerpmetode. In teenstelling met ProteinMPNN, moet SeqPredNN nog in vitro toegepas word, en is eksperimentele validasie nog nie bekom nie. Hierdie studie het beide SeqPredNN en ProteinMPNN gebruik om nuwe proteïenvolgordes vir Bacillus subtilis lipase A (LipA) en Streptomyces griseus trypsin (SGT) te voorspel. Die SeqPredNN-volgorde-herwinningskoerse was <40% identies aan die oorspronklike volgordes, terwyl die ProteinMPNN-voorspellings >60% identies was. Nadat alle residue wat nodig geag was vir katalise terug ingevoer was, is die saamgestelde volgordes gevou met AlphaFold2. Die gevolglike konformasies vir beide OPV instrumente was merkwaardig soortgelyk aan die oorspronklike X-straal kristal strukture. Verdere molekulêre dinamika-simulasies en ligand-koppeling het getoon dat die OPV-ensieme fisies en chemiese eienskappe besit wat grootliks vergelykbaar is met dié van die oorspronklike ensieme, ten spyte van groot verskille in aminosuurvolgordes. Om hierdie eksperimenteel te bewys is die nuwe proteïene in Escherichia coli en Pichia pastoris stamme geproduseer. Laer rekombinante produksie vlakke van die OPV-variante was merkbaar in vergelyking met die oorspronklike proteïene, veral vir SeqPredNN LipA. Verder was katalitiese aktiwiteit aansienlik verminder of heeltemal verlore vir die voorspelde ensieme. Dit is waarskynlik as gevolg van modifikasies van die elektrostatiese oppervlakpotensiaal en aktiewe terreintopologie van die ensieme wat nie meer korrekte substraatinteraksies fasiliteer nie en hierdie word voorgestel as reperkussies van die hoogs unieke proteïenvolgordes. Voorlopige karakterisering deur sirkulêre dichroïsme het empiriese sekondêre struktuursamestellings opgelewer wat van die verwagte waardes verskil het. Hierdie het bygedra tot die gebrek aan ooreenstemming tussen die berekende- en eksperimentele resultate. Dit is belangrik om te noem dat hierdie studie die eerste eksperimentele implementering van SeqPredNN verskaf en kritiese oorwegings vir moeilike proteïenontwerpteikens beklemtoon. Die uiteindelike nut van ‘n omgekeerde vou-instrument lê egter in sy vermoë om proteïenvolgordes te voorspel wat strukture sal vorm wat in staat is om spesifieke funksies te verrig. Gevolglik, terwyl hierdie instrumente ‘n uitstekende beginpunt bied vir rasionele proteïenmodifikasie, kan ‘n meer holistiese benadering met voortdurende eksperimentele validasie benodig word vir beter sukses.
Description
Thesis (MSc)--Stellenbosch University, 2024.
Keywords
Citation