Applications of natural language processing for low-resource languages in the healthcare domain

Daniel, Jeanne Elizabeth (2020-03)

Thesis (MSc)--Stellenbosch University, 2020.

Thesis

ENGLISH ABSTRACT: Since 2014 MomConnect has provided healthcare information and emotional support in all 11 official languages of South Africa to over 2.6 million pregnant and breastfeeding women, via SMS and WhatsApp. However, the service has struggled to scale efficiently with the growing user base and increase in incoming questions, resulting in a current median response time of 20 hours. The aim of our study is to investigate the feasibility of automating the manual answering process. This study consists of two parts: i) answer selection, a form of information retrieval, and ii) natural language processing (NLP), where computers are taught to interpret human language. Our problem is unique in the NLP space, as we work with a closed-domain question-answering dataset, with questions in 11 languages, many of which are low-resource, with English template answers, unreliable language labels, code-mixing, shorthand, typos, spelling errors and inconsistencies in the answering process. The shared English template answers and code-mixing in the questions can be used as cross-lingual signals to learn cross-lingual embedding spaces. We combine these embeddings with various machine learning models to perform answer selection, and find that the Transformer architecture performs best, achieving a top-1 test accuracy of 61:75% and a top-5 test accuracy of 91:16%. It also exhibits improved performance on low-resource languages when compared to the long short-term memory (LSTM) networks investigated. Additionally, we evaluate the quality of the cross-lingual embeddings using parallel English-Zulu question pairs, obtained using Google Translate. Here we show that the Transformer model produces embeddings of parallel questions that are very close to one another, as measured using cosine distance. This indicates that the shared template answer serves as an effective cross-lingual signal, and demonstrates that our method is capable of producing high quality cross-lingual embeddings for lowresource languages like Zulu. Further, the experimental results demonstrate that automation using a top-5 recommendation system is feasible.

AFRIKAANSE OPSOMMING: Sedert 2014 bied MomConnect vir meer as 2.6 miljoen swanger vrouens en jong moeders gesondheidsinligting en emosionele ondersteuning. Die platform maak gebruik van selfoondienste soos SMS en WhatsApp, en is beskikbaar in die 11 amptelike tale van Suid-Afrika, maar sukkel om doeltreffend by te hou met die groeiende gebruikersbasis en aantal inkomende vrae. Weens die volumes is die mediaan reaksietyd van die platform tans 20 ure. Die doel van hierdie studie is om die vatbaarheid van ’n outomatiese antwoordstelsel te ondersoek. Die studie is tweedelig: i) vir ’n gegewe vraag, kies die mees toepaslike antwoord, en ii) natuurlike taalverwerking van die inkomende vrae. Hierdie probleem is uniek in die veld van natuurlike taalverwerking, omdat ons werk met ’n vraag-en-antwoord datastel waar die vrae beperk is tot die gebied van swangerskap en borsvoeding. Verder is die antwoorde gestandardiseerd en in Engels, terwyl die vrae in al 11 tale kan wees en meeste van die tale kan as lae-hulpbron tale geklassifiseer word. Boonop is inligting oor die taal van die vrae onbetroubaar, tale word gemeng, daar is spelfoute, tikfoute, korthand (SMS-taal), en die beantwoording van die antwoorde is nie altyd konsekwent nie. Die gestandardiseerde Engelse antwoorde, wat gedeel word deur meertalige vrae, asook die gemende taal in die vrae, kan gebruik word om kruistalige vektorruimtes aan te leer. ’n Kombinasie van kruistalige vektorruimtes en masjienleer-modelle word afgerig om nuwe vrae te beantwoord. Resultate toon dat die beste masjienleer-model die Transformator-model is, met ’n top-1 akkuraatheid van 61:75% en ’n top-5 akkuraatheid van 91:16%. Die Transformator toon ook ’n verbeterde prestasie op die lae-hulpbron tale, in vergelyking met die lang-korttermyn-geheue (LSTM) netwerke wat ook ondersoek is. Die kwaliteit van die kruistalige vektorruimtes word met parallelle Engels-Zulu vertalings geëvalueer, met die hulp van Google Translate. So wys ons dat die Transformator vektore vir die parallelle vertalings produseer wat baie na aan mekaar in die kruistalige vektorruimte, volgens die kosinusafstand. Hierdie resultate demonstreer dat ons metode die vermoë besit om hoë-kwaliteit kruistalige vektorruimtes vir lae-hulpbron tale soos Zulu te leer. Verder demonstreer die resultate van die eksperimente dat ’n top-5 aanbevelingstelsel vir outomatiese beantwoording haalbaar is.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/107969
This item appears in the following collections: