Language modelling for code-switched automatic speech recognition in five South African languages

Van der Westhuizen, Ewald (2018-12)

Thesis (PhD)--Stellenbosch University, 2018.

Thesis

ENGLISH ABSTRACT: Code-switching refers to natural, spontaneous language alternation by multilingual speakers during a conversation or utterance, and is prevalent in everyday conversations by multilingual South Africans. Automatic speech recognition systems are generally highly optimised for monolingual input and performance deteriorates when presented with mixed-language speech. This thesis addresses the automatic recognition of speech containing code-switching between English and four South African Bantu languages, focussing specifically on the language modelling of English-isiZulu, English-isiXhosa, English- Setswana and English-Sesotho. Due to the severe scarcity of code-switched speech data in South African languages, it was necessary to first develop a representative corpus. This new and unique 35-hour corpus contains segmented and transcribed code-switched speech from conversations in South African soap operas, which exhibit spontaneous utterances with regular code-switching in the target languages. Insertional, alternational, and intraword intrasentential code-switching are all represented in the data, as are some other special characteristics of fast, spontaneous Bantu speech such as postlexical deletion. The distribution of language switches is extremely sparse, however. In this thesis, a number of data-driven modelling approaches were investigated and applied to address the sparsity by augmenting the training data with synthetically generated data. Postlexical deletion was successfully modelled statistically with joint-sequence models, and these models were used to generate synthetic pronunciations which were demonstrated to lead to improved automatic speech recognition performance. Two new code-switched language modelling approaches were proposed to address data sparsity. First, parallel language-dependent language modelling (PLDLM), which consists of two monolingual language models with explicit language transitions, was demonstrated to outperform a conventional language-independent language model in terms of recognition word error rate. Second, language models in which word embeddings were used to synthesise probable unseen code-switched bigrams were considered. It was possible to achieve a reduction of up to 31% in language model perplexity across a language switch boundary by including such synthesised code-switch bigrams. Although smaller, improvements in the recognition word error rate were also observed.

AFRIKAANSE OPSOMMING: Kodewisseling behels die natuurlike, spontane skakeling tussen tale deur veeltalige sprekers gedurende ’n gesprek of uiting en kom alledaags voor in gesprekke van veeltalige Suid-Afrikaners. Outomatiese spraakherkenningstelsels is in die algemeen spesifiek geoptimeer vir die hantering van eentalige spraak en toon swak werkverrigting in die hantering van meertalige spraak. Hierdie tesis spreek die outomatiese herkenning van spraak met kodewisseling tussen Engels en vier Suid-Afrikaanse Bantoe-tale aan. Die taalmodellering van Engels-IsiZulu, Engels-IsiXhosa, Engels-Setswana en Engels-Sesotho spraak met kodewisseling word spesifiek aangespreek. Weens die skaarste van spraakdata in Suid-Afrikaanse tale wat kodewisseling bevat, was dit nodig om ’n verteenwoordigende spraakkorpus saam te stel. Hierdie nuwe en unieke korpus bestaan uit 35-uur se gesegmenteerde en getranskribeerde spraak wat kodewisseling bevat. Die data is onttrek uit gesprekke in Suid-Afrikaanse sepie-TVreekse, wat spontane spraak met gereelde kodewisseling toon in die voorge noemde tale. Verskeie kodewisselingsvorme kom in die data voor, waaronder intersentensiële kodewisseling as ’n insetsel (insertional), as ’n alternerende sinsdeel (alternational) of intern tot ’n woord (intraword) kan voorkom. Die verspreiding van kodewisselingvoorbeelde in die data is egter besonder yl. ’n Aantal datagedrewe modelleringstegnieke is ondersoek om yl afrigdata met sintetiese data aan te vul. Vokaaldelesie, ’n kenmerkende verskynsel in spontane spraak met ’n hoë tempo, word ook onder die Afrikatale waargeneem. Vokaaldelesie is suksesvol gemodelleer met gesamentlike-sekwensiemodelle. Hierdie modelle is gebruik om sintetiese uitsprake te skep wat gelei het tot verbeterde woordfouttempo met die outomatiese spraakherkenner. Twee nuwe benaderings tot die taalmodellering van kodewisseling is ondersoek. Die eerste is ’n parallelle taalafhanklike taalmodel wat twee eentalige taalmodelle met eksplisiete taaloorgangskakels verbind. Dit is bewys dat hierdie benadering ’n beter woordfouttempo as die konvensionele taalonafhanklike taalmodel kon lewer. Die tweede benadering het taalmodelle ondersoek waarby woordinbedding toegepas is om waarskynlike kodewisselingsbigramme te sintetiseer. Dit is moontlik om ’n afname van tot 31% in die perpleksiteit by ’n taalskakelingspunt te bewerkstellig deur die sintetiese kodewisselingsbigramme by die taalmodelle in te sluit. ’n Verbetering in woordfouttempo is ook waargeneem, alhoewel kleiner.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/104997
This item appears in the following collections: