SUN ETD - Theses and Dissertations
Permanent URI for this community
This community is a clearing house for masters and doctorates submitted via Thesis Management
Browse
Browsing SUN ETD - Theses and Dissertations by Author "Barends, Umr"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemAutomatic orthography standardisation for under-resourced languages(Stellenbosch University, 2024-12) Barends, Umr; Niesler, Thomas; Stellenbosch University. Faculty of Engineering. Dept. of Electrical & Electronic Engineering.This work addresses the normalization of the orthography of a severely under-resourced language, taking as a specific example the West African language known as Bambara. One aspect of the lack of resources for such languages is that spelling and orthographic conventions are not firmly established. This for example leads to variations in how speech is transcribed by mother-tongue speakers, which in turn leads to inconsistencies in the annotations found in a speech corpus. According to our investigation, there is no data available for the normalization of the Bambara language other than the very small corpus used in this work. To our knowledge, this is also the only corpus of transcribed Bambara speech. Normalizing the spelling of Bambara spellings is important for systems such as ASR or text to speech, where more consistent spellings equate to better performance of such language model based systems. The baseline method, known as anagram hashing, uses word anagrams and word n-grams to perform the normalization. These methods have been used by other researches to normalize historical text to modern spellings. In addition, we determine the performance that can be achieved by applying the machine learning methods: softmax regression, LSTM and bi-LSTM. Our experiments indicate that the neural network models out-performed the anagram hashing algorithm on the task of normalization of the Bambara orthography. We also found that word-level models performed better than character-level models. Among the machine learning models, the softmax regression model performed best at normalizing the Bambara orthography. We conclude that it is possible to perform automatic normalization of orthography using machine learning models that is superior to the current state-of-the -art, but that the small size of the traning set does not allow the recurrent architecture to surpass the performance of softmax regression.