Automatic phoneme recognition of South African English

Engelbrecht, Herman Arnold (2004-03)

Thesis (MEng)--University of Stellenbosch, 2004.

Thesis

ENGLISH ABSTRACT: Automatic speech recognition applications have been developed for many languages in other countries but not much research has been conducted on developing Human Language Technology (HLT) for S.A. languages. Research has been performed on informally gathered speech data but until now a speech corpus that could be used to develop HLT for S.A. languages did not exist. With the development of the African Speech Technology Speech Corpora, it has now become possible to develop commercial applications of HLT. The two main objectives of this work are the accurate modelling of phonemes, suitable for the purposes of LVCSR, and the evaluation of the untried S.A. English speech corpus. Three different aspects of phoneme modelling was investigated by performing isolated phoneme recognition on the NTIMIT speech corpus. The three aspects were signal processing, statistical modelling of HMM state distributions and context-dependent phoneme modelling. Research has shown that the use of phonetic context when modelling phonemes forms an integral part of most modern LVCSR systems. To facilitate the context-dependent phoneme modelling, a method of constructing robust and accurate models using decision tree-based state clustering techniques is described. The strength of this method is the ability to construct accurate models of contexts that did not occur in the training data. The method incorporates linguistic knowledge about the phonetic context, in conjunction with the training data, to decide which phoneme contexts are similar and should share model parameters. As LVCSR typically consists of continuous recognition of spoken words, the contextdependent and context-independent phoneme models that were created for the isolated recognition experiments are evaluated by performing continuous phoneme recognition. The phoneme recognition experiments are performed, without the aid of a grammar or language model, on the S.A. English corpus. As the S.A. English corpus is newly created, no previous research exist to which the continuous recognition results can be compared to. Therefore, it was necessary to create comparable baseline results, by performing continuous phoneme recognition on the NTIMIT corpus. It was found that acceptable recognition accuracy was obtained on both the NTIMIT and S.A. English corpora. Furthermore, the results on S.A. English was 2 - 6% better than the results on NTIMIT, indicating that the S.A. English corpus is of a high enough quality that it can be used for the development of HLT.

AFRIKAANSE OPSOMMING: Automatiese spraak-herkenning is al ontwikkel vir ander tale in ander lande maar, daar nog nie baie navorsing gedoen om menslike taal-tegnologie (HLT) te ontwikkel vir Suid- Afrikaanse tale. Daar is al navorsing gedoen op spraak wat informeel versamel is, maar tot nou toe was daar nie 'n spraak databasis wat vir die ontwikkeling van HLT vir S.A. tale. Met die ontwikkeling van die African Speech Technology Speech Corpora, het dit moontlik geword om HLT te ontwikkel vir wat geskik is vir kornmersiele doeleindes. Die twee hoofdoele van hierdie tesis is die akkurate modellering van foneme, geskik vir groot-woordeskat kontinue spraak-herkenning (LVCSR), asook die evaluasie van die S.A. Engels spraak-databasis. Drie aspekte van foneem-modellering word ondersoek deur isoleerde foneem-herkenning te doen op die NTIMIT spraak-databasis. Die drie aspekte wat ondersoek word is sein prosessering, statistiese modellering van die HMM toestands distribusies, en konteksafhanklike foneem-modellering. Navorsing het getoon dat die gebruik van fonetiese konteks 'n integrale deel vorm van meeste moderne LVCSR stelsels. Dit is dus nodig om robuuste en akkurate konteks-afhanklike modelle te kan bou. Hiervoor word 'n besluitnemingsboom- gebaseerde trosvormings tegniek beskryf. Die tegniek is ook in staat is om akkurate modelle te bou van kontekste van nie voorgekom het in die afrigdata nie. Om te besluit watter fonetiese kontekste is soortgelyk en dus model parameters moet deel, maak die tegniek gebruik van die afrigdata en inkorporeer taalkundige kennis oor die fonetiese kontekste. Omdat LVCSR tipies is oor die kontinue herkenning van woorde, word die konteksafhanklike en konteks-onafhanklike modelle, wat gebou is vir die isoleerde foneem-herkenningseksperimente, evalueer d.m.v. kontinue foneem-herkening. Die kontinue foneemherkenningseksperimente word gedoen op die S.A. Engels databasis, sonder die hulp van 'n taalmodel of grammatika. Omdat die S.A. Engels databasis nuut is, is daar nog geen ander navorsing waarteen die result ate vergelyk kan word nie. Dit is dus nodig om kontinue foneem-herkennings result ate op die NTIMIT databasis te genereer, waarteen die S.A. Engels resulte vergelyk kan word. Die resulate dui op aanvaarbare foneem her kenning op beide die NTIMIT en S.A. Engels databassise. Die resultate op S.A. Engels is selfs 2 - 6% beter as die resultate op NTIMIT, wat daarop dui dat die S.A. Engels spraak-databasis geskik is vir die ontwikkeling van HLT.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/49867
This item appears in the following collections: