The automatic and unconstrained segmentation of speech into subword units

Van Vuuren, Van Zyl
Stellenbosch : Stellenbosch University
ENGLISH ABSTRACT: We develop and evaluate several algorithms that segment a speech signal into subword units without using phone or orthographic transcripts. These segmentation algorithms rely on a scoring function, termed the local score, that is applied at the feature level and indicates where the characteristics of the audio signal change. The predominant approach in the literature to segmentation is to apply a threshold to the local score, and local maxima (peaks) that are above the threshold result in the hypothesis of a segment boundary. Scoring mechanisms of a select number of such algorithms are investigated, and it is found that these local scores frequently exhibit clusters of peaks near phoneme transitions that cause spurious segment boundaries. As a consequence, very short segments are sometimes postulated by the algorithms. To counteract this, ad-hoc remedies are proposed in the literature. We propose a dynamic programming (DP) framework for speech segmentation that employs a probabilistic segment length model in conjunction with the local scores. DP o ers an elegant way to deal with peak clusters by choosing only the most probable segment length and local score combinations as boundary positions. It is shown to o er a clear performance improvement over selected methods from the literature serving as benchmarks. Multilayer perceptrons (MLPs) can be trained to generate local scores by using groups of feature vectors centred around phoneme boundaries and midway between phoneme boundaries in suitable training data. The MLPs are trained to produce a high output value at a boundary, and a low value at continuity. It was found that the more accurate local scores generated by the MLP, which rarely exhibit clusters of peaks, made the additional application of DP less e ective than before. However, a hybrid approach in which DP is used only to resolve smaller, more ambiguous peaks in the local score was found to o er a substantial improvement on all prior methods. Finally, restricted Boltzmann machines (RBMs) were applied as features detectors. This provided a means of building multi-layer networks that are capable of detecting highly abstract features. It is found that when local score are estimated by such deep networks, additional performance gains are achieved.
AFRIKAANSE OPSOMMING: Ons ontwikkel en evalueer verskeie algoritmes wat 'n spraaksein in sub-woord eenhede segmenteer sonder om gebruik te maak van ortogra ese of fonetiese transkripsies. Dié algoritmes maak gebruik van 'n funksie, genaamd die lokale tellingsfunksie, wat 'n waarde produseer omtrent die lokale verandering in 'n spraaksein. In die literatuur is daar gevind dat die hoofbenadering tot segmentasie gebaseer is op 'n grenswaarde, waarbo alle lokale maksima (pieke) in die lokale telling lei tot 'n skeiding tussen segmente. 'n Selektiewe groep segmentasie algoritmes is ondersoek en dit is gevind dat lokale tellings geneig is om groeperings van pieke te hê naby aan die skeidings tussen foneme. As gevolg hiervan, word baie kort segmente geselekteer deur die algoritmes. Om dit teen te werk, word ad-hoc metodes voorgestel in die literatuur. Ons stel 'n alternatiewe metode voor wat gebaseer is op dinamiese programmering (DP), wat 'n statistiese verspreiding van lengtes van segmente inkorporeer by segmentasie. DP bied 'n elegante manier om groeperings van pieke te hanteer, deurdat net kombinasies van hoë lokale tellings en segmentwaarskynlikheid, met betrekking tot die lengte van die segment, tot 'n skeiding lei. Daar word gewys dat DP 'n duidelike verbetering in segmentasie akkuraatheid toon bo 'n paar gekose algoritmes uit die literatuur. Meervoudige lae perseptrone (MLPe) kan opgelei word om 'n lokale telling te genereer deur gebruik te maak van groepe eienskapsvektore gesentreerd rondom en tussen foneem skeidings in geskikte opleidingsdata. Die MLPe word opgelei om 'n groot waarde te genereer as 'n foneem skeiding voorkom en 'n klein waarde andersins. Dit is gevind dat die meer akkurate lokale tellings wat deur die MLPe gegenereer word minder groeperings van pieke het, wat dan die addisionele toepassing van die DP minder e ektief maak. 'n Hibriede toepassing, waar DP net tussen kleiner en minder duidelike pieke in die lokale telling kies, lei egter tot 'n groot verbetering bo-op alle vorige metodes. As 'n nale stap het ons beperkte Boltzmann masjiene (BBMe) gebruik om patrone in data te identi- seer. Sodoende, verskaf BBMe 'n manier om meervoudige lae netwerke op te bou waar die boonste lae baie komplekse patrone in die data identi seer. Die toepassing van dié dieper netwerke tot die generasie van 'n lokale telling het tot verdere verbeteringe in segmentasie-akkuraatheid gelei.
Thesis (MEng)--Stellenbosch University, 2014.
Speech segmentation, Segmentation algorithms, UCTD