Improving visual speech synthesis using Decision Tree Models

Date
2016-03
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: Visual speech synthesis is essential for believable virtual character interaction. Traditionally, animation artists recreate the oral motions expected from speech utterances. In response, we present decision tree-based clustering techniques which are employed in automating visual speech animation. This is achieved using a small dataset of phoneticallyannotated audiovisual speech. Our work focuses on extending existing tree-based clustering algorithms by improving on the modelling of coarticulation effects. This is accomplished by capturing the motion of natural speech segments, referred to as dynamic visemes, and conserving their parameters during clustering and speech synthesis. Dynamic visemes are defined as the trajectories of oral features segmented by triphone boundaries. By applying simple search and concatenation criteria, our visual speech synthesis system uses decision trees to better predict which dynamic visemes to use. Experimentation guided all design decisions, suggesting which oral features were of greatest importance, identifying an appropriate dynamic viseme length and finding an effective interpolation method for conserving coarticulation. We evaluate the performance of our visual speech synthesis models by computing squared error differences between synthesised and measured feature trajectories. Perceptual tests also asked participants to compare virtual characters animated by model outputs. Both measured and perceptual tests show that our approaches lead to a clear improvement over a comparable baseline. Through our research, we intended on making speech synthesis more accessible. Therefore, the conversational agents are based on the freely available MakeHuman and Blender software components. The customised oral feature motion capture system is also easily reproduced and requires only consumer grade recording equipment.
AFRIKAANSE OPSOMMING: Visuele spraaksintese is noodsaaklik om geloofwaardige interaksie met virtuele karakters moontlik te maak. In die verlede het animasiekunstenaars mondbewegings vanaf werklike spraak nageboots. In hierdie studie bied ons tegnieke aan wat gebaseer is op saambondeling met behulp van besluitnemingsbome. Hierdie tegnieke word gebruik om die animasie van visuele spraak te outomatiseer, en maak gebruik van ’n klein datastel van foneties geannoteerde oudiovisuele spraak. Ons werk fokus op die uitbrei van bestaande besluitnemingsboom-saambondelingsalgoritmes, deur die modellering van koartikulasie-effekte te verbeter. Dit word moontlik gemaak deur eers die beweging van natuurlike spraaksegmente (viseme) vas te vang, en dan hul parameters te bewaar tydens die saambondeling en spraaksintese. Dinamiese viseme word gedefinieer as die trajekte van mondeienskappe, gesegmenteer deur trifoongrense. Deur eenvoudige soek- en saamvoegingskriteria toe te pas, kan ons visuele spraaksintesestelsel van besluitnemingsbome gebruik maak om beter te voorspel watter viseme aangewend moet word. Alle ontwerpsbesluite is deur ekspermintering gelei, om bv. die mondeienskappe van grootste belang te identifiseer, om ’n gepaste viseemlengte vas te stel, en om ’n effektiewe interpolasiemetode te vind wat koartikulasie bewaar. Ons evalueer die werksverrigting van ons visuele spraaksintesemodel deur die kwadraatfout tussen die gesintetiseerde en gemete eienskaptrajekte te bereken. Tydens perseptuele toetse is deelnemers gevra om die geloofwaardigheid van virtuele karakters, aangedryf deur die modeluittrees, te beoordeel. Beide gemete en perseptuele toetse het aangedui dat die voorgestelde tegnieke ’n duidelike verbetering bo ’n geskikte basislynmeting toon. Die doel van hierdie navorsing is om spraaksintese meer toeganklik te maak. Om hierdie rede is die gespreksagente gebou op die vrylik beskikbare MakeHuman- en Blendersagtewarekomponente. Die pasgemaakte mondeienskap-bewegingsaftaster is ook eenvoudig om te herproduseer, en benodig slegs verbruikersgraad-opneemtoerusting.
Description
Thesis (MEng)--Stellenbosch University, 2016.
Keywords
UCTD, Speech synthesis, Speech processing systems, Computer animation, Speech processing systems -- Digital techniques, Decision trees
Citation