Speech generation in a spoken dialogue system

Visagie, Albertus Sybrand (2004-12)

Thesis (MScIng)--University of Stellenbosch, 2004.

Thesis

ENGLISH ABSTRACT: Spoken dialogue systems accessed over the telephone network are rapidly becoming more popular as a means to reduce call-centre costs and improve customer experience. It is now technologically feasible to delegate repetitive and relatively simple tasks conducted in most telephone calls to automatic systems. Such a system uses speech recognition to take input from users. This work focuses on the speech generation component that a specific prototype system uses to convey audible speech output back to the user. Many commercial systems contain general text-to-speech synthesisers. Text-to-speech synthesis is a very active branch of speech processing. It aims to build machines that read text aloud. In some languages this has been a reality for almost two decades. While these synthesisers are often very understandable, they almost never sound natural. The output quality of synthetic speech is considered to be a very important factor in the user’s perception of the quality and usability of spoken dialogue systems. The static nature of the spoken dialogue system is exploited to produce a custom speech synthesis component that provides very high quality output speech for the particular application. To this end the current state of the art in speech synthesis is surveyed and summarised. A unit-selection synthesiser is produced that functions in Afrikaans, English and Xhosa. The unit-selection synthesiser selects short waveforms from a recorded speech corpus, and concatenates them to produce the required utterances. Techniques are developed for designing a compact corpus and processing it to produce a unit-selection database. Speech modification methods were researched to build a framework for natural-sounding speech concatenation. This framework also provides pitch and duration modification capabilities that will enable research in languages such as Afrikaans and Xhosa where text-to-speech capabilities are relatively immature.

AFRIKAANSE OPSOMMING: Telefoniese, spraakgebaseerde dialoogstelsels word steeds meer algemeen, en is ’n doeltreffende metode om oproepsentrumkostes te verlaag. Dit is tans tegnologies moontlik om ’n groot aantal eenvoudige transaksies met automatiese stelsels te hanteer. Sulke stelsels gebruik spraakherkenning om intree van die gebruiker te ontvang. Hierdie werk fokus op die spraakgenerasiekomponent wat ’n spesifieke prototipestelsel gebruik om afvoer aan die gebruiker terug te speel. Vele kommersi¨ele stelsels gebruik generiese teks-na-spraak sintetiseerders. Sulke teksna- spraak sintetiseerders is steeds ’n baie aktiewe veld in spraaknavorsing. In die algemeen poog navorsing om teks te kan lees en om te sit in verstaanbare spraak. Sulke stelsels bestaan nou al vir ten minste twee dekades. Alhoewel heeltemal verstaanbaar, klink hierdie stelsels onnatuurlik. In telefoniese spraakgebaseerde dialoogstelsels is kwaliteit van die sintetiese spraak belangrik vir die gebruiker se persepsie van die stelsel se kwaliteit en bruikbaarheid. Die dialoog is meestal staties van aard en hierdie eienskap word benut om ho¨e kwaliteit spraak in ’n bepaalde toepassing te sintetiseer. Om dit reg te kry is die huidige stand van sake in hierdie veld bestudeer en opgesom. ’n Knip-en-plak sintetiseerder is gebou wat werk in Afrikaans, Engels en Xhosa. Die sintetiseerder selekteer kort stukkies spraakgolfvorms vanuit ’n spraakkorpus, en las dit aanmekaar om die vereiste spraak te produseer. Outomatiese tegnieke is ontwikkel om ’n kompakte korpus te ontwerp wat steeds alles bevat wat die sintetiseerder sal nodig hˆe om sy taak te verrig. Verdere tegnieke prosesseer die korpus tot ’n bruikbare vorm vir sintese. Metodes van spraakmodifikasie is ondersoek ten einde die aanmekaargelaste stukkies spraak meer natuurlik te laat klink en die intonasie en tempo daarvan te korrigeer. Dit verskaf infrastruktuur vir navorsing in tale soos Afrikaans en Xhosa waar teks-na-spraak vermo¨ens nog onvolwasse is.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/16460
This item appears in the following collections: