SUN ETD - Theses and Dissertations
Permanent URI for this community
This community is a clearing house for masters and doctorates submitted via Thesis Management
Browse
Browsing SUN ETD - Theses and Dissertations by Author "Baas, Matthew"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemDisentangled Representations in Speech Processing Applications(Stellenbosch University, 2024-12) Baas, Matthew; Kamper, Herman; Stellenbosch University. Faculty of Engineering. Dept. of Electrical & Electronic Engineering.A central goal in systems that produce speech is to easily control high-level characteristics of the speech while retaining naturalness. If we had such a system, it would enable a range of speech processing applications, from more realistic speech assistants, to assistive applications for those with speech disfluencies. There is, however, a tension: most of the best existing speech processing methods rely on the explicit disentanglement of speech, but we know that humans process speech as a purely continuous signal without explicit factorizations. The explicit nature of the former means that they typically identify a handful of meaningful characteristics of speech beforehand and then carefully design systems to measure, disentangle, and modify these characteristics, recombining them to produce output speech. For example, speech synthesis systems might design modules to model speaker identity, prosody, and emotion separately from one another. This has a key limitation: explicitly identifying the discrete set of aspects that comprises speech is a contested and open-ended task, with some aspects intrinsically tied to one another, prohibiting explicit demarcation (e.g. phonemic identity is tied to timing in certain languages). We observe the recent progress in other domains whereby injecting knowledge from domain experts when designing a model is inferior to more general methods that make fewer assumptions about the data and task. Instead, better performance is obtained when the disentanglement is learnt from the data rather than externally imposed. This leads to our primary aim: to investigate how we might bridge this tension between explicit and implicit disentanglement in speech processing. Our main claim is that continuous methods that implicitly learn the various aspects of speech can yield improved generalization compared to discrete methods with explicit demarcations of speech characteristics. However, we also observe that discrete methods offer easier training and scaling over purely continuous ones. This thesis is divided into four parts. The first part investigates disentanglement in unconditional speech synthesis. We propose a new generative adversarial network (GAN) based approach for unconditional speech synthesis that generates speech purely from a continuous latent space without explicit conditioning. We introduce new techniques to optimize our model for learning a disentangled latent space whereby linear subspaces correspond to meaningful characteristics of speech. In experiments in a constrained setting of limited-vocabulary speech, we confirm that the learnt latent space is more disentangled than existing methods. And, critically, we demonstrate a key benefit of learning a disentangled, continuous latent space by showing that our GAN can perform several speech processing tasks unseen during training. Specifically, we show that – with simple linear operations in the latent space – we can perform voice conversion, speech enhancement, speaker verification, and keyword classification, in some cases to a similar level to task-specific baselines. This investigation shows that learning good continuous latent spaces enables generalization to unseen tasks. The second part uses the insights of the first investigation to develop a model for a hard, practical task: any-to-any voice conversion. The result is k-nearest neighbors voice conversion (kNN-VC), a method which uses the linearly disentangled nature of features produced by existing speech representation models to perform voice conversion. As the name suggests, it is a method whereby provided speech is converted to sound like a desired target speaker by simply replacing each feature from the source with the mean of the k-nearest neighbors from the target. Despite its simple nature, compared to top performing existing methods, kNN-VC achieves a new state-of-the-art for voice conversion. Not using discrete speaker labels enables the model to interpolate between voices, perform inference on unseen languages, and even be adapted to sample new speakers from a text prompt. The third part of this thesis attempts to tackle the tension from the other side: can we give discrete methods similar benefits to continuous methods? We investigate this by attempting to incorporate disentangled continuous units for a task that necessitates discrete outputs: speech recognition. Specifically, we propose the first discrete diffusion model for speech recognition. Using the disentangled features from the prior investigation as conditioning, we iteratively refine a multinomial distribution over characters until we arrive at a final coherent transcript. We demonstrate comparable performance to existing state-of-the-art contrastive models on the LibriSpeech speech recognition benchmark. Compared to the dynamic programming algorithms necessary to decode from contrastive models, the output produced by our discrete diffusion model is readily interpretable. It also allows for the extensions afforded by various diffusion decoding techniques. This shows that adapting continuous domain methods (denoising diffusion) and disentangled continuous features for discrete domains yields certain benefits. In the fourth part, we demonstrate the practical usefulness of the knowledge obtained from the first three parts by applying the lessons and methods to improve several existing speech processing tasks. Concretely, we demonstrate how voice conversion applied to unseen languages can be used to improve speech recognition in very low-resource settings. We also investigate how voice conversion can aid those with speech disfluencies by correcting stuttered speech, and test its generalization limits by investigating human-to-instrument conversions. In summary, this thesis shows that the learnt disentanglements provided by continuous speech processing models allow for simpler generalization and control, but that discrete/explicit disentanglement methods still retain benefits in terms of ease of training and scalability. The question remains open as to what the best way is to combine the strengths of both approaches – or if it is even truly possible.