Browsing by Author "Jacobs, Chrsitiaan"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemMultilingual acoustic word embeddings for zero-resource languages(Stellenbosch : Stellenbosch University, 2023-12) Jacobs, Chrsitiaan; Kamper, Herman; Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering.ENGLISH ABSTRACT: Developing speech applications with neural networks require large amounts of transcribed speech data. The scarcity of labelled speech data therefore restricts the development of speech applications to only a few well-resourced languages. To address this problem, researchers are taking steps towards developing speech models for languages where no labelled data is available. In this zero-resource setting, researchers are developing methods that aim to learn meaningful linguistic structures from unlabelled speech alone. Many zero-resource speech applications require speech segments of different durations to be compared. Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-duration speech segments. Proximity in vector space should indicate similarity between the original acoustic segments. This allows fast and easy comparison between spoken words. To produce AWEs for a zero-resource language, one approach is to use unlabelled data from the target language. Another approach is to exploit the benefits of supervised learning by training a single multilingual AWE model on data from multiple well-resourced languages, and then applying the resulting model to an unseen target language. Previous studies have shown that the supervised multilingual transfer approach outperforms the unsupervised monolingual approach. However, the multilingual approach is still far from reaching the performance of supervised AWE approaches that are trained on the target language itself. In this thesis, we make five specific contributions to the development of AWE models and their downstream application. First, we introduce a novel AWE model called the Contrastive RNN. We compare this model against state-of-the-art AWE models. On a word discrimination task, we show that the Contrastive RNN outperforms all existing models in the unsupervised monolingual setting with an absolute Improvement in average precision ranging from 3.3% to 17.8% across six evaluation languages. In the multilingual transfer setting, the Contrastive RNN performs on par with existing models. As our second contribution, we propose a new adaptation strategy. After a multilingual model is trained, instead of directly applying it to a target language, we first _ne-tune the model using unlabelled data from the target language. The Contrastive RNN, although performing on par with multilingual variants, showed the highest increase after adaptation, giving an improvement of roughly 5% in average precision on five of the six evaluation languages. As our third contribution, we take a step back and question the effect a particular set of training languages have on a target language. We specifically investigate the impact of training a multilingual model on languages that belong to the same language family as the target language. We perform multiple experiments on African languages which show the benefit of using related languages over unrelated languages. For example, a multilingual model trained on one-tenth of the data from a related language outperforms a model trained on all the available training data from unrelated languages. As our fourth contribution, we showcase the applicability of AWEs by applying them to a real downstream task: we develop an AWE-based keyword spotting system (KWS) for hate speech detection in radio broadcasts. We validate performance using actual Swahili radio audio extracted from radio stations in Kenya, a country in Sub-Saharan Africa. In developmental experiments, our system falls short of a speech recognition based KWS system using five minutes of annotated target data. However, when applying the system to real in-the-wild radio broadcasts, our AWE-based system (requiring less than a minute of template audio) proves to be more robust, nearly matching the performance of a 30-hour speech recognition model. In the fifth and final contribution, we introduce three novel semantic AWE models. The goal here is that the resulting embeddings should not only be similar for words from the same type but also for words sharing contextual meaning, similar to how textual word embeddings are grouped together based on semantic relatedness. For instance, spoken instances of \football" and \soccer", although acoustically different, should have similar acoustic embeddings. We specifically propose leveraging a pre-trained multilingual AWE model to assist semantic modelling. Our best approach involves clustering word segments using a multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a classifier model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show for the first time that AWEs can be used for downstream semantic query-by-example search.