Browsing by Author "Van Staden, Lisa"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemImproving unsupervised acoustic word embeddings using segment- and frame-level information(Stellenbosch : Stellenbosch University, 2021-12) Van Staden, Lisa; Kamper, Herman; Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering.ENGLISH ABSTRACT: Many speech processing tasks involve measuring the acoustic similarity between speech segments. Conventionally, these speech comparisons are performed using dynamic time warping (DTW), a computationally expensive alignment-based approach. Recent research has shown that fixed dimensional vectors, which are representations for speech segments of variable length, can be used in these tasks. These vectors, called acoustic word embeddings (AWEs), allow for efficient comparisons. A number of studies have shown that AWEs can be used in tasks such as unsupervised term discovery (UTD) and query-by-example-search in a zero-resource setting, where transcriptions for speech are not available and full speech recognition is therefore not possible. Therefore, some studies have focussed on developing unsupervised AWEs methods in this setting. However, the intrinsic quality of supervised AWEs is still vastly superior compared to unsupervised AWEs. This serves as motivation to investigate methods to improve the quality of unsupervised AWEs. Additionally, this is also of interest to the language acquisition field, considering that infants do not require transcriptions to learn speech. We focus on three different problem areas present in current AWEs. Firstly, we consider the nuisance factors in AWEs. The acoustic properties of different speakers and genders vary dramatically and in an unsupervised environment these properties, which we call nuisance factors, can still be captured to a large extent. This is addressed by applying speaker and gender conditioning and adversarial training to existing AWEs models, the autoencoder recurrent neural network (AE-RNN) and correspondence autoencoder recurrent neural network (CAE-RNN). We find that these methods reduce some speaker and gender information and marginally improve the AWEs. Secondly, we consider if improvements at the frame-level will have a positive effect on the quality of the AWEs. Many AWE studies have focussed on the word-level, but a few other zero-resource studies have instead focussed on developing short-time frame- level speech representations that capture meaningful contrasts such as phonemes. These contrasts are more relevant at a shorter time scale than most AWEs approaches, that focus on discriminative words. Three existing representation types are considered: contrastive predictive coding (CPC), autoregressive predictive coding (APC) and the correspondence autoencoder (CAE). These are used as input features to the CAE-RNN and compared to using conventional mel-frequency cepstral coefficients (MFCCs). Additionally, we introduce a fourth learned representation method: correspondence autoregressive predictive coding (CAPC), that combines the mechanisms of the frame-level CAE and APC models. We find that better input features have a significant impact on the quality of the AWEs with the best results from using the CPC features. The last problem we consider is the training strategy used for AWE models. Motivated by the idea that human infants are first exposed to speech from only a small number of speakers which gradually increases, we apply a speaker number-based curriculum learning strategy to the AE-RNN and CAE-RNN and compare it to using a multiple speaker strategy. We find that this training strategy does not make a difference to the quality of the AWEs. Taken together, in our experiments we find that the most impactful solution is to use learned frame-level representations as input. Speaker and gender normalising has a marginally positive effect on the quality of the AWEs and the training strategy has no impact. Going forward, these improved AWEs can be used in downstream tasks. Although we only considered AWEs from the AE-RNN and CAE-RNN, the problems we focussed on are not necessarily model-specific and our findings are relevant to other AWE modelling research.