Improving unsupervised acoustic word embeddings using segment- and frame-level information

Date
2021-12
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: Many speech processing tasks involve measuring the acoustic similarity between speech segments. Conventionally, these speech comparisons are performed using dynamic time warping (DTW), a computationally expensive alignment-based approach. Recent research has shown that fixed dimensional vectors, which are representations for speech segments of variable length, can be used in these tasks. These vectors, called acoustic word embeddings (AWEs), allow for efficient comparisons. A number of studies have shown that AWEs can be used in tasks such as unsupervised term discovery (UTD) and query-by-example-search in a zero-resource setting, where transcriptions for speech are not available and full speech recognition is therefore not possible. Therefore, some studies have focussed on developing unsupervised AWEs methods in this setting. However, the intrinsic quality of supervised AWEs is still vastly superior compared to unsupervised AWEs. This serves as motivation to investigate methods to improve the quality of unsupervised AWEs. Additionally, this is also of interest to the language acquisition field, considering that infants do not require transcriptions to learn speech. We focus on three different problem areas present in current AWEs. Firstly, we consider the nuisance factors in AWEs. The acoustic properties of different speakers and genders vary dramatically and in an unsupervised environment these properties, which we call nuisance factors, can still be captured to a large extent. This is addressed by applying speaker and gender conditioning and adversarial training to existing AWEs models, the autoencoder recurrent neural network (AE-RNN) and correspondence autoencoder recurrent neural network (CAE-RNN). We find that these methods reduce some speaker and gender information and marginally improve the AWEs. Secondly, we consider if improvements at the frame-level will have a positive effect on the quality of the AWEs. Many AWE studies have focussed on the word-level, but a few other zero-resource studies have instead focussed on developing short-time frame- level speech representations that capture meaningful contrasts such as phonemes. These contrasts are more relevant at a shorter time scale than most AWEs approaches, that focus on discriminative words. Three existing representation types are considered: contrastive predictive coding (CPC), autoregressive predictive coding (APC) and the correspondence autoencoder (CAE). These are used as input features to the CAE-RNN and compared to using conventional mel-frequency cepstral coefficients (MFCCs). Additionally, we introduce a fourth learned representation method: correspondence autoregressive predictive coding (CAPC), that combines the mechanisms of the frame-level CAE and APC models. We find that better input features have a significant impact on the quality of the AWEs with the best results from using the CPC features. The last problem we consider is the training strategy used for AWE models. Motivated by the idea that human infants are first exposed to speech from only a small number of speakers which gradually increases, we apply a speaker number-based curriculum learning strategy to the AE-RNN and CAE-RNN and compare it to using a multiple speaker strategy. We find that this training strategy does not make a difference to the quality of the AWEs. Taken together, in our experiments we find that the most impactful solution is to use learned frame-level representations as input. Speaker and gender normalising has a marginally positive effect on the quality of the AWEs and the training strategy has no impact. Going forward, these improved AWEs can be used in downstream tasks. Although we only considered AWEs from the AE-RNN and CAE-RNN, the problems we focussed on are not necessarily model-specific and our findings are relevant to other AWE modelling research.
AFRIKAANSE OPSOMMING: Baie spraakprosesseringstake behels dat die akoestiese ooreenkoms tussen spraaksegmente gemeet word. Konvensioneel, word hierdie spraakvergelykings uitgevoer met behulp van dinamies tyd-buiging, ’n raampie-ooreenstemgebasseerde metode wat berekenings gewys duur is. Onlangse navorsing toon dat vaste dimensionele vektore, wat voorstellings vir spraaksegmente van verskilende lengtes is, in hierdie take gebruik kan word. Hierdie vektore, wat akoestiese woordinbeddings (AWI) genoem word, maak dit moontlik om doeltreffend spraakvergelykings uit te voer. ’n Aantal studies het al gewys dat AWIs gebruik kan word in take soos toesiglose term-ontdekking en navraag-na-voorbeeld soek in ’n nul-hulpbron-spraakinstelling, waar transkripsies vir spraak nie beskikbaar is nie. Daarom is daar in sommige studies gefokus op die ontwikkeling van toesiglose AWI- modelering metodes in hierdie instelling. Die intrinsieke kwaliteit van AWIs onder toesig is egter steeds ver hoër in vergelyking met toesiglose AWIs. Dit dien as motivering om metodes te ondersoek wat die kwaliteit van toesiglose AWIs kan verbeter. Verder, is dit ook van belang vir die taalverwerwingsveld, aangesien babas nie transkripsies benodig om spraak aan te leer nie. Ons fokus op drie verskillende probleemareas wat in huidige AWIs voorkom. Eerstens beskou ons die oorlasfaktore in AWIs. Die akoestiese eienskappe van verskillende sprekers en geslagte wissel dramaties en in ’n toesiglose instelling kan hierdie eienskappe, wat ons na verwys as oorlasfaktore, nog tot in ’n groot mate vasgevang word in die AWIs. Ons spreek dit aan deur spreker- en geslagsvoorwaardelikheid, en teenstrydige opleiding op bestaande AWI-modelle, die outoenkodeerder herhalende neurale netwerk (OE-HNN) en korrespondensie outoenkodeerder herhalende neurale netwerk (KOE-HNN), toe te pas. Ons vind dat hierdie metodes van die spreker- en geslagsinligting verminder en dat dit die kwaliteit van die AWIs effens verbeter. Tweedens kyk ons of verbeterings op raamvlak ’n positiewe uitwerking op die kwaliteit van die AWIs sal hê. Baie AWI-studies he al gefokus op die segmentvlak, maar ’n paar ander nul-hulpbronstudies het eerder gefokus op die ontwikkeling van kort-tydperk spraak- voorstellings op die raamvlak, wat betekenisvolle kontraste, soos foneme, kan opvang. Ons oorweeg drie verskillende bestaande voorstellingtipes: kontrasterende voorpsellende kodering (KVK), outoregresiewe voorspellende kodering (OVK) en korrespondensie ou- toenkodeerder (KOE). Hierdie word as invoerkernmerkvektore vir die KOE-HNN gebruik en ons vergelyk dit met die gebruik van die konvensionele Mel-frekewensie kepstrale ko ̈effisi ̈ente. Ons stel ook ’n vierde metode vir geleerde voorstellings voor: korrespondensie outoregresiewe voorspellende kodering, wat die meganismes van die raamvlak KOE- en OVK-modelle kombineer. Ons vind dat hierdie beter invoerkernmerkvektore ’n groot impak op die kwaliteit van die AWIs het waar die beste resultate van die gebruik van KVK invoerkernmerkvektore is. Die laaste probleem wat ons oorweeg, is die opleidingstrategie wat vir AWI-modelle gebruik word. Gemotiveer deur die idee dat babas aanvanklik aan spraak blootgestel word van slegs ’n klein getal sprekers, pas ons ’n sprekergetalgebaseerde kurrikulumleerstrategie toe op die OE-HNN en KOE-HNN en vergelyk dit met die gebruik van ’n meervoudige sprekerstrategie. Ons vind dat hierdie opleidingstrategie nie ’n verskil maak aan die kwaliteit van die AWIs nie. Alles saamgevat, vind ons dat die mees effektiewe oplossing is om raamvlak geleerde voorstellings as invoerkenmerkvektore te gebruik. Normalisering van spreker- en geslagin- ligting in AWIs het ’n effens positiewe impak op die kwaliteit daarvan en die verskil in opleidingstrategie het geen impak nie. Hierdie verbeterde AWIs kan vorentoe gebruik word in stroomaf take. Alhoewel ons slegs AWIs van die OE-HNN en KOE-HNN oorweeg het, is die probleme waarop ons gefokus het nie noodwendig modelspesifiek nie. Daarom is ons vonds relevant vir ander AWI-modellering navorsing.
Description
Thesis (MEng)--Stellenbosch University, 2021.
Keywords
Improving unsupervised acoustic word embeddings; segment- and frame-level information, UCTD, Acoustical engineering, Artificial neural networks, Speech processing systems, Predictive coding, Machine learning
Citation