Unsupervised feature learning for speech using correspondence and Siamese networks

Last, Petri-Johan (2020-03)

Thesis (MEng)--Stellenbosch University, 2020.

Thesis

ENGLISH ABSTRACT: In automatic speech recognition systems, speaker characteristics, such as gender, pitch, and talking speed, can affect the performance of the system. Humans, however, are able to understand what is being said, regardless of these speaker characteristics. There are therefore features in speech that make up word identities, regardless of the speaker characteristics. Being able to learn such features from speech would be beneficial to downstream speech processing tasks. In this thesis, we perform three experiments. Throughout all three experiments, networks are trained on unlabelled speech data so that their applicability in a zero resource environment can be evaluated. Terms are automatically discovered using an unsupervised term discovery system, and the training procedure as a whole is unsupervised. The networks learn frame-level acoustic features, which are then evaluated using a word discrimination task that also measures speaker independence. In the first experiment, we perform a comparison between the correspondence autoencoder (CAE) and Triamese networks. We show that, under the described training conditions, features produced by the CAE outperform those produced by the Triamese network. In the second experiment, we investigate the effect speaker conditioning has on features produced by the CAE. A speaker matrix is constructed with randomised speaker representations for each speaker in the training set. By using access to the speaker labels, a speaker embedding is extracted from this matrix and concatenated to the input of the decoder half of the network. These embeddings are fully trainable, which gives the network more parameters to manipulate and, in theory, make the encoder half of the network less prone to keep speaker-specific information. We show that speaker conditioning produces mixed results, as it worsens performance on one dataset while increasing performance on another. In the final experiment, we develop a novel CAE-Triamese hybrid network, the CTriamese network. By applying the contrastive loss of the Triamese network to the middle layer of the CAE, the intermediate representations of the CAE face an extra constraint. We show that this network produces features that outperform features produced by both the CAE and Triamese networks on the evaluation task. We also show that, unlike the CAE, the CTriamese network produces features that score higher on the evaluation task when speaker conditioning is introduced.

AFRIKAANSE OPSOMMING: In outomatiese spraakherkenningstelsels kan sprekerkenmerke, soos geslag, toonhoogte en geselssnelheid, die werking van die stelsel be nvloed. Mense kan egter verstaan wat gese word, ongeag hierdie sprekerkenmerke. Daar is dus kenmerke in spraak wat woordidentiteite beinvloed, wat onafhanklik van die sprekerkenmerke is. Om sulke eienskappe uit spraak te leer kan voordelig wees vir spraakverwerkingstake. In hierdie tesis voer ons drie eksperimente uit. In al drie eksperimente word netwerke opgelei op ongemerkde spraakdata, sodat die toepaslikheid daarvan in 'n nulhulpbronomgewing beoordeel kan word. Terme word outomaties ontdek deur 'n termontdekkingstelsel sonder toesig, en die opleidingsprosedure as geheel het geen toesig nie. Die netwerke leer akoestiese kenmerke op raamvlak, wat dan ge evalueer word met behulp van 'n woorddiskriminasie-taak wat ook die sprekersonafhanklikheid meet. In die eerste eksperiment voer ons 'n vergelyking uit tussen die korrespondensie outoenkodeerder (KOE) en Triamese netwerke. Ons toon aan dat kenmerke wat deur die KOE vervaardig is, beter vaar in die evalueringstaak as die wat deur die Triamese netwerk vervaardig is, onder die bogenoemde opleidingsomstandighede. In die tweede eksperiment ondersoek ons die e ek wat sprekerkondisionering het op die kenmerke wat deur die KOE vervaardig word. 'n Sprekermatriks word saamgestel met lukrake sprekervoorstellings vir elke spreker in die opleidingstel. Deur toegang tot die spreker identiteite te gebruik, word 'n sprekervoorstelling uit hierdie matriks onttrek en gekoppel aan die ingang van die dekodeerderhelfte van die KOE. Hierdie voorstellings kan deur die netwerk aangepas word, wat die netwerk meer parameters gee om te manipuleer en, in teorie, die kodeerderhelfte van die netwerk minder geneig maak om sprekerspesi eke inligting te hou. Ons toon aan dat sprekerskondisionering gemengde resultate lewer, aangesien dit die prestasie op een datastel vererger en die prestasie op 'n ander verhoog. In die laaste eksperiment ontwikkel ons 'n nuwe KOE-Triamese basternetwerk, die KTriamese netwerk. Deur die kontrasverlies van die Triamese netwerk op die middelste laag van die KOE toe te pas, het die intermediere voorstellings van die KOE 'n ekstra beperking. Ons toon aan dat hierdie netwerk kenmerke lewer wat beter vaar in die evalueringstaak as kenmerke wat deur die KOE en Triamese netwerke vervaardig word. Ons toon ook aan dat die KTriamese-netwerk, anders as die KOE, kenmerke produseer wat beter vaar in die evalueringstaak wanneer sprekerkondisionering toegepas word.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/107936
This item appears in the following collections: