Unsupervised feature learning for speech using correspondence and Siamese networks

Last, Petri-Johan

Unsupervised feature learning for speech using correspondence and Siamese networks

dc.contributor.advisor	Kamper, M. J.	en_ZA
dc.contributor.advisor	Engelbrecht, H. A.	en_ZA
dc.contributor.author	Last, Petri-Johan	en_ZA
dc.contributor.other	Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering.	en_ZA
dc.date.accessioned	2020-02-24T05:29:35Z
dc.date.accessioned	2020-04-28T12:10:10Z
dc.date.available	2020-02-24T05:29:35Z
dc.date.available	2020-04-28T12:10:10Z
dc.date.issued	2020-03
dc.description	Thesis (MEng)--Stellenbosch University, 2020.	en_ZA
dc.description.abstract	ENGLISH ABSTRACT: In automatic speech recognition systems, speaker characteristics, such as gender, pitch, and talking speed, can affect the performance of the system. Humans, however, are able to understand what is being said, regardless of these speaker characteristics. There are therefore features in speech that make up word identities, regardless of the speaker characteristics. Being able to learn such features from speech would be beneficial to downstream speech processing tasks. In this thesis, we perform three experiments. Throughout all three experiments, networks are trained on unlabelled speech data so that their applicability in a zero resource environment can be evaluated. Terms are automatically discovered using an unsupervised term discovery system, and the training procedure as a whole is unsupervised. The networks learn frame-level acoustic features, which are then evaluated using a word discrimination task that also measures speaker independence. In the first experiment, we perform a comparison between the correspondence autoencoder (CAE) and Triamese networks. We show that, under the described training conditions, features produced by the CAE outperform those produced by the Triamese network. In the second experiment, we investigate the effect speaker conditioning has on features produced by the CAE. A speaker matrix is constructed with randomised speaker representations for each speaker in the training set. By using access to the speaker labels, a speaker embedding is extracted from this matrix and concatenated to the input of the decoder half of the network. These embeddings are fully trainable, which gives the network more parameters to manipulate and, in theory, make the encoder half of the network less prone to keep speaker-specific information. We show that speaker conditioning produces mixed results, as it worsens performance on one dataset while increasing performance on another. In the final experiment, we develop a novel CAE-Triamese hybrid network, the CTriamese network. By applying the contrastive loss of the Triamese network to the middle layer of the CAE, the intermediate representations of the CAE face an extra constraint. We show that this network produces features that outperform features produced by both the CAE and Triamese networks on the evaluation task. We also show that, unlike the CAE, the CTriamese network produces features that score higher on the evaluation task when speaker conditioning is introduced.	en_ZA
dc.description.abstract	AFRIKAANSE OPSOMMING: In outomatiese spraakherkenningstelsels kan sprekerkenmerke, soos geslag, toonhoogte en geselssnelheid, die werking van die stelsel be nvloed. Mense kan egter verstaan wat gese word, ongeag hierdie sprekerkenmerke. Daar is dus kenmerke in spraak wat woordidentiteite beinvloed, wat onafhanklik van die sprekerkenmerke is. Om sulke eienskappe uit spraak te leer kan voordelig wees vir spraakverwerkingstake. In hierdie tesis voer ons drie eksperimente uit. In al drie eksperimente word netwerke opgelei op ongemerkde spraakdata, sodat die toepaslikheid daarvan in 'n nulhulpbronomgewing beoordeel kan word. Terme word outomaties ontdek deur 'n termontdekkingstelsel sonder toesig, en die opleidingsprosedure as geheel het geen toesig nie. Die netwerke leer akoestiese kenmerke op raamvlak, wat dan ge evalueer word met behulp van 'n woorddiskriminasie-taak wat ook die sprekersonafhanklikheid meet. In die eerste eksperiment voer ons 'n vergelyking uit tussen die korrespondensie outoenkodeerder (KOE) en Triamese netwerke. Ons toon aan dat kenmerke wat deur die KOE vervaardig is, beter vaar in die evalueringstaak as die wat deur die Triamese netwerk vervaardig is, onder die bogenoemde opleidingsomstandighede. In die tweede eksperiment ondersoek ons die e ek wat sprekerkondisionering het op die kenmerke wat deur die KOE vervaardig word. 'n Sprekermatriks word saamgestel met lukrake sprekervoorstellings vir elke spreker in die opleidingstel. Deur toegang tot die spreker identiteite te gebruik, word 'n sprekervoorstelling uit hierdie matriks onttrek en gekoppel aan die ingang van die dekodeerderhelfte van die KOE. Hierdie voorstellings kan deur die netwerk aangepas word, wat die netwerk meer parameters gee om te manipuleer en, in teorie, die kodeerderhelfte van die netwerk minder geneig maak om sprekerspesi eke inligting te hou. Ons toon aan dat sprekerskondisionering gemengde resultate lewer, aangesien dit die prestasie op een datastel vererger en die prestasie op 'n ander verhoog. In die laaste eksperiment ontwikkel ons 'n nuwe KOE-Triamese basternetwerk, die KTriamese netwerk. Deur die kontrasverlies van die Triamese netwerk op die middelste laag van die KOE toe te pas, het die intermediere voorstellings van die KOE 'n ekstra beperking. Ons toon aan dat hierdie netwerk kenmerke lewer wat beter vaar in die evalueringstaak as kenmerke wat deur die KOE en Triamese netwerke vervaardig word. Ons toon ook aan dat die KTriamese-netwerk, anders as die KOE, kenmerke produseer wat beter vaar in die evalueringstaak wanneer sprekerkondisionering toegepas word.	af_ZA
dc.description.version	Masters	en_ZA
dc.format.extent	xi, 63 leaves : illustrations (some color)
dc.identifier.uri	http://hdl.handle.net/10019.1/107936
dc.language.iso	en	en_ZA
dc.publisher	Stellenbosch : Stellenbosch University	en_ZA
dc.rights.holder	Stellenbosch University	en_ZA
dc.subject	Correspondence autoencoder	en_ZA
dc.subject	Automatic speech recognition	en_ZA
dc.subject	Speech processing systems	en_ZA
dc.subject	Pattern recognition systems	en_ZA
dc.subject	Siamese neural networks	en_ZA
dc.subject	UCTD	en_ZA
dc.title	Unsupervised feature learning for speech using correspondence and Siamese networks	en_ZA
dc.type	Thesis	en_ZA

Files

Original bundle

Now showing 1 - 1 of 1

Name:: last_unsupervised_2020.pdf
Size:: 1.43 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Masters Degrees (Electrical and Electronic Engineering)