Unsupervised feature learning for speech using correspondence and Siamese networks

dc.contributor.advisorKamper, M. J.en_ZA
dc.contributor.advisorEngelbrecht, H. A.en_ZA
dc.contributor.authorLast, Petri-Johanen_ZA
dc.contributor.otherStellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering.en_ZA
dc.date.accessioned2020-02-24T05:29:35Z
dc.date.accessioned2020-04-28T12:10:10Z
dc.date.available2020-02-24T05:29:35Z
dc.date.available2020-04-28T12:10:10Z
dc.date.issued2020-03
dc.descriptionThesis (MEng)--Stellenbosch University, 2020.en_ZA
dc.description.abstractENGLISH ABSTRACT: In automatic speech recognition systems, speaker characteristics, such as gender, pitch, and talking speed, can affect the performance of the system. Humans, however, are able to understand what is being said, regardless of these speaker characteristics. There are therefore features in speech that make up word identities, regardless of the speaker characteristics. Being able to learn such features from speech would be beneficial to downstream speech processing tasks. In this thesis, we perform three experiments. Throughout all three experiments, networks are trained on unlabelled speech data so that their applicability in a zero resource environment can be evaluated. Terms are automatically discovered using an unsupervised term discovery system, and the training procedure as a whole is unsupervised. The networks learn frame-level acoustic features, which are then evaluated using a word discrimination task that also measures speaker independence. In the first experiment, we perform a comparison between the correspondence autoencoder (CAE) and Triamese networks. We show that, under the described training conditions, features produced by the CAE outperform those produced by the Triamese network. In the second experiment, we investigate the effect speaker conditioning has on features produced by the CAE. A speaker matrix is constructed with randomised speaker representations for each speaker in the training set. By using access to the speaker labels, a speaker embedding is extracted from this matrix and concatenated to the input of the decoder half of the network. These embeddings are fully trainable, which gives the network more parameters to manipulate and, in theory, make the encoder half of the network less prone to keep speaker-specific information. We show that speaker conditioning produces mixed results, as it worsens performance on one dataset while increasing performance on another. In the final experiment, we develop a novel CAE-Triamese hybrid network, the CTriamese network. By applying the contrastive loss of the Triamese network to the middle layer of the CAE, the intermediate representations of the CAE face an extra constraint. We show that this network produces features that outperform features produced by both the CAE and Triamese networks on the evaluation task. We also show that, unlike the CAE, the CTriamese network produces features that score higher on the evaluation task when speaker conditioning is introduced.en_ZA
dc.description.abstractAFRIKAANSE OPSOMMING: In outomatiese spraakherkenningstelsels kan sprekerkenmerke, soos geslag, toonhoogte en geselssnelheid, die werking van die stelsel be nvloed. Mense kan egter verstaan wat gese word, ongeag hierdie sprekerkenmerke. Daar is dus kenmerke in spraak wat woordidentiteite beinvloed, wat onafhanklik van die sprekerkenmerke is. Om sulke eienskappe uit spraak te leer kan voordelig wees vir spraakverwerkingstake. In hierdie tesis voer ons drie eksperimente uit. In al drie eksperimente word netwerke opgelei op ongemerkde spraakdata, sodat die toepaslikheid daarvan in 'n nulhulpbronomgewing beoordeel kan word. Terme word outomaties ontdek deur 'n termontdekkingstelsel sonder toesig, en die opleidingsprosedure as geheel het geen toesig nie. Die netwerke leer akoestiese kenmerke op raamvlak, wat dan ge evalueer word met behulp van 'n woorddiskriminasie-taak wat ook die sprekersonafhanklikheid meet. In die eerste eksperiment voer ons 'n vergelyking uit tussen die korrespondensie outoenkodeerder (KOE) en Triamese netwerke. Ons toon aan dat kenmerke wat deur die KOE vervaardig is, beter vaar in die evalueringstaak as die wat deur die Triamese netwerk vervaardig is, onder die bogenoemde opleidingsomstandighede. In die tweede eksperiment ondersoek ons die e ek wat sprekerkondisionering het op die kenmerke wat deur die KOE vervaardig word. 'n Sprekermatriks word saamgestel met lukrake sprekervoorstellings vir elke spreker in die opleidingstel. Deur toegang tot die spreker identiteite te gebruik, word 'n sprekervoorstelling uit hierdie matriks onttrek en gekoppel aan die ingang van die dekodeerderhelfte van die KOE. Hierdie voorstellings kan deur die netwerk aangepas word, wat die netwerk meer parameters gee om te manipuleer en, in teorie, die kodeerderhelfte van die netwerk minder geneig maak om sprekerspesi eke inligting te hou. Ons toon aan dat sprekerskondisionering gemengde resultate lewer, aangesien dit die prestasie op een datastel vererger en die prestasie op 'n ander verhoog. In die laaste eksperiment ontwikkel ons 'n nuwe KOE-Triamese basternetwerk, die KTriamese netwerk. Deur die kontrasverlies van die Triamese netwerk op die middelste laag van die KOE toe te pas, het die intermediere voorstellings van die KOE 'n ekstra beperking. Ons toon aan dat hierdie netwerk kenmerke lewer wat beter vaar in die evalueringstaak as kenmerke wat deur die KOE en Triamese netwerke vervaardig word. Ons toon ook aan dat die KTriamese-netwerk, anders as die KOE, kenmerke produseer wat beter vaar in die evalueringstaak wanneer sprekerkondisionering toegepas word.af_ZA
dc.description.versionMastersen_ZA
dc.format.extentxi, 63 leaves : illustrations (some color)
dc.identifier.urihttp://hdl.handle.net/10019.1/107936
dc.language.isoenen_ZA
dc.publisherStellenbosch : Stellenbosch Universityen_ZA
dc.rights.holderStellenbosch Universityen_ZA
dc.subjectCorrespondence autoencoderen_ZA
dc.subjectAutomatic speech recognitionen_ZA
dc.subjectSpeech processing systemsen_ZA
dc.subjectPattern recognition systemsen_ZA
dc.subjectSiamese neural networksen_ZA
dc.subjectUCTDen_ZA
dc.titleUnsupervised feature learning for speech using correspondence and Siamese networksen_ZA
dc.typeThesisen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
last_unsupervised_2020.pdf
Size:
1.43 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: