Robust voice activity detection for low-resource automatic speech recognition

Date
2021-12
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: The automatic separation of raw audio into speech and non-speech is an important preprocessing step for many real-world speech processing systems. This task, known as voice activity detection (VAD), has largely been studied using constrained, synthetically corrupted speech data. In this thesis, we present a number of new VAD systems that are specifically designed for noisy in-the-wild audio. Previous research has shown the relationship between improved VAD and better downstream automatic speech recognition (ASR) performance. Our systems are to be used in a preprocessing step for low-resource ASR applied to real-world audio, and should therefore be computationally efficient, yet robust and accurate. We present four different experimental approaches to VAD, as well as experiments with additional speaker diarisation as a method for fully automatic segmentation. Our baseline system is a small convolutional neural network (CNN) classifier. While it provides reasonable performance with an equal error rate (EER) of 0.221 and an area under the receiver operator characteristic curve (ROC AUC) of 0.848, it results in a noisy, unreliable segmentation. Two approaches aimed at addressing this problem are presented. First, a CNN-based system with hidden Markov model (HMM) smoothing is proposed and found to over-compensate by smoothing the segments too aggressively. Despite this, the system provides notable improvement over the baseline, achieving an EER of 0.195 and an ROC AUC of 0.873. Second, a CNN-based system with more sophisticated HMM smoothing using Gaussian mixture model (GMM) emissions is proposed and shown to provide better performance, with an EER of 0.166 and a ROC AUC of 0.905. In conjunction with x-vector speaker diarisation, these two systems were used to automatically segment audio for semi-supervised ASR training in a resource constrained environment and shown to achieve a word error rate (WER) improvement of 2% absolute. Finally, a new hybrid architecture for VAD is presented, incorporating both CNN and bidirectional long short-term memory (BiLSTM) layers trained in an end-to-end manner. This model provides robust, state-of-the-art performance, with an EER of 0.107 and an AUC of 0.951, thereby comfortably outperforming a much larger ResNet-based benchmark. Furthermore, this performance is attainable with relatively small model sizes of fewer than 200k parameters.
AFRIKAANSE OPSOMMING: Die outomatiese skeiding van rou klank in spraak en nie-spraak is ‘n belangrike voorverwerkingstap vir baie reële spraakverwerkingstelsels. Hierdie taak, bekend as stemaktiwiteitswaarneming (voice activity detection, VAD), is groten- deels bestudeer deur gebruik te maak van beperkte, sinteties-gekorrupteerde spraakdata. In hierdie tesis bied ons ‘n aantal nuwe VAD-stelsels aan wat spesifiek o ntwerp i s v ir r aserige, i n-the-wild k lank. Vorige n avorsing het die verband aangetoon tussen verbeterde VAD en beter resulterende outomatiese spraakherkenning (automatic speech recognition, ASR) prestasie. Ons stelsels sal gebruik word in ‘n voorverwerkingstap vir hulpbronbeperkte ASR wat op reële klank toegepas word en wat dus rekenaardoeltreffend, dog robuust en akkuraat, moet wees. Ons bied vier verskillende eksperimentele benaderings tot VAD aan, asook eksperimente met addisionele spreker-diarisering as ‘n metode vir volledig outomatiese segmentering. Ons basislynstelsel is ‘n klein klas- sifiseerder vir die verwikkelde neurale netwerk(convolutional n eural network, CNN). Alhoewel dit redelike werkverrigting bied met ‘n gelyke foutkoers (equal error rate, EER) van 0.221 en ‘n area onder die ontvanger bediener karakteris- tieke kurwe (area under the receiver operator characteristic curve, ROC AUC) van 0.848, het dit ‘n raserige, onbetroubare segmentering tot gevolg. Twee benaderings wat daarop gemik is om hierdie probleem aan te spreek, word aangebied. Eerstens word ‘n CNN-gebaseerde stelsel met versteekte Markov- model (hidden Markov model, HMM) effening voorgestel en word gevind dat dit oorkompenseer deur die segmente té aggressief uit te stryk. Ten spyte hiervan, bied die stelsel ‘n noemenswaardige verbetering ten opsigte van die basislyn, met ‘n EER van 0.195 en ‘n ROC AUC van 0.873. Tweedens word ‘n CNN-gebaseerde stelsel met meer gesofistikeerde HMM-effening met behulp van die Gaussiese mengselmodel (Gaussian mixture model, GMM) voorgestel en getoon dat dit beter prestasie bied, met ‘n EER van 0.166 en ‘n ROC AUC van 0.905. Hierdie twee stelsels, gekombineer met x-vektor spreker-diarisering, is gebruik om klank outomaties te segmenteer vir semi-toesig ASR-opleiding in ‘n hulpbronbeperkte omgewing en ‘n verbetering van 2% absoluut in die woordfoutkoers (word error rate, WER) is behaal. Laastens word ‘n nuwe hibriede argitektuur vir VAD voorgestel, wat beide CNN en tweerigting lang korttermyn-geheue (bidirectional long short-term memory, BiLSTM) vlakke bevat wat op ‘n end-tot-end wyse afgerig is. Hierdie model bied robuuste, mededingende prestasie, met ‘n EER van 0.107 en ‘n AUC van 0.951, wat die prestasie van ‘n veel groter ResNet-gebaseerde standaard gerieflik oortref. Daarbenewens is hierdie prestasie haalbaar met relatief klein modelgroottes van minder as 200k parameters.
Description
Thesis (MEng)--Stellenbosch University, 2021.
Keywords
Robust voice activity detection, UCTD, Automatic speech recognition, Speech processing systems, Voice activity detection
Citation