Abstract:
We address the automatic generation of acoustic subword
units and an associated pronunciation dictionary for speech recognition.
The speech audio is first segmented into phoneme-like units by detecting
points at which the spectral characteristics of the signal change abruptly.
These audio segments are subsequently subjected to agglomerative
clustering in order to group similar acoustic segments. Finally, the
orthography is iteratively aligned with the resulting transcription in terms
of audio clusters in order to determine pronunciations of the training
words. The approach is evaluated by applying it to two subsets of the
TIMIT corpus, both of which have a closed vocabulary. It is found that,
when vocabulary words occur often in the training set, the proposed
technique delivers performance that is close to but lower than a system
based on the TIMIT phonetic transcriptions. When vocabulary words
are not repeated often in the training set, the best system is able to
outperform its counterpart based on the TIMIT phonetic transcriptions,
although recognition performance in both cases is poor.