Automatic discovery of subword units and pronunciations for automatic speech recognition using TIMIT
Both authors from Stellenbosch University.
Proceedings of the twenty-first annual symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa, November 2010.
We address the automatic generation of acoustic subword units and an associated pronunciation dictionary for speech recognition. The speech audio is first segmented into phoneme-like units by detecting points at which the spectral characteristics of the signal change abruptly. These audio segments are subsequently subjected to agglomerative clustering in order to group similar acoustic segments. Finally, the orthography is iteratively aligned with the resulting transcription in terms of audio clusters in order to determine pronunciations of the training words. The approach is evaluated by applying it to two subsets of the TIMIT corpus, both of which have a closed vocabulary. It is found that, when vocabulary words occur often in the training set, the proposed technique delivers performance that is close to but lower than a system based on the TIMIT phonetic transcriptions. When vocabulary words are not repeated often in the training set, the best system is able to outperform its counterpart based on the TIMIT phonetic transcriptions, although recognition performance in both cases is poor.