Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages

dc.contributor.authorHermann, Ennoen_ZA
dc.contributor.authorKamper, Hermanen_Za
dc.contributor.authorGoldwater, Sharonen_ZA
dc.date.accessioned2023-05-04T09:11:40Zen_ZA
dc.date.available2023-05-04T09:11:40Zen_ZA
dc.date.issued2021-04en_ZA
dc.descriptionCITATION: Hermann, E. Kamper, H, and Goldwater, S. 2021. Multilingual and Unsupervised Subword Modeling. Computer Speech & Language 65(2021):17 pages. doi.10.1016/j.csl.2020.101098en_ZA
dc.descriptionThe original publication is available at: sciencedirect.comen_ZA
dc.description.abstractSubword modeling for zero-resource languages aims to learn low-level representations of speech audio without using transcriptions or other resources from the target language (such as text corpora or pronunciation dictionaries). A good representation should capture phonetic content and abstract away from other types of variability, such as speaker differences and channel noise. Previous work in this area has primarily focused unsupervised learning from target language data only, and has been evaluated only intrinsically. Here we directly compare multiple methods, including some that use only target language speech data and some that use transcribed speech from other (non-target) languages, and we evaluate using two intrinsic measures as well as on a downstream unsupervised word segmentation and clustering task. We find that combining two existing target-language-only methods yields better features than either method alone. Nevertheless, even better results are obtained by extracting target language bottleneck features using a model trained on other languages. Cross-lingual training using just one other language is enough to provide this benefit, but multilingual training helps even more. In addition to these results, which hold across both intrinsic measures and the extrinsic task, we discuss the qualitative differences between the different types of learned features.en_ZA
dc.description.versionPublisher’s versionen_ZA
dc.format.extent17 pagesen_ZA
dc.identifier.citationHermann, E. Kamper, H, and Goldwater, S. 2021. Multilingual and Unsupervised Subword Modeling. Computer Speech & Language 65(2021):17 pages. doi.10.1016/j.csl.2020.101098en_ZA
dc.identifier.issn0885-2308 (online)en_ZA
dc.identifier.otherdoi.10.1016/j.csl.2020.101098en_ZA
dc.identifier.urihttp://hdl.handle.net/10019.1/126865en_ZA
dc.language.isoen_ZAen_ZA
dc.publisherElsevier Ltden_ZA
dc.rights.holderAuthors retain copyrighten_ZA
dc.subjectComputational linguisticsen_ZA
dc.subjectArtificial intelligenceen_ZA
dc.subjectText processing (Computer science)en_ZA
dc.subjectZero-resource speech technologyen_ZA
dc.subjectSubword modelingen_ZA
dc.subjectUnsupervised feature extractionen_ZA
dc.titleMultilingual and Unsupervised Subword Modeling for Zero-Resource Languagesen_ZA
dc.typeArticleen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
enno_multilingual_2021.pdf
Size:
1.18 MB
Format:
Adobe Portable Document Format
Description:
download article
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: