Hidden Markov Models for Natural Language Processing

Date
2024-03
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH SUMMARY: Foundational to many Natural Language Processing (NLP) applications is the task of partof-speech (POS) tagging, which attempts to disambiguate words by labelling them with their correct grammatical part-of-speech. Supervised approaches to this problem train taggers on expert-annotated corpora to learn the patterns that result in the correct labelling. As word and tag context often informs correct labelling, the problem is expressed as a sequence modelling paradigm with a sequence of words W = {w1,w2, ...,wn} and their corresponding grammatical tags T = {t1, t2, ...tn}. Since the 1990s, probabilistic approaches have become more widely adopted than deterministic “rule-based” approaches. Hidden Markov models (HMMs) remain a popular formalism, where the task of inference is employing a trained tagger to find the most likely hidden sequence of tags given an observed sequence of words. HMMs, however, are limited in their ability to express complex dependencies due to strict Markov assumptions and their performance degrades when they encounter words not seen in training. Maximum entropy Markov models (MEMMs) and conditional random fields (CRFs) relax these assumptions and allow for the inclusion of a rich set of features that capture contextual and morphological information. Despite advances, improvements to tagging accuracy are still a relevant pursuit in this field as even small errors may propagate down the NLP pipeline. Parallel to the importance of accuracy are considerations around factors that affect accuracy. These pivotal factors include the quality and abundance of training data, the size and complexity of the tag set, and the presence of unknown words. Simulation procedures are proposed to create these various conditions in the data. A Monte Carlo study compares the taggers and demonstrate the overall robustness of CRFs to these conditions, highlighting extreme cases where CRFs tend to overfit. Finally, future research avenues are discussed.
AFRIKAANSE OPSOMMING: Die grondslag vir baie natuurlike taalverwerking (NLP) toepassings is die taak van woordsoort (POS) etikettering, wat poog om woorde te ondubbelsinnig deur hulle te benoem met hul korrekte grammatikale deel van spraak. Toesighoudende benaderings tot hierdie probleem lei merkers op kundige geannoteerde korpus op om die patrone te leer wat lei tot die korrekte etikettering. Aangesien woord- en merkerkonteks dikwels die korrekte etikettering inlig, word die probleem uitgedruk as ’n volgordemodelleringsparadigma met ’n reeks woorde W = {w1,w2, ...,wn} en hul ooreenstemmende grammatikale etikette T = {t1, t2, ...tn}. Sedert die 1990’s het probabilistiese benaderings meer algemeen aangeneem as deterministiese “reel-gebaseerde” benaderings. Versteekte Markov-modelle (HMMs) bly ’n gewilde formalisme, waar die taak van afleiding is om ’n opgeleide merker te gebruik om die mees waarskynlike versteekte volgorde van merkers te vind gegewe ’n waargenome volgorde van woorde. HMMs is egter beperk in hul vermoe om komplekse afhanklikhede uit te druk as gevolg van streng Markov-aannames en hul prestasie verswak wanneer hulle woorde teekom wat nie in opleiding gesien word nie. Maksimum entropie Markov-modelle (MEMMs) en voorwaardelike ewekansige velde (CRFs) verslap hierdie aannames en maak voorsiening vir die insluiting van ’n ryk stel kenmerke wat kontekstuele en morfologiese inligting vasle. Ten spyte van vooruitgang, is verbeterings aan die akkuraatheid van etikettering steeds ’n relevante strewe in hierdie veld, aangesien selfs klein foute in die NLP-pyplyn kan voortplant. Parallel met die belangrikheid van akkuraatheid is oorwegings rondom faktore wat akkuraatheid beinvloed. Hierdie deurslaggewende faktore sluit in die kwaliteit en oorvloed van opleidingsdata, die grootte en kompleksiteit van die etiketstel, en die teenwoordigheid van onbekende woorde. Simulasieprosedures word voorgestel om hierdie verskillende toestande in die data te skep. ’n Monte Carlo-studie vergelyk die merkers en demonstreer die algehele robuustheid van CRFs vir hierdie toestande, en beklemtoon uiterste gevalle waar CRFs geneig is om te oorpas. Laastens word toekomstige navorsingswee bespreek.
Description
Thesis (MCom)--Stellenbosch University, 2024.
Keywords
Citation