Hidden Markov Models for Natural Language Processing

Stapelberg, Keara-Linn

Hidden Markov Models for Natural Language Processing

dc.contributor.advisor	Muller, Chris	en_ZA
dc.contributor.author	Stapelberg, Keara-Linn	en_ZA
dc.contributor.other	Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.	en_ZA
dc.date.accessioned	2024-03-04T09:50:31Z
dc.date.accessioned	2024-04-26T16:55:27Z
dc.date.available	2024-03-04T09:50:31Z
dc.date.available	2024-04-26T16:55:27Z
dc.date.issued	2024-03
dc.description	Thesis (MCom)--Stellenbosch University, 2024.	en_ZA
dc.description.abstract	ENGLISH SUMMARY: Foundational to many Natural Language Processing (NLP) applications is the task of partof-speech (POS) tagging, which attempts to disambiguate words by labelling them with their correct grammatical part-of-speech. Supervised approaches to this problem train taggers on expert-annotated corpora to learn the patterns that result in the correct labelling. As word and tag context often informs correct labelling, the problem is expressed as a sequence modelling paradigm with a sequence of words W = {w1,w2, ...,wn} and their corresponding grammatical tags T = {t1, t2, ...tn}. Since the 1990s, probabilistic approaches have become more widely adopted than deterministic “rule-based” approaches. Hidden Markov models (HMMs) remain a popular formalism, where the task of inference is employing a trained tagger to find the most likely hidden sequence of tags given an observed sequence of words. HMMs, however, are limited in their ability to express complex dependencies due to strict Markov assumptions and their performance degrades when they encounter words not seen in training. Maximum entropy Markov models (MEMMs) and conditional random fields (CRFs) relax these assumptions and allow for the inclusion of a rich set of features that capture contextual and morphological information. Despite advances, improvements to tagging accuracy are still a relevant pursuit in this field as even small errors may propagate down the NLP pipeline. Parallel to the importance of accuracy are considerations around factors that affect accuracy. These pivotal factors include the quality and abundance of training data, the size and complexity of the tag set, and the presence of unknown words. Simulation procedures are proposed to create these various conditions in the data. A Monte Carlo study compares the taggers and demonstrate the overall robustness of CRFs to these conditions, highlighting extreme cases where CRFs tend to overfit. Finally, future research avenues are discussed.	en_ZA
dc.description.abstract	AFRIKAANSE OPSOMMING: Die grondslag vir baie natuurlike taalverwerking (NLP) toepassings is die taak van woordsoort (POS) etikettering, wat poog om woorde te ondubbelsinnig deur hulle te benoem met hul korrekte grammatikale deel van spraak. Toesighoudende benaderings tot hierdie probleem lei merkers op kundige geannoteerde korpus op om die patrone te leer wat lei tot die korrekte etikettering. Aangesien woord- en merkerkonteks dikwels die korrekte etikettering inlig, word die probleem uitgedruk as ’n volgordemodelleringsparadigma met ’n reeks woorde W = {w1,w2, ...,wn} en hul ooreenstemmende grammatikale etikette T = {t1, t2, ...tn}. Sedert die 1990’s het probabilistiese benaderings meer algemeen aangeneem as deterministiese “reel-gebaseerde” benaderings. Versteekte Markov-modelle (HMMs) bly ’n gewilde formalisme, waar die taak van afleiding is om ’n opgeleide merker te gebruik om die mees waarskynlike versteekte volgorde van merkers te vind gegewe ’n waargenome volgorde van woorde. HMMs is egter beperk in hul vermoe om komplekse afhanklikhede uit te druk as gevolg van streng Markov-aannames en hul prestasie verswak wanneer hulle woorde teekom wat nie in opleiding gesien word nie. Maksimum entropie Markov-modelle (MEMMs) en voorwaardelike ewekansige velde (CRFs) verslap hierdie aannames en maak voorsiening vir die insluiting van ’n ryk stel kenmerke wat kontekstuele en morfologiese inligting vasle. Ten spyte van vooruitgang, is verbeterings aan die akkuraatheid van etikettering steeds ’n relevante strewe in hierdie veld, aangesien selfs klein foute in die NLP-pyplyn kan voortplant. Parallel met die belangrikheid van akkuraatheid is oorwegings rondom faktore wat akkuraatheid beinvloed. Hierdie deurslaggewende faktore sluit in die kwaliteit en oorvloed van opleidingsdata, die grootte en kompleksiteit van die etiketstel, en die teenwoordigheid van onbekende woorde. Simulasieprosedures word voorgestel om hierdie verskillende toestande in die data te skep. ’n Monte Carlo-studie vergelyk die merkers en demonstreer die algehele robuustheid van CRFs vir hierdie toestande, en beklemtoon uiterste gevalle waar CRFs geneig is om te oorpas. Laastens word toekomstige navorsingswee bespreek.	af_ZA
dc.description.version	Masters
dc.format.extent	xvi, 93 pages : illustrations, includes annexures
dc.identifier.uri	https://scholar.sun.ac.za/handle/10019.1/130419
dc.language.iso	en_ZA	en_ZA
dc.publisher	Stellenbosch : Stellenbosch University
dc.rights.holder	Stellenbosch University
dc.subject.lcsh	Computational linguistics -- Statistical methods	en_ZA
dc.subject.lcsh	Text processing (Computer science)	en_ZA
dc.subject.lcsh	Natural language processing (Computer science) -- Statistical methods	en_ZA
dc.subject.name	UCTD
dc.title	Hidden Markov Models for Natural Language Processing	en_ZA
dc.type	Thesis	en_ZA

Files

Original bundle

Now showing 1 - 1 of 1

Name:: stapelberg_hidden_2024.pdf
Size:: 1.46 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Masters Degrees (Statistics and Actuarial Science)