Browsing by Author "Stapelberg, Keara-Linn"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemHidden Markov Models for Natural Language Processing(Stellenbosch : Stellenbosch University, 2024-03) Stapelberg, Keara-Linn; Muller, Chris; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY: Foundational to many Natural Language Processing (NLP) applications is the task of partof-speech (POS) tagging, which attempts to disambiguate words by labelling them with their correct grammatical part-of-speech. Supervised approaches to this problem train taggers on expert-annotated corpora to learn the patterns that result in the correct labelling. As word and tag context often informs correct labelling, the problem is expressed as a sequence modelling paradigm with a sequence of words W = {w1,w2, ...,wn} and their corresponding grammatical tags T = {t1, t2, ...tn}. Since the 1990s, probabilistic approaches have become more widely adopted than deterministic “rule-based” approaches. Hidden Markov models (HMMs) remain a popular formalism, where the task of inference is employing a trained tagger to find the most likely hidden sequence of tags given an observed sequence of words. HMMs, however, are limited in their ability to express complex dependencies due to strict Markov assumptions and their performance degrades when they encounter words not seen in training. Maximum entropy Markov models (MEMMs) and conditional random fields (CRFs) relax these assumptions and allow for the inclusion of a rich set of features that capture contextual and morphological information. Despite advances, improvements to tagging accuracy are still a relevant pursuit in this field as even small errors may propagate down the NLP pipeline. Parallel to the importance of accuracy are considerations around factors that affect accuracy. These pivotal factors include the quality and abundance of training data, the size and complexity of the tag set, and the presence of unknown words. Simulation procedures are proposed to create these various conditions in the data. A Monte Carlo study compares the taggers and demonstrate the overall robustness of CRFs to these conditions, highlighting extreme cases where CRFs tend to overfit. Finally, future research avenues are discussed.