Browsing by Author "Coetsee, Dirko"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemConditional random fields for noisy text normalisation(Stellenbosch : Stellenbosch University, 2014-12) Coetsee, Dirko; Du Preez, J. A.; Stellenbosch University. Faculty of Engineering. Department of Electrical and Electronic Engineering.ENGLISH ABSTRACT: The increasing popularity of microblogging services such as Twitter means that more and more unstructured data is available for analysis. The informal language usage in these media presents a problem for traditional text mining and natural language processing tools. We develop a pre-processor to normalise this noisy text so that useful information can be extracted with standard tools. A system consisting of a tokeniser, out-of-vocabulary token identifier, correct candidate generator, and N-gram language model is proposed. We compare the performance of generative and discriminative probabilistic models for these different modules. The effect of normalising the training and testing data on the performance of a tweet sentiment classifier is investigated. A linear-chain conditional random field, which is a discriminative model, is found to work better than its generative counterpart for the tokenisation module, achieving a 0.76% character error rate compared to 1.41% for the finite state automaton. For the candidate generation module, however, the generative weighted finite state transducer works better, getting the correct clean version of a word right 36% of the time on the first guess, while the discriminatively trained hidden alignment conditional random field only achieves 6%. The use of a normaliser as a pre-processing step does not significantly affect the performance of the sentiment classifier.