Training neural word embeddings for transfer learning and translation
Thesis (D. Phi)--Stellenbosch University, 2016.
ENGLISH ABSTRACT: In contrast to only a decade ago, it is now easy to collect large text corpora from theWeb on any topic imaginable. However, in order for information processing systems to perform a useful task, such as answer a user’s queries on the content of the text, the raw text first needs to be parsed into the appropriate linguistic structures, like parts of speech, named-entities or semantic entities. Contemporary natural language processing systems rely predominantly on supervised machine learning techniques for performing this task. However, the supervision required to train these models are expensive to come by, since human annotators need to mark up relevant pieces of text with the required labels of interest. Furthermore, machine learning practitioners need to manually engineer a set of task-specific features which represents a wasteful duplication of efforts for each new task. An alternative approach is to attempt to automatically learn representations from raw text that are useful for predicting a wide variety of linguistic structures. In this dissertation, we hypothesise that neural word embeddings, i.e. representations that use continuous values to represent words in a learned vector space of meaning, are a suitable and efficient approach for learning representations of natural languages that are useful for predicting various aspects related to their meaning. We show experimental results which support this hypothesis, and present several contributions which make inducing word representations faster and applicable for monolingual and various cross-lingual prediction tasks. The first contribution to this end is SimTree, an efficient algorithm for jointly clustering words into semantic classes while training a neural network language model with the hierarchical softmax output layer. The second is an efficient subsampling training technique for speeding up learning while increasing accuracy of word embeddings induced using the hierarchical softmax. The third is BilBOWA, a bilingual word embedding model that can efficiently learn to embed words across multiple languages using only a limited sample of parallel raw text, and unlimited amounts of monolingual raw text. The fourth is Barista, a bilingual word embedding model that efficiently uses additional semantic information about how words map into equivalence classes, such as parts of speech or word translations, and includes this information during the embedding process. In addition, this dissertation provides an in-depth overview of the different neural language model architectures, and a detailed, tutorial-style overview of the available popular techniques for training these models.
AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar