Ruminating Word Representations with Random Noise Masking

Abstract

We introduce a training method for better word representation and performance, which we call GraVeR (Gradual Vector Rumination). The method is to gradually and iteratively add random noises and bias to word embeddings after training a model, and re-train the model from scratch but initialize with the noised word embeddings. Through the re-training process, some noises can be compensated and other noises can be utilized to learn better representations. As a result, we can get word representations further fine-tuned and specialized in the task. On six text classification tasks, our method improves model performances with a large gap. When GraVeR is combined with other regularization techniques, it shows further improvements. Lastly, we investigate the usefulness of GraVeR 1 .

1. Introduction

Most machine learning methodologies can be formulated to get computational representations from real-life objects (e.g., images, languages, and sounds) and then get high-level representations using model architectures. Therefore, there have been two main approaches to improve model performances: (1) starting with better representations (Melamud et al., 2016; Peters et al., 2018) , and (2) building more sophisticated architectures that can extract important features and generate higherlevel representations (Vaswani et al., 2017; Conneau et al., 2017) . For better initial representations, many NLP researchers have used pretrained word vectors trained on substantially large corpus through unsupervised algorithms, such as word2vec (Mikolov et al., 2013a) , GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2016) . The pretrained word vectors represent the general meaning of words and increase the model performances on most NLP tasks (Turian et al., 2010) . In addition to the algorithms, word vector post-processing research (Faruqui et al., 2015; Vulić et al., 2017; Mrkšić et al., 2017; Jo & Choi, 2018) have attempted to enrich the pretrained representations using external resources. They simply modified the values of the vector representations in some way, and showed improved performance. It implies that we can get further improvement through better initial representations. When training NLP models, we first initialize word representations with pretrained word vectors and then update both the model parameters and the word representations. However, in the training process, the model performance can be limited due to the initial word vectors. For example, the pretrained word representations have general meanings of words, but, in some tasks, the words might not be used for the general meaning. Although the gap between meanings can be learned through the training process, it could fail. Since the pretrained representations are trained from a huge dataset, and their objective functions are based on language modeling, the word vectors are naturally biased to general and frequently used meaning. Besides, the word vectors are updated through gradient descent algorithms, so the values are changed slightly. The word vectors are thus easy to converge on local minima. Therefore, our method starts with an idea-using the word representations fine-tuned by a training process as pretrained word vectors in the next re-training process. Then, word vectors can be trained to learn more appropriate representations to the task. However, the model must be overfitted, and then the word representation would be stuck in local minima. Thus, we add random noise and bias to the word representations before the re-training processes, in order to prevent the model from overfitting and take the word representations far from the local minima.



http://github.com/Sweetblueday/GraVeR 1

