Ruminating Word Representations with Random Noise Masking

Abstract

We introduce a training method for better word representation and performance, which we call GraVeR (Gradual Vector Rumination). The method is to gradually and iteratively add random noises and bias to word embeddings after training a model, and re-train the model from scratch but initialize with the noised word embeddings. Through the re-training process, some noises can be compensated and other noises can be utilized to learn better representations. As a result, we can get word representations further fine-tuned and specialized in the task. On six text classification tasks, our method improves model performances with a large gap. When GraVeR is combined with other regularization techniques, it shows further improvements. Lastly, we investigate the usefulness of GraVeR 1 .

1. Introduction

Most machine learning methodologies can be formulated to get computational representations from real-life objects (e.g., images, languages, and sounds) and then get high-level representations using model architectures. Therefore, there have been two main approaches to improve model performances: (1) starting with better representations (Melamud et al., 2016; Peters et al., 2018) , and (2) building more sophisticated architectures that can extract important features and generate higherlevel representations (Vaswani et al., 2017; Conneau et al., 2017) . For better initial representations, many NLP researchers have used pretrained word vectors trained on substantially large corpus through unsupervised algorithms, such as word2vec (Mikolov et al., 2013a) , GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2016) . The pretrained word vectors represent the general meaning of words and increase the model performances on most NLP tasks (Turian et al., 2010) . In addition to the algorithms, word vector post-processing research (Faruqui et al., 2015; Vulić et al., 2017; Mrkšić et al., 2017; Jo & Choi, 2018) have attempted to enrich the pretrained representations using external resources. They simply modified the values of the vector representations in some way, and showed improved performance. It implies that we can get further improvement through better initial representations. When training NLP models, we first initialize word representations with pretrained word vectors and then update both the model parameters and the word representations. However, in the training process, the model performance can be limited due to the initial word vectors. For example, the pretrained word representations have general meanings of words, but, in some tasks, the words might not be used for the general meaning. Although the gap between meanings can be learned through the training process, it could fail. Since the pretrained representations are trained from a huge dataset, and their objective functions are based on language modeling, the word vectors are naturally biased to general and frequently used meaning. Besides, the word vectors are updated through gradient descent algorithms, so the values are changed slightly. The word vectors are thus easy to converge on local minima. Therefore, our method starts with an idea-using the word representations fine-tuned by a training process as pretrained word vectors in the next re-training process. Then, word vectors can be trained to learn more appropriate representations to the task. However, the model must be overfitted, and then the word representation would be stuck in local minima. Thus, we add random noise and bias to the word representations before the re-training processes, in order to prevent the model from overfitting and take the word representations far from the local minima. In this paper, we propose a simple training framework to find better representations by adding random noise and bias on the word vectors during iterative training processes, which we call GraVeR (Gradual Vector Rumination). We expect that the model makes good uses of the re-training processes with noises, both for learning better representation and for model regularization.

2. Related Works

The representations fine-tuned by GraVeR can be considered as pretrained representations from the previous training process. Also, GraVeR utilizes word-level noises, which are used for model regularization.

2.1. Pretrained Representations

Pretrained Embedding Vector is also called pretrained word representation. According to the distributional representation hypothesis (Mikolov et al., 2013b) , pretrained embedding vectors are composed of pairs of (token, n-dimensional float vector). Unsupervised algorithms (e.g., word2vec (Mikolov et al., 2013a) 2018) trained deep architecture models and then utilized the model weights to represent words by using the outputs of the models. Although recent advanced pretrained representations (Peters et al., 2018; Devlin et al., 2018) show good performances, we take the pretrained embedding vector approach because (1) retraining processes in the pretrained embedding models are very expensive and (2) we use word-level noises whereas the embedding models use token-level embeddings combined with position embeddings.

2.2. Word-level Noises

Adding noises to input data is an old idea (Plaut et al., 1986) . However, a small number of studies were on word-level noises, since the noises on words can distort words' meaning. Word Dropping. NLP tasks that utilize the text as a form of sentence and phrase have considered each word as features. However, too many features can lead models to be overfitted to the training data due to the curse of dimensionality. Therefore, the easiest way to reduce the number of features is to drop words in the sentence at random. Word Embedding Perturbation. Miyato et al. (2016) tried to perturb word vectors and used them in the adversarial training framework for model regularization. Cheng et al. (2018) utilized the noises to build a robust machine translation model. Also, Zhang & Yang (2018) considered the perturbation as a data augmentation method. The previous works added the noises to all word embeddings. It can regularize the model weights, but they ignored the change of word representations and its re-usability. On the other hand, our method gradually adds noises to word embeddings by controlling the amount of noise. Also, iterative training processes that re-use fine-tuned word representation as pretrained word vectors for the next training process can benefit from the noises and make better word representations.

2.3. Regularization Techniques

Some research explained that the normalization could be used for model regularization (van Laarhoven, 2017; Luo et al., 2018; Hoffer et al., 2018) . Dropout (Srivastava et al., 2014) is applied to neural network models, masking random neurons with 0. Dropout randomly and temporarily removes the neural activations during training, so the masked weights are not updated. As a result, the model is prevented from over-tuning on specific features, which involves regularization. Batch Normalization (BN) (Ioffe & Szegedy, 2015) normalizes the features according to minibatch statistics. Batch normalization enables the features to avoid covariate shift-the weight gradients are highly dependent on the previous layers' gradients. Besides, batch normalization speeds up



http://github.com/Sweetblueday/GraVeR



, GloVe(Pennington et al., 2014), fastText (Bojanowski et al.,  2016)) learn the word vectors on substantial corpora to represent general meanings of words. The pretrained embedding vectors are widely used to initialize the word vectors in models. Pretrained Embedding Model is suggested to get a deep representation of each word in the context. Previous research McCann et al. (2017); Peters et al. (2018); Devlin et al. (

