NEURAL EMBEDDINGS FOR

Abstract

We propose a new kind of embedding for natural language text that deeply represents semantic meaning. Standard text embeddings use the outputs from hidden layers of a pretrained language model. In our method, we let a language model learn from the text and then literally pick its brain, taking the actual weights of the model's neurons to generate a vector. We call this representation of the text a neural embedding. We confirm the ability of this representation to reflect semantics of the text by an analysis of its behavior on several datasets, and by a comparison of neural embedding with state of the art sentence embeddings.

1. INTRODUCTION

Capturing the semantic meaning of text as a vector is a fundamental challenge for natural language processing (NLP) and an area of active research (Giorgi et al., 2021; Zhang et al., 2020; Gao et al., 2021; Huang et al., 2021; Yan et al., 2021; Zhang et al., 2021; Muennighoff, 2022; Alexander Liu, 2022; Chuang et al., 2022) . Recent work has focused on fine-tuning pretrained language models with contrastive learning, either supervised (e.g. Reimers & Gurevych ( 2019 Motivated by the need for deeper semantic representations of text, we propose a different kind of embedding. We allow a language model to fine-tune on a text input, and then measure the resulting changes to the model's own neuronal weights as a neural embedding. We demonstrate that neural embeddings do indeed represent the semantic differences between samples of text. We evaluate neural embeddings on several datasets and compare them with several state of the art sentence embeddings. We observe that neural embeddings correlate better specifically with semantics, while being comparable in other evaluations. We find that neural embeddings behave differently from the known embeddings we considered. Our contribution: 1. We propose a new kind of text representation: neural embeddingsfoot_0 (Section 2). 2. We evaluate embeddings by using several datasets and several criteria (Section 3). We show that by these criteria the neural embeddings are (1) better correlated with semantic similarity and consistency, and (2) strongly differ by the errors they do and by how they represent the qualities of the text.

2. NEURAL EMBEDDING METHOD

The technique for generating neural embeddings is using micro-tuning, first introduced for the BLANC-tune method of document summary quality evaluationfoot_1 Vasilyev et al. (2020) . It is a tuning on one sample only, and the tuned model is used for the sample only. Tuning a pretrained model on a specific narrow domain is a common practice to improve performance. Micro-tuning takes this to extreme, narrowing down the 'domain' to a 'dataset' consisting of just one sample. For each text sample, we start with the original language model and fine-tune only a few selected layers L 0 , L 1 , ..., L m while keeping all other layers frozen. Once the fine-tuning on the text sample is complete, we measure the difference between the new weights W ′ j and the original weights W j of each layer L j and normalize the resulting vector. We obtain the neural embedding of the text by concatenating the normalized vectors: E = E c |E c | , E c = ∥ m j=0 (W ′ j -W j )/|W ′ j -W j | (1) Here the symbol ∥ means concatenation, e.g. ∥ m j=1 a j is a concatenation of a 0 , a 1 , ... , a m . Schematically, our illustration is in Figure 1 . For clarity, the algorithm is shown in Appendix A, Figure 8 . For example, if we select three layers from the standard BERT base model, and if each selected layer has 768 weights, then the resulting embedding will have size 768 * 3 = 2304. Before entering Equation 1, the weights of each layer are flattened from their possibly multidimensional tensor form. Through this paper we use the pretrained transformer Devlin et al. ( 2019) model bert-base-uncased from the transformers library Wolf et al. (2020) . We found that layers either from the top of the model, or from the last transformer block of the model perform best. In the next section we use the following selection (see Appendix A), by the notations of the huggingface transformers libraryfoot_2 : 1. L 0 = cls.predictions.transf orm.LayerN orm.weight 2. L 1 = cls.predictions.transf orm.LayerN orm.bias 3. L 2 = cls.predictions.transf orm.dense.bias The micro-tuning task is similar to its pretraining objective, the masked token task. In order to keep the tuning to few epochs with few masking combinations, and to avoid the randomness of the masking, we chose a periodic masking strategy, wherein the masking (or absence of masking) repeats at every P th token. Consider the masking blueprint (k, m), P = k + m. To obtain an input from the text, we keep the first k tokens of the text and mask next m tokens, repeating this pattern to the end of the text. We then generate our second input by shifting the pattern by 1 token, followed by shifting 2 tokens, and so on up to k + m -1 tokens. Moreover, we can create inputs from not one but several blueprints: [(k 1 , m 1 ), ..., (k n , m n )]. For clarity, this algorithm for creating inputs is shown in Appendix A, Figure 10 . All the inputs, randomly shuffled, constitute a 'dataset' for the micro-tuning. In our evaluations we use 10 epochs micro-tuning with learning rate 0.01 and a mix of the simplest masking blueprints: [(2, 1), (1, 1), (1, 2), (1, 3)], which results in a single batch of no more than 12 inputs for any text fitting into model maximal input size. See Appendix B about the processing time and the factors to reduce it. Our ablation study, by removing a layer (Appendix C.1) and by removing a masking blueprint (Appendix C.2), shows relative importance of the layers and the blueprints.



github url will be provided here. The code is in supplementary material. https://github.com/PrimerAI/blanc https://github.com/huggingface/transformers



); Zhang et al. (2021); Yan et al. (2021)) or unsupervised (e.g. Giorgi et al. (2021); Gao et al. (2021)). The embedding is generated by pooling the outputs of certain layers of the model as it processes a text.

Figure 1: Illustration comparing the usual output embeddings (left) and neural embeddings (right). The output embeddings are taken from aggregated outputs of certain layers at inference, as shown on the left. Neural embeddings are taken from weights of certain layers at micro-tuning.

