NEURAL EMBEDDINGS FOR

Abstract

We propose a new kind of embedding for natural language text that deeply represents semantic meaning. Standard text embeddings use the outputs from hidden layers of a pretrained language model. In our method, we let a language model learn from the text and then literally pick its brain, taking the actual weights of the model's neurons to generate a vector. We call this representation of the text a neural embedding. We confirm the ability of this representation to reflect semantics of the text by an analysis of its behavior on several datasets, and by a comparison of neural embedding with state of the art sentence embeddings.

1. INTRODUCTION

Capturing the semantic meaning of text as a vector is a fundamental challenge for natural language processing (NLP) and an area of active research (Giorgi et al., 2021; Zhang et al., 2020; Gao et al., 2021; Huang et al., 2021; Yan et al., 2021; Zhang et al., 2021; Muennighoff, 2022; Alexander Liu, 2022; Chuang et al., 2022) . Recent work has focused on fine-tuning pretrained language models with contrastive learning, either supervised (e.g. Reimers & Gurevych (2019) ; Zhang et al. (2021) ; Yan et al. (2021) ) or unsupervised (e.g. Giorgi et al. (2021) ; Gao et al. (2021) ). The embedding is generated by pooling the outputs of certain layers of the model as it processes a text. Motivated by the need for deeper semantic representations of text, we propose a different kind of embedding. We allow a language model to fine-tune on a text input, and then measure the resulting changes to the model's own neuronal weights as a neural embedding. We demonstrate that neural embeddings do indeed represent the semantic differences between samples of text. We evaluate neural embeddings on several datasets and compare them with several state of the art sentence embeddings. We observe that neural embeddings correlate better specifically with semantics, while being comparable in other evaluations. We find that neural embeddings behave differently from the known embeddings we considered. Our contribution: 1. We propose a new kind of text representation: neural embeddingsfoot_0 (Section 2). 2. We evaluate embeddings by using several datasets and several criteria (Section 3). We show that by these criteria the neural embeddings are (1) better correlated with semantic similarity and consistency, and (2) strongly differ by the errors they do and by how they represent the qualities of the text.

2. NEURAL EMBEDDING METHOD

The technique for generating neural embeddings is using micro-tuning, first introduced for the BLANC-tune method of document summary quality evaluation 2 Vasilyev et al. (2020). It is a tuning on one sample only, and the tuned model is used for the sample only. Tuning a pretrained model on a specific narrow domain is a common practice to improve performance. Micro-tuning takes this to extreme, narrowing down the 'domain' to a 'dataset' consisting of just one sample. For each text sample, we start with the original language model and fine-tune only a few selected layers L 0 , L 1 , ..., L m while keeping all other layers frozen. Once the fine-tuning on the text sample is complete, we measure the difference between the new weights W ′ j and the original weights W j of each layer L j and normalize the resulting vector. We obtain the neural embedding of the text by concatenating the normalized vectors: E = E c |E c | , E c = ∥ m j=0 (W ′ j -W j )/|W ′ j -W j | Here the symbol ∥ means concatenation, e.g. ∥ m j=1 a j is a concatenation of a 0 , a 1 , ... , a m . Schematically, our illustration is in Figure 1 . For clarity, the algorithm is shown in Appendix A, Figure 8 . For example, if we select three layers from the standard BERT base model, and if each selected layer has 768 weights, then the resulting embedding will have size 768 * 3 = 2304. Before entering Equation 1, the weights of each layer are flattened from their possibly multidimensional tensor form. Through this paper we use the pretrained transformer Devlin et al. (2019) model bert-base-uncased from the transformers library Wolf et al. (2020) . We found that layers either from the top of the model, or from the last transformer block of the model perform best. In the next section we use the following selection (see Appendix A), by the notations of the huggingface transformers libraryfoot_2 : 1. L 0 = cls.predictions.transf orm.LayerN orm.weight 2. L 1 = cls.predictions.transf orm.LayerN orm.bias 3. L 2 = cls.predictions.transf orm.dense.bias The micro-tuning task is similar to its pretraining objective, the masked token task. In order to keep the tuning to few epochs with few masking combinations, and to avoid the randomness of the masking, we chose a periodic masking strategy, wherein the masking (or absence of masking) repeats at every P th token. Consider the masking blueprint (k, m), P = k + m. To obtain an input from the text, we keep the first k tokens of the text and mask next m tokens, repeating this pattern to the end of the text. We then generate our second input by shifting the pattern by 1 token, followed by shifting 2 tokens, and so on up to k + m -1 tokens. Moreover, we can create inputs from not one but several blueprints: [(k 1 , m 1 ), ..., (k n , m n )]. For clarity, this algorithm for creating inputs is shown in Appendix A, Figure 10 . All the inputs, randomly shuffled, constitute a 'dataset' for the micro-tuning. In our evaluations we use 10 epochs micro-tuning with learning rate 0.01 and a mix of the simplest masking blueprints: [(2, 1), (1, 1), (1, 2), (1, 3)], which results in a single batch of no more than 12 inputs for any text fitting into model maximal input size. See Appendix B about the processing time and the factors to reduce it. Our ablation study, by removing a layer (Appendix C.1) and by removing a masking blueprint (Appendix C.2), shows relative importance of the layers and the blueprints.

3. EVALUATIONS

The goal of this section is to demonstrate that neural embeddings do capture the semantic differences between texts, and that on average the texts with similar meanings get close to each other in the neural embedding space. No embedding can be ideal for all purposes; here we attempt to observe the behavior of neural embeddings on different data and from different points of view. We are also comparing neural embedding with the following existing sentence embeddings: 1. all-mpnet-base-v2 and all-MiniLM-L6-v2 4 5 6 2. LaBSE Feng et al. (2022) 7 8 3. SGPT-125M and SGPT-5.8B Muennighoff (2022) 9 10 11

3.1. CORRELATIONS WITH SEMANTIC, LEXICAL AND SYNTACTIC SIMILARITIES

In this section we review how neural embeddings correlate with semantic, lexical and syntactic similarities of paraphrases. The separation of these three qualities is possible by utilizing the qualitycontrolled paraphrase generation 12 , introduced in Bandel et al. (2022) . We generate paraphrases with different degree of similarity to the original phrase. By changing one generation control parameter (through values 0.05, 0.10, 0.15, ..., 0.95) and keeping two other fixed (with values selected from 0.1, 0.3, 0.5, 0.7, 0.9), we vary one specific similarity (semantic or lexical or syntactic), while keeping two others fixed. As our original phrases we take first sentences from the first 100 pairs of phrases from MRPC train dataset -Microsoft Research Paraphrase Corpus Dolan & Brockett (2005) 13 . Our own embedding-based similarity score of a generated phrase is simply a scalar product of the neural embeddings of the generated phrase and of the original phrase. For clarity Appendix D contains more detail. The correlations obtained for the generated phrases are shown in Figure 2 . Each cell value in the figure represents a correlation between the embedding-based score and the generation 'score' (control parameter value of the generation). For example, the top right corner of the first (left) heatmap in Figure 2 has a dark-blue (high) correlation value; in this case the semantic and syntactic similarities were kept constant at high value 0.9, while the lexical similarity has been varying. All the correlations are positive, and are especially strong for lexical and semantic qualities. The correlations are also positive for the existing usual kinds of embeddings that we selected for comparison. Since all the correlations are positive, we can visualize and compare the goodness of neural embeddings vs usual embeddings by the ratio of neural embeddings correlations to the usual embeddings correlations. Figures 3, 4, 5, 6 and 7 show such comparisons with several known embeddings, in scale intentionally cut to the same interval (0.5, 1.5). Accordingly to these figures, the embeddings-based scores that use neural embeddings correlate better with semantic similarity and with lexical similarity. Other embeddings work better for syntactic similarity. Stronger correlation with semantic similarity is unambiguously good for an embedding. Lexical and syntactic similarities are supposedly (at least approximately) untangled from semantics, so a stronger correlation with them may be considered as a desired or not desired feature, depending on downstream goals. MRPC sentences are typically long. In appendix E we repeat our evaluation on short sentences, observing even better performance of neural embeddings in comparison to the other embeddings. 4 https://www.sbert.net/docs/pretrained models.html 5 https://huggingface.co/sentence-transformers/all-mpnet-base-v2 6 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 7 https://tfhub.dev/google/LaBSE/2 8 https://huggingface.co/sentence-transformers/LaBSE 9 https://github.com/Muennighoff/sgpt 10 https://huggingface.co/Muennighoff/SGPT-125M-weightedmean-nli-bitfit 11 https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit 12 https://github.com/IBM/quality-controlled-paraphrase-generation 13 https://huggingface.co/datasets/glue/viewer/mrpc/train 

3.2. TRIPLETS ANCHOR-POSITIVE-NEGATIVE

If we compare the semantic similarity between texts B and C with regard to text A, and if B is semantically closer to A, then we would want the scalar product of the embeddings of B and A to be higher than the product of the embeddings of C and A: sim(A, B) > sim(A, C) ⇐⇒ (E A * E B ) > (E A * E C ) In this section we will take triplets of texts with the similarity sim satisfying sim(A, B) > sim(A, C) from several datasets, and we will count the fraction of triplets for which Equation 2 is not satisfied. We can get the triplets (A, B, C) from any dataset of texts grouped by semantic similarity, so that semantically similar texts belong to the same group. Such selection of triplets was used for grouped images in Wang et al. (2014) ; Hoffer & Ailon (2015) and in related computer vision works. Selecting 'anchor' A and 'positive' B from the same group, and 'negative' C from any other group provides a test: We should verify (E A * E B ) > (E A * E C ). With a dataset in hand, we create all possible triplets and then count the fraction of the triplets for which this condition is not satisfied. The datasets we use, listed with the short notations of Tables 1 and 2: 1. mrpc: Composed of all sentence pairs that were annotated as similar in the 'train' dataset of MRPC -Microsoft Research Paraphrase Corpus Dolan & Brockett (2005) foot_3 . There are 2474 such pairs, hence we have 2474 groups, each group consisting of two texts (sentences). 2. sts: Composed of all sentence pairs that were annotated with similarity score at least 4 (on the scale 1-5) in STS 'test' (Englsh language) dataset (Cer et al., 2017; May, 2021) 15 . There are 338 such pairs, hence we jave 338 group, each group consisting of two texts (sentences). The samples are selected as first 500 classes (by labels), with 10 first texts selected for each class. 6. cestwc/q: Composed as cestwc/t, but using the subset 'quora'. 7. cestwc/c: Composed as cestwc/t, but using the subset 'coco'. 8. summ-g: 1700 generated summaries of the dataset SummEval 19 Fabbri et al. (2021). Each summary is labeled by the document for which it was generated; there are 100 documents, 17 generated summaries for each document. 9. summ-r: 1100 human-written summaries of the dataset SummEval. Each summary is labeled by the document for which it was generated; there are 11 reference summaries for each document. 10. text-summ-g: This dataset composed by adding the texts of the documents to the dataset summ-g, but with restriction that only the document texts serve as A, and only the summaries serve as B and C. 11. text-summ-r: Composed as text-summ-g, but with human-written reference summaries instead of the generated summaries. The above datasets are diverse: besides paraphrased sentences, there are paraphrased questions (cestwc/quora), sentence simplifications (asset), related summaries created from the same text (summg and summ-r), and related text-summary pairs (text-summ-g and text-summ-r). In order to have a manageable numbers of triplets, for some known datasets we took subsets or only parts of them, as specified above. Tables 1 and 2 contain the results of our evaluation. The result for 'text-summ-g' is uniquely sensitive to random seed and the count wrong = 44 can shift up to 10 units up or down. Here we considered normalized embeddings (or, equivalently, cosine as a measure of similarity). Having unnormalized embeddings from the models all-mpnet-base-v2, all-MiniLM-L6-v2 and LaBSE does not make a meaningful difference (Appendix F). The embedding column shows the kind of embeddings used. Neural embeddings are obtained using the bert-base-uncased model, the layers listed in Section 2 and micro-tuning with the blueprints [(2, 1), (1, 1), (1, 2), (1, 3)], 10 epochs and learning rate 0.01, as described in Section 2. The error column shows the fraction of 'broken' triplets (A, B, C) -the triplets for which Eq.2 is not satisfied. The intersect column I shows the fraction of the broken triplets that the considered embeddings have in common with neural embeddings: I = T emb T neural min(|T emb |, |T neural |) ( ) Here I is the intersect, T emb is a set of the triplets broken by the considered embeddings, and T neural is a set of the triplets broken by the neural embeddings. The intersect is not listed for the neural embeddings, for which it equals 1. The same column shows the average value of the scalar product (E A * E B ) over all the triplets, i.e. the average value of the product between embeddings of the texts which were selected as semantically close (same group). Similarly, the diff column shows the average value of the scalar product (E A * E C ) over all the triplets, i.e. the average value of the product between embeddings of the texts which were selected as semantically different (different groups). In most datasets the neural embeddings and the known embeddings behave differently. Their errors are largely disjoint, with the overlap I below 50% for many datasets in Tables 1 and 2 . Another difference between neural and other embeddings is how they spread out in the embedding space. The average product of embeddings is higher for all known embeddings than for neural embeddings. This is true both for semantically similar and for semantically different texts -see the same and diff columns. The datasets containing the pairs of sentences deemed not similar provide another way of evaluation: any similar pair must have higher product of embeddings than any non-similar pair: sim(A, B) > sim(C, D) ⇐⇒ (E A * E B ) > (E C * E D ) Table 3 contains the results of this evaluation. 

3.4. CORRELATIONS WITH HUMAN SCORES

If dataset contains human-scored pairs of texts or sentences, with the score reflecting at least in some sense a similarity between the texts, then the product of embeddings should correlate with the score. In Table 4 we show examples of such correlations. 2021)) has human-scored, by 3 expert, each of 1700 summaries, which were generated by 17 models for 100 documents. Each summary is scored for coherence, consistency, fluency and relevance (we are using the score averaged over the 3 expert scores). Neural embeddings are the best in correlations with consistency, which is a quality that is probably the closest to semantics.

4. CONCLUSION

In this paper we introduced neural embeddings, a new method for representing text as vectors. To our knowledge, this is the first attempt to obtain representations in this manner. In Appendix G we listed the factors by which the construction of neural embeddings differs from the usual embeddings; based on the presented there observations we conclude that micro-tuning is the most important factor. Our observations in Section 3.1 and in Appendix E show that neural embeddings make more emphasis on semantic and lexical qualities, and less on syntactic qualities. We also observed through Section 3 that neural embeddings behave distinctly differently from all the embeddings we included in our comparisons. As an illustration of the difference, in Appendix H we show results for a simple 'ensemble': the concatenation of neural and SGPT-125M embeddings. As expected, in many cases the concatenation outperforms them both. Empirical Methods in Natural Language Processing (EMNLP), pp. 1601-1610, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.124. URL https://aclanthology.org/2020.emnlp-main.124.

A NEURAL EMBEDDINGS AND MICRO-TUNING

In section 2 we defined neural embedding for text. For clarity, the pseudocode is shown in Figure 8 . Taking the same layers from lower blocks worsen the neural embeddings performance. In our evaluations we used the first choice; in Appendix C.1 we present the ablation results with one of the layers excluded. Figure 9 is an illustration for obtaining the inputs for text of 6 tokens with the blueprints [(2, 1), (1, 1), (1, 2), (1, 3)]. There are only 12 inputs for any text fitting into model maximal input size -or even less than 12 inputs if the text length is less than 4 tokens. This fits into single batch. For clarity, the pseudocode for obtaining inputs for micro-tuning is shown in Figure 10 . In Appendix C.2 we present the ablation results with one of the blueprints excluded. Interestingly, if we do not mask tokens at all (but require prediction for all tokens), we still get reasonable embeddings, albeit with much lower 'quality' (if evaluated by any criteria of Section 3). The loss is still defined by cross entropy, no matter how easy the prediction task is, and for the same reason there is always weights difference in Equation 1. Given: Blueprints for masking: [(k 1 , m 1 ), ..., (k n , m n )] Text chunk tokens T = [t 0 , t 1 , ...] Initialize empty list inputs and labels for each masking blueprint (k i , m i ):  Period P = k i + m i Maximal shift S max =

B FASTER MICRO-TUNING

For convenience or reproducibility, the neural embeddings can be obtained simply by using the BERT model as it is, with all the layers frozen, except the selected layers, as described in Section 2. But there are several simplifications that make obtaining neural embeddings almost as fast as a simple forward run of BERT. Since the micro-tuning 'dataset' is extremely small, all the inputs can be kept on GPU through all epochs. More importantly, since all the selected (for embeddings) layers are located at the top of the model, there is no need to repeat the forward run through all the lower transformer blocks. The hidden states from the last block preceding the layers of interest can be obtained only once, at the first epoch, and then reused in all the next epochs. In result, at each epoch except the first one, the micro-tuning deals only with the top of the model (separated from the rest). Again, since the micro-tuning 'dataset' is extremely small, even these hidden states can be kept on GPU, unless we are dealing with a really long text of too many input windows. For example, using simply the whole BERT model (with 10 epochs), the average time of obtaining neural embedding on GPU P100 for MRPC sentence is 0.24 sec., and for STS sentence is 0.16 sec. With hidden states reused, these times become 0.096 sec. and 0.057 sec. Reducing the number of epochs offers another option: Switching from 10 to 5 epochs further decreases these times almost twice (with evaluation results in most cases worse only but several percents). However, this is still much slower than a time around 0.006 -0.008 sec. by 'all-mpnet-base-v2', 'LaBSE' or 'SGPT-145M' (both for MRPC and STS).

C ABLATION: REMOVING LAYERS AND MASK-BLUEPRINTS C.1 ABLATION: LAYERS

In this section we present ablation study results for excluding one of the layers from using in constructing the neural embedding (three layers were used in our evaluations, Section 3). We are using the same micro-tuning as in Section 3, but with 5 (not 10) epochs. The Neural embedding is compared in Tables 5, 6 , 7 and 8 with the embedding versions Neural-L0, Neural-L1 and Neural-L2, where the suffix indicates the layer that was not used. Reminder, the layers are: From Tables 5, 6 , 7 and 8 we can conclude that the layer L2 is the most important: the embeddings Neural-L2 never done well in the tables. The layer L0 is the least important and maybe can be excluded, depending on a task: the embeddings Neural-L0 done well in many cases in the tables. 1. L 0 =

C.2 ABLATION: MASK-BLUEPRINTS

Here in Tables 9, 10, 11 and 12 we present results of ablation by excluding masking blueprints from construction of neural embedding. Similar to the ablation study of layers, we compare here Neural embeddings with embedding versions Neural-B0, Neural-B1, Neural-B2 and Neural-B3, where the The conclusion from Tables 9, 10, 11 and 12 is simpler than from ablation of the layers. The importance of the selected blueprints is not very different, excluding any of them leads to worse performance in most cases, yet the blueprint B1 maybe less important than the others. The blueprint B1 is the simplest blueprint (1,1) -one token kept and one token masked.

D CONTROLLED-GENERATED PHRASES

In Section 3.1 we used control-generated phrases for comparing lexical, syntactic and semantic representation of the phrases by neural embeddings. We reviewed how neural embeddings correlate with semantic, lexical and syntactic similarities of paraphrases. In this appendix we provide more detail. The separation of the three qualities (lexical, syntactic and semantic) is possible by utilizing the quality-controlled paraphrase generationfoot_9 , introduced in Bandel et al. (2022) . We generated paraphrases with different degree of similarity to the original phrase. By changing one generation control parameter and keeping two other fixed, we varied one specific similarity (semantic or lexical or syntactic), while keeping two others fixed. Our own embedding-based similarity score of a generated phrase is simply a scalar product of the neural embeddings of the generated phrase and of the original phrase. In Section 3.1 we calculated correlations between the embedding-based similarity and the generation control parameter. As our original phrases we took first sentences from the first 100 pairs of phrases from MRPC train dataset -Microsoft Research Paraphrase Corpus Dolan & Brockett (2005) 21 . Each evaluation is defined by the following: 1. The selected quality (semantic or lexical or syntactic) for which we generate phrases with different degree of similarity, by varying the corresponding control parameter. 2. The fixed control parameters for the remaining two qualities. We have chosen our fixed control parameters from the values 0.1, 0.3, 0.5, 0.7, 0.9. Thus, with 5 × 5 = 25 choices of the fixed control parameters, and with 3 choices of selecting the quality to vary, we have 25 × 3 = 75 evaluations, i.e. 75 correlations to calculate. For each evaluation, we did the following: 1. For each phrase of our 100 original phrases, we generated new phrases with the selected control parameter taking the values 0.05, 0.10, 0.15, ..., 0.95 (and with the other two control parameters fixed). If increasing of the selected control parameter by 0.05 to the next value does not change the generated phrase, the generated phrase is ignored. The control parameters for the remaining two qualities remain fixed. 2. We obtained the embeddings of the original and of the generated phrases. 3. For each generated phrase, the embedding-based score was calculated as a dot-product between its embedding and the the embedding of the corresponding original phrase. 4. A correlation (Kendall Tau, variant c) was calculated between the embedding-based score and the generation control parameter. Since we ignored duplicated generation phrases while increasing the control parameter along the values 0.05, 0.10, 0.15, ..., 0.95, there are around 300 -900 scored phrases for each evaluation. We show the number of the generated phrases in Figures 11 and 12 . At any choice of the fixed control parameters, the number of generated phrases is not less than 300. All the correlations shown in Section 3.1 have p-value far below 0.05.

E CORRELATION OF EMBEDDING-BASED SCORES WITH SIMILARITY OF

SHORT CONTROL-GENERATED SENTENCES. In Section 3.1 we have considered correlations of embedding-based scores with similarity of sentences taken from MRPC dataset. MRPC sentences are typically long. Here we repeat the same review for shorter sentences. As our original sentences we select from STS test subset (Cer et al., 2017; May, 2021) foot_11 100 first sentences of length between 20 and 40 characters, and generate the results in the same manner as was done in Section 3.1. The resulting correlations for neural embeddings are shown in Figure 13 . The corresponding ratios of the correlations of neural embedding-based score to the correlations based on other embeddings are shown in Figures 14, 15, 16, 17 and 18 . Again, as in Section 3.1, neural embeddings have stronger correlations with semantic and lexical similarities, but now also, in comparison with some of the embeddings, stronger correlations with syntactic similarity. 

F UNNORMALIZED SENTENCE EMBEDDINGS

In our evaluations in Section 3 we used normalized embeddings (hence dot-product equal to cosine). Here we point out that having unnormalized embeddings for the models all-mpnet-base-v2, all-MiniLM-L6-v2 and LaBSE does not affect much our observations. The count of errors (the column "wrong" in Table 1 changes when these embeddings are unnormalized by not more than 3 errors. The only difference for Table 3 is that unnormalized embeddings of model all-MiniLM-L6-v2 get the count "wrong" 736307 instead of 736308. And the correlations in Table 4 remain the same within the shown precision.

G NEURAL VS HIDDEN LAYER EMBEDDINGS

Neural embeddings, introduced in Section 2 differ from the commonly used 'hidden layer' embeddings in four factors or steps: 1. Micro-tuning: the model gets tuned on extremely small 'dataset', produced from a single sample. 2. Measuring the difference: The embedding is not simply taken from a (micro-tuned) model, but obtained as a difference between the embeddings from the micro-tuned model and the original model. 3. Using weights: The embedding is taken not from an output of a hidden layer, but from the weights of the model. 4. Combining layers: The results from several layers are concatenated. Do all these three factors necessary? And if yes, how important is each of them? In order to answer these questions, we compare here the neural embeddings with their simplified versions, stripped from one or more of the factors. These are the simplified versions: 1. cls: embedding picked up from CLS token on output of the BERT last transformer block (block 11). 2. avg: average of embeddings picked up from all tokens of the text on output of the last block. 3. cls-tuned: The same as 'cls', but obtained from micro-tuned model. 4. avg-tuned: The same as 'avg', but obtained from micro-tuned model. 5. cls-diff: Obtained as the difference between 'cls-tuned' and 'cls'. 6. avg-diff: Obtained as the difference between 'avg-tuned' and 'avg'. 7. neural-1: Neural embedding using just one layer, specifically the layer (by huggigface transformers notations) 'bert.encoder.layer.11.output.dense.bias'. For fair comparison, here the neural embeddings are also taken from block 11 (not from cls layers): 1. L 0 = bert.encoder.layer.11.output.LayerN orm.weight 2. L 1 = bert.encoder.layer.11.output.LayerN orm.bias 3. L 2 = bert.encoder.layer.11.output.dense.bias Thus, in our simple collection of embeddings here, the 'cls' and 'avg' represent the usual embeddings. The 'cls-tuned' and 'avg-tuned' represent embeddings that included the first step toward the neural embeddings: micro-tuning. The 'cls-diff' and 'avg-diff' represent embeddings that included the second step toward neural embeddings: measuring the difference. Finally, the 'neural-1' represents the neural embedding with single layer. Notice that we selected the most valuable layer, accordingly to the ablation study in Appendix C.1. The results of the evaluations are presented in Tables 13, 14 , 15, 16. Micro-tuning was done with the same parameters as through the evaluations, Section 3, but here we used 5 (not 10) epochs. The tables show results for normalized versions of embeddings. Figure 18 : Ratio of correlations of neural embeddings to correlations of SGPT-5.8B embeddings. For evaluations of Table 13 , the unnormalized versions of embeddings make less errors only in the following cases, marked yellow: For dataset 'catasaurus', the embeddings 'cls-mtuned' make 57149 errors; for dataset 'asset/t', the embeddings 'cls' make 1672558 errors, and embeddings 'cls-mtuned' make 1232707 errors. For evaluations of Tables 14 and 15 , all unnormalized are worse (make more errors). In Table 16 we included unnormalized versions of embeddings (marked by suffix '-u'), in all cases when they do better than the normalized versions. Overall, from the tables 13, 14, 15 and 16 we can conclude that all the factors matter, and micro-tuning is probably the strongest, most important factor.

H NEURAL AND SGPT EMBEDDINGS CONCATENATED

We observed in Section 3 that neural embeddings and SGPT behave differently: their errors do not overlap greatly, their average products have different ranges, and their correlations with text qualities are different. Here we suggest an additional illustration of this point. We create a simple 'ensemble' by concatenating neural embeddings and SGPT embeddings into a single embedding. An ensemble is expected to perform better than either of its components if the they perform not too differently but have different properties. Specifically we used here SGPT-125M embeddings. In Table 17 we observe that this is indeed true. The exception is 'text-summ-g' where neural embedding was orders of magnitude better, 'cestwc/c' with almost a tie, and, strangely, catasaurus. In Table 18 for 'sts', SGPT was much better to start with. In Table 4 the neural and SGPT embeddings were almost equally good correlating with relevance, and in that case the concatenation is much stronger than either of them. Interesting, even though evaluation of summary quality was not the purpose of the embeddings, neural embeddings correlations for consistency scores exceed correlations of all flavors of ROUGE (ROUGE-L, ROUGE-1, ROUGE-2 and ROUGE-3). Similarly, Neural+SGPT correlations are higher than all flavors of ROUGE for relevance, and SGPT correlations are higher for coherence. 



github url will be provided here. The code is in supplementary material. https://github.com/PrimerAI/blanc https://github.com/huggingface/transformers https://huggingface.co/datasets/glue/viewer/mrpc/train https://huggingface.co/datasets/stsb multi mt https://huggingface.co/datasets/catasaurus/paraphrase-dataset2 https://github.com/facebookresearch/asset https://huggingface.co/datasets/cestwc/paraphrase https://github.com/Yale-LILY/SummEval https://github.com/IBM/quality-controlled-paraphrase-generation https://huggingface.co/datasets/glue/viewer/mrpc/train https://huggingface.co/datasets/stsb multi mt



Figure 1: Illustration comparing the usual output embeddings (left) and neural embeddings (right). The output embeddings are taken from aggregated outputs of certain layers at inference, as shown on the left. Neural embeddings are taken from weights of certain layers at micro-tuning.

Figure 2: Evaluation of neural embeddings on generated phrases with controlled similarity. The original phrases are from MRPC. Each cell represents Kendall Tau (variant c) correlation of the embedding-based score with a selected kind of similarity: lexical similarity on the left heatmap, syntactic similarity on the middle heatmap, and semantic similarity on the right heatmap. The X and Y axes show the similarity by the fixed qualities.

Figure 3: Ratio of correlations of neural embeddings to correlations of all-mpnet-base-v2 embeddings.

Figure 4: Ratio of correlations of neural embeddings to correlations of all-MiniLM-L6-v2 embeddings.

Figure 5: Ratio of correlations of neural embeddings to correlations of LaBSE embeddings.

Figure 6: Ratio of correlations of neural embeddings to correlations of SGPT-125M embeddings.

Figure 7: Ratio of correlations of neural embeddings to correlations of SGPT-5.8B embeddings.

Figure 8: Producing neural embedding for text T . Notice that the weights of layers {L j } are always reset to the original values before obtaining neural embedding for a text. Here the symbol ∥ means concatenation, e.g. ∥ m j=1 a j is a concatenation of a 0 , a 1 , ... , a m .

Figure 9: The inputs for micro-tuning on text with length 6 tokens, obtained by the blueprints [(2, 1), (1, 1), (1, 2), (1, 3)]. The tokens kept (not masked) are colored green. Labels are made for predicting the masked tokens. Altogether there are 12 inputs, represented here by 12 rows of length 6.Our choice of layers for the evaluations presented in Section 3 is influenced by several factors. Practically, any 2D layers (for example attention weights) are too large to create an embedding. The decoder bias layer ('cls.predictions.decoder.bias') is 1D but also too large, with a size 30522. It is reasonable to chose layers from top of the model, when most of the processing is done by the bulk of the transformer layers, and the patterns emerged; this means that the last transformer block and the classification layer are the good places to pick the layers from. In terms of huggingface BERT, it is the layers 'bert.encoder.layer.11' and 'cls.predictions'. Empirically, we found that contiguous set of two or three layers at the end of these blocks works the best, e.g. 1. L 0 = cls.predictions.transf orm.LayerN orm.weight 2. L 1 = cls.predictions.transf orm.LayerN orm.bias 3. L 2 = cls.predictions.transf orm.dense.bias

Figure 10: Creating inputs and labels from a chunk of text. The given length of the chunk len(T ) must not exceed the max allowed size of model input.

Figure 11: Number of control-generated phrases -for lexical similarity (left), syntactic similarity (middle), and semantic similarity (right). The X and Y axes show the similarity by the fixed qualities. The original phrases are from MRPC.

Figure 12: Number of control-generated phrases -for lexical similarity (left), syntactic similarity (middle), and semantic similarity (right). The X and Y axes show the similarity by the fixed qualities. The original phrases are from STS.

Figure 13: Evaluation of neural embeddings on short generated phrases with controlled similarity. The original phrases are from STS. Each cell represents Kendall Tau (variant c) correlation of the embedding-based score with a selected kind of similarity: lexical similarity on the left heatmap, syntactic similarity on the middle heatmap, and semantic similarity on the right heatmap. The X and Y axes show the similarity by the fixed qualities.

Figure 14: Ratio of correlations of neural embeddings to correlations of all-mpnet-base-v2 embeddings.

Figure 15: Ratio of correlations of neural embeddings to correlations of all-MiniLM-L6-v2 embeddings.

Figure 16: Ratio of correlations of neural embeddings to correlations of LaBSE embeddings.

Figure 17: Ratio of correlations of neural embeddings to correlations of SGPT-125M embeddings.

Performance of embeddings, measured as the ability to conform to the ABC-relation by Eq.2. Column error is the fraction of broken ABC-relations. Column total is the number of the triplets, and column wrong is the number of broken triplets; error = wrong/total. I is the intersection with N eural errors, by Eq.3. Column same is the average value of embeddings product for semantically similar AB texts; column diff is the product for semantically different AC. The best and the next-best embeddings are marked cyan and green correspondingly.

Performance of embeddings, as the ability to conform to the ABC-relation by Eq.2.



Correlations of embedding product with human scores.

cls.predictions.transf orm.LayerN orm.weight 2. L 1 = cls.predictions.transf orm.LayerN orm.bias 3. L 2 = cls.predictions.transf orm.dense.bias Ablation: removal of a layer. Triplets.

Ablation: removal of a layer. Triplets. Texts and summaries.

Ablation: removal of a mask-blueprint. Triplets.

Ablation: removal of a mask-blueprint. Triplets. Summaries and texts.

Ablation: removal of a mask-blueprint. Pairs similar and not similar.

Ablation: removal of a mask-blueprint. Correlations of embedding product with human scores.

Neural embeddings vs micro-tuned embeddings vs usual embeddings. Triplets.

Neural embeddings vs micro-tuned embeddings vs usual embeddings. Triplets.

Neural embeddings vs micro-tuned embeddings vs usual embeddings.

Exampe of concatenation, measured against the triplets (Eq.2). Neural, SGPT-125M, and the concatenated embedding 'Neural+SGPT' of Neural and SGPT-125M embeddings. A row with lowest error is marked cyan.

Exampe of concatenation of Neural and SGPT-125M. Measured on pairs (similar vs not similar).

Exampe of concatenation of Neural and SGPT-125M. Correlations with human scores.

