HIDDEN SCHEMA NETWORKS

Abstract

Starting from the assumption that a large proportion of semantic content is necessarily relational, we introduce a neural language model that discovers networks of symbols (schemata) from text datasets. Using a variational autoencoder (VAE) framework, our model encodes sentences into sequences of symbols (composed representation), which correspond to the nodes visited by biased random walkers on a global latent graph. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next, we leverage pretrained BERT and GPT-2 language models as encoder and decoder, respectively, to train our model on language modelling and commonsense knowledge generation tasks. Qualitatively, the model is able to infer schema networks whose nodes (symbols) can be interpreted as encoding different aspects of natural language (as e.g. topics or sentiments). Quantitatively, our results show that the model successfully interprets the inferred symbol sequences, as it achieves state-of-the-art scores on language modeling benchmarks. Source code to reproduce all experiments is provided with the supplementary material.

1. INTRODUCTION

Much of the developmental and causal theories of human cognition are predicated on relational structures of knowledge that naturally exhibit compositionality. Semantic content is intrinsically relational, as one is only able to explain a given unit of knowledge -such as a concept, word or perception -insofar as there are other units of knowledge which relate to it (Block, 1986 ). Thus we can partially construe a concept through its relationships to other concepts (like when we say "a dog is an animal that barks"), just as we can partially construe it through its relationships to our perceptions (when we say "that is a dog", whilst pointing to a dog on the street) or the words we use (when we use the word dog to refer to the concept dog). Likewise, we can partially construe words not only through their relationships to concepts or percepts, but also through their relationships to other words, as words that occur in the same context tend to have similar meanings (Harris, 1954; Firth, 1957) . Note that is precisely this contextual semantic content of words what we have explicit access to, when processing our raw text datasets. On the other hand, generalization, reasoning and understanding seem to be inevitably tied to the compositional nature of knowledge. Indeed, the ability to compose a set of knowledge units (and their relations) into new, more complex relational units, which can be deployed to understand and reason about unseen data -a feature usually referred to as combinatorial generalization -is regarded as key to human-level intelligence (Fodor & Pylyshyn, 1988; Fodor & Lepore, 2002; Lake et al., 2017; Battaglia et al., 2018) . Relational structures allowing for compositionality thus seem to comprise not a sufficient, but a necessary attribute of any representation scheme that strives for the generalization power of human cognition. From the computational side, if one is to inform any modern machine learning model with such structural characteristics, one will initially encounter the problem of finding suitable primitives or data structures. In natural language processing (NLP), for example, it has become common place to leverage distributed continuous representations of words (Bengio et al., 2003) for different downstream tasks. Such representations are trained to encode average contextual semantics -precisely the kind of semantic content typical of word co-occurrence relations we mentioned above -into a semantic space, which allows meaning to change continuously within it (Mikolov et al., 2013 ). Yet, despite earlier attempts (Mitchell & Lapata, 2008) , it is unclear whether such representations can be meaningfully composed into representations of, say, unseen sentences and thus mimic the compositional character of natural language. More recently, contextualized continuous word representations inferred by deep learning architectures have shown spectacular results in many NLP tasks (Radford et al., 2018; Devlin et al., 2018; Radford et al., 2019; Brown et al., 2020) . Their success stems from those models' ability to infer flexible representations through, inter alia, raw, massive datasets, data-scalable attention mechanisms and minimal inductive biases (Vaswani et al., 2017) . These representations are known to not only contain rich contextual word semantics, but also consistently encode sentence-level grammar (Hewitt & Manning, 2019) , and the models from which they are obtained seem to implement some notions of compositionality too (Hupkes et al., 2020; Wei et al., 2022) . Nevertheless, it is still unclear whether such representations can be composed into representations of novel sentences (Yu & Ettinger, 2020; Bhathena et al., 2020) . In fact, most of their syntactic properties are implicit and therefore inferred only a posteriori, typically through probes which neither guarantee their presence, nor establish how they were obtained in the first place (Rogers et al., 2020) . In this work we use a VAE framework (Kingma & Welling, 2013; Rezende et al., 2014) to develop a language model -the Hidden Schema Network model (HSN) -that enforces, via inductive biases, a discrete, relational structure for sentence representation which allows for compositionalityfoot_0 , while exploiting the well-known advantages of attention models and contextualized, pretrained representations. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next, we leverage our methodology to translate the implicit lexical and grammatical aspects of language encoded by pretrained BERT and GPT-2 language models into explicit relational structures, and apply the latter on language modelling and commonsense knowledge generation tasks. Our main contribution is then an exploration of a novel way to integrate discrete (symbols), relational (graphs) and continuous (neural representations) machine learning components into an end-to-end, differentiable representation learning algorithm for natural language modelling. Our aim is thus to try to connect the modern NLP paradigm with classical notions of linguistics, and begin to answer the recent calls for neuro-symbolic integration (Garcez & Lamb, 2020; Cartuyvels et al., 2021) .

2. RELATED WORK

In cognitive psychology, a schema is (roughly) defined as a large, complex unit of knowledge representing what is typical of a group of instances (Bartlett, 1932; Piaget, 1948; Rumelhart, 2017) . Marvin Minsky's frames (Minsky, 1974; 1975) are similar in function to a schema, but perhaps more easily characterized in terms of data structures. We use these terms in a loose fashion, however. Our aim being only to be suggestive of the general problem of knowledge representation (Thagard, 1984) . We are in fact concerned with representation schemes for natural language processing. Within the context of linguistics, Jackendoff (1978) argues that there must be a level of representation -the so-called conceptual structures -at which information conveyed by language must be compatible with information coming from sensory systems. Conceptual structures must, he goes on, be able to represent all the conceptual distinctions made by natural language, and provide some degree of compositionality. Earlier computational models implementing (some kind of) conceptual structure rely on either hand-coded (semantic) network representations (Quillan, 1966; Collins & Quillian, 1969; Brachman, 1977) or hand-coded databases (McClelland & Rogers, 2003) . Other works focus instead on learning semantic representations directly from text data via topic models (Griffiths et al., 2007b) , and even infer latent concept graphs through nonparametric priors (Chambers et al., 2010) . In sharp contrast with these works, modern, neural-based language models incorporate no explicit linguistic notions, and leverage massive datasets and attention mechanisms, in the form of large pre-trained language models and contextualized, continuous word representations. We build on top of these ideas, while trying to connect back with models of conceptual structure, which necessarily involve discrete representations (van den Oord et al., 2017; Hu et al., 2017; Zhao et al., 2018; Kaiser & Bengio, 2018; Kaiser et al., 2018) .

3. HIDDEN SCHEMA NETWORKS

We address the problem of learning the joint probability distribution over sequences of words, while inferring interpretable representations capturing their semantics. Neural autoregressive language models approximate such distributions with a product over conditional probabilities, such that p(x 1:T ) = T i=1 p θ (x i |x <i ), where x 1:T = (x 1 , x 2 , . . . , x T ) labels the sequence of words in question, and each conditional is given by (the pdf of) a categorical distribution over some vocabulary of size V . The class probabilities of these conditionals are generally computed as π i = softmax(W • h θ (x <i )), with W ∈ R V ×D trainable, and D the output dimension of h θ , a deep neural network model with parameter set θ (Bengio et al., 2003) . Models of this form allow for tractable estimation of and sampling from either the joint distribution, or any product of the conditionals in Eq. 1. Indeed, their recent implementation in terms of large-capacity, self-attention architectures such as GPT-2 (Radford et al., 2019) has been shown to generate syntactically correct, diverse and fluent text. Yet, most of the linguistic structure encoded by the output representations of these models is implicit and difficult to interpret (Rogers et al., 2020) . In what follows we shall condition the joint distribution of Eq. 1 on an additional latent, discrete representation which can, at least in principle, capture the relational and compositional features of semantic content. Let us assume there is a set E = {e 1 , e 2 , . . . , e K } of K symbols that encode some high-level, abstract semantic content of natural language. Let this set be the set of nodes of a hidden (semantic) graph G, with adjacency matrix A, so that adjacent (connected) symbols are semantically related. These symbols can generically be defined as learnable, dense vectors in R S , for some dimension S. Without loss of generality, however, we opt below for simple indicator ("one-hot") vectors of dimension K instead. We define a schema e j1:j L as a sequence of L ≪ K symbols (e j1 , e j2 , . . . , e j L ), where the indices j 1 , . . . , j L label a subset of connected nodes in G. Accordingly, we refer to G as a schema network. The symbols composing the schemata are chosen through a L-step stochastic process conditioned on G. Partially motivated by research on random walks and human memory search (Griffiths et al., 2007a; Abbott et al., 2012) , as well as by the simplicity of their inference, we choose to compose the schemata via biased random walk processes on G, and leave exploring different schema processes for future work . Let us now specify the generative model in detail.

3.1. GENERATIVE MODEL

We write the joint probability over a sequence x 1:T of T words, together with the hidden graph G, as p θ (x 1:T , A) = z 1:L p θ (x 1:T |e j1:j L )p(z 1:L |A)p(A), where z 1:L labels the sequence z 1 , . . . , z L of K-dimensional, one-hot vectors representing the node labels j 1 , . . . , j L visited by a random walker on G, and θ denotes the trainable model parameters. Note that we introduced the one-hot representation of j i for notational convenience, as shall become evident belowfoot_1 . Next, we specify the different components of Eq. 2. Prior over (global) graph. A prior on the adjacency matrix p(A) allows us to control the topological properties of G. One can choose, for example, random graph models whose degree distribution asymptotically follow a power law (Barabási & Albert, 1999) , or unbiased, maximum entropy graph models, with respect to some given constrains (Park & Newman, 2004) . For the sake of simplicity we choose a Bernoulli (Erdös-Rényi) random graph model (Solomonoff & Rapoport, 1951; Erdös & Rényi, 1959) , for which each link a ij is defined via an independent Bernoulli variable with some fixed, global probability p ∈ [0, 1], so that p(A) = K i,j=1 p aij (1 -p) 1-aij . The probability p will be a hyperparameter of our model. Prior over random walks. The probability p(z 1:L |A) of a random walk over the nodes of G can generally be written as p(z 1:L |A) = p(z 1 ) L i=2 p(z i |z i-1 , A) = K m=1 ρ z m 1 m L i=2   K j=1 K k=1 P z k i z j i-1 k,j   , where p(z 1 ) labels the probability of selecting j 1 as the starting point of the walk, and it is given by (the pdf of) a categorical distribution over the nodes of G, with class probabilities {ρ i } K i=1 . Similarly p(z i |z i-1 , A) labels the conditional probability of jumping from j i-1 to j i , which we define in terms of a K × K transition probability matrix P. Now, to allow for biased random walks, let each node k on G be given a positive weight f k , so that the probability of jumping from j to k is proportional to f k A kj . We then write the transition probability matrix as P k,j = f k A kj K i=1 f i A ij , so that the motion of the random walker is biased according to the node weights f k . These weights should be understood as encoding aspects of the diffusion dynamics that are independent of the topology of the graph (Gómez-Gardenes & Latora, 2008; Lambiotte et al., 2011) . Three comments are in order: first, note that one can also train the prior over walks by making the vectors ρ and f learnable. Second, setting the node weights f = I and the class probabilities ρ = 1 K I, with I the K-dimensional vector of ones, yields a uniform random walk over G, i.e. a process in which the walker has equal probability of jumping to any of its neighbors. Third, one can also allow for inhomogeneous random walks in which the probability matrix changes at each step of the random walk. Such processes can be parameterized with a sequence of weights f [1] , f [2] , . . . , f [L-1] . Decoder and likelihood. Just as in Eq. 1, we define the joint probability over word sequences as a product of conditional probabilities, this time conditioned on the schema e j1:j L too, that is p θ (x 1:T |e j1:j L ) = T i=1 p θ (x i |x <i , e j1:j L ), π i = softmax(W • h dec θ (x <i , e j1:j L )), with π i the class probabilities of the ith conditional, W ∈ R V ×D trainable, and h dec θ a deep neural network model. We let h dec θ be a pretrained GPT-2 language model, and modify it to also process the schema e j1:j L , but remark that any other model for sequence processing (as e.g. a recurrent neural net) could be used instead. A bit more in detail, to condition GPT-2 on e j1:j L , without perturbing its optimized weights too much, we use the pseudo-self -attention (PSA) mechanism introduced by Ziegler et al. (2019) . In a nutshell, this mechanism augments the key and value matrices of GPT-2 in their first L rows with projections of e j1:j L . Figure 1 shows an illustration of the complete decoder model, including the PSA mechanism. Please check Appendix A for the explicit equations of the latter.

3.2. INFERENCE MODEL

The generative model we presented above is hierarchical. The random graph is shared across all sentences and thus constitutes a global latent object. The random walks, in contrast, are local random variables. Our task is to infer the schema and graph posterior distributions that best describe the collection of word sequences in our dataset. To do this, we approximate the true posterior distribution of these variables with a variational posterior of the form q ϕ (z 1:L , A|x 1:T ) = q ϕ (z 1:L |x 1:T , A)q ϕ (A), where ϕ labels the set of trainable parameters. Let us specify each of its components. Posterior over (global) graph. We model the posterior over the graph assigning again Bernoulli variables to its links, but we let the probability of observing each link depend on the global symbols q ϕ (A) = i,j p ϕ (e i , e j ) aij (1 -p ϕ (e i , e j )) 1-aij , where p ϕ (e i , e j ) = sigmoid(g ϕ (e i , e j )), (8) with g ϕ : E × E → R a deep neural network, and p ϕ (e i , e j ) ∈ [0, 1], for all e i ∈ E, the link probabilities. Our reasoning here is that the network g ϕ should infer graphs connecting symbols which are semantically related via the encoded sentences. Posterior over random walks (encoder model). Analog to Eq. 4 we model the posterior probability over random walks on G as q ϕ (z 1:L |x 1:T , A) = K i=1 ρ i (x 1:T , ϕ) z i 1 L i=2   K j=1 K k=1 Q [i-1] k,j (x 1:T , A, ϕ) z k i z j i-1   , where instead of having a single transition probability matrix, we have a sequence of them, thereby allowing the posterior to capture inhomogeneous random walks. Note that we could have also chosen a mean-field decomposition along the steps of the random walk, simply by either ignoring the dependency on the graph, or making the graph fully connected (see Appendix B.4). Going back to Eq. 9, we model the probabilities over the starting point of the random walks and the transition matrices as follows ρ(x 1:T , ϕ) = softmax(h enc 1 ), Q [i] k,j (x 1:T , A, ϕ) = f [i] k (x 1:T , ϕ) A kj m f [i] m (x 1:T , ϕ) A mj , with f [1] , . . . , f [L-1] = exp(h enc 2:L ), where h enc 1 , h enc 2 , . . . , h enc L ∈ R D is the sequence of outputs of a deep neural network model h enc ϕ (x 1:T ) processing the input sequence of T words. The model h enc ϕ (x 1:T ) must then map a sequence of T vectors to a sequence of L vectors. We define h enc ϕ by a pretrained BERT model (Devlin et al., 2018) , followed by a single Transformer block, randomly initialized. The Transformer block processes the T (D-dimensional) outputs from BERT as keys and values, together with a set of L learnable vectors q 1:L as queries. The right hand side of Figure 1 illustrates the complete encoder architecture.

3.3. TRAINING OBJECTIVE

To optimize the parameter sets {θ, ϕ} of our latent variable model we would, as usual, maximize a variational lower bound on the logarithm of the marginal likelihood p θ (x 1:T ) (Bishop, 2006) . It is, however, well known that VAE models tend to encounter problems learning representations encoding information about the data -the so-called posterior collapse problem -especially when dealing with et al., 2015) . To solve this issue practitioners resort to maximizing the variational lower bound, together with the mutual information between data and representations (Zhao et al., 2018; Fang et al., 2019; Zhao et al., 2019) . We follow this same route and show (in Appendix B.3) that maximizing the lower bound and the mutual information corresponds to maximizing the objective L[θ, ϕ] = 1 N N n=1 E q ϕ (z 1:L |x (n) 1:T , A)q ϕ (A) log p θ (x (n) 1:T |z 1:L ) -E q ϕ (A) KL q * ϕ (z 1:L |A); p(z 1:L |A) -KL[q ϕ (A); p(A)], where KL labels the Kullback-Leibler divergence (Kullback & Leibler, 1951) between prior and posterior distributions, and q * ϕ (z 1:L |A) is the aggregated posterior distribution over random walks. The latter is defined as E p D (x 1:T ) [q ϕ (z 1:L |x 1:T , A)] and is in general intractable. In practice, we approximate it with an expression identical to Eq. 9, but with the class probabilities and transition matrices (Eqs. 10 and 11) replaced with their data-averaged counterparts. We refer the reader to Appendix B for details on this, as well as for the explicit, closed-form expressions of the Kullback-Leibler terms in Eq. 12. The full training algorithm is presented in Appendix C.

4. PROOF OF CONCEPT: INFERRING GROUND-TRUTH RANDOM GRAPHS

Before testing the behaviour of our methodology on natural language data, we evaluate the ability of the model to infer hidden graph structures from sequential data in a controlled experiment. To this end, we define a synthetic language model with an underlying, ground-truth graph G * as follows: Given a graph G * with K nodes, and a vocabulary of random tokens V of size V , we assign one random bag of tokens (i.e. one pdf over V) to each node of the graph. Let the K random bags be the K symbols {e 1 , e 2 , . . . , e K } of the synthetic language model. We then sample N uniform random walks of length L over G * , and sample one random token from each symbol (i.e. from each random bag) along the walks. The result is a set of random token sequences of the same length as that of the random walks. Appendix D contains a more detailed description of this generation procedure. Given this set of random token sequences, the task is to infer the hidden ground-truth graph G * . Experimental settings. Following the procedure above we generated two datasets from two random graphs with different topologies. One sampled from the Barabási-Albert model (Barabási & Albert, 1999) , the other from the Erdös-Rényi model (Erdös & Rényi, 1959) . We set both graphs to have K = 100 symbols, and the token sequences to have length L = 10. Each dataset has a total of N = 100000 token sequences. Further details about the random graph model parameters and the dataset statistics can be found in Appendix D. The synthetic datasets are available in the source code. A simple proof-of-concept. We consider a problem in which the set of symbols (random bags) E is known, so that the ground-truth graph G * has a fixed labelling. This setting will allow for simple comparison between G * and our inferred graphs. To infer G * we used a simplified version of HSN, namely: we (i) replace BERT in Fig. 1 with a 2-block Transformer encoder (Vaswani et al., 2017) ; (ii) set the graph model g ϕ (Eq. 8) to a single-layer, feed forward network; and (iii) note that, since the symbols are known, the likelihood of the model is simply given by L i=1 e ji where, as before, j i denotes the index of the non-zero component of z i . We train this model by maximizing Eq. 12 and refer to Appendix D for details on hyperparameters, training procedure and model sizes. Results. Table 1 shows our results for our two synthetic datasets. Specifically, we compute the Area Under the Receiver Operating Characteristic Curve (ROC AUC) of our model q ϕ (A) with respect to G * , and the Frobenious norm between q ϕ and two graphs: the ground-truth one G * , and a second random graph G rand sampled from the same random graph model as G * . We train ten (10) models in total and display the mean and standard deviation of our results. We also use a different G rand for each calculation run. The first metric shows that q ϕ correctly predicts the edges of G * , whereas the other two metrics show that G ∼ q ϕ (A) is closer to G * than to any other random graph sampled from the same distribution. The last two columns in Table 1 show however that q ϕ (A) tends to generate denser graphs as compared to the target. Having demonstrated that HSN can indeed infer hidden graph structure from sequential data in a simple settingfoot_2 , we now move to our main problem: language modelling.

5. LANGUAGE MODELLING AND REPRESENTATION LEARNING

Natural language modelling deals with the prediction of the next word in a sentence or document, given a sequence of previously observed words. A natural evaluation metric is therefore the perplexity per word of the model, which is defined as the exponential of the data-averaged, negative loglikelihood of the model, divided by the number of words in the sequence. One complication with this is that latent variable models can only approximately estimate the likelihood function. One can readily see, however, that Eq. 12 is also a lower bound on log p θ (Bishop, 2006) and so, we estimate the perplexity of our models with exp(-L/T ). Datasets and baselines. We consider three widely used public datasets, namely the Penn Treebank (PTB) (Marcus et al., 1993) , Yahoo and Yelp (Yang et al., 2017) corpora. For completeness we include statistics of these datasets in Appendix E. We compare HSN against a pretrained GPT-2, fine-tuned both during a single epoch and until its objective function plateaus. We also compare againts two VAE language models: iVAE MI (Fang et al., 2019) and Optimus (Li et al., 2020) . The former implements both encoder and decoder as one-layer LSTMs (Hochreiter & Schmidhuber, 1997) . The latter uses pretrained BERT and GPT-2 as encoder and decoder, respectively. Experimental settings. In all experiments we leverage pretrained BERT and GPT-2 models, both with 12 layers, 768 hidden dimensions (D) and 12 attention heads. Note that Optimus shares these settings. We use the public HuggingFace implementation of both these models (Wolf et al., 2020) . The graph model is set to a 2-layer feed forward network, each with hidden dimension 512, and we also train an inhomogeneous random walk prior model (Eq. 4) by making ρ and the sequence of weights f [1] , f [2] , . . . , f [L-1] trainable. Furthermore, we explore HSNs with K = {50, 100} symbols and hidden random walks of L = {5, 20} steps. Let us label these configurations as HSN(K, L). Additional details on hyperparameters and training procedures can be found in Appendix E. Results. Table 2 shows the perplexity of our model, together with the baselines, evaluated on the test set of the three corpora. HSN achieves a much better performance than all baselines under this metric, which implies it successfully interprets the symbol sequences it uses to encode the sentences. Note in particular that HSNs with 50 symbols perform consistently better than their 100-symbol counterparts. As we discuss in Appendix E, 100-symbol HSNs tend to infer networks with many disconnected subgraphs, the largest of which has usually about 50 symbols. It appears then that (about) 50 symbols are enough to encode these corpora. We have repeated these experiments five (5) times, each with a different random initialization of the pink shaded blocks in Fig. 1 . We find our mean perplexities to be better than all baselines even within error bars. The reader can find these results in Appendix E. To get a deeper insight into the features of these representations we now explore the structure of the learned global graphs G, as well as the semantic content of the schemata. Structure of hidden schema networks. We characterize the structure of G in terms of five statistics: its (i) diameter D, (ii) average distance l, (iii) clustering coefficient C, (iv) number of connected components CC and (v) degree distribution P (k) (see Appendix E for the definition of these). We report our results in Table 3 for HSN(50, 5) . Results for the other HSN configurations can be found in the Appendix, from which we mainly find that longer random walks and larger symbol number generically favor larger CC. Going back to Table 3 , we observe that the schema networks from each corpora tend to have smaller average distances l, and much larger clustering coefficients C, than any random graphs (with p = 0.5) of the same size -where random graphs with p = 0.5 correspond to our prior model. Let us remark that the combinations of these two features defines the so-called small-world structure (Watts & Strogatz, 1998) . Intuitively, a larger C implies that a random walker starting from a given node k will have a larger number of paths bringing it back to k. In such an scenario, random walkers tend to cluster in neighborhoods around their starting point -a property that could help encode different semantic aspects in different regions of G. Another consequence is that one could expect schemata composed of repeated symbols. Figure 2 shows the degree distributions of HSN(50, 5). Here we see another aspect on which the schema networks differ from a purely random graph. In particular, the former are more densely connected than the latter. Schemata and semantics. To qualitatively grasp the semantic content of the learned schemata we take advantage of the labels available to both Yahoo and Yelp corpora. For example, Figure 3 displays the random walk distributions over the schema networks for four (4) subsets of Yahoo, as inferred with HSN(50, 5). Similar plots for all subsets (labels) of both corpora, extracted with all our HSN configurations can be found in Appendix E. Note how the "hot" symbols per category reside on different regions of the graphs -as suspected already from the large clustering coefficient of G -and yet, the "Science & Math" schemata (both nodes and edges) of Yahoo are closer to the "Education & Reference" schemata than to the "Sport" schemata, where closer nodes in the figure indicate well-connected nodes in the underlying graph G. We can understand these findings as indicating that the schemata indeed encode semantic notions of their corpora. A similar picture results were extracted from Hwang et al. (2020) . COMET(GPT2-XL) was computed by us. The HSN models have K = 50, L = 20. All models use greedy decoding for all text prefixes in the dataset. holds for Yelp. Finally, we have also defined and explored "schema interpolations" (Appendix E) and have investigated how the schemata are attended to by the model (Appendix F). These experiments (qualitatively) show too that the schemata encode different semantic notions of natural language.

6. COMMONSENSE REASONING GENERATION

It has been proposed recently that large, pretrained language models fine-tuned on (natural language) knowledge graph (KG) tuples, can express their encoded knowledge through language generation, thereby providing commonsense knowledge on demand (Bosselut et al., 2019; Hwang et al., 2020) . These commonsense KGs live however in data (i.e. text) space -the nodes and edges are represented by either single words or sequences of them. This observation led us to investigate whether one could use the COMET framework of Bosselut et al. (2019) , together with the inductive biases of HSN, to translate the implicit knowledge of pre-trained models into KGs in representation space. Arguably so abstract a KG could encompass larger commonsense KGs in data space. With this intuition in mind, let us revisit the COMET framework. Task, datasets and baselines. Consider a training KG of natural language tuples of the form (s, r, o), where s = (x s 1 , . . . , x s |s| ) labels the phrase subject of the tuple, r = x r is the relation token and o = (x o 1 , . . . , x o |o| ) is the phrase object of the tuple. The task is to generate the object o, given s and r. In other words, to infer the distribution p(o|[s, r]). COMET finetunes its pretrained models by maximizing the likelihood of the object, conditioned on the sequence [s, r] = (x s 1 , . . . , x s |s| , x r ) (Bosselut et al., 2019) . In contrast, HSN is trained to auto-encode the complete sequence [s, r, o] = (x s 1 , . . . , x s |s| , x r , x o 1 , . . . , x o |o| ) and is evaluated on object generation tasks, conditioned not only on [s, r] but also on the schema e j1:j L . For this preliminary study we focus on the ATOMIC dataset (Sap et al., 2019a) , evaluate the quality of the generated objects with both, BLEU-2 (Papineni et al., 2002) and BERT Score (Zhang* et al., 2020) metrics, and compare against GPT-2, GPT-2-XL and BART, all trained within the COMET framework (Hwang et al., 2020) . Results. Table 4 shows HSN outperforms all baselinesfoot_3 , which entails it successfully infers and interprets schemata encoding the KG tuples. These schemata, however, are inferred via a posterior of the form q ϕ (z 1: L 2 |[s, r], A)q ϕ (z L 2 +1:L |[s, r, o], z L 2 , A) -see Appendix G for details. Yet, in practice, one does not have access to any object during inference. The classical solution, à la Kalman Filter, is to replace A ), and train the latter via the KL term in Eq. 12. Maximizing the mutual information, however, averages out all local information from the prior and hinders its learning -see e.g. HSN[prior] in Table 4 . An alternative is to train, in the spirit of knowledge distillation (Hinton et al., 2015) , a third-party model on the inferred schemata, to predict z L 2 +1:L conditioned on z 1: L q ϕ (z L 2 +1:L |[s, r, o], z L 2 , A) with a local prior model of the form p θ (z L 2 +1:L |[s, r], z L 2 ,

2

. Our preliminary results, reported as HSN [KD] in Table 4 , improve upon HSN[prior] and even outperform COMET(GPT2) in the BERT Score.

7. CONCLUSION

We introduced a novel representation learning algorithm for natural language modelling that infers discrete, relational representations which allow for compositionality. Experiments show our model learns representations encoding high-level semantics of natural sentences, thereby adding some novel layers of interpretability to large, pretrained language models.

8. REPRODUCIBILITY STATEMENT

We provide source code to reproduce our results as supplementary material. The README.rst file within it contains instructions to install and run the corresponding libraries. The synthetic datasets of section 4 are also provided within the source code file, in the data directory. All other datasets we used are available online and are automatically downloaded by our training scripts. We additionally provide explicit derivation and/or details for all mathematical expression within the main text in the Appendix. Details on hyper-parameter selection and training can also be found in the Appendix.

APPENDIX A PSEUDO-SELF ATTENTION MECHANISM REVISITED

The attention mechanism of the original Transformers (Vaswani et al., 2017) is defined as Attention(Q, K, V) = softmax D -1 2 Q • K T V, where Q, K and V ∈ R T ×D are sets of queries, keys and values, respectively, given by a sequence of T , D-dimensional vectors, packed into matrices. In practice, these queries, keys and values are projected many times with different learnable, linear maps. The Attention operation (Eq. 13) is performed on these different projections in parallel, whose outputs are then concatenated and projected once more with a final, linear map. The complete operation is known as Multi-head Attention (Vaswani et al., 2017) , and we use this notation in Fig. 1 of the main text. Now, the question is how to condition GPT-2 on the schema e j1:j L . Given a sequence of input representations u 1:T , the self -attention mechanism in GPT-2 is obtained by choosing Q = u 1:T • W Q , K = u 1:T • W K and V = u 1:T • W V , all in R T ×D , with W Q , W K and W V ∈ R D×D pretrained matrices. We leverage a pseudo-self attention (PSA) mechanism (Ziegler et al., 2019 ) that augments the key and value matrices in their first L rows, with projections of e j1:j L so that K = e j1:j L • W e K + p enc K , Ṽ = e j1:j L • W e V + p enc V ∈ R (L+T )×D , where p enc is a positional encoding, just as the one used in the original Transformer implementation (Vaswani et al., 2017) . The latter informs GPT-2 about the ordering of the symbols in the schema, as selected by the random walk process. PSA is then simply given by Eq. 13 with the keys and values replaced with the augmented ones, K and Ṽ. The W e K , W e V here are randomly initialized, learnable parameters mapping the schemata onto the decoder self-attention, D-dimensional space, and we have as many of them as layers in GPT-2. Therefore this mechanism allows GPT-2 to attend to the projected schema at each of its layers, with a minimal addition of untrained parameters (Ziegler et al., 2019) .

B TRAINING OBJECTIVE

The Evidence Lower Bound (ELBO) of the Hidden Schema Network model reads L[θ, ϕ] = 1 N N n=1 E q ϕ (z 1:L |x (n) 1:T , A)q ϕ (A) log p θ (x (n) 1:T |z 1:L ) -E q ϕ (A) KL q ϕ (z 1:L |x (n) 1:T , A); p(z 1:L |A) -KL[q ϕ (A); p(A)], where KL[•] denotes the Kullback-Leibler (KL) divergence. Note that this is not the training objective of the main text. There we maximize the ELBO together with the mutual information between sentences and schemata. We give details about this modified objective in subsection B.3 below. Before getting into that, let us first calculate the explicit expressions for the two divergences above.

B.1 KULLBACK-LEIBLER BETWEEN RANDOM WALKS

For notational convenience we will not write the explicit dependence on the graph A in what follows. Using the explicit product form of the probabilities over walks leads to where qϕ (z i |x KL[q ϕ (z 1:T |x (n) 1:T ); p(z 1:T )] = L i=2 E qϕ (zi-1|x (n) 1:T )q ϕ (zi|zi-1x (n) 1:T ) log q ϕ (z i |z i-1 , x (n) 1:T ) p(z i |z i-1 ) + KL[q ϕ (z 1 ); p(z 1 )], ( ) Graph G * Model NLL KL -z KL -G AUC |G * -G| F |G r -G| F N. (n) 1:T ) is the aggregated probability over all walks until step i. Since the random walks are Markovian, q can be explicitly written as qϕ (z i |x (n) 1:T ) = 1≤j<i Q [j] (x (n) 1:T , ϕ) • ρ(x (n) 1:T , ϕ), where the (posterior) class probabilities over the walks' starting points ρ, and the transition matrices Q [i] are defined in Eqs. 10 and 11 of the main text. Using the definitions in Eqs. 4 and 9 we can write the argument of the expectation value in Eq. 16 above as log q ϕ (z i |z i-1 , x (n) 1:T ) p(z i |z i-1 ) = k,j z k i z j i-1 log Q [i-1] k,j (x (n) 1:T , ϕ) P k,j , which means we only need to compute the expectation of the product z k i z j i-1 . This one can easily be shown to be E qϕ (zi-1|x (n) 1:T )q ϕ (zi|zi-1x (n) 1:T ) z k i z j i-1 = Q [i-1] k,j (x (n) 1:T , ϕ) ρ[i-1] j (x (n) 1:T , ϕ), where ρ[i] j (x (n) 1:T , ϕ) is the jth class probability of qϕ (z i |x (n) 1:T ), defined in Eq. 17. Finally, the second KL term in Eq. 16 can be directly evaluated KL[q ϕ (z 1 ); p(z 1 )] = K j=1 ρ j (x (n) 1:T , ϕ) log ρ j (x (n) 1:T , ϕ) ρ j , where ρ j (x (n) 1:T , ϕ) and ρ j are, respectively, the posterior and prior class probabilities for the random walks' starting points. Putting all together we write KL[q ϕ (z 1:T |x (n) 1:T ); p(z 1:T )] = L i=2 K k,j=1 Q [i-1] k,j (x (n) 1:T , ϕ) ρ[i-1] j (x (n) 1:T , ϕ) log Q [i-1] k,j (x (n) 1:T , ϕ) P k,j + K j=1 ρ j (x (n) 1:T , ϕ) log ρ j (x (n) 1:T , ϕ) ρ j (21) B.2 KULLBACK-LEIBLER BETWEEN RANDOM GRAPH MODELS Since both prior and posterior graph models treat each edge in G as a Bernoulli random variable, we can write directly KL[q(A)|p(A)] = ij p ϕ (e i , e j ) log p ϕ (e i , e j ) p +(1 -p ϕ (e i , e j )) log 1 -p ϕ (e i , e j ) 1 -p , where p ϕ (e i , e j ) is the posterior link probability, which is conditioned on the symbols connected by the link, and p is the global prior probability over all links, as defined in Eq. 3 of the main text.

B.3 MAXIMIZING MUTUAL INFORMATION

We would like to maximize the mutual information between the word sequences in our dataset and the schema representations. We have argued that the training objective in the main text already includes such a mutual information term. To see this is indeed the case we need to workout some identities. Let us, for simplicity of notation, consider two discrete variables z and x, the last of which follows an unknown distribution p D (x). What follow are identities -E p D (x) KL[q(z|x); p(z)] = E p D (x) E q(z|x) log p(z) -log(z|x) = H q (z|x) + x p D (x) z q(z|x) log p(z) + log q * (z) -log q * (z) = H q (z|x) -H q * (z) + z q * (z) log p(z) -log q * (z) = -I(z; x) -E q * (z) log q * (z) p(z) = -I(z; x) -KL[q * (z); p(z)], where H q (z|x) = - x p D (x) z q(z|x) log q(z|x), is the conditional entropy with respect to distribution q (see e.g. page 17 in (Cover & Thomas, 1991) ) and  H q * (z) = - z q * (z) log q * (z), is the entropy of distribution q * (z), which we define as the marginal (data-aggregated) distribution q * (z) = x p D (x)q(z|x). ( ) Finally, we used the definition of mutual information I(x; z) = H q * (z) -H q (z|x). (27) See e.g. page 20 in (Cover & Thomas, 1991) . It follows from Eq. 23 that maximizing the ELBO (Eq. 15), together with the mutual information between word sequences and schemata, simply amounts to replacing the KL between the approximate posterior and prior random walk distributions, with the KL between the aggregated posterior and prior random walk distributions. To wit - 1 N N n=1 E q ϕ (A) KL q ϕ (z 1:L |x (n) 1:T , A); p(z 1:T |A) + I(z 1:L ; x 1:T |A) = -E q ϕ (A) KL q * ϕ (z 1:L |A); p(z 1:T |A) , ) where we introduced the aggregated posterior over random walks wrt the word sequence q * ϕ (z 1:L ) = E p(x 1:T ) q ϕ (z 1:L |x 1:T ) ≈ 1 N N n=1 q ϕ (z 1:L |x (n) 1:T ). In practice we approximate this quantity with q * ϕ (z 1:L ) ≈ q * ϕ (z 1 ) L i=2 q * ϕ (z i |z i-1 , A), where q * ϕ (z 1 ) is a categorical distribution whose class probabilities ρ * j (ϕ) are the average of those from our approximate posterior (Eq. 10 in the main text) ρ * j (ϕ) = 1 N N n=1 ρ j (x (n) 1:T , ϕ), and the transition probabilities q * ϕ (z i |z i-1 , A) have transition probability matrices Q * [i] k,j (A, ϕ) = 1 N N n=1 Q [i] k,j (x (n) 1:T , A, ϕ). B.4 MEAN-FIELD SOLUTION Instead of modeling the posterior over random walks with Eq. 9 of the main text, we could consider a mean-field decomposition along the time component, by ignoring the dependency on the graph G q ϕ (z 1:L |x 1:T ) = L i=1 q ϕ (z i |x 1:T ), where at each step of the walk we have a step-dependent categorical distribution q ϕ (z i |x 1:T ) = K j=1 ρ [i] j (x 1:T , ϕ) z j i , whose class probabilities live in the K-simplex. We could model the latter via ρ [1] , . . . , ρ [L] = softmax(h enc 1 , . . . , h enc L ) where h enc 1 , . . . , h enc L are the outputs of our encoder neural network model, shown in Figure 1 of the main text. Replacing the mean-field approximation of 33 into 15 yields KL[q ϕ (z 1:T |x (n) 1:T ); p(z 1:T |A)] = L i=2 E q ϕ (zi|x (n) 1:T ) log q ϕ (z i |x (n) 1:T ) -E q ϕ (zi|x (n) 1:T )q ϕ (zi-1|x (n) 1:T ) log p(z i |z i-1 ) + KL[q ϕ (z 1 ); p(z 1 )], = L i=1 K j ρ [i] j (x 1:T , ϕ) log ρ [i] j (x 1:T , ϕ) ρ j - L i=2 K k,j E q ϕ (zi|x (n) 1:T )q ϕ (zi-1|x (n) 1:T ) z k i z j i-1 log P k,j = L i=1 K j ρ [i] j (x 1:T , ϕ) log ρ [i] j (x 1:T , ϕ) ρ j (36) - L i=2 K k,j ρ [i] k (x 1:T , ϕ)ρ [i-1] j (x 1:T , ϕ) log P k,j .

B.5 FULLY CONNECTED GRAPH

We can replace the adjacency matrix A in the definition of the transition probability matrix of our posterior Q(x 1:T , A, ϕ), with that of a fully connected graph. The aggregated posterior over all walks up to step i (Eq. 17 above) reduces in this case to ρ[i] k (x 1:T , ϕ) = K j f [i-1] k (x 1:T , ϕ)A k,j m f [i-1] m (x 1:T , ϕ)A m,j ρ[i-i] j (x 1:T , ϕ) = f [i-1] k (x 1:T , ϕ) m f [i-1] m (x 1:T , ϕ)   K j ρ[i-i] j (x 1:T , ϕ)   = f [i-1] k (x 1:T , ϕ) m f [i-1] m (x 1:T , ϕ) , which is equivalent to that of the mean-field approximation of section B.4 with ρ[i] k = ρ [i] k .

C HIDDEN SCHEMA NETWORKS ALGORITHM

Algorithm 1: HSN Training (ϕ, ψ) foreach minibatch x 1:T ∼ p(D) do (1) Sample schema network from posterior graph model: A ∼ q ϕ (A), (2) Compute parameters of posterior random walk model: h enc 1 , h enc 2 . . . , h enc L = h enc ϕ (x 1:T ), ρ(ϕ) = softmax(h enc 1 ), Q [i] k,j (ϕ) = f [i] k (ϕ) A kj m f [i] m (ϕ) A mj , with f [1] , . . . , f [L-1] = exp(h enc 2:L ) (3) Compute parameters of prior random walk model: P k,j = f k A kj K i=1 f i A ij 4) Sample random walks from posterior distribution: z 1:L ∼ q ϕ (z 1:L |x 1:T , A) (5) Decode sentence: for i = 0 to T -1 do x i ∼ p θ (x i |x <i , e j1:j L ), π i = softmax(W • h dec θ ( x <i , e j1:j L )) end (6) Compute loss and back-propagate: L[θ, ϕ] = 1 N N n=1 E q ϕ (z 1:L |x (n) 1:T , A)q ϕ (A) log p θ (x (n) 1:T |z 1:L ) -E q ϕ (A) KL q * ϕ (z 1:L |A); p(z 1:L |A) -KL[q ϕ (A); p(A)] end D ON SYNTHETIC DATASET EXPERIMENTS In this section we give additional details of and results from our proof-of-concept experiments.

D.1 SYNTHETIC LANGUAGE MODEL

We generate our synthetic dataset as follows: first, we sample a single, fixed graph G * with K nodes from a predefined random graph model. Second, we define a set of random tokens V, of size V , to be our vocabulary. We create each token as a random 3-tuple from the Latin alphabet, and choose to have at least one order of magnitude more tokens than nodes in G (that is, V ≫ K). Third, we assign a random bag of tokens to each node in G * . These random bags can simply be understood as probability distributions over V, and can be represented as V -dimensional vectors whose components live on the simplex. Note in particular that, by construction, tokens can be shared among the different nodes of G * . Finally, let us identify the K random bags with the K symbols {e 1 , e 2 , . . . , e K } of the synthetic language model. To generate synthetic sentences we sample uniform, L-step random walks on G * , whose transition matrix is given by Eq. 4 in the main text, with f = I. Having obtained a set of random walks on G * , we sample one random token from each of the symbols (i.e. from each random bag) along the walks.

D.2 EXPERIMENTAL SETTINGS

Here we give additional details for reproducibility

Datasets

• Following the procedure above we generated two datasets from two random graphs with different topologies. One sampled from the Barabási-Albert model (Barabási & Albert, 1999) , the other from the Erdös-Rényi model (Erdös & Rényi, 1959) . We generate these graphs using NetworkX, a Python language software package for network structures (Hagberg et al., 2008) . Specifically, we generate Barabási-Albert graphs by attaching 3 edges from each new node to old ones, and Erdös-Rényi graphs with an edge probability of 0.5. We set both graphs to have K = 100 symbols. • We define each random bag of tokens in G * to have two tokens only (each with equal probability). • We use a vocabulary of 1000 random tokens. • Once the graph is fixed, we set the token sequence length to L = 10 (L = 11) for the Erdös (Barabási) datasets and generate a total of N = 100000 token sequences from each random graph. Hidden Schema Network (HSN) settings • We train randomly initialized embeddings of dimension 256, one for each token. We sample these from a normal distribution with zero mean and a standard deviation of 0.01. • The posterior graph model is defined via a single feed-forward neural network with 256 hidden units. • The prior graph model has the edge probability p as hyperparameter. We crossvalidate it from the set p = {0.1, 0.2, 0.5, 0.6, 0.8} and found that HSN could fit the Barabási dataset only with small values {0.1, 0.2}. HSN could fit the Erdös dataset with larger values {0.5, 0.6} • The posterior random walk model is defined by replacing BERT with a 2-block Transformer encoder (Vaswani et al., 2017) , each with 2 heads, 256 hidden units and dropout probability of 0.2. • The prior random walk model was set to a uniform random walk.

Training details

• We use a batch size of 256 and train with Adam (Kingma & Ba, 2014), with a learning rate of 0.0001, in all experiments. • To sample both graph and random walk posterior models with use the Gumbel-Softmax trick (Jang et al., 2016) , with a constant temperature of 0.75 • We train the models for 200 epochs

D.3 ADDITIONAL RESULTS

Table 5 displays the mean and standard deviation of some additional results on our proof-of-concept experiments. We trained ten models in total. We first trained a simple LSTM Network to infer the correct symbol order in each random token sequence. We noticed that a network with 256 hidden units was enough to solve this task perfectly. Indeed, the negative log-likelihood (NLL) of these models corresponds to choosing the 2-token random bag sequence (i.e. the schema) that yields the correct token sequence without errors. The HSN performs equally well on the Barabási dataset, and slightly worst on the Erdös dataset. In fact, we have noticed the Erdös dataset proved to be more challenging to learn with the HSN in all regards. See, for example, the AUC scores or the Frobenious norms of HSN in this dataset, as compared to the Barabási case. We think this might be due to the fact that Barabási graphs have more structure, simply because of their sparsity, which arguably make them easier to infer with our inductive bias. Note also how increasing the prior edge probability p affects the average number of edges of the inferred graphs.

E ON LANGUAGE MODELLING EXPERIMENTS

In this section we give additional details of and results from our language modelling and representation learning experiments.

E.1 EXPERIMENTAL SETTINGS

Here we give additional details for reproducibility

Datasets

• We consider three widely used public datasets, namely the Penn Treebank (PTB) (Marcus et al., 1993) , Yahoo and Yelp (Yang et al., 2017) 

HSN settings

• In all experiments we leveraged pretrained BERT and GPT-2 models, both with 12 layers, 768 hidden dimensions (D) and 12 attention heads. We used the public HuggingFace implementation of both these models (Wolf et al., 2020) . • The posterior graph model is set to a 2-layer feed forward network, each with hidden dimension 512. • We crossvalidated the prior edge probability over the set of values p = {0.1, 0.2, 0.5, 0.6} and found p = 0.5 (a maximum entropy prior) to yield the best results. All results we report correspond to this (p = 0.5) case. • We also train an inhomogeneous random walk prior model by making ρ and the sequence of weights f [1] , f [2] , . . . , f [L-1] trainable. We initialized them by sampling from a normal distribution with zero mean and standard deviation of 0.01. • We experimented with HSN of K = {50, 100} symbols and random walks of length L = {5, 20}.

Training details

• We used a batch size of 32 and train with Adam (Kingma & Ba, 2014), with a learning rate of 0.00001, in all experiments. • To sample both graph and random walk posterior models with used the Gumbel-Softmax trick (Jang et al., 2016) , with a constant temperature of 1.0. • We used a cyclical schedule to anneal both KL terms in our training objective from zero to one (Fu et al., 2019) . When the annealing weight (usually called β in the literature) is finite, we used a KL threshold scheme (Li et al., 2019) , with a threshold value of 0.1. • We trained the models for 100 epochs, although they usually needed about 60 epochs only to converge (in the NLL). • We applied word dropout to the input of the decoder model with probability 0.3 in the following cases: (i) for all models trained on PTB; (ii) and all models with L = 50 trained on all datasets.

E.2 ADDITIONAL RESULTS

Here we report results complementing the conclusions of the main text. Language modelling. Table 6 displays our perplexity results on all datasets, just as in the main text. In the last four rows we additionally report the mean and standard deviation we obtained when repeating the experiments with the HSN model five times, with different initializations. The conclusion of the main text, viz. that our results outperform all baselines, remains unaltered, even within error bars. We additionally report in Table 7 the mean values of the KL for five 100-symbol HSN runs. Graph statistics. We characterize the structure of G in terms of five statistics: (i) the diameter D, which measures the maximum path length over all node pairs in G; (ii) the average distance l, which instead measures the average shortest path length between all node pairs; (iii) the clustering coefficient C, which represents the probability that two neighbors of a randomly chosen node are themselves neighbors; (iv) the number of connected components CC; and (v) the degree distribution P (k), which represents the probability that a randomly chosen node will have k neighbors. Table 8 reports the statistics of our inferred graphs for all datasets, and all model configurations. We can see that increasing the random walk length from 5 to 20 increases the number of connected components of the graphs. As a consequence, subsets of word sequences are map onto smaller subgraphs, the larger of which is about 50 symbols. One could argue that, since longer random walk lengths imply a larger set of possible schema configurations, the number of symbols required to describe our three corpora can simply decrease. In other words, less symbols are needed by long schemata. Similarly, directly increasing the symbols number leads too to a larger number of connected components. Indeed, even the short schemata in Yelp and Yahoo do not use all available symbols to model the corpora. Representation learning. We can get a graphical picture of the features we just discussed above in Figures 8-10 below. Very importantly, we see that the schema distribution is different for each category of each corpora in all model configurations. In other words, we do not observe any kind of mode collapse. Finally, we have also explored "schema interpolations": given two schemata e j1:j L and e m1:m L , we find the shortest path (of length l) on G connecting the end of e j1:j L with the beginning of e m1:m L . Our interpolation steps are the schemata {e j1+i:j L +i : ∀ 0 ≤ i ≤ l + L along the path}. F WHICH SYMBOLS DO WORDS ATTEND TO? A PRELIMINARY STUDY ON YELP REVIEWS In this section we investigate how symbols are used by HSN when generating text. We do this by exploring the decoder attention matrix between the symbols and the generated tokens. Reading the attention wights, we can examine which symbols are most important for the generation of any given token, i.e. which symbols are attended to more strongly. A bit more in detail we select, for a given token in a given sentence, the symbol with the highest attention value. We can then compute the distribution of most attended symbols when generating that token for the complete dataset. Thus, for a model trained on the Yelp dataset, we examine to which symbols does the decoder of HSN attend to, when processing the words good, great and bad. Figure 4 shows the most attended symbol distribution for layers 1 (first), 5 (middle), 12 (last), when averaging the attention matrices over all attention heads. Figures 5, 6 , 7 show these distributions for each head separately. Note how, for a fixed token, the distribution of attention changes as one moves between heads and layers, albeit there are too some repeating patterns. We can quantify these features by computing the Kullback-Leibler (KL) divergence between these distributions. The KL values are shown in Table 9 . Interestingly enough, the distribution of symbols that are attended to when processing the word great is closer to the distributions of symbols attended by the word good, than to the distributions of symbols attended by the word bad. 



Note that throughout the paper we refer only to compositionality of representations and not to the compositional functions that can be implemented by the models we use. The latter, functional compositionality, is studied by e.g.Hupkes et al. (2020). Explicitly, ji denotes the index of the non-zero component of zi, i.e. ji = {k ∈ [1, K] : z k i = 1}, with the superindex k denoting the components of zi. We could, of course, now study the harder problem for which the symbols are unknown. However, the learned graphs model would not be aligned with G * making the graph comparison non-trivial. Note, in particular, that BART(Lewis et al., 2020) has 400M parameter, whereas HSN has 250M.



Figure 1: Left: Diagram of Hidden Schema Network model. Center: Decoder architecture as a modified GPT-2 model of M layers, with a pseudo-self-attention mechanism to attend to the schema ej 1 :j L . Please see the Appendix for details. The "c" operations labels concatenation. Right: Encoder architecture as BERT model, followed by a single Transformer block. In both center and right figure purple shaded blocks represent submodules with pretrained parameters. Pink shaded blocks represent submodules with randomly initialized parameters.

Figure 2: Empirical degree distributions of inferred graphs from each corpora. Results correspond to HSN with L = 5, K = 50. We also show the distribution for random graphs with p = 0.5. The graphs are sampled 500 times.

Figure 3: Schema distributions inferred by HSN(50, 5) for four labels of the Yahoo corpora. The node positions in the figure are consistent among labels and were computed using a force-directed embedding of the global graph G.

Figure 4: Distribution of most attended symbols when generating tokens good, bad, great for HSN(100, 5) trained on the Yelp data set. The decoder attention matrices between symbols and output are averaged over all attention heads for layers 1, 5 and 12.

Figure5: Distribution of most attended symbols when generating tokens good, bad, great for HSN(100, 5) trained on the Yelp data set. The distribution is computed from the decoder attention matrices between symbols and output for each attention head for layer 1.

Figure6: Distribution of most attended symbols when generating tokens good, bad, great for HSN(100, 5) trained on the Yelp data set. The distribution is computed from the decoder attention matrices between symbols and output for each attention head for layer 5.

Figure7: Distribution of most attended symbols when generating tokens good, bad, great for HSN(100, 5) trained on the Yelp data set. The distribution is computed from the decoder attention matrices between symbols and output for each attention head for layer 12.

Figure 10: Schema distributions inferred from each category of the Yelp dataset. The node positions in the figure are consistent among labels and were computed using a force-directed embedding of the global graph G.

Inference of ground-truth random graphs natural language (Bowman

Left: Perplexity per word (lower is better) on three datasets. GPT2 [one epoch] and Optimus results were extracted fromLi et al. (2020). iVAEMI was taken fromFang et al. (2019). GPT2 [fine-tuned] was computed by us. End-of-sequence tokens are kept during evaluation.

Random 611.69 ± 17.61 2.00 ± 0.00 1.50 ± 0.01 0.50 ± 0.02 1.00 ± 0.00 50.00 ± 0.00

Statistics of Schema Networks per corpora with K = 50 and L = 5. Random denotes an Erdös-Rényi model with p = 0.5 for the corresponding K.

Metrics of object generation quality for ATOMIC dataset. COMET(GPT2-XL) and COMET(BART)

Inference on ground-truth random graphs. Here we use the notation HS(p) to denote Hidden Schema Network models with prior graph distributions whose edge probability is set to p.



Kullback-leibler  divergence for 100-symbol HSN models (trained 5 times) in all datasets

Statistic of inferred graphs for all datasets

corpora. • PTB training set has a total of 38219 sentences. The average length of which is of about 22 words. The validation and test set have 5527 and 5462 sentences, respectively. The minimum (maximum) sentence length in PTB is of 2 (78) words. • Yahoo training set has a total of 100000 sentences. The average length of which is of about 80 words. The validation and test sets have 10000 sentences each. The minimum (maximum) sentence length in Yahoo is of 21 (201) words. The Vocabulary size is of 200000 words. • Yelp training set has a total of 100000 sentences. The average length of which is of about 97 words. The validation and test sets have 10000 sentences each. The minimum (maximum) sentence length in Yelp is of 21 (201) words. The Vocabulary size is of 90000 words.

show interpolations of random instances from all datasets. Note how the model successfully interpolate between categories in both Yelp and Yahoo. Kullback-Leibler divergence between the distributions of most attended symbols, when generating the tokens good, bad and great. Results are computed with HSN(100, 5) trained on Yelp. The KL values are computed for each head separately and then averaged.

G ON COMMONSENSE REASONING GENERATION

In this section we expatiate on the details of our approach to commonsense reasoning generation.First, we modify the encoder component of HSN to process the tuples (s, r, o) asso that the first half of the schema depends on subject and relation only, whereas the second half depends on the entire 3-tuple. As it will become evident below, this decoupling is necessary for the inference of novel objects.Each of the posterior distributions above is modelled with the same architecture, as shown in Fig. 1 , but sharing a single pretrained BERT model. That is, we have two copies of all pink-shaded blocks in the Fig. 1 , one for q ϕ (z 1: L 2 |[s, r], A), the other for q ϕ (z L 2 +1:L |[s, r, o], z L 2 , A), and a single pretrained BERT model.Using such a 2-component encoder model we are able to successfully infer schema representations for the KG tuples, as shown in Table 4 . The task is however to infer new objects, given only subjectrelation pairs. We thus need a way to infer schema representations without relying on the phrase object o.The classical solution to this inference problem is to replace, à la Kalman Filter,r], A) -and train the prior via the KL term in Eq. 12.As shown in Section B, maximizing the mutual information between data and representations averages out all local information in the KL term, and thus hinders the learning of the prior -see e.g. HSN[prior] in Table 4 : the samples from the prior are not close enough to those of the posterior, hence the significant drop in performance of the model.An alternative is to train, in the spirit of knowledge distillation (Hinton et al., 2015) , a third-party model on the inferred schemata, to predict z L 2 +1:L conditioned on z 1: L 2 .Indeed, given the inferred schemata from the training KG, we consider a sequence-to-sequence model which inputs z 1: L

2

, together with the subject-relation pair, and outputs z L 2 +1:L . That is, a model of the form p θ (z L 2 +1:L |z 1: L 2 , [s, r]). Specifically we use (i) a bidirectional LSTM network with hidden dimension of 512 to encode the first half of the schemata, (ii) a pretrained BERT model to encode the subject-relation pair, and (iii) a LSTM network of dimension 512 as an autoregressive decoder model. The initial (hidden) states of the latter are determined by an MLP which inputs the representations from the LSTM and BERT encoder models. The model is trained on samples fromOur preliminary results, HSN [KD] in Table 4 , show that this approach improves upon the untrained prior model, and even outperforms the stand-alone COMET(GPT-2) model.

G.1 ATOMIC DATASET

For this preliminary study we focus only on the ATOMIC dataset of Sap et al. (2019b) . It contains 877K (s, r, o) tuples covering a variety of social commonsense knowledge around specific If-Then events. A bit more in detail, ATOMIC splits its commonsense knowledge into nine categories, covering the event's causes, its effects on the agent, and its effect on other direct (or implied) participants. We use the training splits from Sap et al. (2019b) with considerable irony the case also shows how completely japan has turned the tables on u.s. business (1) in brief the chancellor of the exchequer nigel lawson's decisions were justified by their intended political and financial convenience and credit (2) analysts said they expect the federal authority to be totally revamped giving japanese manufacturers more clear way to measure their exports. (3) but others say inco commission has been inadequate (4) in 1970 banco exterior an agency run by banco exterior <unk> de <unk> <unk> was attempting to reduce liabilities and raise the sale of certain works by the division the amended filings also point out that under a new agreement <unk> has an <unk> obligation to sell farmers to axa upon an acquisition of b.a.t 4) what kind of rules does gravity apply? if a certain weight is placed in a container, the net force applied on it hits the water surface and the right weight will turn into gravity how does a photovoltaic system that feeds back into the power grid get on the same phase angle? or? should i say does it need to be the same as the _UNK's?Interpolate: Business & Finance Family -Relationships at 35, am i too old to go to college to become a psychiatrist? i'm 37, and i just started my second semester in a 2 year college... you need to be prepare for the financial aspects, but the social ones are no problem...(1) what would be a good title for a _UNK _UNK? i have _UNK in _UNK and there are no real courses done for it but i do love the job and i've already done my freshman year. i am currently teaching placement at _UNK and need the same as the average undergraduate student... (2) has anyone here applied in the past 4 months or is it better to get a try out y _UNK a slightly better long term career _UNK ... (3) lately im having trouble with my fiancee, how do i bring him back? it obvious at this point that you can't " bring us together ". try playing games. (4) could i still go out with this guy and still be friends and respect him.? i don't want to just fell in love with the guy that i was with. i want 2 be with friend's girl and still be friends... how do i know if my man, is inlove with me? well... some questions, how old are _UNK? -are you wealthy?, is he wealthy?, how long have you been together...

Table 11: Interpolation between four random instances from the Yahoo dataset

Interpolate: Very bad -bad do not use this company!! they told me within one hour, then i called again they said the driver have 90 mins. 90 minutes later, they said the driver is in traffic and wait for 15 minutes, i checked google map no accident, all green on all freeway... (1) i ordered for pick up as my daughter hadn't been told that or even ordered online. when i spoke to the young lady, who was _UNK, she carried on a conversation with not a manager. it's bad customer service and i wouldn't even bother with this place... (2) place was clean... when i called to let them know i 'd get something else, the person that answering the phone wouldn't understand me... really? i gave this restaurant a b + for the cleanliness of the food and the friendliness of the staff (3) i had the quesadilla and the carnitas tacos. i felt every bite of these were so rubbery and the potatoes were off. i feel like the service and the quality of food can do much better. (4) somewhat disappointed. i did it once and loved it but today, today's water is bitter and salty... and the mint and cherry blossom _UNK'flavors just taste that way. the food quality doesn't match the place at all. i think it's ok for a pub but this place is supposed to be a nice place for professional lunches. i had the chicken flatbread and the chicken was more like subway chicken! with so many options around that area i won't pick this place for lunch.Interpolate: Very bad -Very Good skip it... there are much better options out there! the " hot " food was not hot, and the flavor was only mediocre at most.(1) indifferent to locals. the kids size pizzas were a billion times worse than a pizza hut. the quality of food was just awful. i wouldn't recommend this to a significant other for what it is. (2) this new mexican spot is ok, bordering on childish. i went with friends and ordered a carne asada burro... it wasn't off the hook ; what made this place great were the chips & salsa sucked. yuck! ... (3) wow. _UNK you give so much frosting!! we were a groupon special for a cupcake for the princess of chocolate, and we were pretty stoked. they were _UNK and creative. they even suggested we try the coconut ... we 'll definitely be back soon. (4) went for the first time during a recent trip to vegas. our server jeff made special recommendations for our friends and i.it was fantastic most of the food was light and fresh... i would highly recommend this place! i had dinner at republic kitchen tonight for the first time and was very impressed with the service, the decor, the menu, and the food quality... i am going back sunday for their brunch and jazz! 

