HIDDEN SCHEMA NETWORKS

Abstract

Starting from the assumption that a large proportion of semantic content is necessarily relational, we introduce a neural language model that discovers networks of symbols (schemata) from text datasets. Using a variational autoencoder (VAE) framework, our model encodes sentences into sequences of symbols (composed representation), which correspond to the nodes visited by biased random walkers on a global latent graph. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next, we leverage pretrained BERT and GPT-2 language models as encoder and decoder, respectively, to train our model on language modelling and commonsense knowledge generation tasks. Qualitatively, the model is able to infer schema networks whose nodes (symbols) can be interpreted as encoding different aspects of natural language (as e.g. topics or sentiments). Quantitatively, our results show that the model successfully interprets the inferred symbol sequences, as it achieves state-of-the-art scores on language modeling benchmarks. Source code to reproduce all experiments is provided with the supplementary material.

1. INTRODUCTION

Much of the developmental and causal theories of human cognition are predicated on relational structures of knowledge that naturally exhibit compositionality. Semantic content is intrinsically relational, as one is only able to explain a given unit of knowledge -such as a concept, word or perception -insofar as there are other units of knowledge which relate to it (Block, 1986 ). Thus we can partially construe a concept through its relationships to other concepts (like when we say "a dog is an animal that barks"), just as we can partially construe it through its relationships to our perceptions (when we say "that is a dog", whilst pointing to a dog on the street) or the words we use (when we use the word dog to refer to the concept dog). Likewise, we can partially construe words not only through their relationships to concepts or percepts, but also through their relationships to other words, as words that occur in the same context tend to have similar meanings (Harris, 1954; Firth, 1957) . Note that is precisely this contextual semantic content of words what we have explicit access to, when processing our raw text datasets. On the other hand, generalization, reasoning and understanding seem to be inevitably tied to the compositional nature of knowledge. Indeed, the ability to compose a set of knowledge units (and their relations) into new, more complex relational units, which can be deployed to understand and reason about unseen data -a feature usually referred to as combinatorial generalization -is regarded as key to human-level intelligence (Fodor & Pylyshyn, 1988; Fodor & Lepore, 2002; Lake et al., 2017; Battaglia et al., 2018) . Relational structures allowing for compositionality thus seem to comprise not a sufficient, but a necessary attribute of any representation scheme that strives for the generalization power of human cognition. From the computational side, if one is to inform any modern machine learning model with such structural characteristics, one will initially encounter the problem of finding suitable primitives or data structures. In natural language processing (NLP), for example, it has become common place to leverage distributed continuous representations of words (Bengio et al., 2003) for different downstream tasks. Such representations are trained to encode average contextual semantics -precisely the kind of semantic content typical of word co-occurrence relations we mentioned above -into a semantic space, which allows meaning to change continuously within it (Mikolov et al., 2013 ). Yet, despite earlier attempts (Mitchell & Lapata, 2008) , it is unclear whether such representations can be meaningfully composed into representations of, say, unseen sentences and thus mimic the compositional character of natural language. More recently, contextualized continuous word representations inferred by deep learning architectures have shown spectacular results in many NLP tasks (Radford et al., 2018; Devlin et al., 2018; Radford et al., 2019; Brown et al., 2020) . Their success stems from those models' ability to infer flexible representations through, inter alia, raw, massive datasets, data-scalable attention mechanisms and minimal inductive biases (Vaswani et al., 2017) . These representations are known to not only contain rich contextual word semantics, but also consistently encode sentence-level grammar (Hewitt & Manning, 2019), and the models from which they are obtained seem to implement some notions of compositionality too (Hupkes et al., 2020; Wei et al., 2022) . Nevertheless, it is still unclear whether such representations can be composed into representations of novel sentences (Yu & Ettinger, 2020; Bhathena et al., 2020) . In fact, most of their syntactic properties are implicit and therefore inferred only a posteriori, typically through probes which neither guarantee their presence, nor establish how they were obtained in the first place (Rogers et al., 2020) . In this work we use a VAE framework (Kingma & Welling, 2013; Rezende et al., 2014) to develop a language model -the Hidden Schema Network model (HSN) -that enforces, via inductive biases, a discrete, relational structure for sentence representation which allows for compositionalityfoot_0 , while exploiting the well-known advantages of attention models and contextualized, pretrained representations. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next, we leverage our methodology to translate the implicit lexical and grammatical aspects of language encoded by pretrained BERT and GPT-2 language models into explicit relational structures, and apply the latter on language modelling and commonsense knowledge generation tasks. Our main contribution is then an exploration of a novel way to integrate discrete (symbols), relational (graphs) and continuous (neural representations) machine learning components into an end-to-end, differentiable representation learning algorithm for natural language modelling. Our aim is thus to try to connect the modern NLP paradigm with classical notions of linguistics, and begin to answer the recent calls for neuro-symbolic integration (Garcez & Lamb, 2020; Cartuyvels et al., 2021) .



Note that throughout the paper we refer only to compositionality of representations and not to the compositional functions that can be implemented by the models we use. The latter, functional compositionality, is studied by e.g.Hupkes et al. (2020).



Figure 1: Left: Diagram of Hidden Schema Network model. Center: Decoder architecture as a modified GPT-2 model of M layers, with a pseudo-self-attention mechanism to attend to the schema ej 1 :j L . Please see the Appendix for details. The "c" operations labels concatenation. Right: Encoder architecture as BERT model, followed by a single Transformer block. In both center and right figure purple shaded blocks represent submodules with pretrained parameters. Pink shaded blocks represent submodules with randomly initialized parameters.

