HIDDEN SCHEMA NETWORKS

Abstract

Starting from the assumption that a large proportion of semantic content is necessarily relational, we introduce a neural language model that discovers networks of symbols (schemata) from text datasets. Using a variational autoencoder (VAE) framework, our model encodes sentences into sequences of symbols (composed representation), which correspond to the nodes visited by biased random walkers on a global latent graph. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next, we leverage pretrained BERT and GPT-2 language models as encoder and decoder, respectively, to train our model on language modelling and commonsense knowledge generation tasks. Qualitatively, the model is able to infer schema networks whose nodes (symbols) can be interpreted as encoding different aspects of natural language (as e.g. topics or sentiments). Quantitatively, our results show that the model successfully interprets the inferred symbol sequences, as it achieves state-of-the-art scores on language modeling benchmarks. Source code to reproduce all experiments is provided with the supplementary material.

1. INTRODUCTION

Much of the developmental and causal theories of human cognition are predicated on relational structures of knowledge that naturally exhibit compositionality. Semantic content is intrinsically relational, as one is only able to explain a given unit of knowledge -such as a concept, word or perception -insofar as there are other units of knowledge which relate to it (Block, 1986 ). Thus we can partially construe a concept through its relationships to other concepts (like when we say "a dog is an animal that barks"), just as we can partially construe it through its relationships to our perceptions (when we say "that is a dog", whilst pointing to a dog on the street) or the words we use (when we use the word dog to refer to the concept dog). Likewise, we can partially construe words not only through their relationships to concepts or percepts, but also through their relationships to other words, as words that occur in the same context tend to have similar meanings (Harris, 1954; Firth, 1957) . Note that is precisely this contextual semantic content of words what we have explicit access to, when processing our raw text datasets. On the other hand, generalization, reasoning and understanding seem to be inevitably tied to the compositional nature of knowledge. Indeed, the ability to compose a set of knowledge units (and their relations) into new, more complex relational units, which can be deployed to understand and reason about unseen data -a feature usually referred to as combinatorial generalization -is regarded as key to human-level intelligence (Fodor & Pylyshyn, 1988; Fodor & Lepore, 2002; Lake et al., 2017; Battaglia et al., 2018) . Relational structures allowing for compositionality thus seem to comprise not a sufficient, but a necessary attribute of any representation scheme that strives for the generalization power of human cognition. From the computational side, if one is to inform any modern machine learning model with such structural characteristics, one will initially encounter the problem of finding suitable primitives or data structures. In natural language processing (NLP), for example, it has become common place to leverage distributed continuous representations of words (Bengio et al., 2003) for different downstream tasks. Such representations are trained to encode average contextual semantics -precisely the kind of semantic content typical of word co-occurrence relations we mentioned above -into a semantic space, which allows meaning to change continuously within it (Mikolov et al., 2013 ). Yet, despite earlier attempts (Mitchell & Lapata, 2008) , it is unclear whether such representations can be meaningfully composed into representations of, say, unseen sentences and thus mimic the compositional character of natural language. More recently, contextualized continuous word

