SYMBOL-SHIFT EQUIVARIANT NEURAL NETWORKS

Abstract

Neural networks have been shown to have poor compositionality abilities: while they can produce sophisticated output given sufficient data, they perform patchy generalization and fail to generalize to new symbols (e.g. switching a name in a sentence by a less frequent one or one not seen yet). In this paper, we define a class of models whose outputs are equivariant to entity permutations (an analog being convolution networks whose outputs are invariant through translation) without requiring to specify or detect entities in a pre-processing step. We then show how two question-answering models can be made robust to entity permutation using a novel differentiable hybrid semantic-symbolic representation. The benefits of this approach are demonstrated on a set of synthetic NLP tasks where sample complexity and generalization are significantly improved even allowing models to generalize to words that are never seen in the training set. When using only 1K training examples for bAbi, we obtain a test error of 1.8% and fail only one task while the best results reported so far obtained an error of 9.9% and failed 7 tasks.

1. INTRODUCTION

Previous work have shown how neural networks fail to generalize to new symbols (Lake & Baroni, 2018; Sinha et al., 2019; Hupkes et al., 2019) . In particular, Lake & Baroni (2018) showed that seq2seq models are able to perfectly learn a set of rules given enough data, however they fail to generalize these learned rules to new symbols. We illustrate the generalization issue of current models in the context of question-answering (QA) on the first task of bAbi (Weston et al., 2015) . This dataset identified a set of tasks testing which type of reasoning can be achieved by a question-answering system (e.g. several supporting facts, compound reference, positional reasoning, etc). Each task consists in a set of stories with an associated question such as: "John took the apple. John traveled to the hallway. Who has the apple?" Clearly, we would expect a QA system to be able to answer the previous example when "John" is replaced by "Sasha", "Bob", or any possible name even if it has not been seen during training. To investigate whether QA models perform abstraction over symbols, we perform an experiment where the training and test sets of the first bAbi task are regenerated with an increasing number of names. Fig. 1 shows how the performance of Memory-Networks (MN) (Sukhbaatar et al., 2015) and Third-order tensor product RNN (TPR) (Schlag & Schmidhuber, 2018) dramatically drops as the number of names increases in contrast to their symbolic counter-part SMN and STPR proposed in this paper. Both models reach low error well below the 5% limit even when the number of names and vocabulary become considerably larger than the original task. The main contribution of this paper is the proposal of a hybrid semantic/symbolic representation that is equivariant to entity permutation. The main advantage and novelty of our approach is that entities are not required to be identified in advance as we rely solely on differentiation to determine whether a word acts like an entity. We show how to extend two question-answering models to handle this hybrid representation and demonstrate in extensive experiments the benefit of such an approach: the sample-complexity is significantly improved, better compositionality is obtained and symbolic models reach better accuracy on the studied tasks in particular when being trained with less training data. The paper starts by reviewing related works, we then introduce formally what it means to permute entities. We then define layers that are robust to such perturbation and show how two recent questionanswering models can be adapted in this context. Finally, experiments are conducted to assess the benefits of our method.

2. RELATED WORKS

Improving compositionality of neural networks has been an important on-going effort in the past years. The SCAN dataset proposed from Lake & Baroni (2018) initially showed how standard neural networks baselines can fail to generalize to new symbols when learning a set of artificially constructed rules. Several approaches were proposed to solve this issue. For instance, Lake (2019) designed meta-learning episodes that led the model to solve the task, (Nye et al., 2020) showed how one could infer symbolic neural programs with a similar meta-learning procedure. Alternatively, Gordon et al. (2020) proposed to design an equivariant model (a model whose latent representations are unchanged when permuting symbols). A common limit of those approaches is that they require specifying which words are symbols in advance (Lake ( 2019 2016), to enable sequence to sequence models to generalize to unseen words at inference time. While specifying entities in advance (or detecting them in a pre-processing step with Named-entity recognition (Marsh & Perzanowski, 1998) ) before applying a model may give compositionality, we would clearly want instead models to be able to infer automatically whether a word should behave as a symbol or not. While positional encoding (Graves et al., 2014; Vaswani et al., 2017) may give some compositionality -as it allows to reason over positions -this solution is not practical for language as inter-word distances are not fixed. For instance the distance between a noun and its verb varies and positional embedding is not enough to achieve compositionality (Hupkes et al., 2019 ). An interesting line of research have been the study of equivariant models whose representations are invariant (or equivariant) to symmetries present in the data (Zaheer et al., 2017; Ravanbakhsh et al., 2017) . Adding invariance to data symmetries has been theoretically shown to drastically reduces sample complexity (Sannai & Imaizumi, 2019) . For instance, convolution neural networks require significantly less training data and achieve much better performance than a MLP as they are invariant to image translation. Gordon et al. (2020) proposed the first NLP model provably capable of handling symmetries between symbols albeit requiring the need to specify such symmetries in advance. Tensor product representation (TPR) Smolensky (1990) allows to stores complex relations between value and variables with distributed representations and offer some compositionality. Recently, Schlag & Schmidhuber (2018) proposed an architecture able to learn TPR parameters by differentiation and obtained state-of-the-art results for bAbi at the time of publishing. However, the compositionality of the proposed approach is limited (as shown in Fig. 1 ) by the fact that every entity needs to be seen sufficiently many times so that a proper entity vector is found, in addition the model has been shown to learn orthogonal representation for entities which requires as many hidden dimensions as the total number of entities.



Figure 1: Test error on first bAbi task when increasing the number of names. Error of symbolic models are all bellow 1%.

); Nye et al. (2020) also requires substantial amount of supervision and designing meta-episodes). An exception is Russin et al. (2019) which proposed to decompose syntax and semantic for SCAN. None of those approaches can generalize to arbitrary large amount of entities or entities not seen in the training as the one we propose. The problem of compositionality becomes much easier if symbols (or entities) are detected beforehand. For instance, Li et al. (2015) showed that replacing entities by dedicated token placeholders leads to significant improvement in question answering. The same approach has also been applied in Machine Translation and Data to Text generation Luong et al. (2015); Serban et al. (2016); Lebret et al. (

