SYMBOL-SHIFT EQUIVARIANT NEURAL NETWORKS

Abstract

Neural networks have been shown to have poor compositionality abilities: while they can produce sophisticated output given sufficient data, they perform patchy generalization and fail to generalize to new symbols (e.g. switching a name in a sentence by a less frequent one or one not seen yet). In this paper, we define a class of models whose outputs are equivariant to entity permutations (an analog being convolution networks whose outputs are invariant through translation) without requiring to specify or detect entities in a pre-processing step. We then show how two question-answering models can be made robust to entity permutation using a novel differentiable hybrid semantic-symbolic representation. The benefits of this approach are demonstrated on a set of synthetic NLP tasks where sample complexity and generalization are significantly improved even allowing models to generalize to words that are never seen in the training set. When using only 1K training examples for bAbi, we obtain a test error of 1.8% and fail only one task while the best results reported so far obtained an error of 9.9% and failed 7 tasks.

1. INTRODUCTION

Previous work have shown how neural networks fail to generalize to new symbols (Lake & Baroni, 2018; Sinha et al., 2019; Hupkes et al., 2019) . In particular, Lake & Baroni (2018) showed that seq2seq models are able to perfectly learn a set of rules given enough data, however they fail to generalize these learned rules to new symbols. We illustrate the generalization issue of current models in the context of question-answering (QA) on the first task of bAbi (Weston et al., 2015) . This dataset identified a set of tasks testing which type of reasoning can be achieved by a question-answering system (e.g. several supporting facts, compound reference, positional reasoning, etc). Each task consists in a set of stories with an associated question such as: "John took the apple. John traveled to the hallway. Who has the apple?" Clearly, we would expect a QA system to be able to answer the previous example when "John" is replaced by "Sasha", "Bob", or any possible name even if it has not been seen during training. To investigate whether QA models perform abstraction over symbols, we perform an experiment where the training and test sets of the first bAbi task are regenerated with an increasing number of names. Fig. 1 shows how the performance of Memory-Networks (MN) (Sukhbaatar et al., 2015) and Third-order tensor product RNN (TPR) (Schlag & Schmidhuber, 2018) dramatically drops as the number of names increases in contrast to their symbolic counter-part SMN and STPR proposed in this paper. Both models reach low error well below the 5% limit even when the number of names and vocabulary become considerably larger than the original task. The main contribution of this paper is the proposal of a hybrid semantic/symbolic representation that is equivariant to entity permutation. The main advantage and novelty of our approach is that entities are not required to be identified in advance as we rely solely on differentiation to determine whether a word acts like an entity. We show how to extend two question-answering models to handle this hybrid representation and demonstrate in extensive experiments the benefit of such an approach: the sample-complexity is significantly improved, better compositionality is obtained and symbolic models reach better accuracy on the studied tasks in particular when being trained with less training data. The paper starts by reviewing related works, we then introduce formally what it means to permute entities. We then define layers that are robust to such perturbation and show how two recent questionanswering models can be adapted in this context. Finally, experiments are conducted to assess the benefits of our method.

2. RELATED WORKS

Improving compositionality of neural networks has been an important on-going effort in the past years. The SCAN dataset proposed from Lake & Baroni (2018) initially showed how standard neural networks baselines can fail to generalize to new symbols when learning a set of artificially constructed rules. Several approaches were proposed to solve this issue. For instance, Lake (2019) designed meta-learning episodes that led the model to solve the task, (Nye et al., 2020) showed how one could infer symbolic neural programs with a similar meta-learning procedure. Alternatively, Gordon et al. (2020) proposed to design an equivariant model (a model whose latent representations are unchanged when permuting symbols). A common limit of those approaches is that they require specifying which words are symbols in advance (Lake (2019) ; Nye et al. ( 2020) also requires substantial amount of supervision and designing meta-episodes). An exception is Russin et al. (2019) which proposed to decompose syntax and semantic for SCAN. None of those approaches can generalize to arbitrary large amount of entities or entities not seen in the training as the one we propose. The problem of compositionality becomes much easier if symbols (or entities) are detected beforehand. For instance, Li et al. (2015) showed that replacing entities by dedicated token placeholders leads to significant improvement in question answering. The same approach has also been applied in Machine Translation and Data to Text generation Luong et al. (2015) ; Serban et al. (2016) ; Lebret et al. (2016) , to enable sequence to sequence models to generalize to unseen words at inference time. While specifying entities in advance (or detecting them in a pre-processing step with Named-entity recognition (Marsh & Perzanowski, 1998) ) before applying a model may give compositionality, we would clearly want instead models to be able to infer automatically whether a word should behave as a symbol or not. While positional encoding (Graves et al., 2014; Vaswani et al., 2017) may give some compositionality -as it allows to reason over positions -this solution is not practical for language as inter-word distances are not fixed. For instance the distance between a noun and its verb varies and positional embedding is not enough to achieve compositionality (Hupkes et al., 2019 ). An interesting line of research have been the study of equivariant models whose representations are invariant (or equivariant) to symmetries present in the data (Zaheer et al., 2017; Ravanbakhsh et al., 2017) . Adding invariance to data symmetries has been theoretically shown to drastically reduces sample complexity (Sannai & Imaizumi, 2019) . For instance, convolution neural networks require significantly less training data and achieve much better performance than a MLP as they are invariant to image translation. Gordon et al. (2020) proposed the first NLP model provably capable of handling symmetries between symbols albeit requiring the need to specify such symmetries in advance. Tensor product representation (TPR) Smolensky (1990) allows to stores complex relations between value and variables with distributed representations and offer some compositionality. Recently, Schlag & Schmidhuber (2018) proposed an architecture able to learn TPR parameters by differentiation and obtained state-of-the-art results for bAbi at the time of publishing. However, the compositionality of the proposed approach is limited (as shown in Fig. 1 ) by the fact that every entity needs to be seen sufficiently many times so that a proper entity vector is found, in addition the model has been shown to learn orthogonal representation for entities which requires as many hidden dimensions as the total number of entities. Finally, the VARS approach (Vankov & Bowers, 2020) consists in outputting a one-hot vector representing a symbolic variable that is randomly assigned to different positions during training to enforce compositionality. While this approach grants some compositionality, the approach is limited as one must draw symbol permutations so that each object is seen in all possible one-hot values. In addition, one must specify in advance which object or word behaves as a symbol and the method only support symbolic output and cannot represent symbolic inputs nor perform computation with hybrid representation as we propose (combining semantic and symbolic representation).

3. SYMBOL-SHIFT EQUIVARIANCE

When we learn to answer "John" given a specific context we would like to be able to answer "Sasha" if both names were permuted in the context. In what follows, we introduce the notion of symbolshift equivariance: e.g. a condition restricting possible permutations as some permutations may perturb the sentence grammar (permuting "John" by "why") or cause ambiguity (permuting "John" by "Mary" if the question involves a gender). We assume all words are taken from a vocabulary V which is a discrete set of n words. We are interested in providing an answer a ∈ V given a context consisting in a question q = [q 1 , . . . , q nq ] ∈ V nq and a list of sentences (or stories) x = [x 1 , . . . , x T ] with x i = [x i1 , . . . , x ini ] ∈ V ni . We denote X = (x, q) and Φ(X) = a the function that predicts the answer a ∈ V given the context X = (x, q) ∈ X . Let Γ : V → V a word permutation. Given a sequence [y 1 , . . . , y n ], we define Γ(y) = [Γ(y 1 ), . . . , Γ(y n )] where the permutation is applied to each word in the sequence, similarly, Γ(X) = ([Γ(x 1 ), . . . , Γ(x T )], Γ(q)). Assuming each word has an associated vector parameter, we say that a permutation is a symbol-shift if it preserves vectors parameters. For instance in Figure 2 , a map permuting "John" with "Sasha" is a symbol-shift as both words share the same parameters but a map permuting "John" and "lemon" is not. Formally, assuming each word i ≤ n has an associated set of parameters v i ∈ R D , we say that a permutation Γ : V → V is a symbol-shift if v i = v Γ(i) for all i ≤ n. We are now ready to define symbol-shift equivariance. A critical advantage is that we do not need to specify symmetries in advance between entities, as we instead rely on vector semantics whose embeddings will be learned end-to-end. Definition 1. Let Φ : X → V a function mapping a context to an answer. We say that Φ is symbolshift equivariant if for any symbol-shift Γ and for any X ∈ X , Φ(Γ(X)) = Γ(Φ(X))

4. SYMBOLIC QUESTION ANSWERING

In this section, we show how to define symbol-shift-equivariant models. The main idea consists of concatenating two representations, a standard semantic representation in R d as well as a symbolic representation in R m where m denotes the number of distinct words in the stories and question. The symbolic representation will be made such that for i ≤ m, the i-th component of the symbolic representation corresponds to the i-th word appearing in the context. For instance in Figure 3 , there are m = 5 words in the context and "apple" is the fourth word by order of appearance so its symbolic embedding is the fourth one-hot vector. The model symbolic output has larger probability for the fourth word of the context which is dereferenced to "Apple". We now describe formally how the symbolic representations are constructed and how we perform linear transformation and projection back to the original vocabulary. Finally, we derive the symbolic counter-part of Memory-Networks and TPR models that will be proved to be symbol-shift equivariant. Mapping words into and from symbolic representations. In each input example, the set of words present in the stories x and the question q is denoted as C X = {q} ∪ {{x i }, i ≤ T } and we let m = |C X | be the number of distinct words in the context. To project words to its symbolic representation R m , we represent each unique word by a one-hot vector representing the position of its first appearance, using the bijection ϕ X : C X → [1, m] 1 . To dereference a symbolic representation in R m back to its vocabulary id, we define the matrix B ϕ ∈ R n×m that maps one-hot vectors of symbolic representations to one-hot vectors in R n representing the word id in the vocabulary as shown in Fig. 3 , such that B ϕ e m j = e n ϕ -1 (j) for j ≤ m where e k l ∈ R k denotes the l-th one-hot vector in R kfoot_1 . Note given a one-hot symbolic representation p ∈ R m , the i-th coordinates of B ϕ p ∈ R n is given by : [B ϕ p] i = pϕ(i) if i ∈ C 0 else (1) hence B ϕ allows to dereference a symbolic representation p ∈ R m to a word vector B ϕ p ∈ R n . Hybrid semantic-symbolic embeddings. We embed words with the concatenation of a standard semantic word embedding as well as a symbolic embedding respectively parametrized by A ∈ R d×n and α ∈ [0, 1] n . The semantic embedding maps a word x ∈ [1, n] to Ae x ∈ R d . While the symbolic embedding of x consists in the one-hot vector of the order of appearance of the word x in its context multiplied by a learned parameter. More precisely, it is defined as α x e ϕ(x) ∈ R m , where α x is an output of a sigmoid unit on a learnable parameter, i.e. 0 < α x < 1, that indicates how much each word should behave as a symbolfoot_2 , and e ϕ(x) is the one-hot vector of the order of appearance of the word x in its context. The final embedding of a word of x then consists in the concatenation of the semantic and symbolic parts: x → Ae x ⊕ α x e ϕ(x) ∈ R d+m , where ⊕ denotes the concatenation operator. Note that all parameters are differentiable allowing the model to learn both word semantic and how much each word should behave as a symbol. The symbolic part will be shown to allow the model to be robust to symbol permutation while being able to represent an arbitrary number of symbols and generalize to new ones. For instance in Fig. 3 , one can see that permuting "John" by "Sasha" would not change embeddings (as long as both name share the same parameters) as both words will still appear in the same order in the context. Symbolic projection. Given an internal state h = h sem ⊕ h sym ∈ R d+m , we interpret it as a distribution on the vocabulary p ∈ R n with p sem = softmax (Bh sem ) p sym = B ϕ softmax (h sym ) (3) p = β p sem + (1 -β) p sym (4) where B ∈ R n×d and β ∈ [0, 1] are parameters to learn. The final distribution is a mixture of two distributions p sem and p sym . The first one p sem is seen as a semantic output as it increases answer probability of a word if its semantic embedding is closer Model John the apple Who took 1 -1 1 0 0 0 0 -3 2 0 1 0 0 0 2 3 0 0 1 0 0 -1 1 0 0 0 1 0 took the apple 2 3 0 0 1 0 0 -1 1 0 0 0 1 0  8 -3 0 0 0 0 1 -3 2 0 1 0 0 to the semantic state h sem . The second one p sym interprets h sym as probabilities of words from the context using B ϕ to dereference positions. Indeed, denote p = softmax (h sym ) ∈ R m and using Eq. 1, we get that the i-th coefficient of p sym is given by 0 -1 1 1. 0 0 2 0 apple John orange ... ... psem = softmax (Bhsem) ∈ R n sem. emb. Ax ∈ R d hsym ∈ R m hsem ∈ R d apple John orange ... ... psym = Bϕsoftmax (hsym) ∈ R n symb. emb. αxϕx ∈ R m [p sym ] i = [B ϕ p] i = pϕ(i) if i ∈ C 0 else (5) Symbolic transformation. We perform a linear transformation of an internal symbolic representation h = h sem ⊕ h sym ∈ R d+m with: h sem ⊕ h sym → W h sem ⊕ (λI + γ11 T )h sym ∈ R d+m , where W ∈ R d×d , λ ∈ R, γ ∈ R are parameters to learn, I ∈ R m×m is the identity matrix and 1 = [1, . . . , 1] T ∈ R m×1 . The linear transformation for the symbolic part is taken from Zaheer et al. (2017) where it was shown to be the unique form of a linear parametric equivariant function. In our case, the symbolic transformation is invariant to permutation of symbolic coordinates, allowing the model to be independent from the choice of a particular bijection ϕ.

4.1. SYMBOLIC MODELS

We are now ready to show how question-answering models can be extended so that they become symbol-shift equivariant. We recall the definition of Memory-Networks (Sukhbaatar et al., 2015) and Third-order tensor product RNN (Schlag & Schmidhuber, 2018 ) models before deriving their symbolic counter-part. Memory-Networks. Memory-Networks iteratively updates an internal question representation with K ≥ 1 self-attention layers (or hops) before projecting the final representation to the answer distribution. The model parameters consists in K + 1 embedding matrices A (0) , . . . , A (K) ∈ R d×n . The query representation is initially set to u (0) = nq j=1 A (0) e qj and iteratively updated with: m i = ni j=1 A (k) e xij , c i = ni j=1 A (k+1) e xij u (k+1) = u (k) + n i=1 softmax m T i u (k) i c i (8) In words, the internal representation is updated with the vector of output memories c i weighted by similarity between the current question representation u (k) i and the input memory m i . The final internal representation u (K) ∈ R d is mapped to the answer distribution with: p = softmax A (K) T u (K) ∈ R n so that the probability of every word being the answer is a function of the similarity of its embedding in A (K) and the final question representation u (K) . In addition the paper proposes temporal and positional encoding to allow distinguishing words or stories order which we discuss in the appendix. Symbolic Memory-Networks. We extend Memory-Networks by using K + 1 symbolic embeddings from Eq. 2 and concatenating semantic and symbolic representation. This model has the parameters A (k) ∈ R n×d and α (k) ∈ R n for 0 ≤ k ≤ K as well as β ∈ [0, 1]. Instead of mapping internal representations into R d , we map them to R d+m with: m i = ni j=1 A (k) e xij ⊕ α (k) xij e ϕ(xij ) , c i = ni j=1 A (k+1) x ij ⊕ α (k+1) xij e ϕ(xij ) The internal representation is updated K times with Eq. 8 to obtain a final internal representation u (K) ∈ R d+m . To obtain the output predictive distribution, we project u (K) ∈ R d+m as described in Eq. 4 with B T = A (K) and β ∈ [0, 1]. TPR. Third-order tensor product RNN proposes to encode stories with a non-standard RNN whose representation is a tensor product representation (Schlag & Schmidhuber, 2018) . More precisely, the model state consists in a tensor denoted as F t ∈ R E×R×E that is initialized with zeros and updated at each story with: F t = F t-1 + ∆F t The update term ∆F t depends on a learned parametric representation of entities and roles that are obtained by mapping the story representation s t with MLPs. The story representation is obtained by summing the k words of a story with s t = nt j=1 Ae xtj p j where denotes the componentwise product, A denotes an embedding matrix and p i are positional embeddings. Once the story representation is obtained, entities and roles representations are obtained with MLPs: e (i) t = f e (i) (s t ; θ e (i) ) ∈ R E r (j) t = f r (j) (s t ; θ r (j) ) ∈ R R for 1 ≤ i < 3, 1 ≤ j < 4 and where f is an MLP and θ its parameters. The update term ∆F t is given by a close form formula designed to update entity information into F t given the entity and role embeddings e (i) t and r (i) t , we detail it in the appendix for space reasons. The internal representation F t is updated after reading each story and to perform inference, the final internal representation F T is used to decode the entity and role representation of the question. Similarly to the story embedding, entities and role representations of the questions are first obtained by mapping the question embedding s Q = nq j=1 Ae qj p j through a MLP: n = f n (s Q ; θ n ) ∈ R R , l j = f lj (s Q ; θ lj ) ∈ R E , 1 ≤ j < 4 (11) Given those representations, the distribution over possible answer is obtained by first obtaining the following internal representations: î(1) = (F T • n) • 34 l (1) , î(2) = (F T • î(1) ) • 34 l (2) , î(3) = (F T • î(2) ) • 34 l (3) , where • 34 denotes tensor inner-product, finally the answer distributions is obtained with the following projection: p = softmax B 3 k=1 LN( î(k) ) ∈ R n (13) where LN denotes layer-normalization (Ba et al., 2015) and B ∈ R n×E are projection parameters. For MLPs, Schlag & Schmidhuber (2018) proposes to use two hidden layers with internal dimension d and tanh activation functions, hyperparameters of the method are given in the appendix. The parameters of the models are word and positional embeddings, MLPs parameters as well as projections parameters. Symbolic TPR. We modify TPR to handle symbolic representations for entity, role as well as intermediate representations. All entities, roles and intermediate representations are embedded into R d+m . Stories are embedded symbolically with: Then, we use symbolic MLPs to find entity and role representations: s t = nt j=1 (Ae xtj p j ) ⊕ (α xtj e ϕ(xtj ) ) ∈ R d+m e (i) t = fe (i) (s t ; θ e (i) ) ∈ R d+m , r (j) t = fr (j) (s t ; θ r (j) ) ∈ R d+m , where f denotes a symbolic MLP with linear transformation as described in Eq. 6. The tensor representation is updated with the same equations as TPR to obtain a final representation F T ∈ R (d+m) 3 . Given the symbolic embedding of the question s Q = nq j=1 (Ae qj p j ) ⊕ (α qj e ϕ(qj ) ), we extract symbolic roles and entities with a symbolic MLP: n = fn (s Q ; θ n ) ∈ R d+m , l j = flj (s Q ; θ lj ) ∈ R d+m We then obtain the hidden representation h = 3 k=1 LN( î(k) ) ∈ R d+m with Eq. 12 that is projected to an answer distribution as described in Eq. 4 where B ∈ R n×d , β ∈ [0, 1] are additional projection parameters. A key result justifying the symbolic qualifier is that SMN and STPR are symbol-shift equivariant. Theorem. SMN and STPR are symbol-shift equivariant. The proof is in the appendix A for space reasons. The main idea consists in remarking that embeddings are invariant to symbol-shift and consequently, latent representations are invariant too since the model is deterministic. The proof then shows that having identical latents gives equivariance given our symbolic projection construction. We observe that in practice, the learned semantic embeddings of different entities will not be exactly the same as in Figure 2 : in particular the semantic embeddings of "apple" and "orange" may be close but different for instance. Our experiments will show that having symbol-shift equivariant models is a good inductive bias improving accuracy and compositionality of the studied models even if entities vectors are not exactly aligned.

5. EXPERIMENTS

We perform experiments on bAbi tasks (Weston et al., 2015) with version v1.2 of the dataset. The initial dataset consists in a set of 20 synthetic question-answering tasks designed to test capabilities of a dialog agent. In each example, an answer must be provided given a question and stories consisting in a sequence of sentences. We use the two versions which contains 1K and 10K training examples per task respectively. Given our computing budget, we perform all experiments in the single-task setting where models are trained on the 20 tasks independently. Every experiment is repeated with 10 different seeds and we report mean/std over those runs. The performance of MN, SMN, TPR and STPR is reported in Fig. 4 and Tab. 1. Fig. 4 depicts the average error per task over time and shows that symbolic approaches significantly improves the sample efficiency of both methods which achieve much faster convergence rate and also converges to better values. In Tab. 1, we report the average error obtained at convergence for all methods. Again, symbolic models SMN and STPR outperforms their non-symbolic counter-part. Because the embeddings of entities learned by TPR are orthogonal (Schlag & Schmidhuber, 2018) , we argue that most dimensions are used to represent this basis (requiring at least as many dimensions as the number of possible entities) while STPR require less dimensions since orthogonal representation are available to the model by construction. To see this effect, we run a smaller model with d = 20 and dropout set to 0.5 for both STPR and TPR that are called respectively STPRsm and TPR-sm. Results indicate that STPR-sm performs almost as well in the 10K setting while reaching an average test error of 5.6% when trained with 1K examples as opposed to TPR-sm whose performance is severely hurt by the dimension reduction as it cannot represent orthogonal basis for the different entities anymore. To the best of our knowledge, the best reported result with 1K examples is 9.9% from QRN model (Seo et al., 2017) who reported the best run over 10 seeds while we report average result over seeds. Using the same procedure, we obtain a test-error of 1.8% and pass 19/20 tasks (passing means reaching an error lower than 5%). This makes a significant improvement to the problem of passing all task with only 1K examples but the problem still stands (in particular when reporting mean test-error instead of best error over multiple seeds). Zero-shot entities. In Tab. 2, we show the accuracy obtained on two bAbi tasks in a zero-shot setting where we leave the training dataset unchanged but perturb the test dataset by introducing unseen entities. Precisely, we replace the 6 rooms present in the task (kitchen, bedroom, office, garden, hallway, bathroom) by kitchenette, guest-room, open-space, entry, terrace, toilet only in the test set, see Fig. 5 . We also measure test-accuracy when introducing unseen people and objects in the test set with the same procedure (both tasks contain objects, rooms, and people). For all models, the semantic embeddings of words unseen in the training are set to the zero vector. Symbolic models outperform their semantic-only counter-part in this setting as they can perform abstraction over symbols rather than relying only on specific embeddings for each name. Experiment details. Hyperparameters in bAbi are kept identical across tasks, and we reused hyperparameters of (Sukhbaatar et al., 2015) for Memory-Networks, and Schlag & Schmidhuber (2018) for TPR (they are given in the appendix). For symbolic variants, we use the same hyperparameters as the semantic version. We used the public implementations (Junki, 2015) and (Alexander, 2019) for MN and TPR based on Pytorch (Paszke et al., 2019) that we adapted to support symbolic computation.

6. CONCLUSION

We introduced a novel hybrid semantic and symbolic representation able to handle internal transformations and projections of the final representations back to the initial vocabulary in a way that is symbol-shift equivariant. The main advantage of our approach is that we rely on differentiation to detect whether words should behave as symbols and therefore sidestep the need to detect or specify entities in advance. Our experiments showed that having symbol-shift equivariant models can significantly decrease the amount of training data required. In particular, we were able to solve 19/20 bAbi tasks using only 1K training examples. We also showed that our approach performs well in challenging zero-shot settings or when the number of entities in a task becomes very large. An interesting area for future work consists in extending other architectures such as Transformer (Vaswani et al., 2017) models with the hope of diminishing the vast amount of data they currently require. Finally, another interesting application will be to use this symbolic representation in order to ensure fairness in question answering or other applications. It remains to show that f 2 (X) is equivariant. Denoting the probabilities over symbols p = softmax (h sym (X)) = softmax (h sym (Γ(X))) ∈ R m and using Eq. 5, we obtain [f 2 (X)] i = pϕ(i) if i ∈ C X 0 else , [f 2 (Γ(X))] i = pϕ (i) if i ∈ C Γ(X) 0 else Then, for any i ≤ n, [Γ(f 2 (X))] i = [f 2 (X)] Γ -1 (i) (26) = p(ϕ(Γ -1 (i)) if Γ -1 (i) ∈ C X 0 else (27) = pϕ (i) if i ∈ C Γ(X) 0 else (28) = [f 2 (Γ(X))] i B ARTIFICIAL DATASET EXPERIMENT To investigate how models generalize with different number of entities, we generate artificial datasets where questions have the form of: x 1 = v 1 . . . . x N = v N . x i =? where we expect v i as an answer where v i is the last value assigned to x i and C EMBEDDINGS LEARNED BY MEMORY-NETWORKS Fig. 8 shows the word embeddings of the first layer of a MN after learning the first bAbi task generated with 10k names. The words on the upper-left are verbs, stop-words and locations while the words on the bottom-right are names (the figure is zoomable). The model tries to diagonalize entities embeddings given the available dimensions (e.g. finding orthogonal embeddings for names). This means that attention models may also require a number of dimensions at least as large as the number of entities and hence struggle to generalize to large number of entities or entities not seen at training. 22.0 ± 0.6 21.7 ± 0.9 0.0 ± 0.0 6.8 ± 10.3 4.5 ± 14.1 0.0 ± 0.0 5 13.9 ± 3.8 6.6 ± 3.1 0.8 ± 0.3 0.7 ± 0.6 0.6 ± 0.3 1.0 ± 1.2 6 3.8 ± 0.8 7.6 ± 5.5 0.3 ± 0.2 15.1 ± 3.5 0.4 ± 0.6 1.2 ± 0.9 7 2.3 ± 0.5 6.1 ± 0.4 0.5 ± 0.2 6.0 ± 15.6 0.3 ± 0.5 0. The main difference in the implementation we used for TPR are that RADAM optimizer was used in Schlag & Schmidhuber (2018) , we used ADAM instead to have the same setting with Memory-Networks and also because this optimizer is not available in Pytorch. For Memory-Networks, we reuse Junki (2015) implementation where temporal encoding regularization is not present. x j ∈ V 1 , v j ∈ V 2 for all j. In

Memory-Networks temporal encoding

To allow models to preserve temporal information, Sukhbaatar et al. (2015) proposes to have additional parameters T A ∈ R T ×d and T C ∈ R T ×d where T denotes the maximum number of stories and add them to m i , c i as follows: m i = ni j=1 A (k) e xij + T A (i) c i = ni j=1 A (k+1) e xij + T C (i) In the case of Symbolic Memory-Networks, we proceed similarly with an additional symbolic temporal encoding parametrized with α A ∈ R T and α C ∈ R T and update m i , c i as follows:



From now on, we drop the subscript notation when there is no ambiguity and denote C, ϕ we later omit superscript dimension as they will always be implicitly defined Note that if αx = 0 for all word x, the model is reduced to a standard semantic-only model



Figure 1: Test error on first bAbi task when increasing the number of names. Error of symbolic models are all bellow 1%.

Figure 2: Illustration of symbol-shift equivariance. Left: representation of word parameters (identical words are depicted close so that they can be distinguished). Middle and right: two cases where Φ is symbol-shift equivariant for two different symbol-shifts.

Figure 3: Illustration of symbolic representation in a case where d = 2 and m = 5.

Figure 4: Learning curves of average accuracy when training with 1K and 10K examples of bAbi.

we study how well models perform when the number of variables and values increase. Three datasets are generated with |V 1 | = |V 2 | = k for k ∈ {10, 100, 1000}, each dataset has 10K training examples and 1K non-overlapping examples in the test set and each story has N = 10 assignments. While the task is elementary, both MN and TPR struggle to generalize to larger number of entities in contrast to their symbolic counterpart SMN and STPR that can solve the task even when with thousands of entities.

Figure 7: Artificial dataset error convergence when increasing the number of values and variables.

experiments, we used a batch-size of 32 and ADAM with a learning-rate of 0.001. In the case of TPR models, we followSchlag & Schmidhuber (2018)  and set β 1 = 0.6, β 2 = 0.4. Gradients norm are clipped to 5.0. All models are trained with early-stopping using 10% of the training set as validation, and we perform 80000 gradient updates 2 times decaying the learning rate by 2 each time (which is a setup close to bothSukhbaatar et al. (2015) that decays the learning rate 4 times and Alexander (2019) that trains for significantly longer).

John took the apple in the kitchen.John took the apple in the kitchenette.John took the key in the kitchen Sasha took the apple in the kitchen. John went to the bedroom.John went to the guest-room.John went to the bedroom. Test error when training on original bAbi tasks and evaluating error on different test-dataset. For zero-shot test-datasets, test entities (e.g. rooms, objects, people) are not seen during training.

Aggregate error on bAbi.

Per task error on bAbi with 1K training examples.

annex

A PROOF OF SYMBOL-SHIFT EQUIVARIANCE Theorem. SMN and STPR are symbol-shift equivariant.Proof. Let us take Φ ∈ {SMN, STPR} and show that for any symbol-shift Γ and context X ∈ X , Φ(Γ(X)) = Γ(Φ(X)). Since the output of Φ is considered as a distribution on words, e.g. as a vector Φ(X) ∈ R n , we first define the action of Γ on a vector v ∈ R n by defining v = Γ(v) as the vector verifying vSince Γ is a symbol-shift, all word parameters (which comprises A, B and α) are equivariant, e.g. for each word i ≤ n:the two bijections mapping context words in C X and C Γ(X) to their order of appearance, see Fig. 6 . Using the fact that permuting words does not change entity order of occurrence, we have:We first show that the embedding defined in Eq. 2 is invariant to symbol-shift, e.g. that E(x) = E(Γ(x)) for all x ∈ X where E(x) = Ae x ⊕ α x e ϕ X (x) . Using Eq. 16, we get that Ae x = Ae Γ(x) . Then using Eq. 17 and 16, we get that α Γ(x) e ϕ (Γ(x)) = α Γ(x) e ϕ(x) = α x e ϕ(x) .Consequently, since embeddings are invariant by symbol-shift and since the model is deterministic, the same latent representation is obtained for both X and Γ(X). If we denote the latent function as h(X) = h sem (X) ⊕ h sym (X) we then have:Given that the output is defined asit is sufficient to show that the two functions f 1 (X) = Bh sem (X) and f 2 (X) = B ϕ X softmax (h sym (X)) are equivariant as softmax and (constant) linear transformations are trivially equivariant. The following sequence of equalities show that f 1 (X) is equivariant, it usesand the fact that the latent function is invariant.In addition both Memory-Networks and Symbolic Memory-Networks use the positional embedding used in TPR.

E TPR UPDATE EQUATIONS

After reading each story s t , F t is updated by adding the sum of three tensors ∆F t = W t + M t + B t whose expression is given by: ŵt = (F t • 34 e(1)t ⊗ r(1)t ⊗ r(1)t ⊗ e(2)t ⊗ r(2)t ⊗ r(2)t ⊗ r(3)t ⊗ r(3)Where ⊗ denotes the tensor outer-product and • ij denotes the tensor inner-product.The term W t is a write term that allows to encode the association (e(1)t , et ) while retrieving the previously assigned entity and removing its previous association. The term M t allows to associate this removed entity to a different relation r(2) t (also removing the previously assigned to r(2) t ). Finally the term B t is called a backlink as it allows associative association, we refer to the original paper Schlag & Schmidhuber (2018) for detailed intuitions on the three update terms.

