IMPROVING OUT-OF-DISTRIBUTION GENERALIZATION WITH INDIRECTION REPRESENTATIONS

Abstract

We propose a generic module named Indirection Layer (InLay), which leverages indirection and data internal relationships to effectively construct symbolic indirect representations to improve out-of-distribution generalization capabilities of various neural architectures. InLay receives data input in the form of a sequence of objects, treats it as a complete weighted graph whose vertices are the objects and edge weights are scalars representing relationships between vertices. The input is first mapped via indirection to a symbolic graph with data-independent and trainable vertices. This symbolic graph is then propagated, resulting in new vertex features whose indirection will be used for prediction steps afterward. Theoretically, we show that the distances between indirection representations are bounded by the distances between corresponding graphs, implying that unseen samples with very different surface statistics can still be close in the representation space to the seen samples if they share similar internal relationships. We demonstrate that InLay is consistently effective in improving out-of-distribution generalization throughout a comprehensive suite of experiments, including IQ problems, distorted image classification, and few-shot domain adaptation NLP classification. We also conduct ablation studies to verify different design choices of InLay. We introduce our main contribution, namely the Indirection Layer (InLay). InLay takes a sequence of objects as input and transforms the sequence into a new indirect graph-structured representation. Concretely, let X = (x 1 , x 2 , . . . , x k ) ∈ R k×n be the input sequence for InLay, where k is the number of objects and each x i ∈ R n represents an object. For example, an object may be either an image in IQ problems, or a patch of image in image classification task, or a paragraph in few-shot NLP classification task (see Section 4). To better exploit data internal relationships, we treat each input sequence as a directed complete weighted graph (with no self-loop) whose vertices represent the objects and edges represent relationships as scalars in [-1, 1]. Specifically, for each sequence X, we denote G X as its corresponding graph. We define G k to be the space of all directed weighted complete graphs G with k vertices and edge weights in [-1, 1]. From now on, we will only write G instead of G X when it is not necessary to specify X, and we denote A G as the adjacency matrix of G. This adjacency matrix captures the internal relationships of the corresponding data sequence. Remark 2.1. (Canonical indexing assumption) As the set of graph vertices may permute, a graph G with k vertices may not have an unique adjacency matrix. To assure the well-definedness of A G , we assume that (when computing the adjacency matrix) the i-th vertex represents the i-th element of the input sequence. We show in Appendix C that the indirection representations are still maintained if the canonical indexing assumption is not obeyed.

1. INTRODUCTION

There have been several evidences showing that deep learning models may fail drastically in out-ofdistribution (OOD) testing circumstances (Geirhos et al., 2018; Keysers et al., 2020) . One reason widely agreed upon is that neural networks tend to learn surface statistics of data (Lake et al., 2017) and thus can not generalize to new samples with different statistics. On the other hand, humans excel at generalizing, and it has been long believed that the ability to think in a symbolic way is the key for humans to quickly adapt to new situations (Mitchell, 2021) . A powerful concept that can bridge concrete data and symbols is indirection, which binds two objects together and uses one to refer to the other. In computer science, indirection is widely used via pointer: data is bound to its memory address, and programs use the memory address to refer to that data. The capacity to draw analogies is yet another trait that facilitates human generalization. Several cognitive science theories have been proposed to explain analogy, and the Structure-Mapping Theory (SMT) (Gentner, 1983 ) is one of the most successful among them. SMT argues that not object attributes but the relationships between them are transferred in an analogy. For example, the hydrogen atom is analogous to the solar system not because they share the same sizes or temperatures but because they both have entities revolving around a center due to the attractive force. This suggests that internal relationships of a situation contain essential information for generalization. In this paper, we propose a method that simultaneously leverages indirection and data internal relationships to construct indirection representations, which can be interpreted as symbolic representations that respect the similarities between internal relationships. For instance, two IQ problems with similar hidden rules (i.e., similar internal relationships) should have similar indirection representations, though they contain completely different shapes or images. The indirection operator maps this graph to a symbolic graph with the same weight edges, however the vertices are fixed and trainable. This symbolic graph is propagated and the updated node features are indirection representations. Different concrete inputs may share the same indirection representations if their corresponding graphs have the same adjacency matrices. This illustrates the core idea of InLay: constructing indirection representations by transferring internal relationships through indirection. To this end, we implement our method in the form of a generic module named Indirection Layer (InLay), which can construct indirection representations from either encoded or raw low-sensory data and can be equipped with various models to improve their OOD generalization capabilities. InLay receives a sequence of objects as input and produces a sequence with the same length including associated indirection representations. The input sequence is viewed as a complete weighted graph where each edge weight represents the relationship between two corresponding objects, and thus the adjacency matrix of this graph captures the internal relationships of the input. The core operation of InLay consists of two steps: indirection and graph propagation (see Fig. 1 for illustration). The input is first processed through indirection to transfer all edge weights to another symbolic graph whose vertices are data-independent and trainable. This symbolic graph is then propagated, resulting in updated vertex features as the indirection representations of the input. These indirection representations are used as new representations for prediction steps afterward. We show both theoretically and empirically that InLay can help to improve OOD generalization. Theoretically, we show that InLay indirection preserves internal structures of graphs, and the distances between indirection representations are bounded by the cut distances between corresponding graphs. Thanks to these theoretical properties, the indirection representation of a new data instance can be located near a seen one if they share similar internal relationships (although the surface features may be entirely different), thus the two instances have a higher chance of being interpreted similarly. Empirically, we show that InLay consistently helps different models to improve their OOD generalization capabilities in a comprehensive suite of experiments involving numerous datasets and OOD scenarios, including IQ problems with unseen objects and unseen rules, distorted image classification, and few-shot domain adaptation NLP classification. We also conduct ablation experiments to study the necessity of different design choices in InLay and provide practical analysis on the success of InLay. We aim to learn suitable representations for the graph such that the internal relationships of the input sequence can be transferable to novel settings. To this end, we contribute the Indirection Layer (InLay), which leverages indirection and data internal relationships to construct indirection representations. InLay is a generic and flexible module that can be equipped into different models to construct indirection representations from either encoded data or raw low-sensory data (e.g, for the case of Vision Transformer; see Section 4.2) in two steps: indirection and graph propagation.

Indirection

For each X, the adjacency matrix A G X ∈ R k×k of G X represents the internal relationships between objects in X. Each component a X ij of A G X is computed as a X ij = tanh Q xi•K xj √ 4n if i = j and a X ij = 0 if i = j , where • is the inner product and Q, K ∈ R n×4n are trainable weights that project x i and x j onto a higher dimensional space so that a linear kernel may represent the relationship between x i and x j . The choice of tanh as a non-linear transformation is important: it maps the dot products to [-1, 1], allowing InLay to possess nice theoretical properties regarding boundedness of distances (see Section 3); and, tanh allows negative similarities between objects, which may help to represent opposite relations, e.g., translations to the left and to the right. See Section 4.1.1 and Appendix G for experimental details. In indirection, each object is bound to a symbol. We denote by V ind = v ind 1 , v ind 2 , . . . , v ind k ∈ R k×n to be the set of symbols where each v ind i ∈ R n is data-independent and trainable. Let G ind k be the subset of G k that consists of all graphs whose set of vertices is V ind (i.e., each vertex represents some v ind i and no two vertices represent the same v ind i ). The space G ind k can be interpreted as the space of symbolic graphs with fixed vertices. We define the indirection operator I as follows. Definition 2.2. Given an input sequence X = (x 1 , x 2 , . . . , x k ) and its corresponding graph G X ∈ G k , the indirection operator I is a mapping from G k to G ind k that maps G X to I(G X ) so that A G X = A I(G X ) and the i-th vertex of I(G X ) represents v ind i . Remark 2.3. Definition 2. 2 is introduced in the case when the canonical indexing assumption (see Remark 2.1) is obeyed. The vertex order emerges when computing the adjacency matrix. A more general definition is given in Appendix C. The indirection operator I maps each object x i to its associated symbol v ind i while assuming the pairwise relationship between v ind i and v ind j is the same as one between x i and x j (see Fig. 1 ). That is, I ignores the concrete features of objects but still maintains the relationships between them.

Graph propagation

After indirection, each data graph G is mapped to a symbolic graph I(G). This operation can be interpreted as follows: at first, edge weights of I(G) are unspecified; then the indirection operator I assigns edge weights from the data to I(G). Once receiving this information from data, I(G) is propagated and the updated vertex features are indirection representations of the input sequence. Formally, for an input sequence X, if we denote r X to be the indirection representations of X, then r X = A G X V ind . This symbolic r X is used as a new representation for X for prediction steps afterward. To summarize, for each input sequence X, InLay constructs associated indirection representation r X : r X = tanh XQ(XK) √ 4n V ind . ( ) This equation is closely related to self-attention, except for three points: 1. data are projected onto a higher dimensional space by matrix multiplying with Q and K; 2. the softmax operator is replaced by tanh; and most importantly, 3. the value V ind is not computed based on the data X. While the first two differences empirically enhance InLay's performances (see Section 4.4), the third one stands for the core idea of indirection in InLay. An ablation study on V ind will also be conducted in Section 4.4 to demonstrate the role of each element. Initialization of V ind may greatly affect the overall performance. To reduce this effect, we replace V ind in Eq. (1) by ψ(V ind ), where ψ : R n → R n is a trainable 2-layer neural network applied to rows of V ind . We also use multi-heads to compute the adjacency matrix so that local information of feature vectors is better utilized. The number of heads is tuned for each specific task.

3.1. BOUNDEDNESS WITH RESPECT TO THE CUT DISTANCE

Graph spectrum and Laplacian are important graph characteristics that can be computed entirely by graph adjacency matrices. From Definition 2.2, it follows that the indirection operator I preserves graph spectrum and Laplacian, which means I preserves graph internal structure. To some extent, this agrees with the Structure-Mapping Theory (Gentner, 1983) , which states that not the attributes but the internal relationships are transferred in an analogy. In other words, learning internal relationships may already be enough to capture the essence of a situation. Further details for the Structure-Mapping Theory will be given in Section 5. Next, we investigate how distances between graphs may constrain distances between indirection representations. Before defining graph distance, we define isomorphism between graphs in G k . Definition 3.1. Given two graphs G = (V, E) ∈ G k and G = (V , E ) ∈ G k with associated adjacency matrices A G = (a ij ) i,j=1,k and A G = (a ij ) i,j=1,k . We say G and G are isomorphic, denote by G ∼ = G , if there exists a bijection φ : V → V so that a ij = a φ(i)φ(j) for every i, j ∈ V . Two isomorphic graphs can be interpreted as being identical up to isomorphism, and thus a graph distance defined on G k should respect this property, i.e., the distance between two isomorphic graphs is 0. One such distance is the cut distance δ (Borgs et al., 2008) , which is a useful tool to compare similarities betwen structures (Liu et al., 2018) , and also for studying the convergence of sequence of graphs (Borgs et al., 2008) . A formal definition for δ is given in Appendix A. It follows from the definition of δ that δ (G, G ) = 0 if and only if G is isomorphic with G . Moreover, if G 1 ∼ = G 2 then δ (G 1 , G ) = δ (G 2 , G ) for any G . Since G ∼ = I(G), the indirection operator I preserves δ distance, i.e., δ (G, G ) = δ (I(G), I(G )) for every G, G ∈ G k . We have shown that the indirection operator I admits invariant properties with respect to the graph spectrum, Laplacian and the cut graph distance δ . The following result shows that the distances between indirection representations are bounded by cut distances between corresponding graphs. For each G ∈ G k , we denote r G = A G V ind to be its associated indirection representation. Theorem 3.2. For any two graphs G ∈ G k and G ∈ G k , the following inequality holds: r G -r G ∞ ≤ k 2 + k 2 δ (G, G ) V ind ∞ , where . ∞ is the matrix infinity norm (see Definition A.3 in Appendix A). Proof. See Appendix B. Note that even when δ (G, G ) = 0, r G may still be different from r G . This is because by design, InLay also takes into account the ordinal information of input sequence, which may be important in some specific use cases, e.g., when the input is sequence of image patches. Theorem 3.2 shows that if G and G are close, their indirection representations will not be far away from each other as well. This is an important property since the original vertex representations of G and G may be arbitrarily far though G and G are isomorphic, e.g., two IQ problems with the same hidden rules but different images may be represented very differently. Theorem 3.2 also shows the necessity of training V ind to obtain appropriate V ind ∞ : if V ind ∞ is too large, the bound in Ineq. (2) is loose; conversely, if V ind ∞ is too small, the bound may be too strict so that indirection representations are not well separated enough. An empirical ablation study on V ind will be given in Section 4.4.

3.2. CONNECTION BETWEEN INLAY AND STRUCTURAL ANALOGY

Current machine learning methods follow the manifold hypothesis and tend to interpolate on the learned manifold during testing. This ability of interpolation is usually referred as making value analogies, i.e., making analogies between data features. However, value analogy may not be enough in more extreme generalization cases when surface statistics of testing samples vastly differ from that of training data. Structural analogy is believed to be necessary for ML models to reach higher levels of generalization (Chollet, 2021) . By making structural analogy, concrete information is partly ignored while structural information is compared, e.g., two IQ problems with the same hidden rules are structurally analogous even though the data (e.g., images) are entirely different between the problems. In InLay, the structural information is maintained in the form of adjacency matrices, which are computed based on data features. To some extent, InLay can be interpreted as a hybrid method of value analogy and structural analogy. When it comes to structural analogy, one might need a metric to measure the similarities between structures. Among different metrics, the cut distance δ is one of the few methods able to compare directed weighted graphs (Tantardini et al., 2019) . For instance, a recent work by Liu et al. (2018) leverages the cut distance to compare complex networks, including artificial networks and real networks of chemical molecules. Theorem 3.2 draws a connection between InLay and the cut distance by showing that the distances between indirection representations are bounded by corresponding cut distances, and thus emphasizes the structural inductive bias InLay brings into deep learning models.

4. EXPERIMENTS

In this section, we conduct several experiments with different scenarios of OOD generalization to show that InLay can adapt to various models and improve their performances on various datasets. The OOD testing scenarios include IQ problems with unseen images and unseen rules, distorted image classification, and domain adaptation on few-shot NLP classification, all of which require the ability to understand the problem in a systematic and symbolic way in order to generalize on new OOD circumstances. Throughout these experiments, we show that InLay consistently helps models to perform better. We also provide an ablation study on the necessities of different design choices in InLay, as well as a practical analysis of the success of InLay. In practice, we optionally use context normalization (Webb et al., 2020a) to further improve InLay. If context normalization is applied in InLay, there will be two such layers: one to normalize the original representation X, and one to normalize the symbolic representation r X .

4.1. OUT-OF-DISTRIBUTION IQ PROBLEMS

IQ problems are powerful testbeds for OOD generalization capability of deep learning models. Despite their simple appearances, IQ problems are challenging in the sense that they require models to understand the hidden rules instead of just surface features to solve new problems with unseen objects or even unseen (but related) rules. There have been evidences showing that current deep learning models may fail when facing problems with unseen objects (Webb et al., 2020b) . In this experiment, we show that models coupled with InLay achieve better performances on two IQ datasets: FINE (Pham et al., 2022) and RAVEN (Zhang et al., 2019) . Examples are given in Fig. 2b . We use a vanilla 6-layer Vision Transformer (ViT) (Dosovitskiy et al., 2020) as the base model and test it, with or without InLay, on different datasets, namely the SVHN (Netzer et al., 2011) and CIFAR10&100 (Krizhevsky, 2009) . The models are trained on original images in two cases: with and without data augmentation, and tested on images with various distortions, including image transformation (90-degree rotation) and color transformations (color jitter, grayscale). In the case of data augmentation, we use all other distortions for augmentation except one used for testing. If InLay is equipped, we divide each 32 × 32 image into overlapping patches of size 8 × 8 and stride 4. These patches are vertices of the graph that represents the current image. The patch indirection representations are reassembled to form a new image of the same size as the original one. This new image is then fed into ViT. Context normalization is not used in this task. Results are shown in Table 3 . As expected, ViT performs poorly when the images are distorted. In average, InLay helps to improve its performance by 3.3% on SVHN, 0.9% on CIFAR10, and 5.3% on CIFAR100 when there is no data augmentation and 1.8% on SVHN, 12.8% on CIFAR10, and 7.2% on CIFAR100 when data augmentation is included. We can also observe that 1. ViT with data augmentation but without InLay is still mostly worse than ViT with InLay but without data augmentation; and 2. InLay helps improve ViT in both cases of with and without data augmentation. Note that it should not be interpreted that adding InLay to ViT is equivalent to adding a Transformer layer. Empirically, performances of 7-layer ViT only slightly differ from 6-layer ViT (see Appendix E), while it is clear that adding InLay may boost performances significantly.

4.3. FEW-SHOT NLP DOMAIN ADAPTATION

Humans can handle NLP classification tasks given a small number of examples. While humans can quickly adapt to such new scenarios, deep language models may not. Gao et al. (2019) proposed the FewRel 2.0 dataset that consists of few-shot NLP classification tasks, where the domains of train and test tasks vastly differ. Specifically, the training texts are taken from the Wikipedia corpus, while texts for testing originate from the PubMed and UMLS databases that contain large amounts of biomedical literature and sciences. This creates a big obstacle for few-shot language models to adapt to: their performances drop drastically as reported in the original paper. Inspired by this result, we conduct an experiment on FewRel 2.0 dataset to show that InLay works well for language models. Different few-shot models, including Prototypical Network (Snell et al., 2017) , SNAIL (Mishra et al., 2017) , Graph Neural Network (Garcia and Bruna, 2017), and MTB (Soares et al., 2019) , are trained with BERT encoder (Devlin et al., 2018) on 5-way-1-shot and 10-way-1-shot tasks. All models are equipped with context normalization. Since the test set is not provided for the public, we only report the test results on validation set, which shares the same domain as the test set. Results are shown in Table 4 . Except ProtoNet, InLay helps other models improve: 19.8% for SNAIL, 21.3% for GNN, and 2.0% for MTB in average. The case of ProtoNet can be explained as follows: ProtoNet depends the distances between data instances, which is similar to the spirit of our InLay. Because ProtoNet already has this inductive bias, InLay can not help to improve it. In each experiment, we modify one design choice and keep others fixed. We consider three main design choices: activation function to compute adjacency matrices, projection on higher space to compute dot products, and trainability and data-independence of V ind . We also consider the case when the indirection representations are treated as relative positional encoding to be added to the original input. Results are reported in Fig. 3a . In the ablation for activation function, replacing tanh with softmax significantly decreases the performance, while the result when no activation is applied is only slightly lower. This is because softmax does not allow negative values; moreover, it imposes the constraint of summing-to-one on edges of graph, which is unnecessary in the theoretical analysis. On the other side, projecting data onto higher dimensional spaces also plays a vital role in InLay as it helps linearize the relations between objects so that dot products may manage to represent those relations, and not doing so may lead to a drastic drop in performance. Maintaining a trainable set of symbols V ind is beneficial for InLay, and the performances with randomly sampled V ind from Gaussian tend to decrease when the standard deviations of the Gaussians increase. This can be explained by Theorem 3.2: increase of standard deviations leads to bigger V ind ∞ , which loosens the bound in Ineq. (2). Keeping V ind data-independent is also important, and treating the indirection representations as relative positional encoding is not efficient.

4.4.2. FURTHER PRACTICAL ANALYSIS

Besides theoretical analysis in Section 3, we further provide practical evidence showing why InLay may help models to generalize better. We again use the OOD classification tasks as a testbed. In short, we would like to show that InLay reduces the distance between an image and its distorted version, thus models may recognize the similarity between the two images more easily. Using the absolute distance may not be a fair metric since scaling two vectors by the same factor may already reduce the distance between them. Instead, we compute the relative distances between vectors, i.e. the relative distance between u and v is 2 × u-v ∞ u ∞ + v ∞ . Relative distances between images and their distorted versions are computed in two cases: with InLay and without InLay. Results are shown in Fig. 3b , and it is clear that the relative distances in InLay case are lower than those without InLay. This can be partly explained by Theorem 3.2: the original image corresponds with G, and the distorted image corresponds with G . Since the distorted image is closely related to the original one, the distance between their corresponding graphs is small, and thus the distance between their indirection representations r G and r G also tends to be small according to Ineq. (2).

5. RELATED WORK

Systematic generalization has attracted attention recently in the deep neural networks community. One approach is to train a mixture of experts as functional modules, and these experts either compete (Parascandolo et al., 2018) or are composed by attention mechanism (Rahaman et al., 2021) to solve a task. Fedus et al. (2021) proposed the Switch Transformer to simplify routing algorithms in mixture-of-experts models to reduce communication and computational costs. Another approach is to design architectures that mimic human's ability to think and reason sequentially. One well-known early model following this approach is the Module Network (Andreas et al., 2016) which attacks the image-QA tasks by parsing the query into sequential sub-queries, each of which is solved by a module in the form of neural networks. The MAC recurrent network (Hudson and Manning, 2018) and the Neural State Machine (Hudson and Manning, 2019) follow a similar idea, however, in MAC the query is explicitly and expressively decomposed by a sequence of RNN-type MAC cells, while Neural State Machine relies on probabilistic graphs representing underlying semantics to reason sequentially. Recently, Wei et al. (2022) proposed the idea of chain of thought to improve the ability of large language models to perform complex reasoning. Our InLay also follows the idea of injecting symbolic inductive bias, however, we focus on representations instead of functional modules. Models equipped with InLay can be interpreted as 2-step reasoning: the low-level sensory data is first represented symbolically by InLay, then processed by the following models. Indirection is one of the most useful ideas that has been long applied in different areas of computer science. One of the most illustrative examples for indirection is the concept of pointer. Recently, there have been works leveraging indirection to improve generalization capabilities of deep learning models. ESBN (Webb et al., 2020b) uses an RNN controller to sequentially produce a symbolic key for each object and reasons on keys only. The keys in ESBN are computed based on the controller and similarities between objects, which is similar to our InLay; however, the keys are produced sequentially, which may increase the computational cost. Recently, Pham et al. (2022) proposed FINE, which is a fast-weight approach that utilizes indirection on functional spaces and has achieved promising performances on different OOD testing scenarios of IQ problems. The idea of transferring relationships in InLay is inspired from the Structure Mapping Theory (SMT) (Gentner, 1983) , which is a revolutionary theory of analogy in cognitive science. Analogy is a vital concept to explain human cognition, and it has been long argued that analogy-making underlies humans' ability to flexibly adapt to new situations (Gentner et al., 2001) . Before SMT, it had been assumed that in a strong analogy, the base and the target should share several attributes in common (Tversky, 1977) . SMT, in contrast, argues that not attributes but the relationships between objects are transferred in an analogy; in other words, the essence of a situation lies in internal relationships instead of concrete attributes. From this point of view, the theoretical results in Section 3 can be interpreted as a justification for SMT in the case of InLay: transferring relationships only does not lose significant information such as graph characteristics and graph topology. It is also worth noting that from few-shot learning perspective, our InLay can be categorized as fast-weight (Malsburg, 1994) embedding learning model, in which the attentional weight A G X is computed on-the-fly and the indirection representation r X is computed accordingly. The trainable set of symbols V ind can be interpreted as positional encodings, and the output r X of InLay is a relative positional encoding regarding the input X. The idea of relative positional encoding (Shaw et al., 2018) , as a replacement for the absolute positional encoding in Transformer, has been widely investigated and several variants have been proposed (Dai et al., 2019; Huang et al., 2018) . It is worth noting that the relative positional encoding is added to the original input, while r X in InLay plays the role of the input for the following model. InLay thus should not be considered as a variant of a Transformer layer or relative positional encoding; instead, its design highlights the idea of indirection and symbolic representations.

6. CONCLUSION

In this paper, we propose InLay as a separate module that can be plugged into different models to improve OOD generalization. InLay leverages the idea of indirection to redirect data representation based on a trainable set of symbols. Viewing each data point as a complete weighted graph, we prove theoretically that InLay preserves graph internal structure and graph topology, and the distances between refined representations are bounded by the distances between corresponding graphs. We show the effectiveness of InLay through a comprehensive suite of experiments, including different OOD testing scenarios on IQ problems, distorted image classification, and few-shot NLP domain adaptation classification tasks. We also conduct ablation experiments to study necessities of different design choices in InLay, as well as further practical analysis on the success of InLay. InLay opens up several future directions. From the theoretical side, it is worth investigating how the manifold containing original data representations is transformed during InLay, and why this manifold transformation can help generalization. From the practical view, stacking multiple InLay's to form a hierarchical indirection network is a promising idea. Theorem 3.2, diam R G ≤ 2 V ind ∞ for every G ∈ G k , where diam stands for the diameter of a set. We now focus on the distance between R G and R G , and we use the popular Hausdorff metric to measure this distance. The distance between R G and R G is meaningful in the sense that for any r ∈ R G , we can find r ∈ R G so that r -r ∞ ≤ d H (R G , R G ). If d H (R G , R G ) is small and r is observed, then r is likely to be treated similarly as r. The following last theorem shows that the Hausdorff distance d H with respect to the . ∞ norm between R G and R G also depends on the distance between G and G . Theorem B.1. Given two graphs G ∈ G k and G ∈ G k . The following inequality holds: d H (R G , R G ) ≤ k 3 V ind ∞ δ(G, G ). Moreover, if G and G are not isomorphic and rank V ind = k, then R G ∩ R G = ∅. Proof of Theorem B.1. Denote [G] = { G ∈ G k : G ∼ = G}. We will prove that for every G 1 ∈ [G], there exits G 2 ∈ [G ] so that r G1 -r G2 ∞ ≤ k 3 V ind ∞ δ(G, G ). Following the definition of δ , for G 1 ∈ [G], there exists G 2 ∈ [G ] so that δ (G 1 , G ) = d (G 1 , G 2 ). On the other hand, since G 1 ∼ = G, it follows that δ (G 1 , G ) = δ (G, G ), and hence d (G 1 , G 2 ) = δ (G, G ). This leads to r G1 -r G2 ∞ = A G1 V ind -A G2 V ind ∞ ≤ A G1 -A G2 ∞ V ind ∞ ≤ k 3 d (G 1 , G 2 ) V ind ∞ = k 3 δ (G, G ) V ind ∞ . This means d(r G1 , R G ) = inf G∈[G ] r G1 -r G ≤ k 3 δ (G, G ) V ind ∞ for every G 1 ∈ [G]. Similary, d(r G2 , R G ) = inf G∈[G] r G2 -r G ≤ k 3 δ (G, G ) V ind ∞ for every G 2 ∈ [G ]. This leads to d H (R G , R G ) = max sup G1∈[G] d(r G1 , R G ), sup G2∈[G ] d(r G2 , R G ) ≤ k 3 V ind ∞ δ(G, G ). Finally, if rank V ind = k and G and G are not isomorphic, suppose there exists r ∈ R G ∩ R G . Since r ∈ R G , there exists G 1 ∈ [G] so that r = r G1 = A G1 V ind . Similarly, there exists G 2 ∈ [G ] so that r = A G2 V ind . This leads to A G1 V ind = A G2 V ind , and since rank V ind = k, we obtain A G1 = A G2 . This means G 1 ∼ = G 2 , and hence G ∼ = G , which is a contradiction. Hence R G ∩ R G = ∅.

C WELL-DEFINEDNESS OF INDIRECTION REPRESENTATION

In this section, we consider the case when the canonical assumption (see Remark 2.1) is not obeyed. First, we need a more general definition for the indirection operator (Definition 2.2). learning rate 2 • 10 -5 . The indirection representations are computed as in Eq. ( 1) with 32 attention heads. Other details are similar to the original paper and codes are also adapted from the original paper.



Figure 1: Indirection Layer. Concrete data representation is viewed as a complete graph with weighted edges. The indirection operator maps this graph to a symbolic graph with the same weight edges, however the vertices are fixed and trainable. This symbolic graph is propagated and the updated node features are indirection representations. Different concrete inputs may share the same indirection representations if their corresponding graphs have the same adjacency matrices. This illustrates the core idea of InLay: constructing indirection representations by transferring internal relationships through indirection.

Figure 1. (a) An example RPM. O that best completes the problem m and analogical relations. Each imag

Figure 3: (a) Ablation study. Test accuracies of ViT equipped with InLay on grayscale images when a design choice of InLay is replaced or removed. (b) Relative distances between original and distorted representations with and without InLay in different testing cases. 4.4 ABLATION AND ANALYSIS 4.4.1 ABLATION ON INLAY DESIGN CHOICES We conduct ablation experiments to study the necessities of different design choices of InLay. All experiments are conducted on OOD classification task (see Section 4.2) with ViT and grayscale testing images.In each experiment, we modify one design choice and keep others fixed. We consider three main design choices: activation function to compute adjacency matrices, projection on higher space to compute dot products, and trainability and data-independence of V ind . We also consider the case when the indirection representations are treated as relative positional encoding to be added to the original input. Results are reported in Fig.3a.

igure 4: Illustration of Theorem 3.2 and Theorem B.1.

InLay constructs indirection representations from either encoded or raw data. If the prediction model is equipped with an encoder, InLay will sit between the encoder and prediction to transform encoded representations to indirection representations (see Fig.2afor an illustration). When there is no encoder, e.g., when the prediction model is Vision Transformer, InLay directly constructs indirection representations from raw data. To be fair when comparing, models with or without InLay are all trained with the same training settings, including batch size, learning rate, number of training iterations, optimizer, etc., and we only report test results after the last iteration. Average results are reported in the main text; full results with standard deviation are given in Appendix D. More training details are also given in Appendix K.

/56.9 27.1/31.4 29.8/58.5 30.3/61.0 35.2/69.2 30.0/34.5 29.6/56.0 31.9/48.7 PrediNet 26.7/28.5 25.5/28.3 25.6/29.0 26.4/29.9 27.5/31.6 26.8/27.8 26.4/28.4 33.6/35.9 Average test accuracy (%) without/with InLay on FINE dataset.

Average test accuracy (%) without/with InLay on RAVEN dataset.4.1.1 FINE DATASETFINE dataset consists of IQ problems with geometric transformations as hidden rules. To succeed in this dataset, models should treat objects as symbols and learn the relationship between these symbols.

Average test accuracy (%) without/with InLay on OOD classification task with ViT.

Average validation accuracy (%) without/with InLay on FewRel 2.0.4.2 OUT-OF-DISTRIBUTION CLASSIFICATIONHumans can consistently recognize objects in different positions, angles, or colors. Current deep learning models may not.Geirhos et al. (2018) show that when test images are injected with different kinds of distortions other than ones in training, deep neural networks may fail drastically on image classification tasks. We take inspiration from that result and conduct similar experiments to test whether InLay can help models improve their performances on OOD image classification tasks.

Average test accuracy (%) without/with InLay on FINE dataset.

Average test accuracy (%) of NTM (with or without InLay) on FINE dataset with different activation functions.

APPENDIX

A DEFINITION OF δ AND . ∞We first define the cut distance d between graphs with the same set of vertices. Definition A.1. (Borgs et al., 2008) Given two graphs G = (V, E) ∈ G k and G = (V, E ) ∈ G k with associated adjacency matrices A G = (a ij ) i,j=1,k and A G = (a ij ) i,j=1,k . The cut distance d between G and G is The distance d can also be interpreted as the distance between the internal relationships (i.e., the adjacency matrices) of two graphs. However, one drawback of d is that it is not invariant under isomorphism. The generalized cut distance δ is proposed to overcome this drawback. Definition A.2. (Borgs et al., 2008) Given two graphs G = (V, E) ∈ G k and G = (V, E ) ∈ G k with associated adjacency matrices A G = (a ij ) i,j=1,k and A G = (a ij ) i,j=1,k . The generalized cut distance δ between G and G is computed as δ (G, G ) = min G∼ =G d ( G, G ), where G shares the same set of vertices with G .Next, we define the matrix infinity norm . ∞ induced from the vector max norm. Definition A.3. For a given matrix A = (a ij ) i=1,k,j=1,n , its infinity norm is computed asIn words, the matrix infinity norm is the max row sum.Proposition A.4. The matrix infinity norm is sub-multiplicative, i.e., AB ∞ ≤ A ∞ B ∞ .

B PROOF OF THEOREM 3.2 AND MORE THEORETICAL RESULTS

In this section, we provide proofs for theoretical results in the main text. An illustration of theoretical results are given in Fig. 4 .From the definition of δ (Definition A.2), there exists, where A G ind and A I(G ) are the adjacency matrices of G ind and I(G ) respectively, it follows from the definition of d that E ∞ ≤ k 3 ε (since the absolute value of each element of E is less than k 2 ε due to Definition A.1 of d , and the infinity matrix norm is the max row sum where each row has k elements). Finally, note thatTheorem 3.2 focuses on the distance between two single indirection representations. Now we move our attention to the distance between sets of indirection representations associated with isomorphic classes of graphs. For eachDefinition C.1. Given an input sequence X = (x 1 , x 2 , . . . , x k ) and its corresponding graph G X ∈ G k , the indirection operator I is a mapping from G k to G ind k that maps G X to I(G X ) so that 1. A G X = A I(G X ) and 2. if the i-th vertex of G X represents for x j , then the i-th vertex of I(G X ) represents for v ind j .Consider an input sequence X = (x 1 , x 2 , . . . , x k ) and its corresponding graph G X with adjacency matrix A G X , which is computed based on the assumption in Remark 2.1. The associated indirection representation computed by Eq. ( 1) is r X , i.e., the i-th element of r X is the indirection representation for x i .Now consider an arbitrary graph G X that also represents X with adjacency matrix A G X and set of vertices {v 1 , v 2 , . . . , v k }. This means there exists a permutation σ (with an associated permutation matrix P ) so that v σ(i) represents for x i for all i, andThis means the σ(i)-th element of r X is the i-th element of r X . On the other hand, after graph propagation, u σ(i) represents for the σ(i)-th element of r X , which is the i-th element of r X . Since v σ(i) represents for x i and v σ(i) is mapped to u σ(i) by the indirection operator I, it follows that the i-th element of r X is the indirection representation for x i . This shows that the indirection representations of x i 's are unchanged when the vertices of G X permute. 

F RUNNING TIME OF INLAY

We report running time (s/iter) of ViT and ViT+InLay on CIFAR10 dataset. Models are trained on a single Tesla V100-SXM2 GPU. Overall, ViT+InLay requires roughly 10% more computational time.

G MORE ABLATION STUDIES ON TANH ACTIVATION

Readers may observe in Fig. 3a that having no activation in InLay still achieves almost equal performance. However, that is just a special case; Table shows results of similar ablation experiments with NTM on the FINE dataset with different activation functions. Among all, the tanh activation achieves best average performance.

H ABOUT INPUT SEQUENCE LENGTH H.1 WHEN THE SEQUENCE IS TOO LONG

When the number of nodes is large, graph neural networks usually suffer from the issue of oversmoothness, which is the phenomenon that all nodes become nearly the same after updated. However, we show that InLay may mildly suffer from this issue. In the OOD classification task, we increase All input sequences in our experiments are of fixed length. The case of varying input sequence length can be treated as fixed-length case if the maximum sequence length is known: we can use empty nodes to fulfill any sequence to reach that maximum length. We conduct experiments on OOD classification task to illustrate this idea: for each image patches sequence, we randomly remove some (2 to 6) patches, so that the resulting sequences have different lengths; we then use zero tensors for padding so that all sequences now have the same lengths. We train ViT+InLay on CIFAR10 dataset and test on images with grayscale distortion. The test accuracy is 61.4%, which is not much different from 62.7% of the fixed-length case.A more challenging scenario is when the lengths of testing sequences are longer than training ones. Current design of InLay does not allow it to deal with this situation. We believe this is promising for future work.

I COMPARISON WITH OTHER INDIRECTION APPROACHES

We compare InLay with different indirection approaches like ESBN (Webb et al., 2020b) and FINE (Pham et al., 2022) on FINE dataset. We incorporate InLay with Transformer as it shows the best performance among different models. For FINE, we use NICE backbone as suggested in the original paper. Results are shown in Table 14 . Overall, Transformer+InLay shows competitive results with the best performances on 5/8 tasks and second-best performances on 2/8 tasks.

J MORE ABLATION EXPERIMENTS ON CONTEXT NORMALIZATION

We further conduct ablation experiments to show the necessity of context normalization in InLay. with transformed images. We use Adam optimizer (Kingma and Ba, 2014) with learning rates ranging from 10 -5 to 3 • 10 -4 , depending on specific model and transformation. We train all models with batch size 32 in 200 epochs. The indirection representations are computed as in Eq. ( 1) with 1 attention head.The training set contains 5,000 IQ problems, while testing set contains 10,000 IQ problems of unseen images and unseen rules. Specifically:• With translation, models are trained on problems with translation vectors (a, b) with a ∈ {0, 3, 6, 9} and b ∈ {0, ±3, ±6, ±9}, and tested with a ∈ {-3, -6, -9}, i.e., train on problems with translations to the right and test on problems with translations to the left. • With rotation, models are trained on problems with rotation angle α ∈ {0 We use 3-layer CNN encoder with kernel size 3 and stride 2 to encode 80 × 80 images to feature vectors of size 256. We use Adam optimizer with learning rates ranging from 10 -4 to 3 • 10 -4 and gradient clipping 1. All models are trained with batch size 32 in 250 epochs. The indirection representations are computed as in Eq. ( 1) with 1 attention head.We apply the Dynamic Residual Tree (DRT) as follows: we first apply DRT to feature vectors, then pass resulting vectors through InLay to obtain indirection representations, then apply DRT once again and these final resulting vectors will be the input for prediction models. Other details are similar to the original paper and codes are also adapted from the original paper.

K.2 OOD IMAGE CLASSIFICATION

We use Adam optimizer with learning rate 5 • 10 -4 . All models are trained with batch size 32 in 200 epochs. The indirection representations are computed as in Eq. ( 1) with 32 attention heads.We use 6-layer ViT with patch size 8, 16 attention heads and dropout rate 0.1. The dimension of feedforward layer is 2048.

K.3 FEW-SHOT NLP DOMAIN ADAPTATION

We use BERT to encode paragraphs to feature vectors of size 768. Models are trained with batch size 32 on 5-way-1-shot tasks and batch size 16 on 10-way-1-shot tasks. We use SGD optimizer with

