RELATIONAL LEARNING WITH VARIATIONAL BAYES

Abstract

In psychology, relational learning refers to the ability to recognize and respond to relationship among objects irrespective of the nature of those objects. Relational learning has long been recognized as a hallmark of human cognition and a key question in artificial intelligence research. In this work, we propose an unsupervised learning method for addressing the relational learning problem where we learn the underlying relationship between a pair of data irrespective of the nature of those data. The central idea of the proposed method is to encapsulate the relational learning problem with a probabilistic graphical model in which we perform inference to learn about data relationships and other relational processing tasks.

1. INTRODUCTION

American Psychological Association defines relational learning as (VandenBos & APA, 2007) : Definition 1.1 (Relational learning). Learning to differentiate among stimuli on the basis of relational properties rather than absolute properties. In other words, relational learning refers to the ability to recognize and respond to relationship (called relational property) among objects irrespective of the nature of those objects (called absolute property). Relational learning has long been recognized as a hallmark of human cognition, and there has been substantial research showing that adequate cognitive capacity is necessary for relational processing (Biederman, 1987; Medin et al., 1993; Holyoak, 2012; Doumas & Hummel, 2013; Gentner, 2016) . As a machine learning application, relational learning can provide new insight into data analysis by dissecting information in the data into relational property and absolute property. However, in order to discover relationship patterns among raw and unknown data, relational learning is only truly useful if it can be achieved without supervised data. A key challenge in learning relational property with machine learning-based methods is that relational property is an abstract construct; unlike absolute property, which is based on observable data and can be quantitatively measured, relational property is an abstract quantity that is difficult to objectively quantify, especially when the learning is unsupervised. In this work, we propose an unsupervised learning method-variational relation learning (VRL)-for addressing the relational learning problem. The proposed method is completely unsupervised, which means that the learning does not require a labeled training dataset nor training examples that have the same (or different) relational property. At its core, VRL encapsulates the relational learning problem with a probabilistic graphical model (PGM) in which we perform inference to learn about relational property and other relational processing tasks. Furthermore, our main learning algorithm is derived from the PGM using first principles, which gives us the flexibility to use any compatible computational inference method and still retains the desired properties of the proposed method. Our contribution in this paper is threefold. First, we propose a PGM that encapsulates the relational learning problem. Second, we formulate various relational processing tasks as performing inference and learning in the PGM. Third, we propose an efficient and effective learning algorithm that can be trained end-to-end and unsupervised.

2. PROBLEM DEFINITION

We begin with formulating the relational learning problem as a machine learning problem: we observed a paired dataset X = { (a (i) , b (i) ) | i ∈ [1..N ] } consisting of N i.i.d samples generated from a joint distribution p( a ∈ A, b ∈ B ); our goal is to learn a relational property between a (i) and b (i) irrespective of their absolute property. Furthermore, we want the learning to be unsupervised, e.g., we do not require a labeled dataset, such as (a (i) , b (i) , z (i) ) where z (i) is a target variable indicating (a (i) , b (i) )'s relational property, nor do we require training examples that have the same (or different) relational property. There are two distinct features that separate our problem formulation from other unsupervised learning problem formulations: 1. We dissect the information in X into relational property and absolute property; relational property characterizes the relationship between a (i) and b (i) , whereas absolute property represents specific features that independently describe a (i) and b (i) . 2. Our goal is to learn a relational property among X irrespective of its absolute property, i.e., we want to learn a relational property that is decoupled from the absolute property. In addition, we are interested in two related relational processing tasks: relational discriminationfoot_0 and relational mappingfoot_1 (VandenBos & APA, 2007) .Relational discrimination allows us to differentiate (a (i) , b (i) ) from (a (j) , b (j) ) based on their relational properties, while relational mapping allows us to apply the relational property of (a (i) , b (i) ) to a different set of data, for example, deduce that b (j) is related to a (j) in the same way that b (i) is related to a (i) .

3. METHOD

Here we introduce the proposed VRL method for addressing the relational learning problem and discuss various optimization challenges unique to VRL.

3.1. VARIATIONAL RELATION LEARNING

The proposed VRL method consists of two parts: first, we encapsulate the relational learning problem with a PGM, called VRL-PGM; we then formulate various relational processing tasks as performing inference and learning in VRL-PGM. The VRL-PGM model, shown in Fig. 1 , generates data a, z, and b by sampling from PDFs that come from parametric families of distributions-p θ ( a ), p θ ( z ), p θ ( b|a, z )-that are differentiable almost everywhere with respect to (w.r.t.) a, z, and θ. In practice, we observe only a set of independent realizations { (a (i) , b (i) ) | i ∈ [1. .N ] } while the true parameter θ * and the corresponding latent variables z (i) are unobserved. A well-known property of the PGM shown in Fig. 1 is that random variables a and z are independent with no variables observed, but not conditionally independent when b is observed, i.e., p θ ( a, z (Bishop, 2006) . In VRL-PGM, the absolute property can be interpreted as representing the dependency between a and b, i.e., features in a that can be used to predict b, while the relational property, represented by the latent variable z, can be interpreted as any additional information not found in a but can help to better predict b. The key requirement that the learned relational property be decoupled from the absolute property is enforced by VRL-PGM's construction where z, which represents the relational property, and a, from which the absolute property is derived, are independent. The proposed VRL-PGM reflects our priority and compromise for using a PGM to represent the abstract relational learning problem: we sacrifice some identifiability of the original abstract problem (e.g., VRL-PGM artificailly introduces causal relationship between a, b and z, b) but we gained a rigorous and mathematical tractable PGM while achieving our primary objective of learning a decoupled (independent) relational property. Additional discussions on the connection between VRL-PGM and the relational learning problem is provided in appendix D.1. ) = p θ ( a )p θ ( z ), p θ ( a, z | b ) = p θ ( a | b )p θ ( z | b ) Having established VRL-PGM, our primary learning objective is to approximate the unknown true likelihood function p θ ( b|a, z ) and posterior p θ ( z|a, b ) (note that by observing b, random variables z and a are no longer independent). Learning p θ ( z | a, b ) provides us a way to infer (a (i) , b (i) )'s relational property z (i) ; moreover, it serves as a basis for performing relational discrimination where we compare relational properties between different pairs of data. Learning p θ ( b | a, z ) allows us to perform relational mapping where we use the relational property of (a (i) , b (i) ) to map from a (j) to b (j) , i.e., b (j) ∼ p θ ( b | a (j) , z (i) ) where z (i) ∼ p θ ( z | a (i) , b (i) ).

3.2. VARIATIONAL LOWER BOUND

We estimate the parameter for p θ ( b | a, z ) by following the maximum-likelihood (ML) principle, and approximate the true posterior p θ ( z | a, b ) with variational Bayesian approach. More specifically, we use a variational distribution q φ ( z | a, b ), parameterized by φ, to approximate the unknown (and often intractable) true posterior through maximizing a variational lower bound (Bishop, 2006) . To derive such a lower bound, we first write the log-evidence as log p θ ( X ) = log p θ ( { (a (i) , b (i) ) | i ∈ [1..N ] } ) = N i=1 log p θ ( a (i) , b (i) ) , where each term in the summation can be expressed as: log p θ ( a (i) , b (i) ) = D KL q φ ( z | a (i) , b (i) ) p θ ( z | a (i) , b (i) ) + E q φ (z|a (i) ,b (i) ) log p θ ( z, a (i) , b (i) ) -log q φ ( z | a (i) , b (i) ) . The first term on the RHS is the KL-divergence from p θ ( z | a (i) , b (i) ) to q φ ( z | a (i) , b (i) ) , which provides a measure of dissimilarity between the two distributions; the second term on the RHS continues as: E q φ (z|a (i) ,b (i) ) log p θ ( z, a (i) , b (i) ) -log q φ ( z | a (i) , b (i) ) = E q φ (z|a (i) ,b (i) ) log p θ ( b (i) | a (i) , z )p θ ( z )p θ ( a (i) ) -log q φ ( z | a (i) , b (i) ) = E q φ (z|a (i) ,b (i) ) log p θ ( b (i) | a (i) , z ) + log p θ ( z ) -log q φ ( z | a (i) , b (i) ) + log p θ ( a (i) ), where in the second line we use the fact that random variables a and z are independent. Substitute Eq. (2) back in (1) and rearrange terms gives us: log p θ ( b (i) | a (i) ) = D KL q φ ( z | a (i) , b (i) ) p θ ( z | a (i) , b (i) ) + L(θ, φ; a (i) , b (i) ) (3) where L(θ, φ; a (i) , b (i) ) = E q φ (z|a (i) ,b (i) ) log p θ ( b (i) |a (i) , z ) + log p θ ( z ) -log q φ ( z|a (i) , b (i) ) . (4) Since KL-divergence is non-negative, L(θ, φ; a (i) , b (i) ) (abbreviated as L (i) for notation compactness) serves as a lower bound for the conditional log-likelihood log p θ ( b (i) | a (i) ). Maximizing L (i) w.r.t. φ and θ gives us both a ML estimate for p θ ( b | a, z ) (by maximizing the first term inside the expectation in Eq. ( 4)) and a lower KL-divergence (the better q φ ( z | a (i) , b (i) ) approximates the true posterior p θ ( z | a (i) , b (i) )) as the conditional log-likelihood log p θ ( b (i) | a (i) ) does not depend on φ. The lower bound L (i) can be maximized with gradient ascend methods; however, its gradients w.r.t. φ is difficult to obtain: the expectation in Eq. ( 4) is taken w.r.t. the distribution q φ ( z | a (i) , b (i) ), which is a function of φ (Paisley et al., 2012) . To obtain efficient estimators for both L (i) and its gradients, we adopt the reparameterization trick developed in Kingma & Welling (2014) where the random variable z is expressed as a transformation of another random variable ∼ p( ) that is independent of a, b, and φ: z = g( , a (i) , b (i) , φ) where g is some differentiable and invertible transformation. Given such a change of variable, the lower bound L (i) can be rewritten as: L (i) = E p( ) log p θ ( b (i) | a (i) , z ) + log p θ ( z ) -log q φ ( z | a (i) , b (i) ) , where z = g( , a (i) , b (i) , φ) and ∼ p( ). Note that the expectation in Eq. ( 5) is taken w.r.t. p( ) and we can now approximate L (i) with a Monte Carlo estimator: L (i) = 1 L L l=1 log p θ ( b (i) | a (i) , z (i,l) ) + log p θ ( z (i,l) ) -log q φ ( z (i,l) | a (i) , b (i) ), where z (i,l) = g( (i,l) , a (i) , b (i) , φ) and (i,l) ∼ p( ). The lower bound for a minibatches of data i) . And finally, the gradients ∇ θ,φ L(θ, φ; i) can be computed in a straightforward manner and used to update the parameters θ and φ with stochastic optimization methods, such as SGD. Finally, additional discussion is provided in appendix D.2 where we explain how the optimization of the variational lower bound in Eq. 4 naturally encourages the indpendence of z and a; we also discuss possible extensions to Eq. 4 to explicitly safeguard against introducing dependency between z and a. X M = { (a (i) , b (i) ) | i ∈ [1..M ] } can be approximated by L(θ, φ; X M ) = N M M i=1 L ( X M ) = N M M i=1 ∇ θ,φ L (

3.3. OPTIMIZATION CHALLENGES

The proposed VRL method introduces unique challenges to the variational lower bound optimization problem (see Sønderby et al. (2016) and Bowman et al. (2016) for other known challenges). To explain these challenges, we first break down VRL's parameter updating process into the following steps (using a single datapoint as an example): (1) a datapoint (a (i)  , b (i) ) is selected; (2) sample z (i) ∼ q φ k ( z | a (i) , b (i) ) by using the current parameter φ k ; (3) evaluate L (i) by using φ k , θ k ; (4) calculate gradients g = ∇ θ k ,φ k L (i) ; (5) use gradients g to update φ k , θ k and get new parameters φ k+1 , θ k+1 . This parameter updating process can be depicted with an information flow diagram shown in Fig. 2a . Ideally, we would like every path in Fig. 2a to contribute to the evaluation of all the terms in its reachable nodes in order to obtain meaningful gradients for updating its associated parameters; however, there are two situations where this is not the case. The first situation, called p θ (b (i) |a (i) , z (i) ), p θ (z (i) ) z (i) ∼ q φ (z|a (i) , b (i) ) b (i) φ a (i) φ θ a (i) θ (a) p θ (b (i) |a (i) , z (i) ), p θ (z (i) ) z (i) ∼ q φ (z|a (i) , b (i) ) b (i) φ a (i) φ θ a (i) θ (b) Figure 2 : Information flow diagrams depicting VRL's parameter updating process, where each path uses its associated parameters to propagate information in the forward direction and gradients in the backward direction: (a) Unmodified parameter updating process, where overfitting occurs when the learning of p θ (b (i) |a (i) , z (i) ) rely only on the dash-dotted path (deterministic-mapping) or the dashed path (information-shortcut); (b) Parameter updating process improved with RPDA. information-shortcut, occurs when the learning of p θ ( b (i) | a (i) , z (i) ) rely entirely on the dashed path in Fig. 2a ; more specifically, the dashed (shortcut) path directly propagates b (i) through z (i) to p θ ( b (i) | a (i) , z (i) ) and, as a result, the relational property z (i) may learn only to encode the absolute property of b (i) . The second situation, called deterministic-mapping, occurs when b (i) can be fully characterized by a (i) ; in this case, the learning of p θ ( b (i) | a (i) , z (i) ) may rely only on the dash-dotted path in Fig. 2a . While both situations can be viewed as a overfitting problem, deterministic-mapping says more about the data itself and exposes a potential limitation of VRL: a decoupled relational property is primarily learned through exploring information not found in a but can help to better predict b, and when a fully characterizes b, VRL no longer need to explore information beyond a to predict b and this may prevent VRL from learning a meaningful relational property. On the other hand, information-shortcut is caused by short-cutting the parameter updating process, which we may overcome with additional regularization techniques. Here we propose two approaches for mitigating the information-shortcut problem by disrupting the flow of information passing through the shortcut path. In the first approach, we restrict the flow of information by constraining the expressiveness of the latent variable z; for example, by adopting an informative prior with restrictive constraint, such as p θ ( z ) = N (z; 0, σ 2 I), σ 1, or representing z with a discrete random variable (assuming we know a priori the underlying relational proproty are discrete). In the second approach, we propose a novel data augmentation strategy-relationpreserving data augmentation (RPDA)-that aims to eliminate the shortcut path. First, we define a set of relation preserving functions D = { d(a, b; r) | r ∈ R(some index set), d : A×B → A×B } where the data relationship is preserved in the following sense:  p θ ( z | a, b ) = p θ ( z | a , b ) for (a , b ) = d( L (i) RPDA = 1 L L l=1 log p θ ( b (i) | a (i) , z (i,l) ) + log p θ ( z (i,l) ) -log q φ ( z (i,l) | a (i) , b (i) ), where (a (i) , b (i) ) = d(a (i) , b (i) ; r (i) ), r (i) ∼ U(R), and z (i,l) = g( (i,l) , a (i) , b (i) , φ), (i,l) ∼ p( ). Note that due to the relation preserving property of D, we have q φ ( z (i,l) | a (i) , b (i) ) = q φ ( z (i,l) | a (i) , b (i) ) and, therefore, L RPDA is equivalent to L (i) in Eq. ( 6). When we optimize with L (i) RPDA , the parameter updating process can be redrawn in Fig. 2b , where now the learning of p θ ( b (i) | a (i) , z (i) ) can no longer rely solely on the shortcut path to propagate b (i) since it differs from b (i) by a non-deterministic factor r (i) . In practice, it may seem unrealistic to assume that we can construct a set of RPDA functions D without extensive knowledge of the underlying relational property. However, we can treat data augmentation as a form of regularization and construct a D that reflects our prior knowledge and belief of the underlying system (Ronneberger et al., 2015; Perez & Wang, 2017) . For example, if we want the learning to be rotation invariance (a common theme in computer vision applications), we can construct a D that consists of image rotation augmentations, e.g., d(a, b; r) = (rot(a, r), rot(b, r)) where rot(x, r) rotates the image x by r ∈ R = [0, 360) degrees (note that both a and b are rotated by the same amount). Additional remarks on the practical applicability of RPDA is provided in appendix D.3, and a detailed ablation study is provided in appendix C. To summarize this section, the proposed VRL method with RPDA is described in Algorithm 1.

4. RELATED WORK

Machine learning approaches for relational processing have gained increasing interest and attention in the literature. Most of these methods focus on high-level cognitive tasks, such as visual Q&A and state prediction for complex-physics systems, and derive their relational processing capabilities from learning with clever designed neural networks (Hill et al., 2019; Santoro et al., 2017; Raposo et al., 2017; Battaglia et al., 2018; 2016; Wu et al., 2015; Reed et al., 2015; van Steenkiste et al., 2018; Chang et al., 2016; Fragkiadaki et al., 2015) . Our work differ from these methods in two ways: (1) our primary focus is addressing the relational learning problem where we want to learn a decoupled relational property; (2) we enforce the decoupling requirement on the learned relational property with a PGM, which gives us the flexibility to use any compatible inference method or function approximation and still satisfy the decoupling requirement. Algorithm 1 VRL with RPDA procedure VRL(X, p( ), D) If RPDA not available, D = { id(•) | (a, b) = id(a, b) } Initialize parameters θ, φ while not convergence of parameters (θ, φ) do Sample minibatch X M = { (a (i) , b (i) ) | i ∈ [1..M ] } from X. Sample (i,l) ∼ p( ), r (i) ∼ U(R), i = 1, ..., M, l = 1, ..., L. Run RPDA and obtain (a (i) , b (i) ) = d(a (i) , b (i) ; r (i) ), i = 1, ..., M . Compute gradients g = ∇ θ,φ L θ, φ; X M = N M M i=1 ∇ θ,φ L (i) RPDA (see Eq. ( 7)). Update parameters θ, φ using gradients g (e.g. SGD). end while return θ, φ end procedure Conventional unsupervised learning methods can also be appied to our problem setting (Kingma & Welling, 2014; Goodfellow et al., 2014; Mikolov et al., 2013a; b; Song et al., 2007) ; however, these methods learn a single representation for the data with superimposing information about their relational and absolute property. The difficulty of decoupling the relational property from the learned representation constitute a major obstacle to relational reasoning. Other related work include methods on learning a disentangled representations with applications in style-transfer, image-to-image translation, domain adaptation, etc. (Huang et al., 2018; Chen et al., 2016; Tenenbaum & Freeman, 1997; Higgins et al., 2017; Bousmalis et al., 2016; Mathieu et al., 2016; Tulyakov et al., 2017; Denton & Birodkar, 2017; Villegas et al., 2017; Donahue et al., 2017; Shen et al., 2017) Most of these methods strive to learn a disentangled representations of content and style (or pose for video sequence data) where content is generically defined as the underling spatial structure, and style as the rendering of the structure. In comparison, our work can be viewed as learning a disentangled representations of relational and absolute property; however, we argue that style-content separation is fundamentally different from relational-absolute separation. More specifically, we consider both style and content information as absolute property (both describe features of an individual data), while relational property provides new information on the (abstract) relationship between the paired data.

5. EXPERIMENT

In this section, we present experimental results from applying the proposed VRL method to a set of relational learning tasks designed with the MNIST dataset (LeCun & Cortes, 2010).

5.1. MNIST RELATIONAL LEARNING TASK

To setup a relational learning task, a paired dataset X = { (a (i) , b (i) ) | i ∈ [1..N ] } was generated by the following steps: (1) the MNIST dataset { (x (i) , y (i) ) | i ∈ [1. .T ] }, where x (i) and y (i) are the digit images and their labels, was augmented with applying five evenly-spaced rotations to each of the image  x (i) to get { (x (i,j) , y (i) ) | i ∈ [1..T ], j ∈ [1..5] } (examples of augmented images are shown in Fig. 3); (2) individual datapoint (a (i) , b (i) ) of X was chosen to be randomly rotated images of x (i) , i.e., X = { (a (i) = x (j,k) , b (i) = x (j,l) ) | j ∼ U([1..T ]), k, l ∼ U([1..5]) }. Note that x (i,1) : 0 • , x (i,2) : 72 • , x (i,3) : 144 • , x (i,4) : 216 • , x (i,5) : 288 • ). there are five uniquely defined rotational relationships between (a (i) , b (i) ) in X and since they are randomly selected, the rotational relationship (relational property) is decoupled from the absolute properties of (a (i) , b (i) ). Here, we use X to assess VRL's capability of discovering the underlying relational property (rotational relationships) irrespective of their absolute property. Additional MNIST relational learning tasks are introduces in appendix B.1. The MNIST relational learning experiment presented above may seem contrived at first glance but, upon deeper examination, represents a novel and unique problem setting that exemplifies a key relational learning challenge-learning a decoupled relational property-for existing unsupervised learning methods (see Sec. 4 for related work). We argue that any unsupervised method that can successfully solve the above problem (or any relational learning problem) must, at a minimum, simultaneously accomplish the following two goals: 1. An unsupervised learning mechanism that captures and preserves the data relationships (relational property), e.g, capturing and preserving the rotational relationship between (a (i) , b (i) ) during learning. 2. An unsupervised learning mechanism that decouples absolute property from the learned data relationships, e.g., learning a rotational relationship between (a (i) , b (i) ) that does not depend on the absolute property (digit representation, rotation, etc.) of individual a (i) , b (i) . While most existing unsupervised methods are well-equipped to accomplish the first goal, we argue that the second goal presents itself as a major challenge; more specifically, to the best of our knowledge, most existing unsupervised methods focus on modeling all aspect of (a (i) , b (i) ) and learn data relationships either jointly with the absolute properties of a (i) , b (i) or as their derivatives (including graph-based methods that use edges to represent relationships). For such methodology, the learned data relationships necessarily entangle/couple with the absolute properties of the data, and therefore, the fundamental relational learning challenge-decoupling relational property-is still unresolved. A key insight into why the proposed VRL framework is capable of overcoming this challenge is recognizing that VRL takes a more targeted approach and learns data relationships as a stand-alone entity (represented by the latent variable z) that is designed, through the construction of VRL-PGM, to be independent of a (however, we note that in VRL the learned relational property z is only independent of a but not b; a discussion of this compromising fact and its implications is provided in appendix D.1).

5.2. IMPLEMENTATION

For these experiments, we adopted a two-dimensional latent variable z ∈ R 2 and let the prior p θ ( z ) be the bivariate normal distribution. We let p θ ( b | a (i) , z (i,l) ) be a multivariate Bernoulli distribution whose probability parameters are computed from a given a (i) and z (i,l) with an autoencoder-like neural network f dec θ f enc θ (a (i) ), z (i,l) . We let the approximated posterior q φ ( z | a (i) , b (i) ) be a bivariate Gaussian distribution with a diagonal covariance N (z; µ (i) , (σ (i) ) 2 I) where µ (i) and σ (i) are the output of a neural network f q φ (a (i) , b (i) ). For RPDA, we constructed a D that consists of image rotation augmentations: D = { (rot(a, r), rot(b, r)) | r ∈ [0, 360) }. Detailed experimental setup is described in appendix A.

5.3. RESULTS

Relational discrimination. We trained VRL on X and used the approximated posterior q φ ( z | a, b ) to infer the relational property of a hold-out dataset. Figure 4a shows a scatter plot of the relational property inferred by VRL where we can see that the approximated posterior accurately cluster(discriminate) data with the same(different) rotational relationship together(apart). Here, we compared VRL with variational autoencoder (VAE) (Kingma & Welling, 2014) and InfoGAN (Chen et al., 2016) , both of which can learn data representations of X in a completely unsupervised manner. In the application of VAE, we adopted a 2-D latent space and use the encoder trained on X to inferred the latent varible of a hold-out dataset; the resulting scatter plot is shown in Fig. 4b where we see that VAE failed to discriminate data based on their relational property (rotational relationship). Next, InfoGAN has demonstrated its ability to learn disentangled representations (represented by structured latent codes c 1 , c 2 , ..., c L ) through generative modelling. Although inferring latent codes for a given data point is a non-trivial task for InfoGAN, we can examine the learned latent representation by manipulating the latent codes and visually inspect the generated 2016) and modeled the latent codes with one categorical code c 1 ∼ Cat(K = 10, p = 0.1) (model discontinuous variation in data) and two continuous codes c 2 , c 3 ∼ Unif(-1, 1) (capture continuous variations in style). Figure 5 shows examples of generated images from manipulating the latent codes; it is clear that none of the latent codes (or a combination of them) distinctively capture the full range of relational property (rotational relationship). Figures 4b and 5 illustrate a major challenge of using VAE and InfoGAN (and other related methods) for relational learning: these methods learn a single representation that encodes both relational and absolute property of X and it is difficult to dissect the relational property from the learned representation. Detailed experimental setup for both VAE and InfoGAN is described in appendix A. Relational mapping. We evaluated VRL's learned likelihood function p θ ( b | a, z ) by visualizing the predicted images given a and z. We chose a from a hold-out dataset and z from: (1) direct sampling in the latent space; (2) relational property inferred from a source datapoint (a s , b s ). Figure 6a shows predicted images with z sampled from the latent space shown in Fig. 4a . Figure 6b shows examples of relational mappings from a (c) to b (r,c) by applying the relational property inferred from a source datapoint (a s , b (r) s ). At first glance, the results in Fig. 6b resemble that of style-transfer, but they are fundamentally different: in style-transfer, the image b (r,c) is generated by applying the style of b  -2 -1 0 1 2 z 1 -2 -1 0 1 2 z 2 +0 +72 +144 +216 +288 (a) Relational properties inferred by VRL -4 -2 0 2 4 z 1 -4 -2 0 2 4 z 2 +0 +72 +144 +216 +288 (1) z (2) z (3) z (4) z (5) (a) (b) Figure 6 : Examples of images predicted by VRL: (a) images predicted from sampled latent variables (sampling the centroid of each cluster in Fig. 4a : " " → z (1) , " " → z (2) ,"+" → z (3) , "×" → z (4) , "♦" → z  (r,c) ∼ p θ ( b | a (c) , z (r) ) where z (r) ∼ q φ ( z | a s , b (r) s ). Additional experimental results and discussion are provided in appendix B and D.4, respectively.

6. DISCUSSION AND CONCLUSION

A core component of the proposed VRL method is approximating the intractable posterior with variational inference (VI) methods. There is a vast literature on the subject of VI that we can leverage to further improve and extend VRL; for example, prior works have proposed flexible and complex approximated posterior distributions that we can use to learn a rich posterior approximations for characterizing the relational property (Rezende & Mohamed, 2015; Dinh et al., 2014) . Another interesting idea to explore is learning a generative model for VRL: in this work, the primary learning objective is maximizing a variational lower bound derived from VRL-PGM; however, with the advent of computationally efficient methods for learning a generative model, it would be interesting to include another learning objective that directly models the data generating aspect of VRL-PGM (Goodfellow et al., 2014; Larsen et al., 2015) . The proposed method comes with both advantages and disadvantages: the main advantage of VRL lies in its relational learning capabilities; however, this may also be one of its disadvantages. More specifically, VRL can learn a decoupled relational property even when it is coupled with the absolute property, i.e., VRL is oblivious to the coupling information between the two properties (an example with coupled relational property is provided in appendix B.1.1). Nevertheless, such information may be of interest to the user, and in this regard, VRL provides only a partial view of the data. In conclusion, the proposed VRL method is an efficient and effective unsupervised learning method for addressing the relational learning problem where our goal is to learn a decoupled relational property. By dissecting the data information into decoupled relational and absolute property, we hope VRL can bring new insight into everyday data analysis and ultimately find applications for a wide variety of problems.

A EXPERIMENTAL SETUP

Recall that we adopted a two-dimensional latent variable z ∈ R 2 and let the prior p θ ( z ) be the bivariate normal distribution. For binary valued data (e.g., MNIST dataset), we let the likelihood function p θ ( b ∈ B | a (i) , z (i,l) ) of VRL be a multivariate Bernoulli distribution whose probability parameters p (i,l) are computed from a given a (i) and z (i,l) with an autoencoder-like neural network f dec θ f enc θ (a (i) ), z (i,l) : f enc θ :a (i) → Conv(3x3x8) → Conv(3x3x32) → Conv(3x3x128) → FC(20) → h (i) ∈ R 20 f dec θ :[h (i) ∈ R 20 , z (i,l) ∈ R 2 ] → FC → Conv T (3x3x128) → Conv T (3x3x32) → Conv T (3x3x8) → Conv(1x1x1) Sigmoid ----→ [0, 1] dim(B) , where Conv(•) is a strided (stride 2) convolutional layer and Conv T (•) is a transposed convolutional layer. We used batch-normalization after most layers and leaky rectified linear units (with leaky rate 0.01) as nonlinear activation function. We let the approximated posterior q φ ( z | a (i) , b (i) ) of VRL be a bivariate Gaussian distribution with a diagonal covariance N (z; µ (i) , (σ (i) ) 2 I) where µ (i) and σ (i) are the output of a neural network f q φ (a (i) , b (i) ): f q φ : [a (i) , b (i) ] → Conv(3x3x8) → Conv(3x3x32) → Conv(3x3x128) → FC(4) → [µ (i) , σ (i) ]. We sampled from the posterior z (i,l) ∼ N (z; l) where (i,l) ∼ p( ) = N (0, I) and set L = 1. We used image rotation augmentations: D = { (rot(a, r), rot(b, r)) | r ∈ [0, 360) } for constructing RPDA. In this case, the learning objective µ (i) , (σ (i) ) 2 I) using z (i,l) = g( (i,l) , a (i) , b (i) , φ) = µ (i) + σ (i) (i, L (i) RPDA in Eq. ( 7) can further be derived as (see next section for the derivation): L (i) RPDA = 1 2 2 j=1 1 + log((σ (i) j ) 2 ) -(µ (i) j ) 2 -(σ (i) j ) 2 + k b (i) k log p (i,1) k + (1 -b (i) k ) log(1 -p (i,1) k ). where (µ (i) , σ (i) ) = f q φ (a (i) , b (i) ), p (i,l) = f dec θ f enc θ (a (i) ), z (i,1) , z (i,1) = µ (i) + σ (i) (i,1) , (i,1) ∼ N (0, I), (a (i) , b (i) ) = (rot(a (i) , r (i) ), rot(b (i) , r (i) )), r (i) ∼ U([0, 360)). Parameters θ and φ were jointly trained to maximize Eq. ( 8) using Adam optimizer (learning rate = 0.0001, β1 = 0.9, β2 = 0.999) (Kingma & Ba, 2014); minibatches of size M = 100 were used. In our VAE implementation we used the following network architecture for encoder, Enc(a (i) , b (i) ), and decoder, Dec(z (i,l) ): B) . Enc :[a (i) , b (i) ] → Conv(3x3x16) → Conv(3x3x64) → Conv(3x3x256) → FC(4) → [µ (i) ∈ R 2 , σ (i) ∈ R 2 ] Dec :[z (i,l) ∈ R 2 ] → FC → Conv T (3x3x256) → Conv T (3x3x64) → Conv T (3x3x16) → Conv(1x1x1) Sigmoid ----→ [0, 1] dim( We used Adam optimizer with the same hyperparameter setup as before. For InfoGAN implementation, we followed the setup described in Chen et al. (2016) except that each datapoint consists a paired MNIST image (a (i) , b (i) ). Under review as a conference paper at ICLR 2021

A.1 DERIVATION OF TRAINING OBJECTIVE FUNCTION

Our derivation for Eq. ( 8) largely follow the work of Kingma & Welling (2014); here for completeness we outline the key steps. First, Eq. ( 4) in Section 3.2 can equivalently be expressed as: L (i) = -D KL q φ (z|a (i) , b (i) ) p θ ( z ) + E q φ (z|a (i) ,b (i) ) log p θ ( b (i) |a (i) , z ) = -D KL q φ (z|a (i) , b (i) ) p θ ( z ) + E p( ) log p θ ( b (i) | a (i) , z ) , where z = g( , a (i) , b (i) , φ), ∼ p( ), and g is some differentiable and invertible transformation. We can approximate L (i) in Eq. ( 9) with a Monte Carlo estimator L (i) : L (i) = -D KL q φ (z|a (i) , b (i) ) p θ ( z ) + 1 L L l=1 log p θ ( b (i) | a (i) , z (i,l) ), where z (i,l) = g( (i,l) , a (i) , b (i) , φ) and (i,l) ∼ p( ). Based on the RPDA functions D = { d(a, b; r) | r ∈ R } and its relation preserving assumption that q φ ( z|a (i) , b (i) ) = q φ ( z|a (i) , b (i) ) for (a i) , b (i) ; r), ∀r (see Section 3.3), we can express L (i) in Eq. ( 10) equivalently as L (i) (i) , b (i) ) = d(a ( RPDA : L (i) RPDA = -D KL q φ ( z|a (i) , b (i) ) p θ ( z ) + 1 L L l=1 log p θ ( b (i) | a (i) , z (i,l) ), where z (i,l) = g( (i,l) , a (i) , b (i) , φ), (i,l) ∼ p( ), (a (i) , b (i) ) = d(a (i) , b (i) ; r (i) ), r (i) ∼ U(R). The first term in Eq. ( 11) is the KL-divergence from p θ ( z ) to q φ ( z|a (i) , b (i) ), which can be computed analytically given that we assume p θ ( z ) = N (0, I) and q φ ( z|a (i) , b (i) ) = N (z; µ (i) , (σ (i) ) 2 I): -D KL q φ ( z|a (i) , b (i) ) p θ ( z ) = 1 2 2 j=1 1 + log((σ (i) j ) 2 ) -(µ (i) j ) 2 -(σ (i) j ) 2 . ( ) The likelihood function p θ ( b (i) | a (i) , z (i,l) ) is defined as a multivariate Bernoulli distribution whose probability parameters p (i,l) are computed from the neural network f dec θ f enc θ (a (i) ), z (i,l) and we have: log p θ ( b (i) | a (i) , z (i,l) ) = k b (i) k log p (i,l) k + (1 -b (i) k ) log(1 -p (i,l) k ). By substituting Eq. ( 12) and ( 13) back in (11) and recall that z (i,l) = g( (i,l) , a (i) , b (i) , φ) = µ (i) + σ (i) (i,l) , p( ) = N (0, I), D = { (rot(a, r), rot(b, r)) | r ∈ [0, 360) }, L = 1 , we arrive at Eq. ( 8).

B.1 ADDITIONAL MNIST RELATIONAL LEARNING EXPERIMENTS

Here, we provide additional experimental results based on the paired MNIST dataset constructed in Section 5.

B.1.1 MNIST EXAMPLE WITH COUPLED RELATIONAL PROPERTY

First, to further test the robustness of the proposed method, we considered a scenarios where the underlying relational property is coupled with the absolute property. An example dataset X 2 was constructed with the rotational relationship between each of the datapoint completely determined by its digit label: X 2 = { (a (i) = x (j,k) , b (i) = x (j,l+1) ) | j ∼ U([1..T ]), k ∼ U([1. .5]), l = (k + y (j) /2 ) mod 5 }. In this case, it is possible to infer the relational property solely based on the image representation of the digit (absolute property of a (i) ), for example, a (i) ∈ ['0', '1'] → 0 • (read: if a (i) is recognized as either digit 0 or 1, the rotational relationship between (a (i)  , b (i) ) is 0 • ) , a (i) ∈ ['2', '3'] → 72 • , a (i) ∈ ['4', '5'] → 144 • , a (i) ∈ ['6', '7'] → 216 • , a (i) ∈ ['8', '9'] → 288 • . The question then arise: is VRL capable of learning a decoupled relational property even when it is coupled with the absolute property? To test this idea, we trained VRL (with the same setup as that described in appendix A) on X 2 but validated it on a hold-out dataset with a decoupled relational property (much like how X was constructed). The resulting scatter plot is shown in Fig. 7 , where we can see that the approximated posterior q φ ( z | a, b ) accurately cluster(discriminate) data with the same(different) rotational relationship together(apart). This result shows that VRL was indeed capable of learning a decoupled relational property irrespective of the digit representation of a (i) . If this were not the case, we would expect to see a scatter plot with heavily overlapped relationship labels since the validation dataset was constructed with random rotational relationships. Examples  -2 -1 0 1 2 3 z 1 -3 -2 -1 0 1 2 3 z 2 +0 +72 +144 +216 +288 z (2) z (3) z (4) z (5) (a) (b) Figure 8 : Examples of images predicted by a VRL model that was trained on X 2 : (a) images predicted from sampled latent variables (sampling the centroid of each cluster in Fig. 7 : " " → z (1) , " " → z (2) ,"+" → z (3) , "×" → z (4) , "♦" → z VRL is capable of learning a decoupled relational property and generalizing it to unseen data, e.g., VRL trained on X 2 learned to rotate any digit by any amount despite not having seen most of the digit-rotation pairs during training. Note that this experiment is an example of deterministic-mapping discussed in Section 3.3, nevertheless VRL was able to utilize all paths in Fig. 2b effectively to learn about the data.

B.1.2 MNIST EXAMPLE WITH MULTIPLE RELATIONAL PROPERTY

Next, we setup a more complex relational learning task that includes both rotational and resizing relationships between (a (i) , b (i) ): each MNIST image x (i) was augmented with five evenly-spaced rotations and three different resizing transformations to get { (x (i,j,k) , y (i)  ) | i ∈ [1..T ], j ∈ [1..5], k ∈ [1..3] }. Examples of augmented images are shown in Fig. 9 , where top, middle, and bottom row images are resized by a factor of ×0.66, ×1, and ×1.5, respectively. We considered a relational Figure 9 : Examples of a MNIST digit augmented with rotational and resizing transformations (from left to right: ,3,k) : 144 • , x (i,4,k) : 216 • , x (i,5,k) : 288 • ; from top to bottom: x (i,j,1) : ×0.66, x (i,j,2) : ×1, x (i,j,3) : ×1.5). x (i,1,k) : 0 • , x (i,2,k) : 72 • , x (i learning task where each datapoint (a (i) , b (i) ) has a decoupled rotational and/or resizing relational property; more specifically, we constructed a dataset X 3 in the following way: { (a (i) = x (j,k,m) , b (i) = x (j,l,n) ) | j ∼ U([1..T ]), k, l ∼ U([1..5]), m ∼ U([1, 2]), n ∼ U(V ) } where V = [1, 2], if m = 1 [2, 3], if m = 2 Note that in this case, b (i) is either the same size as a (i) or is ×1.5 larger than a (i) , and there are a total of 10 different relationships between (a (i) , b (i) ) in X 3 (combinations of 5 rotational and 2 resizing transformations). We trained VRL on X 3 with the same model setup as that described in appendix A and used the trained model to infer the relational property of a hold-out dataset. The inference result is shown in Fig. 10 , where we can see that the approximated posterior q φ ( z | a, b ) accurately cluster(discriminate) data with the same(different) relationship together(apart). Examples of images predicted by the learned likelihood function p θ ( b | a, z ) are shown in Fig. 11 , where Fig. 11a shows predicted images based on direct sampling in the latent space (shown in Fig. 10 ), and Fig. 11b shows examples of relational mappings. These results are consistent with the findings presented and discussed in Section 5.3.

B.1.3 MNIST EXAMPLE WITH CONTINUOUS RELATIONAL PROPERTY

Finally, we present an example with a continuous relational property. Based on the MNIST dataset { (x (i) , y (i) ) | i ∈ [1. .T ] }, we constructed a paired dataset X 4 in the following way: X 4 = { (a (i) = x (j) , b (i) = rot(x (j) , r (i) )) | j ∼ U([1..T ]), r (i) ∼ U([0, 360)) }, where rot(x (j) , r (i) ) rotates the image x (j) by r (i) ∈ [0, 360) degrees. In this case, b (i) is a random rotation of a (i) and there is a continuous (and decoupled) rotational relationship between (a (i) , b (i) ) in X 4 . We trained VRL on X 4 with the same model setup as that described in appendix A and used the trained model to infer the relational property of a hold-out dataset. A scatter plot of the relational property infered by the approximated posterior q φ ( z | a, b ) is shown in Fig. 12 , and examples of images predicted by the learned likelihood function p θ ( b | a, z ) are shown in Fig. 13 , where Fig. 13a shows predicted images based on direct sampling in the latent space (denoted by markers "×" in z (1) -3 -2 -1 0 1 2 3 z 1 -2 -1 0 1 2 3 z 2 +0 +72 +144 +216 z (2) z (3) z (4) z (5) z (6) z (7) z (8) z (9) z (10) (a) (b) Figure 11 : Examples of images predicted by a VRL model that was trained on X 3 : (a) images predicted from sampled latent variables (sampling the centroid of each cluster in Fig. 10 : " "(blue) → z (1) , " " → z (2) ,"+" → z (3) , "×" → z (4) , "♦" → z (5) , " " → z (6) , " " → z (7) , " " → z (8) , " " → z (9) , " "(cyan) z 12 ), and Fig. 13b shows examples of relational mappings. From Fig. 12 and 13 we can see that VRL learned a decoupled relational property that encodes a continuous data (rotational) relationship; however, there is a small region in Fig. 12 with overlapping relational property that leads to an ambiguous interpretation of the rotational relationship (120 • vs. 240 • ). This ambiguity is likely caused by compressing the continuous data (rotational) relationship down to a two-dimensional latent space, z ∈ R 2 , and motivates us to adopt a higher-dimensional latent space, e.g., z ∈ R 3 . Figure 14 shows inference result from repeating the previous experiment but with setting z ∈ R 3 ; we can see that VRL learned a three-dimensional relational property that unambiguously represents the underlying continuous data (rotational) relationship. by the degrees of rotation between the corresponding datapoint (markers "×" denote sampled latent varibles z (1) , . . . , z (5) used for image prediction in Fig. 13a ).  z (1) z (2) z (3) z (4) z (5) (a) (b) s arranged from top to bottom.

B.2.1 YALE FACE EXAMPLE WITH LEARNING EMOTION CHANGE

Here we present another relational learning example based on the Yale face dataset (Belhumeur et al., 1997) . The Yale face dataset consists of 15 subjects, each with 8 facial expressions and 3 lighting configurations. We extracted three facial expressions (happy, surprised, sad) of each subject to form { x (i,j) | i ∈ [1..15], j ∈ [1..3] }. Examples of face images are shown in Fig. 15a , where we center-cropped, resized to 64 × 64, and normalized pixel values to be within [0, 5]. We constructed a dataset X Fe where each datapoint (a (i) , b (i) ) represents a subject with different facial expressions (emotions): X Fe = { (a (i) = x (j,k) , b (i) = x (j,l) ) | j ∼ U([1..15]), k ∼ U([1..3]), l ∼ U([1..3] \ k) }. Our initial intention was to apply VRL to X Fe to learn about the "emotional change" between (a (i) , b (i) ) irrespective of the subject; however, because X F is an extremely limited dataset that consists of only 45 different images (15 subjects, each with 3 facial expressions), we settled for a less ambitious goal of learning an undirected emotional change, i.e., "happy→sad" = "sad→happy" and simply denoted as "happy-sad". In this case, there are three different emotional changes between (a (i) , b (i) ) in X F : "happy-sad", "happy-surprised", "surprised-sad". To setup a VRL model for training, we adopted a two-dimensional latent variable z ∈ R 2 and let the prior p θ ( z ) be the bivariate normal distribution. Since this is a real-valued dataset, we let p θ ( b (i) | a (i) , z (i,l) ) be a multivariate Gaussian distribution with a fixed diagonal covariance N (b (i) ; µ (i,l) b , σ 2 I) where σ = 0.1 and µ (i,l) b is computed from a given a (i) and z (i,l) with an autoencoder-like neural network z1 -2 -1 0 1 2 z 2 -2 -1 0 1 2 z3 -2 -1 0 1 2 z 1 -2 -1 0 1 2 z2 -2 -1 0 1 2 z3 -2 -1 0 1 2 z1 -2 -1 0 1 2 z 2 -2 -1 0 1 2 z3 -2 -1 0 1 2 z 1 -2 -1 0 1 2 z2 -2 -1 0 1 2 z3 -2 -1 0 1 2 z1 -2 -1 0 1 2 z 2 -2 -1 0 1 2 z3 -2 -1 0 1 2 z 1 -2 -1 0 1 2 z2 -2 -1 0 1 2 z3 -2 -1 f dec θ f enc θ (a (i) ), z (i,l) : f enc θ : a (i) → Conv(3x3x4) → Conv(3x3x16) → Conv(3x3x64) → FC(50) → h (i) ∈ R 50 f dec θ : [h (i) ∈ R 50 , z (i,l) ∈ R 2 ] → FC → Conv T (3x3x64) → Conv T (3x3x16) → Conv T (3x3x4) → Conv(1x1x1) → µ (i,l) b and we have: log p θ ( b (i) | a (i) , z (i,l) ) = - 1 2 b (i) -µ (i,1) b 2 0.01 + const. Again, we let the approximated posterior q φ ( z | a (i) , b (i) ) be a bivariate Gaussian distribution with a diagonal covariance N (z; µ (i) , (σ (i) ) 2 I) where µ (i) and σ (i) are the output of a neural network f q φ (a (i) , b (i) ): f q φ : [a (i) , b (i) ] → Conv(3x3x4) → Conv(3x3x16) → Conv(3x3x64) → FC(4) → [µ (i) , σ (i) ] Next, based on the premise that we are only interested in learning an undirected emotional change, we have Finally, we combine the above settings with z (i,l) = µ (i) + σ (i) (i,l) , p( ) = N (0, I), L = 1, and substituting Eq. ( 12) and ( 14) back in (11) to derive the following lower bound estimator: p θ ( z | a, b ) = p θ ( z | b, a L (i) RPDA = 0.01 2 2 j=1 1 + log((σ (i) j ) 2 ) -(µ (i) j ) 2 -(σ (i) j ) 2 - 1 2 b (i) -µ (i,1) b 2 + const. ( ) where (µ (i) , 1) , σ (i) ) = f q φ (a (i) , b (i) ), µ (i,1) b = f dec θ f enc θ (a (i) ), z (i, z (i,1) = µ (i) + σ (i) (i,1) , (i,1) ∼ N (0, I), (a (i) , b (i) ) = swap rot(a, r (i) ), rot(b, r (i) ) , r (i) ∼ U([0, 360)). The rest of the training setup remains the same as that described in appendix A. We trained VRL on X Fe and then used the approximated posterior q φ ( z | a, b ) to infer the relational property of X Fe . The inference result is shown in Fig. 15b , where we can see that VRL learned a relational property that accurately differentiates emotional changes irrespective of the subject.

B.2.2 YALE FACE EXAMPLE WITH LEARNING ILLUMINATION CONDITION CHANGE

Next, we present an experiment that learns relationship on "illumination condition changes" among the Extended Yale Face Database B (Georghiades et al., 2001) . The Extended Yale Face Database B contains 16128 images of 28 human subjects under 9 poses and 64 illumination conditions. We extracted four illumination conditions (source of illumination: left, front, right, top ) of each subject to form { x (i,j) | i ∈ [1..28], j ∈ [1..4] }. Examples of face images are shown in Fig. 16a , where we center-cropped, resized to 64 × 64, and normalized pixel values to be within [0, 5]. We constructed a dataset X F l where each datapoint (a (i) , b (i) ) represents a subject with different illumination conditions (lightings): X F l = { (a (i) = x (j,k) , b (i) = x (j,l) ) | j ∼ U([1..28]), k ∼ U([1..4]), l ∼ U([1..4] \ k) }. Like the "learning undirected emotional changes" example presented in appendix B.2.1, we apply VRL to X F l to learn about the "undirected illumination changes" between (a (i) , b (i) ) irrespective of the subject, i.e., "left→right" = "right→left" and simply denoted as "left-right". In this case, there are six different illumination changes between (a (i) , b (i) ) in X F l : "left-right"("L-R"), "front-top"("F-T"), "left-front"("L-F"), "left-top"("L-T"), "front-right"("F-R"), "right-top"("R-T"). We trained VRL on X F l with the exact same model and training setup as used in appendix B.2.1 and then used the approximated posterior q φ ( z | a, b ) to infer the relational property of X F l . The inference result is shown in Fig. 16b where we can see that VRL correctly identifies four relationships ("L-T", "R-T", "F-R", "L-F" each represented by a distinct cluster in Fig. 16b ) but collapses "F-T" and "L-R" into a single indistinguishable cluster. At first glance, this counterintuitive result-collapsing of "F-T" and "L-R"-seems to indicate that VRL was not able to learn meaningful representations for "F-T" and "L-R"; however, there is an elegant and logical explanation that justifies this unexpected result. Upon closer examination, we argue that it is indeed possible to combine the relationships "F-T", "L-R" without loss of information. In fact, we can give the combined relationship a meaningful name-"Opposite direction of illumination condition" or "Oppo." for short (in this interpretation it is more intuitive to rewrite "front" as "down"). With the newly formed compressed set of relationships R c = {"L-T", "R-T", "F-R", "L-F", "Oppo."}, it is easy to see that R c is a valid set of relationships that fully and unambiguously characterizes the data (thus no loss of information) since: 1. any (a (i) , b (i) ) can be characterized by one and only one relationship in R c ; 2. for any given a (i) , each compatible relationship in R c applied to a (i) leads to a unique b (i) . And while we lost the identifiability of "F-T" and "L-R", combining them into "Oppo." is consistent with the relational learning goal-"Oppo." represents a relative relationship that does not depend on the illumination condition (absolute property) of each individual image. We stress that the proposed VRL method is not always guaranteed to learn a compact set of decoupled relationahips, and we will leave the investigation of this idea to a future work. Finally, we note that although the initial expection of learning a complete set of relationships is certainly reasonable and intuitive, it is quite surprising and unexpected that VRL is capable of discovering-in a completely unsupervised manner-a non-obvious set of relationships that is equally valid, and yet more compact. To summarize the Yale face experiments presented in appendix B.2, we make the following two observations. First, when comparing our results with existing unsupervised leanring methods on face images (Song et al., 2007) , we can see that existing methods cluster face images by its absolute property (e.g., subject identity, individual emotion, individual illumination condition, etc.) while the proposed VRL method clusters images by relationships (e.g., emotional change, illumination condition change). Second, it is worth noting that VRL does not learn the facial expression or illumination condition for each of (a (i) , b (i) ) independently and then classify them based on their difference, but instead directly learns the emotional/lighting relationship between (a (i) , b (i) ).

C ABLATION STUDY ON RPDA

In this work, we introduced relation-preserving data augmentation (RPDA) as a practical data augmentation strategy for overcoming the information-shortcut problem-an unique overfitting problem introduced by VRL (see Section 3.3). In order to evaluate the contribution of RPDA, we performed ablation study on the paired MNIST dataset X constructed in see Section 5.1. First, recall that in Section 5.2 we constructed RPDA functions D with image rotation augmentations and optimized the following lower bound estimator (cf. Eq. ( 7)): L (i) RPDA = 1 L L l=1 log p θ ( b (i) | a (i) , z (i,l) ) + log p θ ( z (i,l) ) -log q φ ( z (i,l) | a (i) , b (i) ), where z (i,l) = g( (i,l) , a (i) , b (i) , φ), (i,l) ∼ p( ), (a (i) , b (i) ) = d(a (i) , b (i) ; r (i) ), r (i) ∼ U(R). In the first ablation study, we experimented with removing RPDA from the VRL training, which amounts to optimizing with the original lower bound estimator L (i) in Eq. ( 6). With the rest of the training setup remained unchanged (see appendix A), we trained VRL on X without RPDA and used the trained model to infer the relational property of a hold-out dataset; the inference result is shown in Fig. 17a . In the second ablation study, we repeated the VRL training, but instead of following the proposed RPDA strategy, we applied RPDA functions D in a conventional data augmentation routine. More specifically, we optimized the lower bound L (i) in Eq. ( 6) over a minibatch of data augmented with D, X M = { (a (i) , b (i) ) | i ∈ [1. .M ] }, which leads to optimizing with the following lower bound estimator: L (i) DA = 1 L L l=1 log p θ ( b (i) | a (i) , z (i,l) ) + log p θ ( z (i,l) ) -log q φ ( z (i,l) | a (i) , b (i) ), where z (i,l) = g( (i,l) , a i) , b (i) , φ), (i,l) ∼ p( ), (a (i) , b (i) ) = d(a (i) , b (i) ; r (i) ), r (i) ∼ U(R). With the rest of the training setup remained unchanged (see appendix A), we trained VRL on X with maximizing L (i) DA and used the trained model to infer the relational property of a hold-out dataset; the inference result is shown in Fig. 17b . Comparing the results from both ablation studies, shown in -2 -1 0 1 2 z 1 -2 -1 0 1 2 z 2 +0 +72 +144 +216 +288 (a) -2 -1 0 1 2 z 1 -2 -1 0 1 2 z 2 +0 +72 +144 +216 +288 Figure 17 : Scatter plots of the relational properties (showing only the mean µ) generated from VRL ablation studies (relationship labels: : 0 • , : 72 RPDA applies D only to selected terms in Eq. ( 16) according to the RPDA strategy), we can draw the conclusion that the improvements brought by RPDA (and its key innovation) comes not from what data augmentation functions are applied but how they are applied.

D ADDITIONAL REMARKS D.1 REMARKS ON VRL-PGM

The central idea of the proposed VRL method is to encapsulate the relational learning problem with a probabilistic graphical model-VRL-PGM-and then formulate various relational processing tasks as performing inference and learning in the graphical model. Here we discuss aspects of the original relational learning problem (see Section 2) that differ from the proposed VRL-PGM (see Section 3.1). First, the original problem specifies that the relational property be decoupled from both a's and b's absolute properties; however, the latent variable z that is used to represent relational property in VRL-PGM is only independent of a but not b. Second, we note that the original problem is inherently undirected with no cause-effect relationship between a and b, whereas VRL-PGM is based a directed acyclic graph (DAG) that artificially introduces conditional dependency between a and b. However, we argue that the application of VRL does not require the true conditional dependency between (a, b) be known in advance only that it is maintained consistently throughout learning and inference, i.e., VRL can be applied in the same way to learn about the relational property between (b, a), where we swap a and b. The above-mentioned two discrepencies represent the compromises we made with adopting VRL-PGM in exchage for a riorous and tractable method for learning a decoupled relational property. We can futher view VRL's optimization challenges-deterministic-mapping and information-shortcut (introduced in Section 3.3)-as the consequence of these compromises: deterministic-mapping can viewd as caused by the causal relationship VRL-PGM introduced between a and b, and information-shortcut can be viewed as caused by the causal relationship VRL-PGM introduced between z and b.

D.2 REMARKS ON INDEPENDENCE BETWEEN z AND a

Recall that the assumption of independence between z and a is central to the VRL learning of a decoupled relational property, and in this work we rely on the construction of VRL-PGM to support such assumption; however, the learning objective (Eq. 4) derived from VRL-PGM does not explicitly force z to be independent of a (nor penalize learning a dependent z). Here, we explain how optimizing Eq. 4 affects the learning of independent z and a. The main learning objective of VRL is described in Eq. 4, where we can further dissect its various terms and observe that the learning is heavily guided by maximizing the likelihood term log p θ ( b (i) |a (i) , z ) since this is the only term that is constrained by the data while all other terms in Eq. 4 can be viewed as regularization on the unobserved z. Next, when maximizing log p θ ( b (i) |a (i) , z ) we are only learning to predict b (i) from a (i) and z (i) ; and since a (i) is already been conditioned on, there is no incentive for z (i) to learn redundant (dependent) information from a (i) . This effect is further "encouraged" when the learning objective Eq. 4 also includes additional regularization terms on z for learning a compact representation that "squeeze-out" any redundant information that does not help with predicting b (i) . In the information flow diagram shown in Fig. 2a one may argue that since a (i) also propagate information through the latent variable z it may introduce dependency between a and z; however, the information propagated from a (i) through z (i) is mainly used to maximize the likelihood term for predicting b (i) , and since there is already a direct propagation path from a (i) to b (i) (as part of the likelihood function p θ ( b (i) |a (i) , z )) there is, again through regularization on z, no incentive for z to carry redundant information from a (only decoupled relationship information derived from a and b). In short, while there are no explicit penalties on learning dependent z and a in VRL's learning objective, the independence is naturally encouraged through the interplay between the different terms in Eq. 4. This effect is also corroborated by our experimental results where we can see that VRL can indeed learn a decoupled (independent) relational property z through optimizing Eq. 4. Ideally, the learning of independent z and a can be achieved through VRL's learning mechanism discussed above; however, in practice there may be numerous reasons that could cause this to fail, such as insufficient training data, failure to reach the global optimum, non-identifiability of the model, etc. Therefore, for pragmatic reasons, it may be desirable to explicitly safeguard against introducing dependency between z and a. Here, we propose a straightforward extension to VRL for achieving this goal: we can append any non-positive function that measures the dependency between a and z with maximum attained when they are independent to Eq. ( 4) without invalidating the lower bound. For example, we can append the negative mutual information between z and a, -I(z, a) = -D KL (p θ ( z, a ) p θ ( z )p θ ( a )), to Eq. ( 4) to obtain: (i) , z ) + log p θ ( z )log q φ ( z|a (i) , b (i) ) -I(z, a). (18) And since I(z, a) ≥ 0 and I(z, a) = 0 if and only if z and a are independent, the addition of -I(z, a) in Eq. ( 18) not only does not invalidate the original bound in Eq. ( 4), but it also does not sacrifice the quality of the lower bound (z and a are independent in VRL-PGM). We will leave the investigation of this idea to a future work. L (i) = E q φ (z|a (i) ,b (i) ) log p θ ( b (i) |a

D.3 REMARKS ON RPDA

Here we would like to comment on the practical applicability of the proposed RPDA stratey. More specifically, we would like to convey the idea that in many practical problem settings, the RPDA functions D can be designed without any knowledge of the underlying relational property. For example, as we have explained in Sec. 3.3, in many computer vision applications, rotation invariant is a desirable property for the learned model; for example, in spectral imaging applications, oftentimes the orientation of the images are not preserved or not enforced (only that they are consistent between the same paired images) (Ronneberger et al., 2015) . In such problem setting, we can safely use image rotation function for constructing D. Another example may be: for a discrete time-series data a[t], b[t] that represent the input and output of a linear time-invariant (LTI) system (commonly assumed in signal processing and control theory (Oppenheim & Schafer, 2009) ), and we want to learn a relational property that characterize the system's impulse response. We have αb[tτ ] = αa[tτ ], ∀α ∈ R, τ ∈ Z, and we can construct D with d(a[t], b[t]; α, τ ) = (αa[tτ ], αb[tτ ]), α, τ ∈ R = R × Z. In all of the above examples, the construction of RPDA functions D reflects our prior knowledge and belief of the underlying system and not based on the data relationships; therefore, in many instances, RPDA can be designed without any knowledge of the underlying relational property. However, we would also note that RPDA is not central to the theory of the proposed method-VRL can be applied without RPDA-but rather a practical data augmentation strategy for addressing a unique optimization challenge (information-shortcut) of VRL learning. In all of our experiments, we find RPDA to be effective and crucial in overcoming the information-shortcut problem (as illustrated in appendix C). But just like any data augmentation, this is problem dependent and we advocate to start without RPDA and only apply it when suspecting information-shortcut occurs.

D.4 REMARKS ON EXPERIMENTAL RESULTS

Here we would like to give an overall summary and make general remarks on the experimental results we have presented in this work (included in Sec. 5 and appendix B). First, we proposed four MNIST experiments that represent a diverse set of relational learning problems: decoupled relational property (Sec. 5.1), coupled relational property (appendix B.1.1), multiple relational properties (appendix B.1.2), and continuous relational property (appendix B.1.3) . Although these experiments are introduced with well-controlled ground-truth relationships (so that we can easily validate and interpret the results), the application of the proposed VRL method is completely unsupervised without any prior knowledge of the underlying relationships. Furthermore, we deliberately design and successfully solve all four MNIST tasks using the exact same model and training setup despite each experiment represents a very different relational learning scenario (discrete vs. continuous, coupled vs. decoupled). In addition, we presented two Yale face tasks with high-level perception relationships: change of emotions (appendix B.2.1), and change of illumination conditions (appendix B.2.2). Again, the application of VRL to the two Yale face tasks is completely unsupervised and we deliberately design and successfully solve these two tasks using the exact same model and training setup. When comparing our model and training setups between the MNIST and Yale face experiments, their differences can all be attributed to the need for accommodating different data types, e.g., increasing nework size for larger face images (vs. smaller MNIST images), modifying learning objective to adapt real-valued Yale dataset (vs. binay valued MNIST dataset), and not for accomodating different relationships. Taking this into consideration, our results shows that we have solved both class of problems with the exact same principled method despite each class of problems represents a very different kind of relationships (the relationships in MNIST are geometric whereas the relationahips in Yale are high-level perception, e.g., sentiment, external environmental factors). We believe our results further demonstrates that the proposed VRL method is robust, stable, and generalizable to many different relationships.



Definition (Relational discrimination in condition). A discrimination based on the relationship between or among stimuli rather than on absolute features of the stimuli. Definition (Relational mapping) The ability to apply what one knows about one set of elements to a different set of elements.



Figure 1: VRL-PGM: a probabilistic graphical model for representing the relational learning problem; the observed random variables a and b are generated from some random process (parameterized by θ) involving an unobserved random variable z.

Figure 3: Examples of a MNIST digit augmented with five evenly-spaced rotations (from left to right:x (i,1) : 0 • , x (i,2) : 72 • , x (i,3) : 144 • , x(i,4) : 216 • , x (i,5) : 288 • ).

Figure 4: Scatter plots of the 2-D latent variable (showing only the mean µ) of hold-out datasets inferred by VRL in (a) and VAE in (b) (relationship labels: : 0 • , : 72 • , + : 144 • , × : 216 • , ♦ : 288 • ).

(a) Varying c2 from -2 to 2 on InfoGAN (b) Varying c3 from -2 to 2 on InfoGAN

Figure 5: Manipulating latent codes of InfoGAN on MNIST where each row represents random samples from varying continuous latent code c 2 in (a) and c 3 in (b) while other latent codes and noise are fixed; different rows correspond to different categorical code c 1 .

(r) sto the content of a(c) , whereas VRL generates image b(r,c) by applying the relationship of (a s , b (r) s ) to the image a(c) . It is evident from Fig.6bthat predicted images b(r,c) do not share similar style to b (r) s , but rather the same rotational relationship w.r.t. a (c) and a s .

z

Figure6: Examples of images predicted by VRL: (a) images predicted from sampled latent variables (sampling the centroid of each cluster in Fig.4a: " " → z (1) , " " → z (2) ,"+" → z (3) , "×" → z (4) , "♦" → z(5) ) and each image b(r,c) , 1 ≤ r ≤ 5, 1 ≤ c ≤ 10, was predicted from an image a (c) (shown in the top row) and a pre-selected latent variable z (r) using b (r,c) ∼ p θ ( b | a (c) , z (r) ); (b) examples of relational mappings of top row images by applying relationships inferred from pairs of source images (a s , b (r) s ) (shown in the left-most column with a s , b (1) s , ..., b (5) s arranged from top to bottom) and each image b (r,c) , 1 ≤ r ≤ 5, 1 ≤ c ≤ 10 was generated by b (r,c) ∼ p θ ( b | a (c) , z (r) ) where z (r) ∼ q φ ( z | a s , b

Figure 7: Scatter plot of the relational property (showing only the mean µ) of a hold-out dataset inferred by a VRL model that was trained on X 3 (relationship labels: : 0 • , : 72 • , + : 144 • , × : 216 • , ♦ : 288 • ). of images predicted by the learned likelihood function p θ ( b | a, z ) are shown in Fig. 8, where Fig. 8a shows predicted images based on direct sampling in the latent space (shown in Fig. 7), and Fig. 8b shows examples of relational mappings. Figures 8a and 8b further corroborate our claim that

Figure8: Examples of images predicted by a VRL model that was trained on X 2 : (a) images predicted from sampled latent variables (sampling the centroid of each cluster in Fig.7: " " → z (1) , " " → z (2) ,"+" → z (3) , "×" → z (4) , "♦" → z (5) ); (b) examples of relational mappings of top row images by applying relationships inferred from pairs of source images (a s , b (r) s ) shown in the left-most column with a s , b (1) s , ..., b (5) s arranged from top to bottom.

Figure 10: plot of the relational property (showing only the mean µ) of a hold-out dataset inferred by a VRL model that was trained on X 3 (relationship labels: (blue) : 0 • , : 72 • , + : 144 • , × : 216 • , ♦ : 288 • , : 0 • , ×1.5, : 72 • , ×1.5, : 144 • , ×1.5, : 216 • , ×1.5, (cyan) : 288 • , ×1.5).

Figure11: Examples of images predicted by a VRL model that was trained on X 3 : (a) images predicted from sampled latent variables (sampling the centroid of each cluster in Fig.10: " "(blue) → z (1) , " " → z (2) ,"+" → z (3) , "×" → z (4) , "♦" → z(5) , " " → z(6) , " " → z(7) , " " → z(8) , " " → z(9) , " "(cyan) z (10) ); (b) examples of relational mappings of top row images by applying relationships inferred from pairs of source images (a s , b (r) s ) shown in the left-most column with a s , b (1) s , ..., b (10) s arranged from top to bottom.

Fig.12), and Fig.13bshows examples of relational mappings. From Fig.12and 13 we can see that VRL learned a decoupled relational property that encodes a continuous data (rotational) relationship; however, there is a small region in Fig.12with overlapping relational property that leads to an ambiguous interpretation of the rotational relationship (120 • vs. 240 • ). This ambiguity is likely caused by compressing the continuous data (rotational) relationship down to a two-dimensional latent space, z ∈ R 2 , and motivates us to adopt a higher-dimensional latent space, e.g., z ∈ R 3 . Figure14shows inference result from repeating the previous experiment but with setting z ∈ R 3 ; we can see that VRL learned a three-dimensional relational property that unambiguously represents the underlying continuous data (rotational) relationship.

Figure12: Scatter plot of the relational property (showing only the mean µ) of a hold-out dataset inferred by a VRL model that was trained on X 4 ; each point is color-coded (best viewed in color) by the degrees of rotation between the corresponding datapoint (markers "×" denote sampled latent varibles z (1) , . . . , z (5) used for image prediction in Fig.13a).

Figure 13: Examples of images predicted by a VRL model that was trained on X 4 : (a) images predicted from sampled latent variables (denoted by markers "×" in Fig. 12); (b) examples of relational mappings of top row images by applying relationships inferred from pairs of source images (a s , b (r) s ) shown in the left-most column with a s , b (1) s , ..., b

Figure 14: Scatter plot of the three-dimensional relational property (showing only the mean µ) of a hold-out dataset inferred by a VRL model (with z ∈ R 3 ) that was trained on X 4 ; each plot shows a different vantage point of the 3D scatter plot, and each point is color-coded (best viewed in color) by the degrees of rotation between the corresponding datapoint.

) and this prompted us to construct RPDA functions D with random image rotation and swapping operations: D = { swap (rot(a, r), rot(b, r)) | r ∈ [0, 360) } where swap(a, b) = (a, b), p = 0.5 (b, a), p = 0.5

Figure 16: Learning illumination condition changes among the Yale face dataset: (a) examples of subjects with different illumination condition (source of illumination): left, front, right, and top; (b) scatter plot of the relational property (showing only the mean µ) inferred by the approximated posterior (relationship labels: : "left-right", ♦: "front-top", : "left-front", +: "left-top", ×: "front-right", : "right-top").

a, b; r), ∀r. Assuming we have access to such a D, the proposed RPDA strategy then seek to optimize a modified lower bound estimator L

RPDA is a critical component necessary for VRL to learn a meaningful and decoupled relational property, especially when flexible function approximations such as deep neural networks are used; second, by looking at the difference between L DA applies RPDA functions D to every term in Eq. (17), whereas L

