LEVERAGING AFFINITY CYCLE CONSISTENCY TO ISO-LATE FACTORS OF VARIATION IN LEARNED REPRESEN-TATIONS Anonymous

Abstract

Identifying the dominant factors of variation across a dataset is a central goal of representation learning. Generative approaches lead to descriptions that are rich enough to recreate the data, but often only a partial description is needed to complete downstream tasks or to gain insights about the dataset. In this work, we operate in the setting where limited information is known about the data in the form of groupings, or set membership, and the task is to learn representations which isolate the factors of variation that are common across the groupings. Our key insight is the use of affinity cycle consistency (ACC) between the learned embeddings of images belonging to different sets. In contrast to prior work, we demonstrate that ACC can be applied with significantly fewer constraints on the factors of variation, across a remarkably broad range of settings, and without any supervision for half of the data. By curating datasets from Shapes3D, we quantify the effectiveness of ACC through mutual information between the learned representations and the known generative factors. In addition, we demonstrate the applicability of ACC to the tasks of digit style isolation and synthetic-to-real object pose transfer and compare to generative approaches utilizing the same supervision.

1. INTRODUCTION

Isolating desired factors of variation in a dataset requires learning representations that retain information only pertaining to those desired factors while suppressing or being invariant to remaining "nuisance" factors. This is a fundamental task in representation learning which is of great practical importance for numerous applications. For example, image retrieval based on certain specific attributes (e.g. object pose, shape, or color) requires representations that have effectively isolated those particular factors. In designing approaches for such a task, the possibilities for the structure of the learned representation are inextricably linked to the types of supervision available. As an example, complete supervision of the desired factors of variation provides maximum flexibility in obtaining fully disentangled representations, where there is a simple and interpretable mapping between elements and the factors of the variation (Bengio et al., 2013) . However, such supervision is unrealistic for most tasks since many common factors of variation in image data, such as 3D pose or lighting, are difficult to annotate at scale in real-world settings. At the other extreme, unsupervised representation learning makes the fewest limiting assumptions about the data but does not allow control over the discovered factors of variation. The challenge is in designing a learning process that best utilizes the supervision that can be realistically obtained in different real-world scenarios. In this paper, we consider weak supervision in the form of set membership (Kulkarni et al., 2015; Denton & Birodkar, 2017) . Specifically, this weak set supervision assumes only that we can curate subsets of training data where only the desired factors of variation to be isolated vary, and the remaining nuisance factors are fixed to same values. We will refer to the factors that vary within a set as the active factors, and those that have fixed and same values as inactive. To illustrate this set supervision, consider the problem of isolating 3D object pose from images belonging to an object category (say, car images). The weak set supervision assumption can be satisfied by simply imaging each object from multiple viewpoints. Note, this would not require consistency or correspondence in viewpoints across object instances, nor any target pose val-Figure 1 : Affinity cycle consistency (ACC) yields embeddings which isolate factors of variation in a dataset P (I). It leverages weak supervision in the form of set membership, such as in the set of images P (I|d 0 ) rendered around a given synthetic car (top, left). A cycle consistency loss encourages finding correspondence between sets of inputs by extracting common factors that vary within both sets and suppressing factors which do not. We show that ACC isolates nontrivial factors of variation, such as pose in the example above, even when only one of the sets has been grouped. Importantly, this allows the incorporation of data with no supervision at all, such as the images of real cars (bottom, left). The learned representations (right, t → ∞), contain only the isolated factor of variation (contrast the alignment here with the untrained representations shown in the middle, t = 0). ues attached to the images. In practice, collecting multiple views of an object in a static environment is much more reasonable than collecting views of different objects with identical poses. In this paper we propose a novel approach for isolating factors of variation by formulating the problem as one of finding alignment between two sets with some common active factors of variation. Considering the application of synthetic-to-real object pose transfer, Figure 1 illustrates two sample sets of car images where pose is the only active factor in the first set P (I|d 0 ) of synthetic car images and the second set P (I) is comprised of both real and synthetic car images. Given these sets, without any other supervision, the aim is to automatically learn the embeddings that can find meaningful correspondences between the points in the two sets. The key idea behind our approach is a novel utilization of cycle consistency. A cycle consistent mapping can be described broadly as some non-trivial mapping that brings an input back to itself, and in our case the mapping is between sets of points in embedding space. We denote our application of cycle consistency as affinity cycle consistency (ACC) as it uses a differentiable version of soft nearest neighbors since the correspondence forming the cycle or not known a priorifoot_0 . Further, no explicit pairwise correspondence between the input sets is needed; it is found by the loss. We posit that this process of finding correspondences is crucial to isolating the desired factors of variation: to match across sets, the representations must ignore commonality within a set (the inactive factors) and focus on the active factors common to both the sets. For example, ACC-learned embeddings from the two sets of car images in Figure 1 can isolate the object pose factor as that is the common active factor across both the sets. We also show how our ACC model can be generalized to the partial set supervision setting: ACC can learn to isolate factors of variation even when set supervision is provided for only one set, while the second set is virtually unrestricted. This has practical importance as it allows us to integrate unsupervised data during training. In Section 4.3 we show how this process can be applied to isolate 3D pose in real images without ever seeing any supervised real images during training. In the following two sections we cover the related works and formally introduce our ACC method. Given the novelty of our approach for isolating factors of variation, we present a progression of experiments to develop an intuition for the technique as it operates in different scenarios. In Section 4.1 we evaluate ACC in various settings using the synthetic Shapes3D dataset where the latent factor values are known, allowing a quantitative analysis. Later, in Section 4.2 we demonstrate the use of ACC in isolating handwritten digit style from its content (class id). In Section 4.3, we show how ACC can be applied in its most general form to isolate 3D object pose in real images with a training



This specific loss has been used previously inDwibedi et al. (2019) to align different videos of the same action, and here we show this loss is much more general. We term it affinity cycle consistency as opposed to the prior work's terminology, temporal cycle consistency, to indicate as such.

