LEVERAGING AFFINITY CYCLE CONSISTENCY TO ISO-LATE FACTORS OF VARIATION IN LEARNED REPRESEN-TATIONS Anonymous

Abstract

Identifying the dominant factors of variation across a dataset is a central goal of representation learning. Generative approaches lead to descriptions that are rich enough to recreate the data, but often only a partial description is needed to complete downstream tasks or to gain insights about the dataset. In this work, we operate in the setting where limited information is known about the data in the form of groupings, or set membership, and the task is to learn representations which isolate the factors of variation that are common across the groupings. Our key insight is the use of affinity cycle consistency (ACC) between the learned embeddings of images belonging to different sets. In contrast to prior work, we demonstrate that ACC can be applied with significantly fewer constraints on the factors of variation, across a remarkably broad range of settings, and without any supervision for half of the data. By curating datasets from Shapes3D, we quantify the effectiveness of ACC through mutual information between the learned representations and the known generative factors. In addition, we demonstrate the applicability of ACC to the tasks of digit style isolation and synthetic-to-real object pose transfer and compare to generative approaches utilizing the same supervision.

1. INTRODUCTION

Isolating desired factors of variation in a dataset requires learning representations that retain information only pertaining to those desired factors while suppressing or being invariant to remaining "nuisance" factors. This is a fundamental task in representation learning which is of great practical importance for numerous applications. For example, image retrieval based on certain specific attributes (e.g. object pose, shape, or color) requires representations that have effectively isolated those particular factors. In designing approaches for such a task, the possibilities for the structure of the learned representation are inextricably linked to the types of supervision available. As an example, complete supervision of the desired factors of variation provides maximum flexibility in obtaining fully disentangled representations, where there is a simple and interpretable mapping between elements and the factors of the variation (Bengio et al., 2013) . However, such supervision is unrealistic for most tasks since many common factors of variation in image data, such as 3D pose or lighting, are difficult to annotate at scale in real-world settings. At the other extreme, unsupervised representation learning makes the fewest limiting assumptions about the data but does not allow control over the discovered factors of variation. The challenge is in designing a learning process that best utilizes the supervision that can be realistically obtained in different real-world scenarios. In this paper, we consider weak supervision in the form of set membership (Kulkarni et al., 2015; Denton & Birodkar, 2017) . Specifically, this weak set supervision assumes only that we can curate subsets of training data where only the desired factors of variation to be isolated vary, and the remaining nuisance factors are fixed to same values. We will refer to the factors that vary within a set as the active factors, and those that have fixed and same values as inactive. To illustrate this set supervision, consider the problem of isolating 3D object pose from images belonging to an object category (say, car images). The weak set supervision assumption can be satisfied by simply imaging each object from multiple viewpoints. Note, this would not require consistency or correspondence in viewpoints across object instances, nor any target pose val-

