ADDRESSING THE TOPOLOGICAL DEFECTS OF DISEN-TANGLEMENT Anonymous authors Paper under double-blind review

Abstract

A core challenge in Machine Learning is to disentangle natural factors of variation in data (e.g. object shape vs pose). A popular approach to disentanglement consists in learning to map each of these factors to distinct subspaces of a model's latent representation. However, this approach has shown limited empirical success to date. Here, we show that this approach to disentanglement introduces topological defects (i.e. discontinuities in the encoder) for a broad family of transformations acting on images -encompassing simple affine transformations such as rotations and translations. Moreover, motivated by classical results from group representation theory, we propose an alternative, more flexible approach to disentanglement which relies on distributed equivariant operators, potentially acting on the entire latent space. We theoretically and empirically demonstrate the effectiveness of our approach to disentangle affine transformations. Our work lays a theoretical foundation for the recent success of a new generation of models using distributed operators for disentanglement

1. INTRODUCTION

Learning disentangled representations is arguably key to build robust, fair, and interpretable ML systems (Bengio et al., 2013; Lake et al., 2017; Locatello et al., 2019a) . However, it remains unclear how to achieve disentanglement in practice. Current approaches aim to map different factors of variations in the data to distinct subspaces of a latent representation, but have achieved only limited empirical success (Higgins et al., 2016; Burgess et al., 2018) . More work on the theoretical foundations of disentanglement could provide the key to the development of more successful approaches. In its original formulation, disentanglement consists in isolating statistically independent factors of variation in data into independent latent dimensions. This perspective has led to a range of theoretical studies investigating the conditions under which these factors are identifiable (Locatello et al., 2019b; Shu et al., 2020; Locatello et al., 2020; Hauberg, 2019; Khemakhem et al., 2020) . More recently, Higgins et al. (2018) has proposed an alternative perspective connecting disentanglement to group theory (see Appendix A for a primer on group theory). In this framework, the factors of variation are different subgroups acting on the dataset, and the goal is to learn representations where separated subspaces are equivariant to distinct subgroups -a promising formalism since many transformations found in the physical world are captured by group structures (Noether, 1915) . However, the fundamental principles for how to design models capable of learning such equivariances remain to be discovered (but see Caselles-Dupré et al. (2019) ). Here we attack the problem of disentanglement through the lens of topology (Munkres, 2014) . We show that for a very broad class of transformations acting on images -encompassing all affine transformations (e.g. translations, rotations), an encoder that would map these transformations into dedicated latent subspaces would necessarily be discontinuous. With this assurance, we reframe disentanglement by distinguishing its objective from its traditional implementation, resolving the discontinuities of the encoder. Guided by classical results from group representation theory 8 ). (Scott & Serre, 1996) , we then theoretically and empirically demonstrate the capacity of a model equipped with distributed equivariant operators in latent space to disentangle a range of affine image transformations including translations, rotations and combinations thereof.

2. EMPIRICAL LIMITATIONS OF TRADITIONAL DISENTANGLEMENT

In this section we empirically explore the limitations of traditional disentanglement approaches, in both unsupervised (variational autoencoder and variants) and supervised settings.

VAE, beta-VAE and CCI-VAE

We show that, consistent with results from prior literature, a variational autoencoder model (VAE) and its variants are successful at disentangling the factors of variation on a simple dataset. We train a VAE, beta-VAE and CCI-VAE (Kingma & Welling, 2014; Higgins et al., 2016; Burgess et al., 2018) on a dataset composed of a single class of MNIST digits (the "4s"), augmented with 10 evenly spaced rotations (all details of the models and datasets are in App. B). After training, we qualitatively assess the success of the models to disentangle the rotation transformation through traditional latent traversals: we feed an image of the test set to the network and obtain its corresponding latent representation. We then sweep a range of values for each latent dimension while freezing the other dimensions, obtaining a sequence of image reconstructions for each of these sweeps. We present in Fig. 1A examples of latent traversals along a single latent dimension, selected to be visually closest to a rotation (see Fig. 5 for latent traversals along all other latent dimensions). We find that all these models are mostly successful at the task of disentangling rotation for this simple dataset, in the sense that a sweep along a single dimension of the latent maps to diverse orientations of the test image. We then show that on a slightly richer dataset (MNIST with all digits classes), a VAE model and its variants fail to disentangle shape from pose. We train all three models studied (VAE, beta-VAE, CCI-VAE) on MNIST augmented with rotation, and find that all these models fail to disentangle rotation from other shape-related factors of variation (see Fig. 1B for the most visually compelling sweep and Fig. 6 for sweeps along all latent dimensions). We further quantify the failure of disentanglement by measuring the variance along each latent in response to a digit rotation, averaged over many digits (see Fig. 1C and details of analysis in App. E). We find that the information about the transformation is distributed across the latents, in contradiction with the conventional notion of disentanglement. One possibility would be that the direction of variance is confined to a subspace, but that this subspace is not aligned with any single latent. In order to discard this possibility, we carry a PCA-based analysis on the latent representation (Fig. 1D and App. E) and we show that the variance in latent representation corresponding to image rotation is not confined to a low-dimensional subspace. Supervised Disentanglement We further explore the limitations of traditional disentanglement in a supervised framework. We train an autoencoder on pairs of input and target digit images (Fig. 1E ), where the target image is a rotated version of the input image with a discrete rotation angle indexed by an integer value k. The input image is fed into the encoder to produce a latent representation. This latent representation is then multiplied by a matrix operator ψ k , parameterized by the known transformation parameter k. This matrix operator, which we call the disentangled operator, is composed of a 2-by-2 diagonal block with a rotation matrix and an identity matrix along the other dimensions (shown in Fig. 1E ). The disentangled operator (i) is consistent with the cyclic structure of the group of rotations and (ii) only operates on the first two latent dimensions, ensuring all other dimensions are invariant to the application of the operator. The transformed latent is then decoded and compared to the target image using an L2 loss (in addition, the untransformed latent is decoded and compared to the original image for regularization purposes). The only trainable parameters are the encoder and decoder weights. We use the same architecture for the encoder and decoder of this model that we use for the VAE models in the previous section. This supervised disentanglement model partly succeeds in mapping rotation to a single latent on rotated MNIST (Fig. 1E top row). However, there remains some digits for which disentanglement fails (Fig. 1E bottom row). It is difficult to evaluate the capacity of the model to learn to rotate many different images with MNIST, because MNIST is only composed of 10 classes of shapes corresponding to the 10 different digits. To further expose the limitations of this model, we design a custom dataset composed of 2000 simple shapes in all possible orientations. When trained on this extensive dataset, we find that our :: the : model fails to capture rotations on many shapes. Instead, it replaces the shape of the input image with a mismatched stereotypical shape (Fig. 1F ). We reproduce all these results with translation in the appendix . In conclusion, we find that common disentanglement methods are limited in their ability to disentangle pose from shape in a relatively simple dataset, even with strong supervision (see also Locatello et al. (2019b) ). We cannot empirically discard the possibility that a larger model, trained for longer on even more examples of transformed shapes, could eventually learn to disentangle pose from shape. However, in the next section we will prove, using arguments from topology, that under the current definition of disentanglement, an autoencoder cannot possibly learn a perfect disentangled representation for all poses and shapes. In Sec. 4 :: 3.4 :::: and :::: Sec. :: 4, we will show that another type of model -inspired by group representation theory-can properly disentangle pose from shape.

3. REFRAMING DISENTANGLEMENT

In this section, we formally prove that traditional disentanglement by a continuous encoder is mathematically impossible for a large family of transformations, including all affine transformations. We then provide a more flexible definition of disentanglement that does not suffer from the same theoretical issues.

3.1. MATHEMATICAL IMPOSSIBILITY OF DISENTANGLEMENT

We first consider a simple example case where disentanglement is impossible. We consider the space of all images of 3 pixels X = R 3 , and the transformation acting on this space to be the group of integer finite translations, assuming periodic boundary conditions of the image in order to satisfy the group axiom of invertibility (Fig. 2A , see App. A for definitions). Given an image, the set of images resulting from the application of all possible translations to this image is called the orbit of this image. We note that the space of images R 3 is composed of an infinite set of disjoint orbits. Can we find an encoder f which maps every point of image space X to a disentangled space Z? To conform to the conventional definition of disentanglement (Higgins et al., 2018 ) (see App. C for a formal definition), Z should be composed of two subspaces, namely (i) an equivariant subspace Z E containing all and only the information about the transformation (i.e. location along the orbit) and (ii) an invariant subspace Z I , invariant to the transformation but containing all other information about the image (i.e. identity of the orbit). Each orbit should thus lie in a plane parallel to Z E (otherwise some information about the transformation would leak into Z I ), and all orbits projected onto Z E should map onto each other (otherwise some information about the identity of the orbit would leak into Z E ). We now consider the orbit containing the black image [0,0,0]. Since all translations of the black image are the black image itself, this orbit contains only one point. And yet, the image of this orbit in Z E should conform to the image of other orbits, which generally consist of 3 distinct points. Since a function cannot map a single point to 3 points, an encoder f ensuring disentanglement for all images cannot exist. Using similar topological arguments, we formally prove the following theorem (App. C.1), generalizing the observation above to a large family of transformations including translations, rotations and scalings. Theorem 1: Disentanglement into subspaces by a continuous encoder is impossible for any finite group acting on Euclidean space R N .

3.2. PRACTICAL EXAMPLES OF TOPOLOGICAL DEFECTS

The formal theorem of the previous section does not tell us how hard it would be to approximate disentanglement in practice. We show next that a disentangling encoder f would need to be discontinuous around all images that present a symmetry with respect to the transformation, which makes this function very discontinuous in practice. As an example, we consider the image of an equilateral triangle undergoing rotation (Fig. 2B , color changes are for visualisation purposes). Due to the symmetries of the triangle, a rotation of 120 • of this image returns the image itself. Now we consider the same image with an infinitesimal perturbation on one corner of the triangle, breaking the symmetry of the image. A rotation of 120 • of this perturbed image returns an image that is infinitesimally close to the original image. And yet the equivariant part of the encoder f E (i.e. the projection of f onto the equivariant subspace Z E ) should map these two images to disjoint points in the equivariant subspace Z E , in order to properly encode the rotation transformation. Generalizing this argument to all symmetric images, we see that a disentangling encoder would be discontinuous in the neighborhood of all images that present a symmetry with respect to the transformation to disentangle. This is incompatible with most deep learning frameworks, where the encoder is usually a neural network implementing a continuous function. We provide a formal proof of the discontinuity of f E in App. C.2. The invariant encoder f I (i.e. the projection of f onto the invariant subspace Z I ) also presents topological defects around symmetric images. We provide both a visual proof and a formal proof of these defects in App. C.2. 3.3 A MORE :::::: MORE FLEXIBLE DEFINITION OF DISENTANGLEMENT An underlying assumption behind the traditional definition of disentanglement is that the data is naturally acted upon by a set of transformations that are orthogonal to each other, and that modify well-separated aspects of the data samples. However, in many cases this separation between factors of variation of the data is not possible (as also noted by Higgins et al. (2018) ). We notice that the current definition of disentanglement unnecessarily conflates the objective of isolating factors of variation with the algorithm of mapping these factors into distinct subspaces of the internal representation. In order to build a model that respects the structure of the data and its transformations, the latent space should instead preserve the entanglement between factors of variation that are not independent. Thus, we : A :::::: model :::::::: equipped :::: with : a ::::: latent ::::::: operator :: is ::::::::: equivariant :: to : a ::::::::::::: transformation : if :::::::: encoding : a ::::::: sample ::: then ::::::: applying ::: the ::::: latent ::::::: operator :: is ::::::::: equivalent :: to ::::::::::: transforming ::: the :::::: sample :::: first :::: then :::::::: encoding. :::::::: Formally, :: we :::: say : a :::::: model ::::::::: f : X → Z :: is :::::::::: equivariant :: to : a ::::::::::::: transformation :: g k :::: with ::::::::: parameter :: k : if ::: for :::: any :::: input ::::: x ∈ X : f (φ k (x)) = ψ k (f (x)), ∀k ∈ K, ::::::::::::::::::::::::: (1) ::::: where :: K ::: is ::: the ::::: space :: of ::::::::::::: transformation :::::::::: parameters. we : turn to a definition of disentanglement in which the transformations are modelled as distributed operators (i.e. not restricted to a subspace) in the latent space. Definition 1: Definition 1. : A :::::::::::: representation :: is :::::::::: disentangled :::: with ::::::: respect :: to : a ::: set :: of :::::::::::::: transformations, :: if :::: there :: is : a ::::: family :: of :::::::::: controllable ::::::::: operators, ::::::::: potentially ::::: acting :: on ::: the ::::: entire ::::::::::::: representation, ::::: where :::: each ::::::: operator :::::::::: corresponds :: to ::: the ::::: action :: of :: a ::::: single ::::::::::::: transformation ::: and ::: the :::::::: resulting ::::: model :: is :::::::::: equivariant. : ::::: These :::::::: operators :::: are : A representation is said to be disentangled with respect to a particular decomposition of a symmetry group into subgroups, if there is a family of known operators acting on this representation, potentially distributed across the full latent, where each operator is equivariant to the action of a single subgroup. ::::::::: controllable This definitionis :: in ::: the :::: sense :::: that :::: they :::: have :: an :::::: explicit :::: form, :::: thus :::::::: allowing ::: the :::: user :: to ::::::::: manipulate ::: the ::::: latent :::::::::::: representation ::: by ::::::: applying ::: the :::::::: operator. :::: This :::::::: definition, : more flexible than traditional disentanglement in the choice of the latent operators, while serving the same objective as traditional disentanglement (isolation and identification of :::: obeys :: to :: the ::::: same ::::::::: desiderata :: of ::::::::::: identification :::: and ::::::: isolation :: of ::: the : factors of variations :::::: present in the data). : .

3.4. THE SHIFT OPERATOR FOR AFFINE TRANSFORMATIONS

::: for ::: the ::::: order ::::::::::::: transformations ::: are :::::: applied ::::::::::::: (associativity). :: A :::::::: collection :: of ::::::::::::: transformations :::: with :: a ::: rule :: for :::::::::: combining ::: two :::::::::::::: transformations ::: into ::::::: another ::::::::: satisfying ::::: these :::: three :::::: simple ::::::::::: requirements :::: has : a ::::: group ::::::: structure :::: (see ::::: App. :: A ::: for :: a :::::: formal ::::::::: definition). ::::: With :: a ::::: group :::::::: structure, ::: we :::: can ::::::::: decompose ::::::::::::: transformations ::: into :::::::::: subgroups, ::::::: describe :::: how ::: to :::::::: represent ::::::::::::: transformations :: as :::::::: matrices, :::: and :::: build :::::: flexible ::::::::::: disentangled :::::: models. Using classical results from the linear representation of finite groups (Scott & Serre, 1996) , we show in App. C.3 :::: now that a carefully chosen distributed operator in latent space, -the shift operator ψ k (shown in Fig. 3A )-is linearly isomorphic to :::::: specific ::::::::::::: transformations :::: that :::::: include :::::: integer :::: pixel :::::::::: translations ::: and :::::::: rotations. ::::: With :::: this :::::::: operator, :: we :::: can :::: learn :: a ::::: latent ::::: space ::::::::: equivariant :: to ::: any ::::: affine ::::::::::::: transformation ::::: using : a :::::: simple :::::: linear ::::::::::: autoencoder. :::::::: Consider :: a ::::: linear ::::::: encoder ::::: model ::::::: f = W . ::: If ::: we :::: want ::: W ::: to ::: be :: an :::::::::: equivariant :::::::: invertible :::::: linear :::::::: mapping ::: W ::::::: between ::: X :::: and :: Z, ::::::: Equation :: 1 ::::::: rewrites :: as ::::::: follows: W φ k (x) = ψ k (W x) ∀x ∈ X(= R N ), ∀k ∈ K ::::::::::::::::::::::::::::::::::::::: (2) ::::: where ::: φ k ::: and :::: ψ k ::: are ::: the ::::::::::::: representations :: of ::::::: g k ∈ G ::: on ::: the :::::: image ::: and :::::: latent ::::: space :::::::::: respectively, :: as :::::: defined :: in ::::: App. ::: A. :::: For ::: W :: to :: be :::::::::: equivariant, :::::::: Equation :: 2 ::::: must :: be :::: true ::: for ::::: every :::::: image :: x. ::: As :: W :: is ::::::::: invertible, :::::::: Equation : 2 :: is :::: true :: if ::: and ::::: only :: if, :::::::: ∀k ∈ K, ::: the ::: two ::::::::::::: representations ::: ψ k :::: and :: φ k ::: are :::::::::: isomorphic: ∀k ∈ K, φ k = W -1 ψ k W ::::::::::::::::::::: (3) ::: We ::::::: consider ::::::::: additional ::::::::: properties ::: on :: φ ::::::::::::: corresponding :: to ::: the ::::::::::: assumptions ::: (i) :::: that ::: G :: is ::::: cyclic :: of ::::: order :: K ::::: with :::::::: generator ::: g 0 :::: and ::: (ii) :: φ :: is :::::::::: isomorphic ::: to ::: the :::::: regular :::::::::::: representation ::: of :: G :::: (see ::::::::::::::::: Scott & Serre (1996) : ). :::::: These :::::::: properties ::: are ::::::::: respected :: by : all cyclic linear transformations of finite order K of :: the : images (see App. A for definitions). Special cases of these transformations include all discrete and cyclic affine transformations, such as integer pixel translation with periodic boundary conditions, or rotations. A practical consequence of this proof is that it is possible to learn a latent space equivariant to any affine transformation with this operator, using a simple linear autoencoder. Definition and properties of the shift operator For a transformation g k ∈ G such that g k = g k 0 , where g 0 is the generator of ::::: Given :::: that ::: the ::::::: encoder ::: and ::::::: decoder ::: are :::::: linear ::: and ::::::::: invertible, ::: the ::: two :::::::::::: representations :: φ ::: and :: ψ :::: must ::: be :::::::::: isomorphic. :::: Two :::::::::::: representations ::: are :::::::::: isomorphic :: if ::: and :::: only :: if ::: they of :::::::::: characters). ::: We :::: thus :::: want ::: to :::::: choose :: ψ :::: such :::: that : it :::::::: preserves ::: the :::::::: character ::: of ::: the ::::::::::: representation : φ :::::::::::: corresponding :: to ::: the :::::: action :: of : G , the corresponding shift operator ψ k is the unitary shift matrix exponentiated by k. Note that this operator is distributed as it acts :: on :::: the :::::: dataset :: of ::::::: images. :::::::::: Importantly, ::: we :::: will ::: see :::: that :::: our :::::::: proposed ::::::: operator :::::: needs :: to ::: be ::::::::: distributed : in :::: the ::::: sense :::: that : it ::::: should ::: act : on the full latent space, contrary to conventional disentanglement models.Importantly, the proposed shift operator is flexible in the sense that it does not require specific knowledge of which transformation groupis applied to the data (i. e. for example rotations or translations), but only the order of the group and the character of its representation (App. A). Furthermore, in App. :::: code. ::: Let ::: us :::::::: consider ::: the ::::::: matrix ::: of ::::: order ::::::::: K = |G| :::: that ::::::::::: corresponds ::: to :: a :::: shift ::: of ::::::::: elements :: in : a ::::::::::::: K-dimensional ::::::: vector ::: by :: k ::::::::: positions. ::::: We ::::::::: construct ::::: from :::: M k :::: the :::: shift ::::::::: operator : as :: a ::::::::::: representation ::: of ::: the ::::::: group's :::::: action ::: on :::: the ::::: latent :::::: space. ::::: For :::: each ::::::: g k ∈ G ::: its :::::::::::: corresponding :::: shift ::::::: operator :: is :::: the ::::: block :::::::: diagonal :::::: matrix ::: of ::::: order ::: N ::::::::: composed ::: of ::: N K :::::::: repetition ::: of :::: M k . :: M k :=         0 0 . . . 1 1 0 . . . 0 0 1 0 . . . . . . 0 . . . 1 0         k ::::::::::::::::::::: (4) ψ k := M k M k ... M k             ::: We ::::: show :: in ::::: App. ::::: C.3 :::: that ::: the :::: shift :::::::: operator :::: has ::: the ::::: same :::::::: character ::: as ::: the :::::::::::: representation :: φ ::::::::::: corresponding ::: to ::: the :::::: action :: of ::: G :: on :::: the :::::: dataset ::: of ::::::: images. :::::: Thus, ::: the :::: two ::: are :::::::::: isomorphic ::: and :::: using :::: this :::: shift :::::::: operator :::::: ensures :::: that ::: an :::::::::: equivariant :::::::: invertible ::::: linear :::::::: mapping ::: W ::::: exists ::::::: between ::::: image ::::: space :::: and ::: the ::::: latent ::::: space :::::::: equipped ::::: with ::: the :::: shift :::::::: operator. ::: In ::::::::: Appendix : C.3.2, we also propose a complex version of the shift operator :::: show :::: that ::: we ::: can :::: also :::::: replace :::: this :::: shift ::::::: operator :: by : a ::::::: complex :::::::: diagonal :::::::: operator, which is more computationally efficient . Although these operators cannot theoretically be mapped to continuous transforms, we note that any continuous transform can be approximated using finite groups. In the remainder of the paper, we denote ψ t,k,N the shift operator that corresponds to t ∈ T where T is a cyclic group of transformations (e. g. rotations)of order N with generator t 0 and order K, such that t = t k 0 . In the next section, we show how these theoretical results can lead to practical and effective disentangling models for affine transformationssuch as rotations and translations. :: to ::::::: multiply :::: with :: the :::::: latent. : : ::: The :::: shift ::::::: operator ::::::::: computes : a :::: shift :: of ::: the ::::: latent ::::: space ::: and ::: we ::: use :::: this :::: form :: of :::::::: operator :: to ::::::: represent ::: any ::::: finite ::::: cyclic ::::: group :: of ::::: affine :::::::::::::: transformations ::: (i.e. :::::: either ::::::: rotation, ::::::::: translation :: in :: x, ::::::::: translation :: in :: y). :::: The :::: role :: of ::: the ::::::: encoder :: is :: to :::::::: construct : a ::::: latent ::::: space :::::: where ::::::::::::: transformations ::: can ::: be ::::::::: represented :: as ::::: shifts. ::::::::::: Importantly, ::: the ::::: shift ::::::: operator ::::: does ::: not ::::::: require :::::::::: knowledge :: of ::: the :::::::::::::: transformation :: in ::::::: advance, ::::: only ::: the :::: cycle ::::: order :: of :::: each :::::: group, ::::: which :: is ::: an ::::::::: assumption ::: we :::: will :::: relax :::: with ::: the :::::: weakly of ::: this :::::::: extension ::: for ::::: future ::::: work. :

4. DISTRIBUTED DISENTANGLEMENT IN PRACTICE

Our empirical (Sec. 2) and theoretical (Sec. 3) findings converge to show the difficulties of disentangling even simple affine transformations into distinct subspaces. Here we show that, using distributed equivariant operators instead, it is practically possible to learn to disentangle these affine transformations, according to our more flexible definition of disentanglement.

4.1. THE SUPERVISED SHIFT OPERATOR MODEL

Guided by our theoretical results, we train a supervised non-variational autoencoder using pairs of samples and their transformed version (with a known transformation indexed by k) using the distributed shift operator from Sec. 3.4 (shown in Fig. 3A ) instead of the disentangled operator from Sec. 2. We feed the original sample x to a linear invertiblefoot_0 :: (or :::::::::::::: quasi-invertible, ::: see :::: App. :::: B.2) : encoder that produces a latent representation. The latent representation is then multiplied by the shift operator matrix parametrized by k. The transformed latent is then decoded and L2 loss between the two reconstructions (x and its transformed version) and their respective ground-truth images is back-propagated. As predicted by character theory, our proposed model is able to correctly structure the latent space, such that applying the shift operator to the latent code at test time emulates the learned transformation (see Fig. 3A and : , test MSE reported in Table 2 ). Also consistent with : , ::: and ::: the ::::: LSBD ::::::::::::: disentanglement :::::::: measure :::: from ::::::::::::::::: Anonymous (2021) ::: and :::::::: reported :: in ::::: App. ::::::: E.1.1). :::::::::::: Interestingly, ::: test ::::: MSE ::: for ::::::: rotations :::: are ::::: lower :::: than ::: for ::::::::::: translations. :::: We :::::: believe :::: this :: is ::: due ::: to ::: the ::: fact :::: that :: in :: the ::::: case :: of ::::::::::: translations, ::::::: changes :: in :::: the ::::: image ::::::: induced ::: by ::::: each :::::::::::: transformation :::: are ::: less ::::::: visually :::::: striking :::: than :::: with :::::::: rotations. ::::::::: Consistent :::: with : the theory, the same linear autoencoder equipped with a disentangled operator fails at learning the transformation (Fig. 3B ). We reproduce these results for translation in App. E. We thus show that, unlike prior approaches, our theoretically motivated model is able to learn to disentangle affine transformations from examples.

4.2. THE WEAKLY SUPERVISED SHIFT OPERATOR MODEL

Here we show that our method can also learn to disentangle transformations in a weakly supervised setting where the model is not given the transformation parameter between pairs of transformed images (e.g. rotation angle) during training. We consider the case of a single transformation for simplicity. We encode samples by pairs x 1 , x 2 (with x 2 a transformed version of x 1 ) into z 1 and z 2 respectively, and use a phase correlation technique to identify the shift between z 1 and z 2 , as in Reddy & Chatterji (1996) . An L2 loss is applied on reconstructed samples, with the original sample transformed according to all possible k and weighted according to soft-max scores given by the cross-correlation method (see App. B.3 for details). Here and in the remainder of the experiments, we use the complex version of the shift operator for computational efficiency (shown in Fig. 3C ). This weakly supervised version of the model has an extra free parameter which is the number of latent transformations, not known a priori. Let us denote K L this number, which can be different than the ground-truth order of the group K. We explore the effect of different K L in App. B.3. The results of this model (with 10 latent transformations K L ) are shown in Fig. 3C . The weakly supervised shift operator model works almost as well as its supervised counterpart, and this is confirmed by test MSE (see Table 2 ). The same model can successfully be trained on MNIST digits (Fig. 14 ).

4.3. MULTIPLE TRANSFORMATIONS: STACKING SHIFT OPERATORS

So far, we have only considered the case of a single type of transformation at a time. When working with real images, there are more than one type of transformation jointly acting on the data, for example a rotation followed by a translation. Here we show how we can adapt our proposed shift operator model to the case of multiple transformations. Stacked shift operator model In the case of multiple transformations, a group element is a composition of consecutive single transformations. For example, elements of the Special Euclidean group (i.e. translations and rotations) are a composition of the form a y a x h, where a x is an element of the x-translation subgroup, a y of the y-translation subgroup, and h of the image rotation subgroup. Our theory in C.3 ensures that each of these subgroup's action in image space is linearly isomorphic to the repeated regular representation ψ. We can thus match the structure of the learning problem by simply stacking linear layers and operators. We build a stacked version of our shift operator model, in which we use one complex shift operator for each type of transformation, and apply these operators to the latent code in a sequential manner, akin to Tai et al. (2019) . Specifically, consider an image x ∈ X that encounters a consecutive set of transformations g n , g n-1 , . . . g 1 , with g i ∈ G i ∀i. We encode x into z with a linear invertible encoder z = W x. We then apply the operator on the latent space, ψ 1 (g 1 ), corresponding to the representation of g 1 on the latent space Z. We then apply a linear layer L 1 before using the operator corresponding to G 2 . The resulting latent code after all operators have been applied is z = ψ n (g n )L n-1 . . . ψ 2 (g 2 )L 1 ψ 1 z. The transformed latent code z is fed to the linear decoder, in order to produce a reconstruction that will be compared with the ground-truth transformed image x , as in the single transformation case. Translations in X and Y Consider the conjunction of translations in x and y axes of the image. This is a finite group of 2D translations. This group is a direct product of two cyclic groups, and it is abelian (i.e. commutative). We refer the interested reader to App. D.1 for details on direct products. To tackle this case with the stacked shift operator model, we first use the shift operator ψ x,k,N corresponding to the translation in x, then apply a linear layer denoted L 1 , before using the operator ψ y,k ,N corresponding to the translation in y: z = ψ y,k ,N L 1 ψ x,k,N z. We train this stacked model on translated shapes with 5 integer translations in both x and y axes (i.e. the group order is 25). Results reported in Fig. 3D show that the stacked shift operator model is able to correctly handle the group of 2D translations.

Translations and rotations

We consider a discrete and finite version of the Special Euclidean group, where A is a finite group of 2D translations presented in the previous section and H a finite cyclic group of rotations. This group has a semi-direct product structure (see App. D.3 for details) and is non-commutative, contrary to the 2D translations case. With the stacked shift operators model, we first use the operator ψ h,j,N corresponding to the rotation h, then the one for translation in x, then the one for y-translation. The resulting transformed latent code is z = ψ y,k ,N L 2 ψ x,k,N L 1 ψ h,j,N z. We train this stacked model on discrete rotations followed by integer translations and discrete rotations, using 5 integer translations in both x and y axes and 4 rotations. Results reported in Fig. 3E and MSE Table 2 show that the model is perfectly able to structure the latent space such that the group structure is respected. :::::::::: Additionally, ::::::: Figures :: 17 Insight from representation theory on the structure of hidden layers When dealing with multiple transformations (e.g. rotations and translations), we know the form of the operator for every subgroup (shift operator), but we do not know a priori the form of the resulting operator for the entire group. In App. D.1 and D.4 we derive from representation theory the operator for the entire group in the 2D translation and Special Euclidean group cases and show that they can be built from each subgroup's operator in a non-trivial way. Importantly, we show that the resulting operator for the discrete finite Special Euclidean case has a block matrix form representation based on representations of both translations and rotations. This is expected: this group is non-commutative, so the correct operator cannot be diagonal otherwise two operators corresponding to two elements would commute. Equipped with this theory we can derive insights about the form that intermediate layers should take after training. In particular, we show (App. D.2) that the layer L 1 should be a block diagonal matrix consisting of repetitions of a permutation matrix that reorders elements in ψ x,k,N z. Similarly, in the case of translations and rotations together, L 2 must reorder ψ y,k ,N , and L 1 be the product of two matrices L 1 = P Q, where Q is a N by N block diagonal matrix and P is reordering the rows of the vector Qψ h,j,N z (see App. D.5). In future work, we plan to explore the use of these insights to regularize internal layers of stacked shift operator models.

5. DISCUSSION

Finding representations that are equivariant to transformations present in data is a daunting problem with no single solution. A large body of work (Cohen et al., 2020; 2018; Esteves et al., 2018; Greydanus et al., 2019; Romero et al., 2020; Finzi et al., 2020; Tai et al., 2019) proposes to hard-code equivariances in the neural architecture, which requires a priori knowledge of the transformations present in the data. In another line of work, Falorsi et al. (2018) ; Davidson et al. (2018) ; Falorsi et al. (2019) show that the topology of the data manifold should be preserved by the latent representation, but these studies do not address the problem of disentanglement. Higgins et al. (2018) have proposed the framework of disentanglement as equivariance that we build upon here. Our work extends their original contribution in multiple ways. First, we show that traditional disentanglement introduces topological defects (i.e. discontinuities in the encoder), even in the case of simple affine transformations. Second, we conceptually reframe disentanglement, allowing equivariant operators to act on the entire latent space, so as to resolve these topological defects. Finally, we show that models equipped with such operators successfully learn to disentangle simple affine transformations. An important direction for future work will be to expand the reach of the theory to a broader family of transformations. In particular, it is unclear how the proposed approach should be adapted to learn transformations which are not affine or linear in image space, such as local deformations, compositional transformations (acting on different objects present in an image), and out-of plane rotations of objects in images (but see Dupont et al. (2020) for an empirical success using a variant of the shift operator). Another important direction would be to extend the theory and proposed models to continous Lie groups. Moreover, our current implementation of disentanglement relies on some supervision, by including pairs of transformed images(with or without knowledge of the parameter of the transformation occurring between them). It , ::::::::: moreover : it : would be important to understand how disentangled representations can be learned without such pairs of transformed images (see Anselmi et al. (2019) ; Zhou et al. (2020) for relevant work). Finally, our work lays a theoretical foundation for the recent success of a new family of methods that -instead of enforcing disentangled representations to be restricted to distinct subspaces-use operators (hard-coded or learned) acting on the entire latent space (Connor & Rozell, 2020; Connor et al., 2020; Dupont et al., 2020; Giannone et al., 2020; Quessard et al., 2020) :::: (see ::: also ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: (Memisevic & Hinton, 2010; Cohen & Welling, 2014; Sohl-Dickstein et al., 2017) :: for :::::::: precursor :::::::: methods). These methods work well where traditional disentanglement methods fail: for instance by learning to generate full 360 • in-plane rotations of MNIST digits (Connor & Rozell, 2020) , and even out-of-plane rotations of 3D objects (Dupont et al., 2020) . Unlike our simple theoretically-derived models which are suited only for affine transformations, these ::::: These : methods use distributed operators in combination with non-linear autoencoder architectures, an interesting direction for future theoretical investigations. • Associativity: (g k • g k ) • g k = g k • (g k • g k ) with g k , g k , g k ∈ G. • Identity element: there exists an element e G ∈ G such that, for every g k ∈ G, g k • e G = e G • g k = e G ::::::::::::::::::: g k • e G = e G • g k = g k , and e G is unique. • Inverse element: for each g k ∈ G, there exists an element g k ∈ G, denoted g -1 k such that g k • g -1 k = g -1 k • g k = e G . In the paper, for clarity we do not write explicitly the operation • unless needed. Finite cyclic groups We will be interested in finite groups, composed of a finite number of elements (i.e. the order of G). A cyclic group is a special type of group that is generated by a single element, called the generator g 0 , such that each element can be obtained by repeatedly applying the group operation • to g 0 or its inverse. Every element of a cyclic group can thus be written as g k = g k 0 . Note that every cyclic group is abelian (i.e. its elements commute). A group G that is both finite and cyclic has a finite order K such that g K 0 = e G .

Representation and equivariance

Informally, for a model to be equivariant to a group of transformations means that if we encode a sample, and transform the code, we get the same result as encoding the transformed sample. (Higgins et al., 2018) show that disentanglement can be viewed as a special case of equivariance, where the transformation of the code is restricted to a subspace. We provide a formal definition of equivariance below after introducing notations. In the framework of group theory, we consider φ a linear representation of the group G acting on X (Scott & Serre, 1996) : φ : G → GL(X). Each element of the group g k ∈ G is represented by a matrix φ(g k ) = φ k , and φ k is a matrix with specific properties: 1. φ is a homomorphism: φ(g k g k ) = φ(g k )φ ( g k ), g k , g k ∈ G. 2. φ k is invertible and φ -1 k = φ(g -1 k ) as φ(g -1 k )φ(g k ) = φ(g k g -1 k ) = φ ( e G ) = I. where I is the identity matrix in GL(X). The set of matrices φ k form a linear representation of G and they multiply with the vectors in X as follows: φ k : X(= R N ) → X s.t. ∀x ∈ X, φ k (x) ∈ X. ( ) The character of φ is the function χ φ such that for each g k ∈ G, it returns the trace of φ(g k ) = φ k , i.e. χ φ (g k ) = T r(φ k ). Importantly, the character of a representation completely characterizes the representation up to a linear isomorphism (i.e. change of basis) (Scott & Serre, 1996) . The character table of a representation is composed of the values of the character evaluated at each element of the group. Similarly, we denote a linear representation of the action of G onto the latent space Z by ψ : G → GL(Z) such that ∀k, ψ(g k ) = ψ k . While corresponding to the action of the same group element g k ∈ G, ψ k does not have to be the same as φ k , as it represents the action of G on Z and not on X. ψ k : Z(= C N ) → Z s.t. ∀z ∈ Z, ψ k (z) ∈ Z. Note that we consider a complex latent space. With these notions, formally the model f : X → Z is equivariant to the group of transformations G that acts on the data if, for all elements of the group g k and its action φ k and ψ k on the spaces X and Z respectively, we have: ∀x ∈ X, ∀k ∈ K, f (φ k (x)) = ψ k (f (x)) Finally, a group action on image space R N is said to be affine if it affects the image through an affine change of cooordinates.

B EXPERIMENTAL DETAILS B.1 DATASET GENERATION

Simple shapes We construct a dataset of randomly generated simple shapes, consisting of 5 randomly chosen points connected by straight lines with 28x28 pixels. We normalize pixel values to ensure they lie within [0, 1]. For all experiments, we use a dataset with 2000 shapes. For each shape, we apply 10 counterclockwise rotations by {0 2014)). For translations along the x-axis or y-axis we apply 10 translations using numpy.roll (see Van Der Walt et al. (2011) ). This ensures periodic boundary conditions such that once a pixels is shifted beyond one edge of the frame, it reappears on the other edge. For experiments with supervision involving pairs, we construct every possible combination of pairs, x 1 , x 2 and apply every transformation to both x 1 and x 2 . For datasets containing multiple transformations, we first rotate, then translate along the x-axis, then translate along the y-axis. We use 50% train-test split then further split the training set into 20%-80% for validation and training. MNIST (LeCun et al., 2010) Similar to simple shapes, we construct rotated and translated versions of MNIST by applying the same transformations to the original MNIST digits. We normalize pixel values to lie between [0, 1]. For supervised experiments, we similarly construct every combination of pairs x 1 , x 2 and apply transformations to both x 1 and x 2 . Since constructing every combination of transformed pairs would lead to many multiples the size of the original MNIST, we randomly sample from the training set to match the original number of samples. We use the full test set augmented with transformations for reporting losses.

B.2 MODEL ARCHITECTURES AND TRAINING

We implement all models using PyTorch with the Adam optimizer (Paszke et al., 2019; Kingma & Ba, 2014) .

Variational Autoencoder and Variants

We implement existing state-of-the-art disentanglement methods β-VAE and CCI-VAE, which aim to learn factorized representations corresponding to factors of variation in the data. In our case the factors of variation are the image content and the transformation used (rotation or translations). We use the loss from CCI-VAE made up of a reconstruction mean squared error and a Kullback-Leibler divergence scaled by β: L = m i (x i -f D (f (x 1 )) 2 /m + β|KL(q(z|x), p(z)) -C| (9) where m is the number of samples and Kullback-Leibler divergence is estimated as KL(q(z|x), p(z) = -0.5 * We use the encoder/decoder architectures from Burgess et al. (2018) comprised of 4 convolutional layers, each with 28 channels, 4x4 kernels, and a stride of 2. This is followed by 2 fully 256-unit connected layers. We apply ReLU activation after each layer. The latent distribution is generated from 30 units: 15 units for the mean and 15 units for the log-variance of a Gaussian distribution. The decoder is comprised of the same transposed architecture as the encoder with a final sigmoid activation. d i (1 + ln(σ 2 i ) -µ 2 i -σ 2 i ), Autoencoder with latent operators For the standard autoencoder and autoencoders with the shift/disentangled latent operators, we use supervised training with pairs (x 1 , x 2 ) and a transformation parameter k corresponding to the transformation between x 1 and x 2 . The loss is the sum of reconstruction losses for x 1 and x 2 : L = m i=1 (x 1,i -f D (f (x 1,i )) 2 /m + m i=1 (x 2,i -f D (ψ k (f (x 1,i )))) 2 /m ( ) where m is the number of samples and ψ is the disentangled or shift latent operator. For a standard autoencoder, only the first reconstruction term is present in the loss function. For the non-linear autoencoders, we use the same architecture above based on CCI-VAE. In the linear case, we use a single fully 28x28 connected layer and an 800 dimensional latent space. We use 800 dimensions to approximate the number of pixels (28x28) and ensure that K=10 divides the latent dimension. This is an approximation of the correct theoretical operator (which should be invertible) that works well in practice. Weakly supervised shift operator model We use linear encoders and decoders with a 784 dimensional latent space to match the number of pixels. Indeed the weakly supervised shift operator uses the complex version of the shift operator, so we can perfectly match the size of the image. Training is done with L2 loss on all possible reconstructions of x 1 (of each training pair i), weighted by scores α i,k . Appendix B.3 below gives a detailed explanation of the computation of the scores α i,k : L = m i=1 (x 1,i -f D (f (x 1,i )) 2 /m + m i=1 K L k=1 α i,k (x 2,i -f D (ψ k (f (x 1,i,k )))) 2 /m ( ) where m is the number of samples Method We experimentally show in Section 4.2 that the proposed operator ψ k works well in practice. Additionally, we developed a method for inferring the parameter of the transformation k that needs to act on the sample. We encode samples by pairs x 1 , x 2 (with x 2 a transformed version of x 1 ) into z 1 and z 2 respectively, and use a classical phase correlation technique (Reddy & Chatterji, 1996) to identify the shift between z 1 and z 2 , described below. Then, we use the complex diagonal shift operator parametrized by the inferred transformation parameter k. Importantly, in the weakly supervised version of the shift operator, the model has an extra free parameter which is the number of latent transformations, not known a priori. Let us denote K L this number, which can be different than the ground-truth order of the group K. To infer the latent transformation that appear between z 1 and z 2 , we compute the crosspower spectrum between the two codes z 1 and z 2 , that are both complex vectors of size N and obtain a complex vector of size N . We repeat K L times this vector, obtaining a K L x N matrix, of which we compute the inverse Fourier transform. The resulting matrix should have rows that are approximately 0, except at the row k corresponding to the shift between the two images, see Reddy & Chatterji (1996) . Thus, we compute the mean of the frequencies of the real part of the inverse Fourier result (i.e. the mean over the N values in the second dimension). This gives us a K L -dimensional vector, which we use as a vector of scores of each k to be the correct shift between z 1 and z 2 . During training, we compute the soft-max of these scores with a temperature parameter k, this gives us K L weights α k . We transform z 1 with all K possible shift operators, decode into K reconstructions x k , and weight the mean square error between x 2 and each x k by α k before back-propagating. This results, for each samples pair (x 1,i , x 2,i ), in the loss: L = m i=1 (x 1,i -f D (f (x 1,i )) 2 /m + m i=1 K L k=1 α i,k (x 2,i -f D (ψ k (f (x 1,i,k )))) 2 /m ( ) where m is the number of samples and α i,k the scores for the pair (x 1,i , x 2,i ). At test-time, we use the transformation with maximum score α i,k . Dataset K L 10 21 Shapes (10,0,0) 0.001 ± 0.0013 0.0005 ± 0.0001 Shapes (0,10,0) 0.0097 ± 0.0052 0.0038 ± 0.0013 Shapes (0,0,10) 0.0115 ± 0.0049 0.005 ± 0.0033 MNIST (10,0,0) 0.0035 ± 0.0039 0.0074 ± 0.0083 Table 1 : Comparing test mean square error (MSE) ± standard deviation of the mean over random seeds for different K L . Numbers in () refer to the number of rotations, the number of translations on the x-axis, and the number of translations on the y-axis respectively. Effect of the number of latent transformations In the weakly supervised shift operator model, the number of latent transformations K L is a free parameter. Interestingly, when using rotations the best cross-validated number of transformations is 10, which matches the ground-truth order of the group. For translations (either on the x or the y axis), best results are obtained using K L = 21 which is larger than the ground-truth order of the group K. Table 1 compares test MSE for both values of K L . We think that in the case of translations, changes in the image induced by each shift (each transformation) are less visually striking than with rotations, and a larger K L gives extra flexibility to the model to identify the group elements and respects the group structure (namely its cyclic aspect).

B.4 HYPER-PARAMETERS

General hyper-parameters We sweep across several sets of hyper-parameters for our experiments. We report results for the model with the lowest validation test loss. To avoid over-fitting, results for any given model are also stored during training for the parameters yielding the lowest validation loss. For experiments with simple shapes we sweep across combinations of In addition to these general parameters, we also sweep across choices of β, {4, 10, 100, 1000} and latent dimension {10, 30} for the variational autoencoder models and variants. We repeat all experiments across four seeds used to initialize random number generation and weight initialization. Weakly supervised shift operator hyper-parameters We perform a sweep over hyperparameters as described above. Additionally for the weakly supervised model, we sweep over temperature τ of the soft-max that shapes the scores of each transformation over values τ = {0.01, 0.1, 1.0}, and the number of transformations composing the operator family (i.e. the order of the group) over values 10 and 21, where 10 is the ground-truth order of the group.

Stacked shift operator model hyper-parameters

We perform a sweep over hyper-parameters as described above. The only exception is that for the case of the Special Euclidean group, we train only for 5 epochs, and try batch sizes {32, 64} for MNIST and {16, 32} for simple shapes, as the number of generated samples is high. Similarly, for the case of 2D translations, we use batch sizes {16, 32} for simple shapes.

C REFRAMING DISENTANGLEMENT: FORMAL PROOFS

We consider the action of a finite group on image space R N . Using tools from topology, we show that it is impossible to learn a representation which disentangles the action of this group with a continuous encoder f .

C.1 TOPOLOGICAL PROOF AGAINST DISENTANGLEMENT

We consider a finite group G of cardinal |G| that acts on R N . Given an image x ∈ R N , an orbit containing that image is given by {g 1 x, g 2 x, ..., g K x}. We consider an encoder f : R N → M that disentangles the group action. The image of f is composed of an equivariant subspace and an invariant subspace. We define f E : R N → M E the projection of f on its equivariant subspace and f I : R N → M I the projection of f on its invariant subspace. f E is equivariant to the group action. In equations: ∀x, ∀g, f E (gx) = gf E (x) For the disentanglement to be complete, f E should not contain any information about the identity of the orbit the image belongs to. If O 1 and O 2 are two distinct orbits, we thus have: ∀x 1 ∈ O 1 , ∀x 2 ∈ O 2 , ∃g ∈ G, f E (x 1 ) = gf E (x 2 ) ( ) f I is invariant to the group action: ∀x, ∀g, f I (gx) = f I (x) We also assume that the representation contains all the information needed to reconstruct the input image: ∀x 1 ∈ O 1 , ∀x 2 ∈ O 2 , f I (x 1 ) = f I (x 2 ) This last assumption corresponds to assuming that every image can be perfectly identified from its latent representation by a decoder (i.e. perfect autoencoder). We also assume that both the encoder and decoder are continuous functions. This is a reasonable assumption as most deep network architectures are differentiable and thus continuous. In the language of topology, the encoder f is a homeomorphism, a continuous invertible function whose inverse is also a continuous function. Also called a topological isomorphism, an homeomorphism is a function that preserves all topological properties of its input space (see Munkres (2014) p.105). Here we prove that f cannot preserve the topology of R N while disentangling the group action of G. Consider f E | O the restriction of f E to a single orbit of an image x without any particular symmetry. f E | O inherits continuity from f , and it can easily be shown that f E | O is invertible (otherwise, information about the transformation on this orbit is irremediably lost, and f can thus not be invertible). f E | O is thus also a homeomorphism, and so it preserves the topology of the orbit O in image space, which is a set of |G| disconnected points. By equation 14, we know that restrictions of f E to all other orbits have an image contained in the same topological space. As a consequence, the image of f E itself is a set of |G| disconnected points. Since f E is a projection of f , the image of f should at least be composed of |G| disconnected parts (this follows from the fact that the projection of a connected space cannot be disconnected). However, this is impossible because the domain of f is R N , which is connected, and f is an homeomorphism, thus preserving the connectedness of R N . In summary, we have shown that, for topological reasons, a continuous invertible encoder cannot possibly disentangle the action of a finite group acting on image space R N . In the next section, we show that topological defects arise in the neighborhood of all images presenting a symmetry with respect to the transformation.

C.2 TOPOLOGICAL DEFECTS ARISE IN THE NEIGHBORHOOD OF SYMMETRIC IMAGES

We proved in the previous section that it is impossible to map a finite group acting on R N to a disentangled representation with a continuous invertible encoder. In this section, in order to gain intuition of why disentanglement is impossible, we show that topological defects appear in the neighborhood of images that present a symmetry with respect to the group action.

C.2.1 f E IS NOT CONTINUOUS ABOUT SYMMETRIC IMAGES: FORMAL PROOF

Let's consider an image x s that presents a symmetry with respect to the group action: ∃g ∈ G, g = e G , gx s = x s Let's further assume that an infinitesimal perturbation to this image along a direction u breaks the symmetry of the image: ∀0 < < E, x := x s + u, gx = x (18) Since f E preserves the information about the transformation, |f E (gx ) -f E (x )| > C = 0 ( ) where C is the smallest distance between two disconnected points of M E . We assume that the group action is continuous: gx = gx s + O( ) = x s + O( ) We can rewrite equation 19 as: |f E (x s + O( )) -f E (x s + O( ))| > C = 0 (21) which is in contradiction with the continuity hypothesis on the encoder. We have thus shown that the equivariant part of the encoder f E presents some discontinuities around all images that present a symmetry. Note that for both rotations and translations, the uniform image is an example of symmetric image with respect to these transformations.

C.2.2 f I IS NOT DIFFERENTIABLE ABOUT SYMMETRIC IMAGES: VISUAL PROOF

As an example, we consider an equilateral triangle which is either perturbed at its top corner, left corner, or both corners (Fig. 4 ). When perturbed on either one of its corner, the perturbation moves the image to the same orbit, because the triangle perturbed on its right corner is a rotated version of the triangle perturbed on its top corner. The gradient of f I along these two directions at the equilateral triangle image should thus be the same (so as not to leak information about the transformation in the invariant subspace Z I ). The simultaneous perturbation along the two corners moves the image to a different orbit, so the gradient of f I along this direction should not be aligned with the previous gradients (so as to preserve all information about identity of the orbit). And yet, if the function f I was differentiable everywhere, this gradient should be a linear combination of the former gradients, and thus all three gradients should be collinear. The function f I can thus not be differentiable everywhere. This imperative is incompatible with many deep learning frameworks, where the encoder is implemented by a neural network that is differentiable everywhere (with a notable exception for networks equipped with relu nonlinearities which are differentiable almost everywhere). Figure 4 : Visual proof that the invariant part of the encoder f I cannot be differentiable about symmetric figures. We assume f I is differentiable and show a contradiction. We consider an equilateral triangle which is perturbed at its top corner, left corner, or both corners. When perturbed either one of its corner, the perturbation brings the image to the same orbit, because of the symmetry. In latent space, the perturbation should thus move the latent representation in the same direction. The perturbation along the two corners simultaneously brings to image to a different orbit, and yet, since the perturbation is a simple linear combination of the single-corner perturbations, it can only be colinear to these perturbations. This collinearity leads to the encoder not being injective, and thus loosing information about the identity of the image.

C.2.3 f I IS NOT DIFFERENTIABLE ABOUT SYMMETRIC IMAGES: FORMAL PROOF

Next, we show that, if we add an extra assumption on the encoder, namely that it is differentiable everywhere, it is also impossible to achieve the invariant part of the encoder f I . Note that this extra assumption if true for networks equipped with differentiable non-linearities, such as tanh or sigmoid, but not for networks equipped with relus. Let's consider an image x s presenting a symmetry w.r.t the group action. We consider perturbations of that image along two distinct directions u and gu. By symmetry, it is easy to see that: ∂f I ∂u xs = ∂f I ∂gu xs (22) As a consequence, a perturbation by u = (u+gu)

2

, is equal to the perturbation along one or the other direction: ∂f I ∂u xs = 1 2 * ∂f I ∂u xs + ∂f I ∂gu xs = ∂f I ∂u xs (23) In the general case, x = x s + u does not belong to the same orbit as x = x s + u. f I is thus losing information about which orbit the perturbed image x belongs to, which is in contradiction with the assumption shown in Equation 16. Datasets are structured by group transformations that act on the data samples. Our goal is for the model to reflect that structure. Specifically, we want our model to be equivariant to the transformations that act on the data. We defined group equivariance in Section A. We show here that a carefully chosen distributed operator in latent space, -the shift operator-is linearly isomorphic to all cyclic linear transformations of finite order K of images (i.e. the group G of transformations is cyclic and finite and it acts linearly on image space). The practical consequences of this fact is that it is possible to learn an equivariant mapping to every affine transformation using this operator, using linear invertible encoder and decoder architectures. Consider a linear encoder model f = W . If we want W to be an equivariant invertible linear mapping W between X and Z, Equation 8 rewrites as follows: W φ k (x) = ψ k (W x) ∀x ∈ X(= R N ), ∀k ∈ K (24) where φ k and ψ k are the representations of g k ∈ G on the image and latent space respectively, as defined in A. For W to be equivariant, Equation 24 must be true for every image x. As W is invertible, Equation 24 is true if and only if, ∀k ∈ K, the two representations ψ k and φ k are isomorphic: ∀k ∈ K, φ k = W -1 ψ k W (25) We consider additional properties on φ corresponding to the assumptions (i) that G is cyclic of order K with generator g 0 and (ii) φ is isomorphic to the regular representation of G (see Scott & Serre (1996) ): 1. φ is cyclic, i.e. such that φ k = φ k 0 where φ 0 = φ(g 0 ) and thus φ K 0 = I. 2. The character of φ is s.t. χ φ (e) = N and χ φ (g k ) = 0 for g k = e. The second property might seem counter-intuitive, but it just means that the transformation leaves no pixel unchanged (i.e. permutes all the pixel). In the case of rotations, this is approximately true since only the rotation origin remains in place. The character table of φ, given the second property, is e g 0 g 2 0 ... g K-1 0 χ φ N 0 0 0 0 We have just seen that if the encoder and decoder are linear and invertible, the two representations φ and ψ must be isomorphic. Two representations are isomorphic if and only if they have the same character (Scott & Serre, 1996, Theorem 4, Corollary 2) . We thus want to choose ψ such that it preserves the character of the representation φ corresponding to the action of G on the dataset of images. Importantly, we will see that our proposed operator needs to be distributed in the sense that it should act on the full latent code. Let us consider the matrix of order K = |G| that corresponds to a shift of elements in a Kdimensional vector by k positions. It is the permutation corresponding to a shift of 1, exponentiated by k. We construct from M k the shift operator as a representation of the group's action on the latent space. For each g k ∈ G such that g k = g k 0 , its corresponding shift operator is the block diagonal matrix of order N composed of N K repetition of M k . M k :=        0 0 . . . 1 1 0 . . . 0 0 1 0 . . . . . . 0 . . . 1 0        k (26) ψ k := M k M k ... M k             (27) Let us compute the character table of this representation. First, for the identity, is it trivial to see that χ ψ (e) = N . Second, for any g k = e, we have χ ψ (g k ) = T r(ψ k ) = 0 (28) since all diagonal elements of g k will be 0. Therefore, the character table of the shift operator is the same as χ φ : e g 0 g 2 0 ... g K-1 0 χ ψ N 0 0 0 0 Using this shift operator ensures that an equivariant invertible linear mapping W exists between image space and the latent space equipped with the shift operator. Note that the character of the disentangled operator does not match the character table of φ, and so we verify once again in this linear autoencoder setting that the disentangled operator is unfit to be equivariant to affine transformations of images. When K does not divide N , we use a latent code dimension that is slightly different than N but divisible by K. This is an approximation of the correct theoretical operator and we verify that it works well in practice. In the next Section C.3.2, we show that we can also replace this shift operator by a complex diagonal operator, which is more computationally efficient to multiply with the latent.

C.3.2 COMPLEX DIAGONAL SHIFT OPERATOR

In order to optimise computational time, we can also consider for M k the following diagonal complex matrix: M k :=       1 0 . . . 0 0 ω . . . . . . 0 ω K-1       k ( ) with ω = e 2iπ K . The shift operator in this case is a diagonal matrix, as follows: ψ k,N := 1 . . . ω K-1 1 . . . ω K-1 . . . ω K-1                           M k M k (30) Let us compute the character table of this representation. First, for the identity, is it trivial to see that χ ψ (e) = N . Second, for any g k = e, we have χ ψ (g k ) = T r(ψ k ) = N -1 n=0 (ω k ) n = 1 -(ω k ) N 1 -ω k = 0 (31) as ω k N = e 2iπkN K = 1 since we're assuming N can be divided by K. Again, the character table of the shift operator is the same as χ φ . Using this operator fastens computation since it requires only multiplying the diagonal values (a vector of size N ) with the latent code (a vector of size N as well) instead of doing a matrix multiplication. Note that when using this complex version of the shift operator, the encoding and decoding layers of the autoencoder should be complex as well. When K does not divide N , we still use a latent code of size N and the operator ψ k is a diagonal matrix of order N , but the last cycle 1, ω, . . . is unfinished (does not go until ω K-1 ). The character of this representation is no longer equal to 0 for non-identity elements g k = e, but equals a small value << N and this approximation works well in practice.

D THE CASE OF MULTIPLE TRANSFORMATIONS: FORMAL DERIVATIONS D.1 TRANSLATIONS IN BOTH AXES AS A DIRECT PRODUCT

To cover the case of 2D translations (acting on both x and y axes of the image), we consider an abelian group G that is the direct product of two subgroups G = A x × A y . Both A x and A y are normal in G because every subgroup of an abelian group is normal. Moreover, we consider that A x and A y are both cyclic of order K and K respectively, which is the case for integer translation of an image using periodic boundary condition. We denote a x,0 and a y,0 the generators of A x and A y , and write each translation as a = (a x,k , a y,k ) with a x,k = a k x,0 , a y,k = a k y,0 . We show that the shift operator can handle this case, with differences. We assume for simplicity that both K and K divide N , and KK divides N . Following Scott & Serre (1996) , we write group elements g ∈ G as g (k,k ) = (a x,k , a y,k ). The order of the group is K = KK . If we consider the regular representation of 2D translations over the image space, as in Section C.3, its character table is e g (0,1) g (0,2) ... g (K-1,K -1) χ φ N 0 0 0 0 We consider two representations over the latent space: ψ x : A x → GL(C K ) is a linear representation of A x and ψ y : A y → GL(C K ) a linear representation of A y , each of the matrix form described in Section C.3.2. Group theory (Scott & Serre, 1996) tells us that ψ is the tensor product of ψ x and ψ y , i.e. for g = (k, k ) ∈ G: ψ k = φ(g) = φ((a x,k , a y,k )) = (ψ x ⊗ ψ y )(a x,k , a y,k ) = ψ x (a x,k ) ⊗ ψ y (a y,k ) Let us consider the two shift operators: ψ x,k := Diag(1, ω 1 , ω 2 1 ..., ω K-1 1 ) k (33) ψ y,k := Diag(1, ω 2 , ω 2 2 , . . . ω K -1 2 ) k where ω 1 = e 2iπ K and ω 2 = e 2iπ K . Then the tensor product ψ x,k ⊗ ψ y,k writes as a diagonal matrix of order KK : ψ x,k ⊗ψ y,k :=               1 ω k 2 ω 2k 2 . . . ω k (K -1) 2 ω k 1 ω k 1 ω k 2 . . . ω k(K-1) 1 ω k (K -1) 2               The character of ψ x,k ⊗ ψ y,k is χ ψ x,k ⊗ψ y,k = K -1 n=0 ω nk 2 × K-1 n=0 ω nk 1 (36) If k = 0, K-1 n=0 ω k 1 n = 1-ω k 1 K 1-ω k 1 = 0 since ω k 1 K = e 2iπk = 1 (and similarly for ω 2 if k = 0). Hence, for (k, k ) = (0, 0), K -1 n=0 ω nk 2 × K-1 n=0 ω nk 1 = 0. For (k, k ) = (0, 0), χ ψ x,k ⊗ψ y,k = KK . Thus, the character table of ψ x ⊗ ψ y is e g (0,1) g (0,2) ... g (K-1,K -1) χ ψ KK 0 0 0 0 We will use for 2D translations a diagonal operator that is the repetition of N KK times ψ x,k ⊗ ψ y,k (assuming KK divides N ), denoted ψ, with is for g k,k = (a x,k , a y,k ) ψ k,k ,N := 1 ω k 2 ω 2k 2 . . . ω k(K-1) 1 ω k (K -1) 2 1 ω k 2 ω 2k 2 . . . ω k(K-1) 1 ω k (K -1) 2 . . .                                                 ψ x ,k ⊗ ψ y ,k ψ x ,k ⊗ ψ y ,k (37) Thus, the character table of ψ is e g (0,1) g (0,2) ... g (K-1,K -1) χ ψ N 0 0 0 0 We see that ψ has the same character table as ψ the representation in image space, hence is a suited operator to use for this case.

GROUP

For the case of 2D translations where G = A x × A y , we first use the operator corresponding to a x,k , the translation in x, then intersect a linear layer denoted L 1 before using the operator corresponding to a y,k , the translation in y. When we operate on the latent code, we perform z = ψ y,k ,N L 1 ψ x,k,N z where ψ x,k,N is the operator representing the translation in x. It is a matrix of order N , where ψ x,k is repeated N K times. ψ x,k,N := 1 ω k 1 ω 2k 1 . . . ω k(K-1) 1 1 ω k 1 ω 2k 1 . . . ω k(K-1) 1 . . .                                               ψ x , k ψ x , k Similarly, the operator corresponding to a y,k is a matrix of order N , where ψ y,k is repeated N K times ψ y,k ,N := 1 ω k 2 ω 2k 2 . . . ω k (K -1) 2 1 ω k 2 ω 2k 2 . . . ω k (K -1) 2 . . .                                                 ψ y , k ψ y , k We see that in order for the product ψ y,k ,N L 1 ψ x,k,N when applied to z, to match the result of ψ k,k ,N applied to z, we need a permutation matrix P that operates on a matrix of order KK made of blocks of K times ψ x,k and returns a matrix of order KK with: • 1 at the first K rows and columns c = 1, (K + 1), . . . KK -K + 1 • ω k 1 at rows from K + 1 to 2K and columns c = 2, K + 2, . . . KK -K + 2 etc. • until ω k(K-1) 1 at rows from KK -K to KK and columns c = K, 2K, KK And the layer L 1 would be a matrix of order N made of this permutation matrix P , repeated in block diagonal form N KK times, such that (ψ y,k ,N L 1 ψ x,k,N ) z = ψ k,k ,N z.

D.3.1 DEFINITION OF A SEMI-DIRECT PRODUCT

A semi-direct product G = A H of two groups A and H is a group such that: • A is normal in G. • There is an homomorphism f : H → Aut(A) where Aut(A) is the group of automorphism of A. For a ∈ A, denote f (h)a by h(a). In other words, f (h) represents how H acts on A. • The semi-direct product G = A H is defined to be the product A × H with multiplication law (a 1 , h 1 )(a 2 , h 2 ) = (a 1 h 1 (a 2 ), h 1 h 2 ) Note that this enforces that h 1 (a 1 ) = h 1 a 1 h -1 1 .

D.3.2 IRREDUCIBLE REPRESENTATION OF A SEMI-DIRECT PRODUCT

In this section, we also assume A is abelian. One can derive the irreducible representations of G in this case, as explained in Scott & Serre (1996) Section 8.2. First, consider the irreducible characters of A, they are of degree 1 since A is abelian and form the character group X = Hom(A, C * ). We use X to match Scott & Serre (1996) notation, but note that X does not denote image space here, but the group of characters of A. H acts on this group by: h χ (a) = χ(h -1 ah) (42) Second, consider a system of representative of the orbits of H in X. Each element of this system is a χ i , i ∈ G/H. For a given χ i , denote H i the stabilizer of χ i , i.e. h χi = χ i . This means h χi (a) = χ i (h -1 ah) = χ i (a), ∀a ∈ A (43) Third, extend the representations of A to a representation of G i = A H i by setting χ i (ah) = χ i (a), h ∈ H i , a ∈ A (44) The χ i are also characters of degree 1 of G i . Fourth, consider the irreducible representations of H i . Scott & Serre (1996) propose to use the irreducible representation ρ of H i . Combining with the canonical projection G i → H i we get irreducible representations ρ of G i . Irreducible representations of G i are obtained by taking the tensor product χ i ⊗ ρ. Finally, the irreducible representations of G are computed by taking the representation induced by χ i ⊗ ρ. Etingof et al. (2009) show that the character of the induced representation is χ Ind G G i (χr⊗ρ) (a, h) = 1 |H i | h ∈Hs.t.h -1 hh ∈Hi χ r (h(a))χ ρ (h -1 hh )

D.4 REPRESENTATIONS OF THE (DISCRETE FINITE) SPECIAL EUCLIDEAN GROUP

In this section, we focus on the specific case of the semi-direct product G = A H where A = A x × A y is the group of 2D translations and H the group of rotations. Hence, A is abelian and H is a cyclic group. We will derive the irreducible representations of this group using the method presented in the previous section D.3.2.

D.4.1 A NOTE ON THE DISCRETE FINITE SPECIAL EUCLIDEAN GROUP

While the Special Euclidean group respects the structure of a semi-direct product in the continuous transformations case, we will consider its discrete and finite version. That is, we consider a finite number of translations and rotations. With integer valued translation, which is of interest when working with images (considering translations of a finite number of pixels), we cannot consider all rotations. Indeed for example rotations of 2 π 8 break the normal aspect of the subgroup of translations. Proof. Take the translation element a = (1, 1) (one pixel translation in x and y respectively), and the rotation h of angle 2 π 8 . Consider the composition hah -1 , applied to a point of coordinates i, j this gives hah -1 (i, j) = h(h -1 (i, j) + (1, 1)) = (i, j) + h(1, 1) = (i, j) + (0, √ 2) and the translation (0, √ 2) is not an integer translation. Thus, A is not normal in G in this case. In what follows, we consider rotations that preserve the normality of the group of integer 2D translations of the image. Namely, these are rotations of 2π 4 , π, 3π 4 and identity and their multiples. Nonetheless, we think the approach is insightful and approximate solutions could be found with this method. Furthermore, to ease derivations we consider that both K ≥ 2 and K ≥ 2 are odd (and thus the product KK is odd), such that stabilizers of the character group of A are either the entire H or only the identity. We leave the exploration of even K and K for future work.

D.4.2 FINDING THE ORBITS

As we consider integer translations with periodic boundary conditions, characters of A are 2D complex numbers and are function of Z 2 elements of the 2D discrete and finite translation group of the form a = (x, y) with x = x k 0 and y = y k 0 . The characters of this group are evaluated as χ (x1,y1) (a) = e i2π( x 1 K k+ y 1 K k ) , where x 1 , y 1 are of the form with x 1 ∈ 1 . . . K and y 1 ∈ 1 . . . K , respectively. We consider two cases: 1. χ 0,0 . 2. χ (x1,y1) with x 1 = 0 or y 1 = 0. Note that the total number of orbits of H in X is 1 H χx∈X |H x | where H x is the stabilizer of χ x . Either χ x = e X is the identity element of X and |H x | = |H| or χ x = e X and the only stabilizer is e H (as shown in Section D.6.1) thus |H x | = 1. Thus, we have that: 1 H χx inX |H x | = 1 H (1 × |H| + (|A| -1) × 1) = 1 + |A| -1 |H| Hence, there are |A|-1 |H| orbits in case 2 (the total number of orbits minus the orbit considered in the first case, χ (0,0) ).

D.4.3 ACTION OF H ON THE CHARACTERS OF A

Elements of H acts on characters of A as: h χ (x 1 ,y 1 ) (a) = χ (x1,y1) (h -1 ah) (47) = χ (x1,y1) (h -1 (a)) (48) = e i2π( x 1 K (cos(-θ)x-sin(-θ)y)+ y 1 K (sin(-θ)x+cos(-θ)y)) = e i2π(( x 1 K cos(θ)+ y 1 K sin(-θ))x+( x 1 K (-sin(-θ))+ y 1 K cos(-θ))y) (50) = e i2π(( x 1 K cos(θ)- y 1 K sin(θ))x+( x 1 K sin(θ)+ y 1 K cos(θ))y) (51) = χ (x1 cos(θ)-y1 sin(θ),x1 sin(θ)+y1 cos(θ)) (a) (52) = χ h((x1,y1)) (a) ( ) where θ is the angle of rotation of h. D.4.4 CASE 1 χ 0,0 (ORBIT OF THE ORIGIN) A representative is χ 0,0 = 1. The stabilizer group H i is the entire H, and the irreducible representations of H are of the form e i2π n |H| for n ∈ 1, . . . , |H| where |H| is the total number of rotations. Thus we use the tensor product θ n (0, φ) = 1 ⊗ e i2π n |H| as a representation of G. There are |H| of such irreducible representations of G, all of degree 1, since the group of rotation is abelian. The resulting representations, corresponding to the irreducible we get from combining this orbit and each one of the irreducible of H can be represented in matrix form: ρ(a, h) :=        1 0 . . . 0 0 e i2π 1 |H| . . . . . . 0 e i2π |H|-1 |H|        j ( ) where j is such that h = h j 0 and h 0 is the generator of the group of rotations. D.4.5 CASE 2 χ (x1,y1) ( |A|-1 |H| OF THEM) A representative can be taken to be χ r = χ (x1,y1) (x, y), and x 1 , y 1 are now fixed. Its stabilizer group H i is only {e H } (see (Berndt, 2007) and proof in Section D.6.1). We now select ρ an irreducible representation of {e H }, and we select the trivial representation 1 and combining with the canonical projection step 1 will also be a representation of G i . We then take the tensor product χ r ⊗ 1 as a representation of G i = A {e H }. Let us now derive the representation of the entire group A G = Ind G Gi (χ r ⊗ 1). Induced representation Ind G Gi (χ r ⊗ 1) First, we need a set of representatives of the left coset of G in G/G k = A e H . The left cosets are defined as (a, h)G k = {(a, h)(a k , e H ) ∀a k ∈ A}. There are |H| cosets (one per h), which we can denote (e A , h)G k . We take a representative g i for each coset G k . We take as representative (e A , h i ) = (e A , h i )(e A , e H ) ∈ (e A , h)G kfoot_1 , so the h i are now fixed.

The representation

(ρ, W ) is induced by (χ r ⊗ 1, V ) if W = |H| i=1 ρ(g i )V with g i = (e A , h i ) and ρ(g i ) is described below. For each g and each g i , there is j(i) ∈ {1, . . . |H|} and f j ∈ G k such that we have gg i = g j(i) f j . Indeed: gg i = (a, h)(e A , h i ) (56) = (a, hh i ) (57) = (e A , hh i )((hh i ) -1 (a), e H ) so g j(i) = (e A , hh i ). In other words, the action of g = (a, h) on an element w ∈ W permutes the representatives: ρ(g)w = ρ((a, h)) |H| i=1 ρ(g i )v i = |H| i=1 ρ(g j(i) )χ((hh i ) -1 (a))v i Now, we need to find for each element of G, the resulting permutation of the coset representatives. G is generated by a ∈ A, h ∈ H: any element (a, h) ∈ G can be written as (a, e H )(e A , h). For (a, e H ), we get g j(i) = (e A , h i ) = g i . The induced representations are thus: ρ(a, e H ) =   χ r (h -1 1 (a)) ... 0 0 0 χ r (h -1 2 (a)) ... 0 0 ... 0 χ r (h -1 |H| (a))   with no permutation. For (e A , h) we get g j(i) = (e A , hh i ). The resulting induced representations are: ρ(e A , h) =   0 0 ... χ r ((hh j -1 (1) ) -1 (e A )) 0 χ r ((hh j -1 (2) ) -1 (e A )) ... Where the permutation matrix P h above represents how h acts on its own group. If hh i = h j then P h has a 1 at the column j of its i-th row. For example, if hh |H| = h 1 then j(|H|) = 1, j -1 (1) = |H| and so for the row 1, there is χ r ((hh H ) -1 (e A )) = 1 at the |H|-th column as above. Let us denote for clarity χ r,i = χ r (h -1 i (a)). The representations of (a, h) will be ρ(a, h) = ρ(a, e H )ρ(e A , h) = χ r,1 ... 0 0 0 χ r,2 ... 0 0 ... 0 χ r,|H| P h = 0 0 ... χ r,1 0 χ r,2 ... 0 ... χ r,|H| 0 For each orbit, we get one of these (of degree H).

D.4.6 RESULTING REPRESENTATION

The resulting representation will be a block diagonal matrix of size |A||H| of the form                                            |H | co p ie s |H | c o p ie s The first |H| elements on the diagonal correspond to the irreducible representation of h = h j 0 , this is the shift operator we have been considering in the single transformation case. Second, each matrix product M r (a, h)P h , representative r, is repeated |H| times as it is of degree |H| (see calculation of degree in the Section D.6.2) and corresponds to: M r (a) =   χ r (h -1 1 (a)) ... 0 0 0 χ r (h -1 2 (a)) ... 0 0 ... 0 χ r (h -1 |H| (a))   If we assume |A||H| divides N , we can use an operator ψ a,h,N that is the repetition of N |A||H| times ρ(a, h) into a matrix of order N :  ψ a,h,N := 1 e i2π 1 |H| j . . . M R (a)P h 1 e i2π 1 |H| j . . . M R (a)P h . . .                                 ρ ( a , h ) ρ ( a , h ) ) =             1 1 ... 1 M r1 (a)I M r1 (a)I ... M R (a)I             The trace of ρ(a, e H ) is T r(ρ(a, e H )) = |H| + |H| hi∈H R r=r1 χ r (h -1 i (a)) as each block-diagonal matrix is repeated |H| times. If we interchange the order of summation we get: T r(ρ(a, e H )) = |H| + |H| R r=r1 hi∈H χ r (h -1 i (a))) (69) = |H| + |H| R r=r1 hi∈H h i,χr (a)) where h i,χr represents the action of h i on χ r . The action of every h i on χ r gives all the elements in the orbit of χ r , so the double sum results in the sum over all characters χ ∈ X of A, apart from the orbit of χ 0,0 which contains only χ 0,0 .  Consequently, the character table of ρ is the following: where a is the 2D translation, composed of a x,k the translation on the x-axis, a y,k translation on the y-axis, and h the rotation. Using the stacked shift operators model, we first use the operator corresponding to the rotation, then the one for translation in x, then the one for y-translation. The resulting transformed latent code is (e A , z = ψ y,k ,N L 2 ψ x,k,N L 1 ψ h,j,N z with a y,k = a k y,0 , a x,k = a k x,0 , h = h j 0 . The operators for translations ψ x,k,N , ψ y,k ,N are described in Equations 39 and 40. We repeat them here for clarity. ψ a x,k ,k,N := 1 ω k 1 ω 2k 1 . . . ω k(K-1) 1 1 ω k 1 ω 2k 1 . . . ω k(K-1) 1 . . .                                               ψ x , k ψ x , k (78) ψ y,k ,N := 1 ω k 2 ω 2k 2 . . . ω k (K -1) 2 1 ω k 2 ω 2k 2 . . . ω k (K -1) 2 . . .                                                 ψ y ,k ψ y ,k where ω 1 = e  . . .                                                 ψ h ,j ψ h ,j The representations for each transformation ψ x,k,N , ψ y,k ,N , ψ h,j,N are linked with the representation that the theory gives us in Equation 66. First, recall that elements of H acts on a character χ r = χ (x1,y1) : χ (x1,y1) (h -1 i (a)) = χ hi((x1,y1)) (a) Thus when h i spans H in a given matrix M r , we obtain |H| distinct characters built from χ r , i.e. M r (a) =   χ h1(r) (a)) ... 0 0 0 χ h2(r) (a) ... 0 0 ... 0 χ h |H| (r) (a)   For another matrix M r , we obtain again |H| distinct orbits, distinct from the one present in M r (otherwise they would be in the same orbit). We have A-1 H representatives (excluding χ (0,0) ), thus |A| -1 characters of A evaluated at a are obtained by considering all the matrices M r . The remaining character is χ (0,0) = 1. Hence, the diagonal matrix ρ(a, e H ) of order |A||H| contains all the characters of A evaluated at the element a, each one repeated |H| times since each matrix M r is repeated |H| times, and the first diagonal elements of ρ(a, e H ) are 1. So we see that what we need are |H| times each character of A. As explained in Section D.4.2, these characters are of the form: χ (x1,y1) (a) = e i(2π x 1 K k+2π y 1 K k ) = e i(2π x 1 K k) e i(2π y 1 K k ) = ω x1k 1 ω y1k 2 . ( ) The characters of each translation group are χ x1 = ω x1 1 for the x-translation and χ y1 = ω y1 2 for the y-translation, thus: χ (x1,y1) (a) = χ (x1,y1) (a x,k , a y,k ) = ω x1k 1 ω y1k 2 = χ x1 (a x,k )χ y1 (a y,k ) Since ψ x,k is represented N K times in ψ x,k,N and ψ y,k is represented N K K times in ψ y,k ,N , we can take L 2 to reorder ψ y,k ,N into blocks of K repeated diagonal elements such that when multiplied with ψ x,k,N by doing ψ y,k ,N L 2 ψ x,k,N , we obtain N KK = N |A| blocks of size |A|, each containing every character of A once. Furthermore, we can see that the resulting matrix ψ y,k ,N L 2 ψ x,k,N is composed of N |A||H| blocks of size |A||H| containing every character of A, but repeated |H| times. This resulting matrix is then multiplied with the matrix L 1 ψ h,j,N . So let us now turn to the representation of the rotation h, that is ψ h,j,N . It is composed of the elements that appear in the upper-left diagonal of ρ(a, h). But we also need to make the permutation matrices P h appear. Recall that diagonal matrix of order |H| that corresponds to the shift representation of h = h j 0 (not repeated N |H| times) is ψ h,j =        1 e i2π j |H| e i2π 2j |H| . . . e i2π j(|H|-1) |H|        If we right-multiply ψ h,j by C =     1 1 . . . 1 1 e i2π 1 |H| . . . e i2π |H|-1 |H| . . . 1 e i2π (|H|-1) |H| . . . e i2π (|H|-1)(|H|-1) |H|     and left-multiply it by B =     1 1 . . . 1 1 e -i2π 1 |H| . . . e -i2π |H|-1 |H| . . . 1 e -i2π (|H|-1) |H| . . . e -i2π (|H|-1)(|H|-1) |H|     we obtain a matrix that is filled with 0 except for each row r, one time the value |H| at the column c where c such that h -r 0 hh c 0 = e H , i.e. hh c 0 = h r 0 , this is exactly what P h represents, as we defined it such that if hh i = h j then P h has a 1 at the column j of its i-th row. Thus, P h = 1 |H| Bψ h,j C Consider a block diagonal matrix M where the first |H| elements are all ones (on the diagonal), and then the matrix B is repeated on the block diagonal |A| -1 times. M = 1 1 . . . 1 B B . . . B                           | H | t i m e s | A | -1 t i m e s (89) Consider M that is the repetition of |A| times ψ h,j M = 1 e i2π 1 |H| j . . . e i2π |H|-1 |H| j 1 e i2π 1 |H| j . . . e i2π |H|-1 |H| j . . .                                   ψ h ,j ( o f o r d e r |H |) ψ h ,j M is of order |A||H|. Thus, M M C = 1 |H| 1 e i2π 1 |H| j . . . e i2π |H|-1 |H| j P h P h . . . P h                               |H | e le m e n ts The operator ψ h,j,N is N |H| repetitions of ψ h,j , hence it can be seen as N |A||H| repetitions of M . Thus, let us denote Q a N by N block diagonal matrix form of N |A||H| repetitions of M . When multiplying Q with ψ h,j,N we can get M M repeated |A||H| times: Qψ h,j,N = 1 e i2π 1 |H| j . . . P h . . . P h 1 e i2π 1 |H| j . . . P h . . .                                         M M ' M M ' This is the matrix that is multiplied with ψ y,k ,N L 2 ψ x,k,N . However, the order of the rows in the resulting does not correspond to the ordering of the characters in ψ y,k ,N L 2 ψ x,k,N . If we want to match the operator in Equation 66 such that ψ a,h,N z = (ψ y,k ,N L 2 ψ x,k,N L 1 ψ h,j,N )z We need L 1 be the product of two matrices P and Q, such that P reorders the rows of the vector Qψ h,j,N z to match the ordering of the characters in ψ y,k ,N L 2 ψ x,k,N . The result of (ψ y,k ,N L 2 ψ x,k,N )P Qψ h,j,N z (94) will be a vector that is a permuted version of the vector we would get with ψ a,h,N z, and a linear decoder can learn to reorder it if we want to exactly match ψ a,h,N z. We show here that for χ r = χ (x1,y1) (x, y) with x 1 = 0 or y 1 = 0 the stabilizer group H i is only {e H }. Proof. Let us consider the stabilizer group of χ (x1,y1) . It is composed of the elements h ∈ H such that for a ∈ A, h χ (x 1 ,y 1 ) (a) = χ h((x1,y1)) (a) = χ (x1,y1) (a). In other words, we must have that the character of the rotated vector h((x 1 , y 1 )) is the same as the character corresponding to (x 1 , y 1 ), for any a. Seeing characters as 2D complex numbers, we must have (x 1 , y 1 ) = (cos θx 1sin θy 1 , sin θx 1 + cos θy 1 ), i.e. the rotated vector corresponding to the character is equal to itself. As we employ boundary conditions, this is for example the case for θ = π or θ = -π for x 1 = K 2 , y 1 = K 2 , since the inverse of ( K 2 , K ) is itself. But we restrict ourselves and we consider only odd K and K , such that there is no angle π such that the result of the rotation of (x 1 , y 1 ) gives the same vector in complex space.

D.6.2 DEGREES OF THE REPRESENTATIONS

Case 1 In Case 1, we have |H| representations of degree 1. Case 2 Recall that in Case 2, H i = {e H } and we use ρ = 1, thus the character of the induced representation is χ Ind G G i (χr⊗1) (a, h) = h ∈Hs.t.h -1 hh =e H χ r (h(a))χ 1 (e H ) h -1 hh = e H means h -1 h = h -1 . If h = e H , h will span the entire H and we have χ Ind G G i (χr⊗1) (a, e H ) = h ∈H χ r (a)χ 1 (e H ) = |H|χ r (a) If h = e H then there is no element h for which this is true. Indeed, otherwise it means h = h h -1 = e H leading to a contradiction. Thus: χ Ind G G i (χr⊗1) (a, h) = |H|χ r (a) if h = e H 0 if h = e H To obtain the degree of the representation, we calculate the character of the identity element. χ Ind G G i (χr⊗1) (e A , e H ) = h ∈Hs.t.h -1 e H h =e H χ r (e A )χ 1 (e H ) (97) = h ∈H χ r (e A )χ 1 (e H ) = |H|χ r (e A )χ 1 (e H ) = |H|χ 1 (e H ) = |H| (98) where we use the fact χ r (e A ) = 1 (A is abelian hence its irreducible characters are of degree 1) and that h -1 e H h = e H is true for all h ∈ H. Hence, the degree of the induced representation is |H|χ 1 (e H ) = |H| since χ 1 (e H ) = |{e H }| = 1. Thus in case 2 each orbit induces a unique induced representation of degree |H|, and we have |A|-1 |H| orbits in the second case. This shows we derived all the irreducible representations of G, as if we do the sum of the degree squared of irreducible representations we have Table 2 : Test mean square error (MSE) ± standard deviation of the mean over random seeds. Numbers in () refer to the number of rotations, the number of translations on the x-axis, and the number of translations on the y-axis in this order. The case of multiple transformation does not apply to the weakly supervised shift operator, and we did not experiment on translated MNIST with the weakly shift operator as its performance on translated simple shapes and rotated MNIST were showing its relevance already. Dataset Model Disentangled Shift Weak. sup. shift (K L = 10) Shapes (10,0,0) 0.0208 ± 5.2e-6 0.0002 ± 1.8e-6 0.001 ± 1.3e-3 Shapes (0,10,0) 0.0352 ± 9.0e-6 0.0052 ± 9.6e-6 0.0097 ± 5.2e-3 Shapes (0,0,10) 0.0353 ± 7.6e-6 0.0052 ± 1.5e-5 0.0115 ± 4.9e-3 Shapes (0,5,5) n/a 0.0047 ± 1.9e-5 n/a Shapes (4,5,5) n/a 0.0049 ± 7.7e-6 n/a Shapes (5, 5, 5) n/a 0.0021 ± 5.9e-6 n/a MNIST (10,0,0) 0.0660 ± 8.2e-5 0.0004 ± 5.6e-6 0.0035 ± 3.9e-3 MNIST (0,10,0) 0.0838 ± 3.2e-5 0.0079 ± 5.1e-5 n/a MNIST (0,0,10) 0.0857 ± 5.1e-5 0.0062 ± 3.2e-5 n/a MNIST (4, 5, 5) n/a 0.004 ± 4.0e-5 n/a 

E.2 ADDITIONAL ANALYSES

Latent Variance and PCA Analysis The analysis in Fig. 1 .C and 1.D quantitatively measures disentanglement in the latent representation for VAE and variants on rotated MNIST. We first apply every transformation to a given shape and compute the variance of the latent representation as we vary the transformation. The final figure shows the average variance across all test samples. For the PCA analysis, we seek to determine whether the transformation acts on a subspace of the latent representations. We first compute the ranked eigenvalues of the latent representations of each shape with all transformations applied to the input. We then normalize the ranked eigenvalues by the sum of all eigenvalues to obtain the proportion of variance explained each latent dimension. Finally, we plot the average of the normalized ranked eigenvalues across all test samples. Additional latent traversals Figures 5 and 6 show all latent traversals for the model with the best validation loss. We see the success of disentanglement in the case of a single digit and the failure of a latent to capture rotation in the case of multiple digits. ::: We ::::: show : a :::: more :::::::: granular ::::::: traversal :::: with :: 50 :::: plots ::: per ::::: latent ::: for :::: each ::::::: baseline ::: in :: 7.In Figures 9, 10 we show comparable results for the case of translations along the x-axis and y-axis. Additional results for the distributed operator models Fig. 14 shows additional results on MNIST. Fig. 14a shows that the weakly supervised shift operator performs well on Rotated MNIST, and in Fig. 14b we see the stacked shift operator model is able to correctly encode the multiple transformations case on MNIST. Fig. 15 shows performance of the weakly supervised shift operator on translated simple shapes (either x or y translations). The effect of the number of latent transformations is explored in App. B.3. It shows the best model is in the case of translation obtained with 21 transformations. Nonetheless, to be able to feed to the model the correct 10 transformations needed in these plots, the reported plots are for the model with 10 latent transformations. Fig. 16 shows example results for the case of the Special Euclidean group (i.e. rotations in conjunction with translations) in a setting where the semi-direct product structure is broken (see Appendix D.4.1). Here, we used a rotation group of order 5. We see the stacked shift operator model performs well nonetheless. 



or quasi-invertible, see App. B.2 An element (a, h) is in the same coset as (eA, h) as (a, h) -1 (eA, h) = (h -1 (a -1 ), h -1 )(eA, h) = (h -1 (a -1 ), eH ) ∈ G k



Figure 1: Failure modes of common disentanglement approaches. A. Latent traversal best capturing rotation for a VAE, β-VAE, and CCI-VAE for rotated MNIST restricted to a single digit class ("4"). B. Same as panel A for all 10 MNIST classes. C. Variance of single latents in response to image rotation, averaged over many test images. D. Ranked eigenvalues of the latent covariance matrix in response to image rotation, averaged over many test images. E. A supervised disentangling model successfully reconstructs some digits (top) but fails on other examples (bottom). F. Failure cases of the supervised model trained on a dataset of 2000 rotated shapes (see also Fig. 8).

Figure 2: Visual proof of the topological defects of disentanglement. A. Top left: O 1 , O 2 and O 3 are three examples of orbits of 3-pixel-images transformed by translation. Bottom: (left) orbits visualized in image space (points constitute the orbits, continuous lines are for visualization purposes); (right) orbits in latent space. When projected onto the equivariant subspace Z E (gray dotted lines), all orbits should collapse onto each other. Yet the orbit of a uniformly black image (red dot) contains a single point and thus cannot be mapped onto the other orbits. B. Discontinuity of f E around symmetric images. Top: consider an image of an equilateral triangle, with an infinitesimal perturbation on one corner (black dot), undergoing rotation (color changes are for visualisation purposes). Bottom: (left) after a rotation of 120 • , the orbit in image space (here projected onto 3 dimensions for visualisation) almost loops back on itself; (right) in Z E , each angle of rotation corresponds to a distinct point in space. Therefore, the encoder f E is discontinuous (as shown by the red arrows).

::::: The :::::::: operators ::: φ k ::: and ::: ψ k ::::::: capture :::: how ::: the :::::::::::: transformation :: g k :::: acts ::: on ::: the ::::: input ::::: space :::: and :::::::::::: representation ::::: space ::::::::::: respectively. ::::: With :::: this :::: view, ::

The ::::::::: flexibility :: of ::::: Def.1 ::::::: unlocks :: a :::::::: powerful ::::::: toolbox ::: for :::::::::::: understanding ::: and :::::::: building :::::::::: disentangled :::::: models. ::::::::::::::: Transformations ::: in :::: data ::::: often :::: have ::::::::: additional :::::::: structure ::::::::: describing :::: how ::: to ::: (1) ::::: undo : a :::::::::::: transformation ::::::::::: (invertibility), ::: (2) ::::: leave : a :::::: sample ::::::::: unchanged ::::::::: (identity), ::: and ::: (3) :::::::: rearrange ::::::::: parenthesis ::

the :::: same :::::::: character :::::::::::::::::::::::::::::::::::::: (Scott & Serre, 1996, Theorem 4, Corollary 2) ::: (see ::::: App. :: A ::: for : a :::::::: definition ::

version :: of ::: the :::: shift ::::::: operator. :::::: These :::::::::: assumptions ::::: allow :: us :: to :::::::: guarantee : a ::::: linear ::::::::: equivariant ::::: model ::: can ::: be :::::: learned :::: with :::: pairs ::: of :::::::: examples ::: (see ::: our ::::::: training ::::::::: objectives :: in :::: App. ::::: B.2). :::: Note :::: that :: the ::::shift ::::::: operator :::: only :::::: handles :::::: cyclic :::::: groups :::: with :::: finite ::::: order ::: (or : a ::::::: product :: of :::: such ::::::: groups). ::: In :::: order :: to :::::tackle ::::::::: continuous :::::::::::::: transformations, : a ::::::::::: discretisation :::: step ::::: could :: be ::::: added :::: and :: we ::::: leave ::: the ::::::::: exploration ::

Figure 3: Success and flexibility of proposed distributed shift operator models: A. Proposed shift operator model successfully learns rotation on simple shapes. B. Disentangled operator fails to learn rotation. C. Weakly supervised shift operator model, using the complex version of the shift operator, successfully rotates simple shapes. Note that the model maps ground-truth counter clockwise rotations to clockwise rotations, while respecting the cyclic structure of the group. D. Stacked shift operator model succeeds on conjunction of translations. E. Stacked shift operator model succeeds on conjunction of translations and rotations. Numbers above plots indicate rotation angle and/or translation in x and y respectively.

:::: and ::: 18 ::::: show :::: pairs ::: of ::::::: samples ::: and ::: the :::::::::::: reconstructions ::: by ::: the ::::::: stacked :::: shift ::::::: operator :::::: model :: in ::: the ::::: cases :: of ::: (i) ::::::::: translation :: in ::::: both : x :::: and : y :::: axes ::: and ::: (ii) :::::::: rotations :::: and :::::::::: translations :: in :::: both ::::: axes. ::: In :::::::: appendix :::::: Figure ::: 16 ::: we :::: also :::::: explore ::: the ::: case :::::: where ::: the ::::: order :: of ::: the ::::: group ::: of ::::::: rotations :: is :: 5, :::::::: breaking ::: the ::::::::: semi-direct ::::::: product :::::::: structure ::: (see ::: note ::: in :::::::: Appendix :::::: D.4.1) ::: and ::::: show :::: that ::: the :::::: stacked ::::: shift ::::::: operator :::::::::: nonetheless :::::::: performs :::: with :::: great ::::::::::: performance.

:::::::: Moreover, ::: in :: the :::: case :::::: where ::: the ::::: latent :::::::: operators ::::: cannot ::be ::::::::: determined ::: in ::::::: advance ::: like :: in ::: the ::::: affine ::::: case, :::: these :::::::: operators ::::: could ::: be :::::: learned :::: like :: in ::::::: Connor : et ::al. :: A :::::: benefit :: of ::: this :::::::: approach :: is :::: that ::::::: multiple :::::::: operators ::: can :: be ::::::: learned :: in ::: the :::: same :::::::: subspace, :::::: instead ::of ::: the ::::::: stacking ::::::: strategy ::: that ::: we :::::: needed :: to ::: use :: in ::: the :::: case :: of :::::::::: hard-coded :::: shift :::::::: operators. : Appendix

where d is the latent dimension (see Kingma & Welling (2014)). For a standard VAE, we use C = 0 and β = 1.0. For β-VAE, we sweep over choices of β with C = 0. For CCI-VAE, we sweep over choices of β and linearly increase C throughout training from C = 0 to C = 36.0.

rates: 0.0005, 0.001 For MNIST, we sweep across combinations of • 5 seeds: 0, 10, 20, 30, 40 • 4 batch sizes: 8,16,32, 64 • 2 learning rates: 0.0005, 0.001

CHARACTER THEORY OF THE DISENTANGLEMENT OF FINITE DISCRETE LINEAR TRANSFORMATIONS C.3.1 THE SHIFT OPERATOR

T r(ρ(a, eH )) = |H| + |H| χ∈X,χ =X0,0 χ(a)(71)(72)The sum of the all the irreducible characters, for a = e A is 0(Scott & Serre, 1996, Corollary 2):χ∈X χ(a) = 0. Hence, ρ(a, e H )) = |H| + |H|(-1) = 0.

and ω 2 = e 2iπ K . And ψ h,j,N is the repetition of the shift operator for the rotation N |H| times, as follows,

A -1 ) t i m e s (91) which is of order |A||H|. Assuming the encoder takes care of the right-multiplication by C and the scaling by |H| (which it can learn to do), we do not write the scaling and right-multiplication by C.

::::::::::::: Disentanglement ::::::::: Measure :: We :::::::: compute ::: the :::::: LSBD :::::::::::::: disentanglement :::::: metric :: to ::::::: quantify ::: how :::: well ::: the ::::: shift ::::::: operator ::::::: captures ::: the :::::: factors :: of :::::::: variation :::::::: compared ::: to ::: the :::::::::: disentangled ::::::: operator ::: (see ::::::::::::::::: Anonymous (2021) : ). ::::: Note ::::::::: traditional :::::::::::::: disentanglement ::::::: metrics ::: are :::: not :::::::::: appropriate :: as :::: they ::::::: describe :::: how :::: well ::::::: factors :: of :::::::: variation :::: are :::::::: restricted :: to ::::::::: subspaces ::: in ::::::: contrast ::: to ::: our :::::::: proposed ::::::::: framework ::::: using ::::::::: distributed ::::: latent :::::::: operators. ::::::: LSBD, ::: on ::: the :::: other ::::: hand, :::::::: measures :::: how :::: well ::::: latent ::::::: operators ::::::: capture :::: each :::::: factor ::::::: variation :: to :::::::: quantify ::::::::::::: disentanglement ::::: even ::: for ::::::::: distributed :::::::: operators. ::::: Using :::::: LSBD, ::: we :::::::: quantify ::: the ::::::::: advantage :: of :::: the :::: shift ::::::: operator ::::: with :::::: LSBD :: of :::::: 0.0020 :::::: versus ::: the :::::::::: disentangled ::::::: operator :::: with :::::: LSBD :: of :::::: 0.0106 ::: for ::: the :::::: models :: in :::: Fig. ::: 3.A :::: and :::: 3.B.

Figure 5: Single Rotated MNIST Digit Label: Latent traversals for VAE (left), β-VAE (middle), CCI-VAE (right) trained on a single rotated MNIST digit (10 rotations). Latent traversal spans the range [-6, 6] for each latent dimension.

Figure 6: Rotated MNIST: Latent traversals for VAE (left), β-VAE (middle), CCI-VAE (right) trained on all rotated MNIST digits (10 rotations). Latent traversal spans the range [-6, 6] for each latent dimension. Note in this case, the best validation model for VAE contained 30 latent dimensions whereas β-VAE and CCI-VAE contain 10.

Figure 7: ::::::: Granular ::::: latent :::::::: traversals ::: for :::: VAE ::::::: β-VAE, ::::::::: CCI-VAE :::::: trained :: on ::: all :::::: rotated ::::::: MNIST :::: digits ::: (10 :::::::: rotations). :::::: Latent ::::::: traversal ::::: spans ::: the ::::: range :::::: [-6, 6] :::: with ::: step :::: size :::: 0.25 ::::::: yielding :: 50 ::::: plots ::: per :::: latent ::::::::: dimension. ::::: Note :: in :::: this :::: case, :::: the ::: best ::::::::: validation :::::: model ::: for :::: VAE ::::::::: contained ::: 30 ::::: latent ::::::::: dimensions ::::::: whereas :::::: β-VAE ::: and ::::::::: CCI-VAE :::::: contain :: 10

Figure 8: Non-linear disentangled operator with latent rotations

Figure 10: Single MNIST Digit Label Translated along Y-Axis: Latent traversals for VAE (left), β-VAE (middle), CCI-VAE (right) trained on a single MNIST digit translated along the y-axis (10 translations). Latent traversal spans the range [-6, 6] for each latent dimension.

Figure 11: MNIST Translated along X-Axis: Latent traversals for VAE (left), β-VAE (middle), CCI-VAE (right) trained on all MNIST digits translated along the x-axis (10 translations). Latent traversal spans the range [-6, 6] for each latent dimension.

Figure 12: MNIST Translated along Y-Axis: Latent traversals for VAE (left), β-VAE (middle), CCI-VAE (right) trained on all MNIST digits translated along the y-axis (10 translations). Latent traversal spans the range [-6, 6] for each latent dimension.

Figure 13: Non-linear disentangled operator with latent translations

Figure 17: :::: Pairs ::: of ::: test :::::::: samples :::: and :::: their ::::::::::::: reconstructions ::: for :::: the ::::::: stacked :::: shift :::::: model ::::: with : 5 ::::::::: translations :: in :::: both :: x ::: and :: y.

Dataset generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Model architectures and training . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Weakly supervised shift operator training procedure . . . . . . . . . . . . . . . B.4 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Details of Section D.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

• , 36 • , 72 • , . . . , 324 • } using scikit-image's rotation functionality (see van der Walt et al. (

. At test-time, we use the transformation with maximum score α i,k . Stacked shift operator model We use linear encoders and decoders with a 784 dimensional latent space to match pixel size, as we use in the stacked model the complex version of the shift operator. Intermediate layers L i are invertible linear layers of size 784 as well. Training is done with L2 loss on reconstructed samples as in the autoencoder with shift latent operator (see Equation 10.) B.3 WEAKLY SUPERVISED SHIFT OPERATOR TRAINING PROCEDURE

Its character table is: (e A , e H ) (a 1 , e H ) (a 2 , e H ) ... (a |A| , h |H| )

For the identity element (e A , e H ), we get for ρ(a, h) the identity matrix, and its trace is |A||H|. When h = e H , P h has no diagonal element and When a = e A , if h = e H we are in the previous case. If h = e H , P H is the identity matrix and the representation will be ρ(a, e H

e H ) (a 1 , e H ) (a 2 , e H ) ... (a |A| , h |H| ) The character of ψ is N |A||H| times the character of χ ρ , i.e. (e A , e H ) (a 1 , e H ) (a 2 , e H ) ... (a |A| , h |H| ) 2D translation case, we can use the theoretical form of the representation to use to gain insight on what the intersected layers L i should be for the case of 2D translations in conjunction with rotations. Elements of this group are (a, h)

Test Mean Squared ErrorTable 2 reports test MSE for disentangled, supervised shift and weakly supervised shift operators.

availability

All code is available at https://anonymous.4open. science/r/5b7e2cbb-54dc-4fde-bc2c-8f75d29fc15a/.

