EQUIVARIANT DISENTANGLED TRANSFORMATION FOR DOMAIN GENERALIZATION UNDER COMBINATION SHIFT

Abstract

Machine learning systems may encounter unexpected problems when the data distribution changes in the deployment environment. A major reason is that certain combinations of domains and labels are not observed during training but appear in the test environment. Although various invariance-based algorithms can be applied, we find that the performance gain is often marginal. To formally analyze this issue, we provide a unique algebraic formulation of the combination shift problem based on the concepts of homomorphism, equivariance, and a refined definition of disentanglement. The algebraic requirements naturally derive a simple yet effective method, referred to as equivariant disentangled transformation (EDT), which augments the data based on the algebraic structures of labels and makes the transformation satisfy the equivariance and disentanglement requirements. Experimental results demonstrate that invariance may be insufficient, and it is important to exploit the equivariance structure in the combination shift problem.

1. INTRODUCTION

The way we humans perceive the world is combinatorial -we tend to cognize a complex object or phenomenon as a combination of simpler factors of variation. Further, we have the ability to recognize, imagine, and process novel combinations of factors that we have never observed so that we can survive in this rapidly changing world. Such ability is usually referred to as generalization. However, despite recent super-human performance on certain tasks, machine learning systems still lack this generalization ability, especially when only a limited subset of all combinations of factors are observable (Sagawa et al., 2020; Träuble et al., 2021; Goel et al., 2021; Wiles et al., 2022) . In risk-sensitive applications such as driver-assistance systems (Alcorn et al., 2019; Volk et al., 2019) and computer-aided medical diagnosis (Castro et al., 2020; Bissoto et al., 2020) , performing well only on a given subset of combinations but not on unobserved combinations may cause unexpected and catastrophic failures in a deployment environment. (Wang et al., 2021a) is a problem where we need to deal with combinations of two factors: domains and labels. Recently, Gulrajani & Lopez-Paz (2021) questioned the progress of the domain generalization research, claiming that several algorithms are not significantly superior to an empirical risk minimization (ERM) baseline. In addition to the model selection issue raised by Gulrajani & Lopez-Paz (2021) , we conjecture that this is due to the ambitious goal of the usual domain generalization setting: generalizing to a completely unknown domain. Is it really possible to understand art if we have only seen photographs (Li et al., 2017) ? Besides, those datasets used for evaluation usually have almost uniformly distributed domains and classes for training, which may be unrealistic to expect in real-world applications.

Domain generalization

A more practical but still challenging learning problem is to learn all domains and labels, but only given a limited subset of the domain-label combinations for training. We refer to the usual setting of domain generalization as domain shift and this new setting as combination shift. An illustration is given in Fig. 1 . Combination shift is more feasible because all domains are at least partially observable during training but is also more challenging because the distribution of labels can vary significantly across domains. The learning goal is to improve generalization with as few combinations as possible. To solve the combination shift problem, a straightforward way is to apply the methods designed for domain shift. One approach is based on the idea that the prediction of labels should be invariant to the change of domains (Ganin et al., 2016; Sun & Saenko, 2016; Arjovsky et al., 2019; Creager et al., 2021) . However, we find that the performance improvement is often marginal. Recent works (Wiles et al., 2022; Schott et al., 2022 ) also provided empirical evidence showing that invariance-based domain generalization methods offer limited improvement. On the other hand, they also showed that data augmentation and pre-training could be more effective. To analyze this phenomenon, a unified perspective on different methods is desired. In this work, we provide an algebraic formulation for both invariance-based methods and data augmentation methods to investigate why invariance may be insufficient and how we should learn data augmentations. We also derive a simple yet effective method from the algebraic requirements, referred to as equivariant disentangled transformation (EDT), to demonstrate its usefulness. Our main contributions are as follows: We provide an algebraic formulation for the combination shift problem. We show that invariance is only half the story and it is important to exploit the equivariance structure. We present a refined definition of disentanglement beyond the one based on group action (Higgins et al., 2018) , which may be interesting in its own right. Based on this algebraic formulation, we derive (a) what combinations are needed to effectively learn augmentations; (b) what augmentations are useful for improving generalization; and (c) what regularization can be derived from the algebraic constraints, which can serve as a guidance for designing data augmentation methods. As a proof of concept, we demonstrate that learning data augmentations based on the algebraic structures of labels is a promising approach for the combination shift problem.

2. PROBLEM: DOMAIN GENERALIZATION UNDER COMBINATION SHIFT

Throughout the following sections, we study the problem of transforming a set of features X to a set of targets Y via a function f : X → Y . Here, X can be a set of images, texts, audios, or more structured data, while Y is the space of outputs. Further, the target Y may have multiple components. For example, Y 1 is the set of domain indices and Y 2 is the set of target labels. Ideally, all combinations of domains and target labels would be uniformly observable. However, in reality, it may not be the case because of selection bias, uncontrolled variables, or changing environments (Sagawa et al., 2020; Träuble et al., 2021) . Let Y train i and Y test i denote the sets of i-th components (the support of the marginal distributions) observed in the training and test data. In the usual domain generalization setting (Wang et al., 2021a; Gulrajani & Lopez-Paz, 2021) , the goal is to generalize to a completely unseen domain, i.e., domain shift. We have Y train 2 = Y test 2 but Y train 1 ∩ Y test 1 = ∅. However, it is unclear how different domains should relate and why a model can generalize without the knowledge of the unknown domain (Wiles et al., 2022) . In this work, we focus on a more practical condition, called combination shift and illustrated in Fig. 1 , where all test domains and labels can be observed separately during training, i.e., Y test i ⊆ Y train i (i = 1, 2), but not all their combinations. An example is the spurious relationship problem (Torralba & Efros, 2011) , such as the co-occurrence of the objects and their background (Sagawa et al., 2020) . In an extreme case, the combinations in the training and test sets could be disjoint, which requires completely out-of-distribution generalization. We survey related problems and approaches in more detail in Appendix D.

3. FORMULATION: EQUIVARIANCE TO PRODUCT ALGEBRA ACTIONS

This section outlines the concepts needed to formally describe the problem and our proposed method. See Appendices B and C for a more detailed review and concrete examples. Those who are interested in the proposed method itself may skip this section and directly jump to Section 4. Because in the domain generalization problem, we have at least two sets, domains and labels, it is natural to study their product structure, which is manifested as statistical independence or operational disentanglement. We focus on the latter and use the following definition: Definition 1. Let {A i = (A i , {f j i : A n j i → A i } j∈J i ) } i∈I be algebras indexed by i ∈ I, each of which consists of the underlying set A i and a collection of operations f j i of arity n j indexed by j ∈ J i . Let A = i∈I A i be the product algebra whose underlying set is the product set A = i∈I A i . Let A act on sets X and Y via actions act X : A × X → X and act Y : A × Y → Y . A transformation f : X → Y is disentangled if it is equivariant to act X and act Y . In short, a disentangled transformation is a function equivariant to actions by a product algebra. Note that a definition of disentangled representations based on product group action has been given in Higgins et al. (2018) , which is a special case when {A i } i∈I are all groups. We emphasize that the concept of disentanglement is rooted in product, not group nor action. We will unwind this definition and discuss the reasons for this extension as well as its limitations below.

3.1. HOMOMORPHISM AND EQUIVARIANCE

An algebra consists of one or more sets, a collection of operations on these sets, and a collection of universally quantified equational axioms that these operations need to satisfy. A homomorphism between algebras is a function between the underlying sets that preserves the algebraic structure. A (left) action of a set A on another set X is simply a binary function act : A × X → X. An action is equivalent to its exponential transpose or currying, a function act : A → X X from A to the set of endofunctions X X , also known as a representation of A on X. An action is faithful if all endofunctions are distinct, and trivial if all elements are mapped to the identity function id X . Let act X and act Y be actions of A on X and Y , respectively. A function f : X → Y is equivariant to act X and act Y if ∀a ∈ A, f • act X (a) = act Y (a) • f. (1) Specifically, if act Y is trivial, f is called invariant to act X : ∀a ∈ A, f • act X (a) = f. In summary, for an underlying set X, an algebra over X describes the structure of the set X itself, while an action or a representation of another algebraic structure A on X describes the structure of a subset of the endofunctions X X . Homomorphisms and equivariant functions describe how the structures of the set and endofunctions are preserved, respectively. An equivariant map can be also considered as a homomorphism between two algebras whose operations are all unary and indexed by elements in the set A. Note that only the equivariance -the structure of endofunctions -may not fully characterizes a learning problem, because not all operations are unary operations. In some problems, it would be necessary to consider the preservation of the structure of other operations with the concept of algebra homomorphism. See also Appendices A to C.

3.2. MONOID AND GROUP

Let us focus on the endofunctions X X for now. A way to describe the structure of a subset of endofunctions X X is to specify an algebra A and an action of A on X preserving the algebraic structure. For example, an important operation is the function composition • : X X × X X → X X , which can be described by how an action preserves a binary operation • : A × A → A: ∀a 1 , a 2 ∈ A, act(a 1 • a 2 ) = act(a 1 ) • act(a 2 ). Since the function composition is associative, (A, •) should be a semigroup. If we also want to include the identity function id X , then there should exist an identity element e ∈ A (a nullary operation), which makes (A, •, e) a monoid. Remark 1 (Group). If we only consider invertible endofunctions, then A becomes a group (Higgins et al., 2018) . However, only considering groups could be too restrictive. For example, periodic boundary conditions are required (Higgins et al., 2018; Caselles-Dupré et al., 2019; Quessard et al., 2020; Painter et al., 2020) for two-dimensional environments (e.g., dSprites (Matthey et al., 2017) ), so that all the movements are invertible and have a cyclic group structure. This is only possible in synthetic environments such as games, not in the real world. Another example is the 3D Shapes dataset (Burgess & Kim, 2018) , which consists of images of three-dimensional objects with different shapes, colors, orientations, and sizes. It is acceptable to model the shape, color, and orientation with permutation groups or cyclic groups. However, it is unreasonable if we increase the size of the largest object, then it becomes the smallest. This is because we only consider the set of natural numbers, representing size, count, or price, and of which addition only has a monoid structure. Therefore, it is important to consider endofunctions in general, not only the invertible ones. In this work, we mainly focus on monoid actions that only describe the function composition and identity function.

3.3. PRODUCT AND DISENTANGLEMENT

Finally, we are in a position to introduce the concept of disentanglement used in Definition  A = A 1 × A 2 act on Y = Y 1 × Y 2 componentwise via an action act Y . We also assume that there is an action act X of A on X that manipulates the features. After properly choosing the algebras and actions, the problem can be then formulated as finding a function equivariant to act X and act Y . Note that it is usually unnecessary and sometimes impossible to decompose X into a product, i.e., X = X 1 ×X 2 may not exist. For example, when X is a set of objects with different shapes and colors, there does not exist an object without color. In this case, we could only equip the endofunctions X X with a product structure.

4. METHOD: EQUIVARIANT DISENTANGLED TRANSFORMATION

In this section, we present our proposed method based on an algebraic formulation of the combination shift problem. The basic idea is that if we choose the algebra properly, the algebraic requirements of the transformation naturally lead to useful architectures and regularization. In the following discussion, we assume that act Y i (a i , y i ) = y ′ i for some a i ∈ A i and y i , y ′ i ∈ Y i , i = 1, 2. We denote an instance whose labels are y 1 and y 2 by x y 1 ,y 2 . 4.1 MONOID STRUCTURE First, we discuss how to choose the algebra that is suitable for our problem and derive the algebraic requirements. As discussed in Section 3.2, we only require that algebras A 1 and A 2 are monoids, which means that there exist associative binary operations • i : A i × A i → A i and identity elements e i ∈ A i for i = 1, 2. Then, according to Eq. ( 3) (action commutes with composition), we can derive that a product action act(a 1 , a 2 ) on X or Y can be decomposed in two ways: act(a 1 , a 2 ) = act(a 1 , e 2 ) • act(e 1 , a 2 ) = act(e 1 , a 2 ) • act(a 1 , e 2 ). (4) Or equivalently, the following diagram commutes (when the action is on Y = Y 1 × Y 2 ): (y 1 , y 2 ) (y 1 , y ′ 2 ) (y ′ 1 , y 2 ) (y ′ 1 , y ′ 2 ) act(a 1 ,a 2 ) act(e 1 ,a 2 ) act(a 1 ,e 2 ) act(a 1 ,e 2 ) act(e 1 ,a 2 ) (5) Thus, we can focus on the endofunctions of the form act(a 1 , e 2 ) and act(e 1 , a 2 ), whose compositions constitute all endofunctions of interest. Remark 2 (Size). Denoting the cardinality of a set A by |A| and the image of a function f on a set X by f [X], i.e., a set defined by {f (x) | x ∈ X}, we can prove that | act([A 1 ], e 2 )| ≤ |A 1 |, | act(e 1 , [A 2 ])| ≤ |A 2 |, and | act([A 1 ], [A 2 ])| = | act([A 1 ], e 2 )| × | act(e 1 , [A 2 ])| ≤ |A 1 | × |A 2 |. The equality holds when the actions are faithful. Thanks to the monoid structure and the product structure, we can reduce the number of endofunctions that we need to deal with from |A 1 | × |A 2 | to at most |A 1 | + |A 2 |. We can further reduce the number if A 1 or A 2 has a smaller generator. For example, although the monoid (N, +) of natural numbers under addition has infinite elements, it can be generated from a singleton {1}. In this case, we can focus on a single endofunction that increases the value by a unit, and all other endofunctions are compositions of this special endofunction.

4.2. EQUIVARIANCE REQUIREMENT

Then, consider a function f : X → Y that extracts only necessary information and preserves the algebraic structure of interest. We require it to be equivariant to two actions act X and act Y . Recall that we can consider endofunctions only of the form act(a 1 , e 2 ) and act(e 1 , a 2 ). Based on Eq. ( 1) (action commutes with transformation), we can derive the algebraic requirement shown in the following commutative diagram: x y 1 ,y 2 (y 1 , y 2 ) y 1 y 2 x y ′ 1 ,y 2 (y ′ 1 , y 2 ) y ′ 1 f act X (a 1 ,e 2 ) p 1 p 2 act Y (a 1 ,e 2 ) act Y 1 (a 1 ) id Y 2 f p 1 p 2 With the projections p 1 and p 2 , we can see that this requirement results in the following four conditions: (a) f 1 = p 1 •f is equivariant to act X (-, e 2 ) and act Y 1 ; (b) f 1 is invariant to act X (e 1 , -); and dually, (c) f 2 = p 2 • f is equivariant to act X (e 1 , -) and act Y 2 ; (d) f 2 is invariant to act X (-, e 2 ). The symbolis a placeholder, into which arguments can be inserted. (R, 1) (R, 0) (B, 1) (B, 0) Notation: component 1 act(a 1 , e 2 ) component 2 act(e 1 , a 2 ) augmentation act(a 1 , a 2 ) prediction f : X → Y 1 × Y 2 Training: Select suitable data pairs and learn component augmentations separately (Eq. ( 7)); Regularize augmentations (Eqs. ( 8) and ( 9)), simultaneously or alternatively; Train a prediction model (Eq. ( 10)). Figure 2 : Equivariant Disentangled Transformation (EDT). All diagrams commute.

4.3. ALGORITHM

Finally, we present a method directly derived from the algebraic requirements of the transformation, referred to as equivariant disentangled transformation (EDT) and illustrated in Fig. 2 . Since the formulation above naturally generalizes to the case of multiple factors Y = Y 1 × • • • × Y n , we present the method in the general form. Architecture Since the output space Y and the selected endofunctions on it are manually designed, the action act Y on Y is known and fixed. However, the action act X on X is usually not available. So our first goal is to learn a set of endofunctions α j i : X → X representing act X (e 1 , . . . , a j i , . . . , e n ) indexed by a j i ∈ A i , i = 1, . . . , n. These endofunctions can be considered as learned augmentations of data that only modify a single factor while keeping other factors fixed. Second, we need to approximate the equivariant function f using a trainable function ϕ : X → Y . Due to the property of product, any function to a product arises from component functions ϕ i : X → Y i , i = 1, . . . , n. Therefore, we can train a model for each component and make these models satisfy the algebraic requirements specified bellow.

Data selection and augmentation

To train an augmentation α j i , we need to collect pairs of instances x and x ′ such that act X ((e 1 , . . . , a j i , . . . , e n ), x) = x ′ , in other words, pairs of the form x y 1 ,...,y i ,...,y n and x y 1 ,...,y ′ i ,...,y n , where act Y i (a j i , y i ) = y ′ i . Then, denoting the set of all measures on X by P X, we can learn the augmentations by minimizing a statistical distance d : P X × P X → R ≥0 : ℓ 0 (α j i ) = d(α j i (x), x ′ ). With a slight abuse of notation, here x and x ′ also represent the empirical distribution. Choices of the statistical distance d include the expected pairwise distance (Kingma & Welling, 2014), maximum mean discrepancy (Li et al., 2015; Dziugaite et al., 2015; Muandet et al., 2017) , Jensen-Shannon divergence (Goodfellow et al., 2014) , and Wasserstein metric (Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018) . Remark 3 (Cycle consistency). It is possible to use all pairs of the form x y 1 ,...,y i ,...,y n and x y ′ 1 ,...,y ′ i ,...,y ′ n , i.e., pairs of instances whose i-th labels correspond to the action, but other labels could be different. For example, if A i is a group, we can simultaneously train two models that are the inverse of each other with a cycle consistency constraint (Zhu et al., 2017; Goel et al., 2021) . With this constraint, the learned augmentation is likely an approximation of act X (e 1 , . . . , a j i , . . . , e n ). However, it is still possible to obtain approximations of act X (q 1 , . . . , a j i , . . . , q n ) and its inverse where q 1 , . . . , q n are not necessarily the identity elements. This happens especially when there are more than two factors and not all combinations are available, which is demonstrated in Section 5. The rich algebraic structure yields various constraints, which can be used as regularization for augmentations. Next, we present three regularization techniques derived from the basic product monoid structure. Note that we can introduce more constraints if we choose a richer algebra. Regularization 1 (Compositionality of augmentations) According to Eq. ( 3), if α j•k i is the approximated action of a j i • i a k i , we can simply define it as α j•k i = α j i • α k i . If we need to approximate it directly, the algebraic requirement leads to the following regularization: ℓ 1 (α j i , α k i , α j•k i ) = d(α j i (α k i (x)), α j•k i (x)). A special case is when we know the composition is the identity function α j•k i = id X , i.e., a j i is the inverse of a k i . This regularization is then equivalent to the "cycle consistency loss" in the CycleGAN model (Zhu et al., 2017) or the "isomorphism loss" in the GroupifiedVAE model (Yang et al., 2022) . Another example is for modifying instances with real-valued targets. We could use multi-scale augmentations (e.g., α 1 increases the value by 1 unit and α 5 increases the value by 5 units) to reduce the cumulative error and gradient computation, and this regularization ensures that these augmentations are consistent with each other (e.g., (α 1 ) 5 ≈ α 5 ). Regularization 2 (Commutativity of augmentations) According to the diagram in Eq. ( 5), we can derive the following regularization, which means that the order of augmentations for different factors should not matter: ℓ 2 (α k i , α l j ) = d(α l j (α k i (x)), α k i (α l j (x))). (9) This can be interpreted as a commutativity requirement: the augmentations are grouped by the factors they modify, and augmentations from different groups should commute, but augmentations within the same group are usually not commutative. Again, we point out that this is only based on the product monoid structure and is nothing group-specific. In Fig. 4 , we illustrate a concrete example of compositionality and commutativity regularization based on the dSprites dataset (Matthey et al., 2017) . The movement of position can be modeled via the additive monoid of natural numbers; while the change of shape can be formulated by a permutation/cyclic group. Suitable training example pairs can be used for learning augmentations directly (ℓ 0 ), but such pairs may be limited. Algebraic regularization terms (e.g., ℓ 1 and ℓ 2 ) introduce inductive biases so that more relationships between training examples can be used as supervision. Regularization 3 (Equivariance of transformation) According to the diagram in Eq. ( 6), we can derive the following equivariance and invariance regularization: ℓ 3 (α j i , ϕ k ) = d(ϕ i (α j i (x)), act Y i (a j i , ϕ i (x))) i = k, d(ϕ k (α j i (x)), ϕ k (x)) i ̸ = k. ( ) It is a good strategy to learn the augmentations first and then use them to improve the transformation (Goel et al., 2021) . However, we can see from this regularization that if the transformation is well trained, it can be used for improving the augmentations too. 

5. EXPERIMENTS

As a proof of concept, we conduct experiments to support the following claims: Learning data augmentation is a promising approach for the combination shift problem. Cycle consistency may be insufficient, and additional constraints need to be considered. We should regularize the data augmentations so that they satisfy the algebraic requirements.

5.1. COMBINATION SHIFT

First, we experimentally demonstrate the insufficiency of the invariance-based approach and the potential of the augmentation-based approach for the combination shift problem. Data We colored the grayscale images from the MNIST dataset (LeCun et al., 1998) with 5 colors to create a semi-synthetic setting. Therefore, there are 5 domains (colors) and 10 classes (digits). We tested the methods in the most extreme case where the combinations of domains and classes of the training and test sets are disjoint. We selected five types of combinations as the training set: AXIS: all red digits and zeros of all colors; STEP: three digits for each color (shown in Fig. 6 in Appendix E); RAND-0.5/-0.7/-0.9: combinations randomly selected with a fixed ratio. Method In addition to an ERM baseline, we evaluated four invariance-based methods: IRM (Arjovsky et al., 2019) , CORAL (Sun & Saenko, 2016) , DANN (Ganin et al., 2016) , and Fish (Shi et al., 2022) ; and two augmentation-based methods: Mixup (Zhang et al., 2018) and MixStyle (Zhou et al., 2021) . Model architectures and hyperparameters are given in Appendix E.

Results

We can see from Table 1 that the ERM baseline and invariance-based methods perform poorly if only limited combinations of domains and classes are observable. The high variance indicates that the learned representation may still depend on the domains. As more combinations become observable in training, the differences in performance of all methods become less statistically significant. On the other hand, the augmentation-based methods usually provide higher performance improvements, although the mixup method may deteriorate performance depending on the setting. MixStyle performs consistently well, partially because it is specifically designed for image styles and thus lends itself well to this setting. With the algebraic constraints, EDT may capture the underlying distribution better and offer larger improvements.

5.2. DATA AUGMENTATION

Next, we discuss potential issues of the augmentation-based method (Goel et al., 2021) based on CycleGAN (Zhu et al., 2017) , which matches the bidirectionally transformed distributions and regularizes the composition to be the identity functions. There are two major issues of this approach. Firstly, it is designed only for two domains (e.g., female and male). Secondly and more importantly, ℓ 0 , ℓ 1 , ℓ 2 , ℓ 3 ) 4.55(0.21) 0.59(0.01) 2 as discussed in Remark 3, when there are more than two factors, cycle consistency alone may not guarantee the identity of non-transformed factors. The comparison on the 3D Shapes dataset (Burgess & Kim, 2018) is shown in Fig. 8 in Appendix E. We can observe that although the floor hue is transformed as desired and the reconstructed images are almost identical to the original ones, other factors such as the object/wall hues are also changed. In contrast, the algebraic requirements of EDT ensure the approximated augmentations are consistent with the desired actions.

5.3. ALGEBRAIC REGULARIZATION

Finally, we further compare heuristic and learned data augmentations and demonstrate the usefulness of algebraic regularization. We used the dSprites dataset (Matthey et al., 2017) and considered one factor as target label and the others as domains. Some methods are no longer applicable because of the continuous or even periodic values of factors and the multiplicatively increasing number of combinations. In Table 2 , we can see that MixStyle provides no significant performance gain in this setting because the heuristic augmentation does not match the underlying mechanism anymore (See also Fig. 10 in Appendix E). In Fig. 4 , we provide the results of an ablation study of the compositionality (ℓ 1 ) and commutativity (ℓ 2 ) regularization, showing that these regularization terms can reduce errors accumulated by compositions of augmentations and increase the number of supervision signals for learning augmentations, as illustrated in Fig. 3 .

6. CONCLUSION

Unlike the usual goal of generalizing to an unseen domain, we formulated the problem of combination shift as learning the knowledge of each factor (domains and labels) and generalizing to unseen combinations of factors, which makes deployment more feasible but training more challenging. We found that invariance-based methods may not work well in this setting, but augmentationbased methods usually excel. To formally analyze data augmentations and provide a guideline on augmentation design, we presented an algebraic formulation of the problem, which also leads to a refined definition of disentanglement. We demonstrated the usefulness of constraints derived from algebraic requirements, discussed potential issues of the existing augmentation method based on cycle consistency, and showed the importance of algebraic regularization. We then pointed out several promising research directions, such as incorporating algebra homomorphism and multi-sorted algebra to discuss a wider range of data augmentation operations. We hope that our algebraic formulation can be used to derive practical algorithms in applications and inspire further studies in this direction. 

A LIMITATIONS AND FUTURE WORK

In this section, we discuss the limitations of this work and potential future work directions. A.1 ALGEBRA HOMOMORPHISM In this work, we only formulated data augmentations of the endofunction form α : X → X, i.e., modifications of only one input. However, there are other operations that do not fall into this form. We suggest using algebra homomorphisms to capture their relations. Here we give three examples: Component combination If the instance can be divided into multiple components, then we can recombine the components from multiple instances to generate a new instance: α : X n → X. This is especially useful when there are many factors and the combinations in the training set are sparse.

Style transfer

Another example is when we cannot divide the instances but can combine their characteristics, such as style transfer (Gatys et al., 2016 ). An example is given in Fig. 5 , where ⊕ : X × X → X is the binary operation that takes the "style" of the first image and the "content" of the second image, and p 1 × p 2 : Y × Y → Y is the corresponding operation in the label space Y . Then, we need to ensure that this binary operation is compatible with other augmentations. For example, if the object in the content image changes, the object in the generated image should change accordingly; while the generated image should not change regardless of the object in the style image. Crowd counting Counting the number of objects or people in an image is an example where we can exploit the structure of natural numbers N. In addition to the monotone function requirement induced by the total order of natural numbers N (Liu et al., 2018) , the free monoid structure (N, +) may induce other useful constraints. For example, the count of two parts should be the sum of the counts of each part. This requirement can be formulated as an algebra homomorphism.

A.2 STATISTICS AND APPROXIMATION

Similarly to previous work (Higgins et al., 2018) , we focused more on the algebraic aspect. We admit that there is still a gap between formulation and practice, because algebra only describes exact equality (=), but sometimes we are more interested in approximate equality (≈). It would be useful to define concepts such as commutativity over a metric space, so that we can analyze errors and introduce statistical tools, to get the best of both worlds.

A.3 STATE AND MULTI-SORTED ALGEBRA

Another issue is that we only considered endofunctions X → X so all data augmentations are applicable to all instances in a "stateless" way, which may not hold true in more complex situations. As a future work, we may consider general functions X i → X j and define which functions are composable and which are not. Also, it could be useful to discuss operations on multiple sets based on multi-sorted algebra, such as graphs (de Haan et al., 2020) .

B A BRIEF REVIEW OF ALGEBRA

In this section, we review the algebraic concepts used in this work. We refer the readers to Dummit & Foote (1991) (abstract algebra), Bergman (2015) (universal algebra), and Awodey (2010) (category theory) for further readings. B.1 ALGEBRA Definition 2 (Algebra). A (single-sorted) algebra consists of a set A, called the underlying set of the algebra, a collection of operations {f i : A n i → A} i∈I , and a collection of universally quantified equational axioms that those operations satisfy. For example, elementary algebra is the study of the set of numbers with arithmetic operations such as addition, subtraction, multiplication, division, and exponentiation. Linear algebra is the study of the set of vectors with operations of vector addition and scalar multiplication. Some algebras with only one binary operation are listed below. Definition 3 (Magma). A magma is a set A equipped with a binary operation • : A × A → A. Definition 4 (Semigroup). A semigroup is a magma (S, •) whose binary operation is associative: ∀s 1 , s 2 , s 3 ∈ S, (s 1 • s 2 ) • s 3 = s 1 • (s 2 • s 3 ). ( ) Definition 5 (Monoid). A monoid is a semigroup (M, •) that has an identity element e ∈ M (a nullary operation e : 1 → M ): ∀m ∈ M, e • m = m • e = m. ( ) Definition 6 (Group). A group is a monoid (G, •, e), and every element has an inverse (a unary operation (- ) -1 : G → G): ∀g ∈ G, g • g -1 = g -1 • g = e. ( ) Definition 7 (Abelian group). An Abelian group is a group (G, •, e, (-) -1 ) whose binary operation is commutative: ∀g 1 , g 2 ∈ G, g 1 • g 2 = g 2 • g 1 . B.2 HOMOMORPHISM Definition 8 (Homomorphism). A homomorphism between two algebras (A, {f i A } i∈I ) and (B, {f i B } i∈I ) of the same type is a function between the underlying sets h : A → B such that ∀a 1 , . . . , a n i ∈ A, h(f i A (a 1 , . . . , a n i )) = f i B (h(a 1 ), . . . , h(a n i )) holds for all corresponding operations f i A : A n i → A and f i B : B n i → B. In other words, the following diagram commutes for all i ∈ I: A n i B n i A B h n i f i A f i B h (16) An invertible homomorphism is called an isomorphism. For example, exp and log functions form a pair of isomorphisms between (R, +) and (R + , ×) because exp(x + y) = exp(x) × exp(y) and log(x × y) = log(x) + log(y). Definition 12 (Representation). A representation of a set A on a set X is a function act : A → X X . Definition 13 (Algebra preservation). A representation act : A → X X preserves an algebra over A if it is a homomorphism from A to X X . A magma/semigroup action preserves composition (a binary operation): ∀a 1 , a 2 ∈ A, ∀x ∈ X, act(a 1 • a 2 , x) = act(a 1 , act(a 2 , x)). (18) A × A × X A × X A × X X id A × act •×id X act act Or equivalently, ∀a 1 , a 2 ∈ A, act(a 1 • a 2 ) = act(a 1 ) • act(a 2 ). A × A X X × X X A X X act× act • • act A monoid action preserves identity (a nullary operation): ∀x ∈ X, act(e, x) = x. act(e) = id X . (23) 1 1 A X X e id X act A group action preserves inverse (a unary operation): ∀a ∈ A, ∀x ∈ X, act(a -1 , act(a, x)) = x. ( ) ∀a ∈ A, act(a -1 ) = act(a) -1 . (26) A X X A X X act (-) -1 (-) -1 act (27) B.5 EQUIVARIANCE Definition 14 (Equivariance). A function f : X → Y is equivariant to two actions act X : A × X → X and act Y : A × Y → Y if ∀a ∈ A, ∀x ∈ X, f (act X (a, x)) = act Y (a, f (x)). (28) A × X A × Y X Y id A ×f act X act Y f (29) Or equivalently, ∀a ∈ A, f • act X (a) = act Y (a) • f. (30) X Y X Y f act X (a) act Y (a) f commutes for all a ∈ A. This justifies that an equivariant map is a homomorphism between two algebras whose operations are all unary and indexed by elements in the set A.

B.6 PRODUCT

Definition 15 (Product). A product A×B of two objects A and B and the corresponding projections p 1 : A × B → A and p 2 : A × B → B satisfy that for any object C and morphisms f 1 : C → A and f 2 : C → B, there is a unique morphism f : C → A × B, such that f 1 = p 1 • f and f 2 = p 2 • f , as indicated in C A A × B B f 1 f f 2 p 1 p 2 Consider two morphisms f : C → A and g : D → B. Based on the universal property of A × B, there exists a unique morphism f × g : C × D → A × B such that the following diagram commutes: C C × D D A A × B B f p 1 p 2 f ×g g p 1 p 2 For example, let both C and D be Y 1 × Y 2 , f = p 1 , and g = p 2 . Then, the following diagram represents "recombination of components": Y 1 × Y 2 (Y 1 × Y 2 ) × (Y 1 × Y 2 ) Y 1 × Y 2 Y 1 Y 1 × Y 2 Y 2 p 1 p 1 p 2 p 1 ×p 2 p 2 p 1 p 2

C ALGEBRA IN SUPERVISED LEARNING

In this section, we look ahead to the application of algebraic theory to supervised learning. C.1 SUPERVISED LEARNING Let X be the set of inputs and Y the set of outputs. In supervised learning, we want to find a function f : X → Y that satisfies some properties. Generally, this is achieved by collecting a set of pairs {(x i , y i ) ∈ X × Y } i∈I as training examples and defining a measure of "goodness" of functions. For example, for a pair (x i , y i ), we expect f to map x i to y i . Let us consider this procedure from an algebraic perspective. Nullary operation First, we point out that identifying an element x from a set X can be considered as a nullary operation x : 1 → X, and evaluating a function f : X → Y at an element x is simply function composition f • x : 1 → Y . Then, requiring f (x) = y is equivalent to say that f should be an algebra homomorphism: 1 1 X Y x y f Therefore, a function that can predict all training examples perfectly is simply a homomorphism from algebra (X, {x i : 1 → X} i∈I ) to algebra (Y, {y i : 1 → Y } i∈I ) where all operations are nullary. This perspective frames direct supervision as an algebraic requirement. However, it is still not practically useful, because the training examples are usually finite and cannot enumerate the set of inputs, but we need machine learning only when the inputs in a test environment are not exactly the same as the inputs for training. Two things are missing: first, we need an assumption to relate training and test data; second, we need not only "yes or no" but also "how much". As discussed in Appendix A, pure algebra only deals with exact equality, so integrating algebra and statistical learning is an important research direction. Unary operation Many works introducing algebraic theory, especially group theory, into machine learning, including this work, have focused on unary operations and their relations. A unary operation or an endofunction α X : X → X transforms a set of states to itself. A homomorphism between (X, α X ) and (Y, α Y ) just relates these unary operations: X Y X Y f α X α Y f Usually, there are multiple unary operations, which themselves form an algebra. Magma/semigroup describes composition, monoid describes identity, and group describes invertibility. An invertible unary operation/endofunction is also called a symmetry. The structure of these unary operations can be described by an action preserving the algebraic structure, which was extensively used in this work. Binary and n-ary operations As also covered in Appendix A, not all operations are unary operations. It would be useful to include n-ary operations and their relations as algebraic requirements for f : X n Y n X Y f n α X α Y f Specifically, operad theory could be useful for analyzing a collection of finitary operations obeying equational axioms. Moreover, future research could continue to explore n-ary functions from an algebraic perspective. For example, f : X → Y and g : A → B may relate two binary functions α X : X × X → A and α Y : Y × Y → B in the following sense: X × X Y × Y A B f ×f α X α Y g which could be used for formulating relation-preserving functions, such as equality (learning from similarity) and order (learning to rank), or metrics, such as isometry, contraction, and Lipschitz continuous function.

C.2 BINARY CLASSIFICATION

Now, let us consider a concrete example, binary classification. Let n be a set whose cardinality is n, 1 a singleton (a set of a single element), + the disjoint union of sets (union of labeled/indexed elements), ∼ = the isomorphism between two sets (a bijective function). In binary classification, Y is simply a set of two elements 2 ∼ = 1 + 1. In other words, we only have a space with the concept of sameness or equality and no other operations. The learning process is to find a function f : X → 2, which decomposes into a pair of functions , 2) . This results in a decomposition of X into two sets X ∼ = X 1 + X 2 , i.e., classification of elements in X. f = f 1 + f 2 , where f i : X i → 1(i = 1 Let us examine the unary operations (endofunctions) on 2. There are in total four endofunctions on 2, which forms a monoid. There are only two invertible ones: the identity and the one that swaps two elements, which constitute a representation of the symmetric group S 2 on 2.

C.3 REGRESSION

To formulate regression, we usually let Y be the set of real numbers R. However, from an algebraic perspective, many operations of real numbers are not needed in the learning process. For example, we rarely consider the product or ratio of two target values. On the other hand, the order, scale, and zero point are of our central interest. Thus, if there exist a minimal value and a unit interval of targets, we can isomorphically transform the target and let Y be the set of natural numbers N. If we cannot determine a minimal value but we are still able to quantize the target values, we can take a step further and consider the algebra of integers Z and the negation operation. There are two important operations of natural numbers: 0 : 1 → N as a nullary operation that identifies the number zero and the successor function S : N → N as a unary operation that maps a number n to the next number S(n). Let x n ∈ X be an instance whose label is n. If X also has the structure of natural numbers, then there exist an element x 0 that has the minimal value and a unary operation T : X → X that takes an instance as input and outputs another instance whose label is one unit higher. The requirement of f being a homomorphism means that the instance with the minimal value is mapped to 0, i.e., f (x 0 ) = 0, and the operation T corresponds to the successor function S in the following way: x n n x S(n) S(n) f T S f Given the number zero 0 and the successor function S of natural numbers N, we can define a commutative monoid with 0 as the identity element and a monoid operation + defined recursively: a + S(b) := S(a + b). This is the free monoid (N, +) generated from a generator {1 := S(0)}. Then, we can consider the case when the free monoid (N, +) acts on X and N itself. A function equivariant to free monoid actions is a function f : X → N such that the following diagram commutes: (m, x n ) (m, n) x n+m n + m id N ×f act X + f (41) Note that when m is the generator 1, this diagram can be reduced to Eq. ( 40). The crowd counting example in Appendix A can be illustrated in the following diagram: (x m , x n ) (m, n) x m+n m + n f ×f ⊕ + f ( ) which means that the count of two parts should be the sum of the counts of each part. This requirement is formulated as a homomorphism of binary operations ⊕ : X × X → X and + : N × N → N.

C.4 DISCUSSION

As discussed in Section 3.1, the equivariance alone may not fully characterizes a learning problem. For example, in binary classification, if we only require the transformation f : X → Y to be equivariant to actions by the symmetric group S 2 , then f is only unique up to permutation; Similarly, in regression, f is only unique up to shift by a natural number or an integer. This may not cause a problem, but we still need some information to determine the optimal solution, for example, the zero point (a nullary operation) in regression. Similarly to Higgins et al. (2018) , we focused on the algebraic aspect of disentanglement. It is worth noting that this formulation is not yet compatible with some definitions of disentanglement based on statistical independence, probability metric, or causal mechanisms (Higgins et al., 2017; Suter et al., 2019; Locatello et al., 2019; Shu et al., 2020; Tokui & Sato, 2022) . In statistical learning, we usually want to find a conditional distribution f : X → P Y , where P Y denotes all probability measures on Y , instead of merely a deterministic transformation f : X → Y . To extend this framework and fully capture the statistical aspect of disentanglement, we need to further incorporate the structure of probability measures, which is left for future work.

D LITERATURE REVIEW: DISTRIBUTION SHIFT

In this section, we review related work in distribution shift in a broader sense. The difference between the training and test data in supervised learning is an important problem and has been studied for years. The distribution shift problem (Quiñonero-Candela et al., 2008) refers to the general case where the training and test data are drawn from related but different distributions: p train (X, Y ) ̸ = p test (X, Y ) The difference can be measured by some distribution divergence (Ben- David et al., 2010; Albuquerque et al., 2019) . Distribution shift can be subcategorized by the distribution assumptions: (Quiñonero-Candela et al., 2008) , implying that the tasks are indexed by a categorical (Blanchard et al., 2011) or continuous (Wang et al., 2020) domain variable. All three types of distribution shift mentioned above may happen when there are multiple domains. To solve this problem, domaininvariant representation learning (Ganin et al., 2016; Sun & Saenko, 2016; Arjovsky et al., 2019; Creager et al., 2021; Shi et al., 2022) has been widely used, which aims to extract features invariant to domain change. In this work, we showed the limitations of invariance-based methods in the combination shift problem. Covariate shift: p train (Y | X) = p test (Y | X) (Sugiyama A closely related concept is disentanglement (Bengio et al., 2013) , which can be defined via statistical independence (Suter et al., 2019; Locatello et al., 2019; Shu et al., 2020; Tokui & Sato, 2022) or product group action (Higgins et al., 2018; Caselles-Dupré et al., 2019; Quessard et al., 2020; Painter et al., 2020; Wang et al., 2021b; Yang et al., 2022) . Our work follows the latter direction. We provided a refined definition of disentanglement based on algebra in Definition 1, which can be seen as an extension of Higgins et al. (2018) . We also discussed potential directions for further extension in Appendix A, including algebra homomorphism, statistics, non-endofunctions, and multi-sorted algebra. Various methods have been developed based on the concept of disentanglement. On approach is based on variants of the variational autoencoder (VAE) (Kingma & Welling, 2014; Higgins et al., 2017) . Another promising approach is based on either heuristic (Zhang et al., 2018; Shorten & Khoshgoftaar, 2019; Chen et al., 2020; Zhou et al., 2021) or learned (Ratner et al., 2017; Volpi et al., 2018; Wang et al., 2021c; Goel et al., 2021) data augmentation. Learning data augmentation is the central interest of our work. AXIS: all red digits and zeros of all colors STEP: three digits for each color, shown in Fig. 6 RAND-0.5/-0.7/-0.9: combinations randomly selected with a fixed ratio 0.5, 0.7, or 0.9. All domains and classes were ensured to appear at least once. (middle rows) augmented data (red to orange) and reconstructed data (orange to red) transformed by a CycleGAN model (Zhu et al., 2017; Goel et al., 2021) ; (bottom row) augmented data transformed by EDT, which satisfies the algebraic constraints.

E.2 3D SHAPES

Data The 3D Shapesfoot_1 dataset contains images of three-dimensional objects with 6 factors (floor hue, wall hue, object hue, scale, shape, and orientation), whose dimensions are 10, 10, 10, 8, 4, and 15. The size of the dataset is 480 000. Data selection Since the goal is to improve generalization using as few combinations as possible, we used a set of properly selected combinations of factors. Concretely, we first randomly select an instance, and then randomly change a factor at a time. An example of a path of transformations is shown in Fig. 7 . We used 10 random paths so there are at most 570 training examples (only around 0.1% of all data).

Model and optimization

Because there is only one image for each combination of factors, there is no need to use distribution matching. We used pixel-wise binary cross-entropy as the learning objective for ℓ 0 , ℓ 1 , and ℓ 2 . Other hyperparameters are the same as those used above.

E.3 DSPRITES

Data The dSpritesfoot_2 dataset contains images of 2D shapes generated from 6 ground truth independent latent factors: color, shape, scale, rotation, x and y positions of a sprite, whose dimensions are 1, 3, 6, 40, 32, and 32 . The size of the dataset is 737 280. Data selection Note that there is no bijection between the factors and the images because of the intrinsic symmetries of the shapes, e.g., C 4 of the square and C 2 of the ellipse. To this end, we only considered a subset of the original dataset where the orientation only ranges from 0 • to 90 • , which resulted in a dataset of size 184 320. The split of training and test data was similar to the above. Thus, we used only 830/184 320 ≈ 0.5% data for learning augmentations. Results Additionally, we show the augmented images in Fig. 9 . We can see that these augmentations are not equally easy to learn: the shape and position augmentations perform relatively well, but modifying the scale and orientation may cause shape distortion. E.4 HEURISTIC AUGMENTATION Fig. 10 shows the images from the colored MNIST, 3D Shapes, and dSprites datasets augmented by Mixup (Zhang et al., 2018) and MixStyle (Zhou et al., 2021) . We can observe that MixStyle actually modifies the colors of the images in the colored MNIST dataset, which may explain why its performance is good in Table 1 . Thus, our results also support the claim "heuristic augmentation improves generalization if the augmentation describes an attribute" from the empirical study of Wiles et al. (2022) . When it is hard to design augmentations by hand, learning augmentations from data and regularizing these augmentations based on the algebraic constraints is a promising way to improve generalization, which is the main claim of our paper.



MNIST (LeCun et al., 1998) http://yann.lecun.com/exdb/mnist/ 3D Shapes(Burgess & Kim, 2018) https://github.com/deepmind/3d-shapes Apache License 2.0 dSprites (Matthey et al., 2017) https://github.com/deepmind/dsprites-dataset Apache License 2.0



Figure 1: Domain generalization under domain shift (an unseen domain) and combination shift (unseen combinations of domains and labels). Domain: color, label: digit, training: , test: .

(a) Compositionality of multi-scale augmentations (b) Commutativity of two disentangled augmentations

Figure 3: Regularizing compositionality and commutativity (and other algebraic structures) of augmentations is a way to introduce inductive biases and exploit the relationships between training examples, which is useful especially when the combinations of factors are scarce in the training data.

.01(0.07) 0.02(0.00) 0.02(0.00) (a) Without compositionality regularization, the error may accumulate after a few compositions.(b) Without commutativity regularization, pairs for learning augmentations may be insufficient.

Figure 4: Randomly selected 5 images (top row) in the dSprites dataset (Matthey et al., 2017) and augmented images (bottom 4 rows) of position (Fig. 4a, left ⇝ right) and shape (Fig. 4b, square ⇝ ellipse ⇝ heart ⇝ square), without (left) and with (right) regularization.

Figure 5: A homomorphism preserving binary operations ⊕ and p 1 × p 2 .

Definition 9 (Exponential). Given sets A and B, the function set B A is the set of all functions from A to B. Given a set A and a function set B A , there exists an evaluation map ϵ : B A × A → B that sends a function f : A → B and a value a ∈ A to the evaluation ϵ(f, a) = f (a) ∈ B. Definition 10 (Exponential transpose). For a binary function f : A × B → C, its exponential transpose (also known as currying) is a function f : A → C B such that ∀a ∈ A, ∀b ∈ B, f (a, b) = f (a)(b). Action). A (left) action of a set A on a set X is a binary function act : A × X → X.

Figure 6: A set of combinations of the colored MNIST data with only 15/50 = 30% data for training. Shaded combinations are used for testing.

floor hue wall hue object hue object hue object hue object hue floor hue object hue wall hue floor hue scale

Figure 7: A path of transformations of data (left to right, top to bottom) of the 3D Shapes dataset.

Model and optimizationWe used a simple 3-layer MLP (64 × 64 → 256 → 64 → output) with ReLU activation as the prediction model, cross-entropy (classification) or mean squared error (regression) as the learning objectives, and an Adam optimizer(Kingma & Ba, 2015) with batch size of 32 and learning rate of 1 × 10 -4 .

Shape: square, ellipse, heart (b) Scale: 6 values linearly spaced in [0.5, 1] (c) Orientation: 10 values in [0 • , 90 • ] (d) Position X: 32 values in 1] (e) Position Y: 32 values in [0, 1]

Figure 9: Augmented training examples of the dSprites dataset

1.  For two objects Y 1 and Y 2 , we can consider their product Y = Y 1 × Y 2 , which is defined via a pair of canonical projections p1 : Y 1 × Y 2 → Y 1 and p 2 : Y 1 × Y 2 → Y 2 .This means that we can divide the product into parts and process each part separately without losing information. We reiterate that: Specifically, (a) if Y 1 and Y 2 are just sets, Y is their Cartesian product; (b) if Y 1 and Y 2 have algebraic structures, Y is the product algebra and the operations are defined componentwise; and (c) if A 1 and A 2 act on Y 1 and Y 2 , respectively, then the product algebraA = A 1 × A 2 can act on Y = Y 1 × Y 2 componentwise.Additionally, if we let P Y be the set of all measures on Y , then P Y 1 × P Y 2 is the set of joint distributions where two components are statistically independent, while P Y = P (Y 1 × Y 2 ) is the set of all possible joint distributions. Product is the common denominator for all the definitions of disentanglement. In Definition 1, we only considered the product structure of endofunctions.We can use this definition to formulate the domain generalization problem as follows. We assume that Y = Y 1 × Y 2 has two components, where Y 1 is the set of domain indices and Y 2 is the set of other target labels. we choose a structure of a subset of the endofunctions Y Y , described by two algebras A 1 and A 2 and two actions act Y 1 and act Y 2 . Then, we let the product algebra

The classification accuracy (%, "mean (standard deviation)" of 5 trials) on the colored MNIST data. For each setting (column), the method with the highest mean accuracy and those methods that are not statistically significantly different from the best one (via one-tailed t-tests with a significance level of 0.05), if any, are highlighted in boldface.

The misclassification rate (%) of shape and mean squared errors (×100) of scale, orientation, and positions on the dSprites dataset ("mean (standard deviation)" of 5 trials).

Label shift: p train (X | Y ) = p test (X | Y ), e.g., class imbalance(Johnson & Khoshgoftaar, 2019) and long-tailed class distribution(Zhang et al., 2021) Concept shift: p train (X) = p test (X), e.g., noisy labels(Song et al., 2022) Distribution shift is also closely related to robust optimization(Ben-Tal & Nemirovski, 2002) and fairness in machine learning(Barocas et al., 2019).

acknowledgement

The remaining combinations were used as the test set.Model We used U-Net (Ronneberger et al., 2015) for the image-to-image data augmentations with 3 layers of downscale/upscale modules and a sigmoid as the last layer. We used a convolutional neural network with spectral norm (Miyato et al., 2018) as the discriminator for distribution matching (Goodfellow et al., 2014) between images (ℓ 0 , ℓ 1 , and ℓ 2 ). To reduce the number of models, the discriminator was conditioned on the factors via additive embedding. We use the same architecture of the discriminator for the classifier except the dimension of output was set to 10. The learning objective for the classifier (ℓ 3 ) is the cross-entropy/negative log-likelihood.Optimization We used an Adam optimizer (Kingma & Ba, 2015) with batch size of 32, learning rate of 1 × 10 -3 for the augmentations and 1 × 10 -4 for the discriminator and the classifier. The model was trained for 10 000 iterations.Infrastructure The experiments were conducted on an NVIDIA Tesla V100 GPU.

