COOPERATIVE ADVERSARIAL LEARNING VIA CLOSED-LOOP TRANSCRIPTION

Abstract

Generative models based on the adversarial process are sensitive to net architectures and difficult to train. This paper proposes a generative model that implements cooperative adversarial learning via closed-loop transcription. In the generative model training, the encoder and decoder are trained simultaneously, and not only the adversarial process but also a cooperative process is included. In the adversarial process, the encoder plays as a critic to maximize the distance between the original and transcribed images, in which the distance is measured by rate reduction in the feature space; in the cooperative process, the encoder and the decoder cooperatively minimize the distance to improve the transcription quality. Cooperative adversarial learning possesses the concepts and properties of Auto-Encoding and GAN, and it is unique in that the encoder actively controls the training process as it is trained in both learning processes in two different roles. Experiments demonstrate that without regularization techniques, our generative model is robust to net architectures and easy to train, sample-wise reconstruction performs well in terms of sample features, and disentangled visual attributes are well modeled in independent principal components.

1. INTRODUCTION

Minimax game provides an unsupervised learning method, which is widely used in generative models such as generative adversarial nets (GAN) (Goodfellow et al., 2014; Chen et al., 2016; Radford et al., 2015) and the recently-proposed closed-loop transcription framework (CTRL) (Dai et al., 2022) . Generative modeling based on minimax two-player game faces some problems, like the instability in training processes, the difficulty in maintaining the balance between the discriminator and the generator (as in GAN) or between the encoder and the decoder (as in CTRL), and the sensitiveness to net architectures (He et al., 2016a; b) . Maintaining balance and stability in the adversarial process attracts a lot of attention. The mainstream is to provide a constrained discriminator (Kurach et al., 2019) . Some regularization techniques are provided, such as weight normalization (Salimans & Kingma, 2016) , weight clip (Arjovsky et al., 2017) , gradient penalty (Gulrajani et al., 2017) , spectral normalization (Miyato et al., 2018) , and adversarial lipschitz regularization (Terjék, 2019) . Different from the mainstream regularization methods, this paper considers the feasibility of letting the discriminator actively adapt to the rhythm of the generator. The reason why maintaining balance in the generative models via adversarial process is difficult is that the generator and the discriminator tend to merely play against each other. However, balance will break sooner or later once the discriminator learns faster than the generator. In contrast, generative models based on Auto-Encoding like variational Auto-Encoding (VAE) (Kingma & Welling, 2013; Lopez et al., 2018) tend to be stable, not facing instability and collapse problems. The reason is that the encoder and decoder in the Auto-Encoding framework learn and update themselves cooperatively to improve reconstruction quality and reduce data dimensions in the same direction. In one word, models work cooperatively rather than against each other. Inspired by this cooperation idea, this paper attempts to combine cooperative learning and adversarial learning in the generative model. In this paper, a generative model via cooperative adversarial learning (CoA-CTRL) is proposed. CoA-CTRL employs the closed-loop transcription framework (CTRL) proposed by (Dai et al., 2022; Ma et al., 2022) and naturally combines the learning strategies of the adversarial process and cooperative process. Firstly, like the discriminator in GAN, the encoder in CoA-CTRL plays as a critic to maximize the feature distance between the real data and the transcribed data. Secondly, consistent with Auto-Encoding, the encoder and decoder cooperatively minimize the difference between the real data and transcribed data. The confrontation and cooperation between the two take turns and intersect, which will actively keep the system in balance.

2. RELATED WORK

Auto-Encoding and its variants. Auto-Encoding is a typical neural network for representation learning and data dimension reduction (Kramer, 1991; Hinton & Zemel, 1993; Hinton & Salakhutdinov, 2006) . Auto-Encoding aims to learn the encoder E θ and the decoder D η simultaneously, and this process can be demonstrated by equation (1). Generally, Auto-Encoding tends to learn from L2 pixel-wise distance. min θ,η L(θ, η) = 1 N N i=1 ||x i -E θ (D η (x i ))|| 2 2 (1) Generative adversarial nets (GAN). Generative adversarial nets (GAN) provides a generative model based on the adversarial process (Goodfellow et al., 2014; Chen et al., 2016) . GAN includes a discriminator and a generator. The discriminator evaluates the performance of the generated images, and the generator tends to fool the discriminator. The two networks are trained based on the twoplayer minimax game by the value function V (G(η), D(θ)) as equation ( 2) displays, where G(η) and D(θ) donate to the generator and discriminator respectively. min η max θ V (η, θ) = E x∼p(x) [log D(x)] + E z∼p(z) [log(1 -D(G(z)))] MCRfoot_1 and CTRL. Recently, Chan et al. ( 2022) and Yu et al. (2020) proposed a new learning objective, the so-called principle of maximal coding rate reduction (MCR 2 ), which is to learn the low-dimension intrinsic structures from high dimension data and obtain discriminative representation between classes. Encoder f (x, θ) maps high dimension data X to the low dimension features Z. As is shown in equation (3), MCR 2 provides a method called coding rate (R(Z, ϵ)) to measure the compactness of learned feature Z integrally subject to the distortion ϵ. The rate reduction (∆R) measures the distance in the feature space. In a special case of two classesfoot_0 , as shown in equation ( 6), there will be data features Z and Ẑ. The distance between Z and Ẑ can be measured by coding rate reduction (∆R(Z, Ẑ)), that is, the difference between the coding rate of (Z ∪ Ẑ) and the average sum of them (R c ). R(Z, ϵ) = 1 2 log det(I + αZZ * ) (3) X f (x,θ) ----------→ Z g(z,η) ----------→ X f (x,θ) ----------→ Ẑ h(x,θ,η)=f •g•f (4) h(x, θ, η) = f (g(f (x, θ), η), θ) (5) ∆R(Z, Ẑ) = R(Z ∪ Ẑ) - 1 2 (R(Z) + R( Ẑ)) Rc (6) min η max θ T (θ, η) = ∆R(f (X, θ), h(X, θ, η)) = ∆R(Z(θ), Ẑ(θ, η)) CTRL (Dai et al., 2022) provides a closed-loop framework based on MCR 2 , consisting an encoder (f (x, θ)) and a decoder (g(z, η)). As equation ( 7) shows, CTRL aims to transcribe data via minimaxing coding rate reduction, in which h(x, θ, η) captures a closed-loop map as demonstrated by equations ( 4) and ( 5). The first segment (x → z → x) in (4) resembles Auto-Encoding, and the second segment (z → x → ẑ) resembles GAN. While GAN generates images from Random Gaussian Distribution noise, in CTRL, as (4) displays, decoder g(z, η) maps from feature Z (which is encoded from X), and then encoder f (x, θ) maps X to feature Ẑ. The distance between X and X is described by the coding rate reduction (∆R) in equation ( 6). R(Z ∪ Ẑ) refers to the coding rate of joint space of Z and Ẑ. R c describes the average sum coding rate of Z and Ẑ. The encoder maximizes ∆R, and the decoder minimizes it. In this adversarial learning process, CTRL is consistent with GAN (Goodfellow et al., 2014) . Chan et al. (2022) shows that the gradient of ∆R will disappear when ∆R is large enough, which leads to low learning efficiency of encoder f (x, θ) and decoder g(z, η). The ∆R (the yellow ball) in the left part of Figure 1 demonstrates this situation. This is consistent with Arjovsky & Bottou (2017) , which shows that when the distribution of real images and fake images does not intersect in GAN, the gradient of the generator disappears. Therefore, keeping ∆R at a small value is important. The ∆R of the right part of Figure 1 is better than the left part for the minimax game. Figure 1 : Two different representations compared based on different rate reductions (Yu et al., 2020; Chan et al., 2022) . R(Z ∪ Ẑ) is demonstrated by a number of ϵ-balls in the joint space of Z and Ẑ (all the balls). R c is the sum of a number of subspaces of Z (green ball) and Ẑ (red ball). ∆R describes their difference (yellow ball). While MCR 2 prefers the left representation for large rate reduction, in the minimax game, the right is better.

3. COOPERATIVE ADVERSARIAL LEARNING

3.1 CLOSED-LOOP TRANSCRIPTION: ONE ENCODER, TWO ROLES The way closed-loop transcription combines the structures of Auto-Encoding and GAN is ingenious, as it gives the encoder two different roles. As the left part of Figure 2 and equation (4) demonstrates, in the segment of x → z → x, the encoder takes the responsibility as the encoder in Auto-Encoding; while in the segment of z → x → ẑ, the encoder takes the responsibility consistent with the discriminator in GAN. Different from the former works such as VAE- GAN (Larsen et al., 2016) and BiGAN (Donahue et al., 2016; Dumoulin et al., 2016) who add a discriminator to estimate the decoder, closed-loop transcription trains only the encoder and the decoder, and it is the encoder that estimates the performance of the decoder. In the backward propagation process, encoder f (x, θ) was back-propagated three times, and therefore generated three gradient values g θ 1 ,g θ 2 , and g θ 3 respectively.

3.2. COOPERATIVE ADVERSARIAL LEARNING: TWO ROLES, LEARN TWICE

Detailed analysis of the encoder's two roles. CTRL's two-role setting and closed-loop framework contribute to the complexity of the objective distance function ∆R(Z, Ẑ) as well as the training process. As shown in equations ( 4) and ( 5), in CTRL's closed-loop map H(x, θ, η), the encoder f (x, θ) is used twice. We can expand the closed-loop map as shown in equation ( 8), in which Ẑ(θ, η) is expanded to Ẑ(θ 1 , η, θ 2 ). We use θ 1 and θ 2 to refer to the encoder's first-time usage (X → Z) and second-time usage ( X → Ẑ). We can also expand the distance function ∆R as shown in equation ( 9), in which ∆R(Z, Ẑ) is expanded to ∆R(Z(θ 3 ), Ẑ(θ 1 , η, θ 2 )). θ 1 , θ 2 , and θ 3 all refer to the encoder f (x, θ), but we mark it differently as it appears three times and its meanings in the distance function ∆R(Z(θ 3 ), Ẑ(θ 1 , η, θ 2 )) are different. θ 2 and θ 3 represent the encoder when it functions as a discriminator to transcribe data X and X, and θ 1 represents the encoder when it takes the responsibility of an encoder in the segment of X → Z → Ẑ. H(X, θ, η) = f (g(f (X, θ), η), θ) = Ẑ(θ, η) = Ẑ(θ 1 , η, θ 2 ) (8) ∆R(Z, Ẑ) = ∆R(Z(θ 3 ), Ẑ(θ 1 , η, θ 2 )) In the backward propagation process, as shown in the right part of Figure 2 , encoder f (x, θ) will be calculated the gradient three times by the objective function ∆R(Z(θ 3 ), Ẑ(θ 1 , η, θ 2 )). The three gradients are g θ1 , g θ2 , and g θ3 , which are shown in equations ( 10), ( 11), and (12). g θ1 = ∇ θ1 ∆R(Z(θ 3 ), Z(θ 1 , η, θ 2 )) g θ2 = ∇ θ2 ∆R(Z(θ 3 ), Z(θ 1 , η, θ 2 )) g θ3 = ∇ θ3 ∆R(Z(θ 3 ), Z(θ 1 , η, θ 2 )) Cooperative adversarial learning based on encoder's two roles. Although CTRL gives the encoder two roles, the original CTRL does not make use of the encoder's dual identity features in terms of its learning strategy (see Algorithm 2 in the Appendix). The original CTRL just follows the simple minimax game in which the encoder functions merely as a discriminator, and its original role of an encoder is ignored. In this paper, we provide cooperative adversarial learning as shown in Algorithm 1 which makes use of the encoder's two roles (encoder and discriminator). Algorithm 1 demonstrates cooperative adversarial learning. Cooperative adversarial learning comprises two processes: (1) Adversarial process. We only use gradients g θ2 and g θ3 (when the encoder functions as a discriminator) in the adversarial process to maximize ∆R to provide an iteration learning signal. As shown in Algorithm 1, via the adversarial process, encoder θ updates itself by its role of discriminator via θ 2 (facing input X) and θ 3 (facing input X) to enlarge ∆R. This process is consistent with GAN and the original CTRL, and the equation is demonstrated by equation ( 13). (2) Cooperative process. We use g θ1 (when the encoder functions as the encoder in Auto-encoding), g θ2 , g θ3 , and g η together in the cooperative process to compress and transcribe data following the learning signal of ∆R. As shown in Algorithm 1, via the cooperative process, we optimize θ 1 , θ 2 , θ 3 , and η, which are all elements in the closed-loop transcription, to compress data and transcribe data via the learning signal of ∆R. Equation ( 14) demonstrates this process. max θ ∆R(f (X, θ), H(X, θ, η)) = ∆R(Z(θ), Ẑ(θ, η)) (13) ↕ adversarial min θ,η cooperation ∆R(f (X, θ), H(X, θ, η)) = ∆R(Z(θ), Ẑ(θ, η)) Algorithm 1 Cooperative Adversarial Learning Require: α, learning rate. ratio, CoA-ratio. ϵ 2 , coding rate parameter. bs, batch size. Require: θ, init parameters of the encoder. η, init parameters of the decoder. while η has not converged do Adversarial process to provide iteration learning signal ∆R Sample X ← x (i) bs i=1 a batch from the real data Z = f (x, θ), X = (Z, η), Ẑ = f ( X, θ) g θ2 ← ∇ θ2 ∆R(Z(θ 3 ), Ẑ(θ 1 , η, θ 2 )) g θ3 ← ∇ θ3 ∆R(Z(θ 3 ), Ẑ(θ 1 , η, θ 2 )) g θ ← g θ2 + g θ3 θ ← θ + ratio × α × Adam(g θ ) Cooperative process to compress and transcribe via learning signal ∆R Z = f (x, θ), X = (Z, η), Ẑ = f ( X, θ) g θ1 ← ∇ θ1 ∆R(Z(θ 3 ), Z(θ 1 , η, θ 2 )) g θ2 ← ∇ θ1 ∆R(Z(θ 3 ), Z(θ 1 , η, θ 2 )) g θ3 ← ∇ θ3 ∆R(Z(θ 3 ), Z(θ 1 , η, θ 2 )) g θ ← g θ1 + g θ2 + g θ3 g η ← ∇ η ∆R(Z(θ 3 ), Z(θ 1 , η, θ 2 )) θ ← θ -α × Adam(g θ ) η ← η -α × Adam(g η ) end while Auto-Encoding optimizes the reconstruction quality based on pixel-wise loss, however, CoA-CTRL optimizes the reconstruction quality based on the distribution distance (∆R) in the feature space, and the feature distribution is defined by the encoder itself, which can be seen as a self-consistent learning strategy in the closed-loop framework. When the encoder tries to find the distribution distance and maximizes it in equation ( 13), it provides an iterative learning signal (∆R) for the reconstruction task in equation ( 14). The encoder takes the responsibility for two roles and learns twice. It minimizes what it maximizes, which is a self-consistent learning strategy (Ma et al., 2022) . In this sense, CoA-CTRL is an adaptive learning strategy and naturally unifies an adversarial process consistent with GAN and a cooperative process consistent with Auto-Encoding, which will contribute to its learning quality and stability. In conclusion, cooperative adversarial learning that follows the value function T (θ, η) is displayed in equation ( 15): min θ,η max θ T (θ, η) = ∆R(f (X, θ), H(X, θ, η)) = ∆R(Z(θ), Ẑ(θ, η))

3.3. COOPERATIVE ADVERSARIAL RATIO (COA-RATIO): A HYPERPARAMETER TO ADJUST THE COOPERATIVE PROCESS AND THE ADVERSARIAL PROCESS

Active control of the adversarial process and the cooperative process. As discussed before, the encoder works and learns in two different roles in the adversarial process and the cooperative process respectively. Therefore, through the functioning of the encoder in both processes, we can actively control the adversarial process and the cooperative process. We introduce a hyperparameter cooperative adversarial ratio (CoA-ratio), which is the ratio of the encoder's learning rate in the adversarial process to its learning rate in the cooperative process, to adjust the learning rate of the encoder in two learning processes. In every iteration, before going through equation ( 13), the learning rate of the encoder would be multiplied by the CoA-ratio, and before going through equation ( 14), the encoder's learning rate would be restored to the previous value. Balance achieved through active control. The discriminator and generator in GAN work in the opposite direction. The learning process of the discriminator is hard to control, and therefore the balance of the training is hard to keep. Benefited by the encoder learning twice in the cooperative and adversarial processes and its different learning rates and speeds realized by the CoA-ratio, the encoder becomes the controller and regulator of the training process. The training process and the loss value ∆R thus can be easily controlled without paying too much attention to constrained network design.

3.4. ADVANTAGES AND DIFFERENCES

Training stability. Encoder's learning twice provides a way to actively balance the learning processes through its two roles. As equation ( 15) shows, the encoder not only maximizes ∆R but also minimizes it, which relieves the need to design special networks or adjust parameters. Compared with former balance techniques, CoA-CTRL does not add constraint techniques or computing processes. It is simple and computationally efficient. In experiments, classic deep nets ResNet18, ResNet50, and ResNet101 (He et al., 2016a; b) are used to validate CoA-CTRL's active balance. Sample-consistent reconstruction. As mentioned in 3.2, CoA-CTRL naturally unifies the learning strategies of Auto-Encoding and GAN, which will help the encoder learn better and faster. Other than that, it will benefit the sample-wise reconstruction. Sample-wise consistent reconstruction g(f (x)) ≈ x is the ideal solution to ∆R(Z(θ), Z(θ, η)) ≈ 0, and ∆R(Z(θ), Z(θ, η)) is determined by encoder f (x, θ) and decoder g(z, η). CTRL (Dai et al., 2022) would minimize ∆R(θ, η) merely through the decoder g(z, η), which would give an approximate optimization choice that results in poor sample-wise consistency. However, if we optimize ∆R(θ, η) by decoder g(z, η) and encoder f (x, θ), the optimization process would become simple, and the ideal sample-consistent solution would be easily obtained. In addition, cooperative adversarial learning via closed-loop transcription produces good disentangled feature space. Later experiments will demonstrate this advantage. Simpler. As Auto-Encoding, its variants (Kingma & Welling, 2013), and GAN all gain a lot of attention in the generative model area, many works have attempted to combine Auto-Encoding and GAN, like Bigan (Donahue et al., 2016) , ALI (Dumoulin et al., 2016) , adversarial autoencoders (Makhzani et al., 2015) and VAE-GAN (Larsen et al., 2016) . Different from those attempts, CoA-CTRL implements the closed-loop transcription (Dai et al., 2022) , introduces the cooperative process, and invites no other discriminator. The encoder learns twice in different roles within one iteration without investing more computing resources. Stability and balance are controlled actively without regulation techniques, which contributes to computing resource saving compared to other regulation techniques like spectral normalization (Miyato et al., 2018) . Also, the original CTRL always depends on big batch sizes, while cooperative adversarial learning could reduce this demand.

4. EXPERIMENTS

4.1 SETTING In this paper, we intend to justify two main advantages of CoA-CTRL: firstly, CoA-CTRL's robustness to different net architectures through our cooperative adversarial learning; secondly, CoA-CTRL's sample-consistent reconstruction. To conduct the experiments, we use two types of encoders: deep encoders and normal encoders, and both types of encoders will be paired with one type of decoder. We conduct the experiments with deep encoders on the diverse data set CIFAR-10 ( Krizhevsky et al., 2009) and STL-10 (Coates et al., 2011) , as well as the facial data set Celeb-A (Liu et al., 2015) , aiming to demonstrate CoA-CTRL's active balance. We conduct the experiments with normal encoders on MNIST (LeCun et al., 1998) , CIFAR-10, and ImageNet-1k (Russakovsky et al., 2015) , aiming to prove CoA-CTRL's sample-wise consistency. More details of the experiment setting could be found in Appendix A.2.

4.2. EMPIRICAL VERIFICATION OF COA-CTRL'S ACTIVE BALANCE 4.2.1 ACTIVE BALANCE TO NET ARCHITECTURES

To verify CoA-CTRL's active balance, we conduct several comparative tests on CIFAR-10, using ResNet18, ResNet50, and ResNet101 as the encoder, and the widely used 8-layer resnet (De8) (Miyato et al., 2018) as the decoder. We intend to prove that even with an unbalanced combination of the encoder and decoder, CoA-CTRL can perform well in a stable manner. We apply the same settings on GAN and CTRL, aiming to compare their stability and performances with CoA-CTRL. Results in Table 1 show that CoA-CTRL works well, while GAN and CTRL fail and collapse in the training process. To quantity CoA-CTRL's performance, we test CoA-CTRL by the widely used Inception score (IS) (Salimans et al., 2016) and Fréchet Inception Distance (FID) (Heusel et al., 2017) . We further compare CoA-CTRL's IS and FID with other major generative models, which are displayed in Table 2 . Interested in the parts feature dimension (nz) and batchsize (bs) play in CoA-CTRL's performance and stability, we additionally adjust nz and batchsize to see whether CoA-CTRL can maintain balance. As shown in Table 1 , we find that CoA-CTRL works well with different combinations of nz and batchsize, and it performs better when we increase nz and batchsize at the same time. We also explore the loss value ∆R in the training process. As shown in Figure 3 , CoA-CTRL keeps low and stable ∆R and high R(Z ∪ Ẑ), while CTRL shows unstable training loss ∆R. CoA-CTRL's stable training loss contributes to its training success and excellent performance. 

4.2.2. ASSOCIATION BETWEEN THE LOSS VALUE ∆R AND PERFORMANCE

As shown in Table 1 and Figure 3 , CoA-CTRL's performance seems to be associated with ∆R. As discussed in section 3.3, we introduce CoA-ratio to adjust the learning rate of the encoder in the adversarial process and the cooperative process, which would influence the loss value ∆R. Therefore, we set CoA-catio at 1.25,1.5,1.75 and 2.0 respectively to explore the influence of ∆R on CoA- CTRL's performance. We conduct the experiment on CIFAR-10, using ResNet18 as the encoder, with nz and batchsize set at 128 and 512. As shown in Table 12 and Figure 6 in the Appendix, different CoA-ratios are associated with different values of ∆R and R, and the performance score IS and FID are highly influenced by ∆R, but the good thing is that the change of ∆R is stable. This paper just points out that CoA-ratio, ∆R, and performance are highly associated. Future works will explore the mechanisms.

4.2.3. EXCELLENT PERFORMANCE ACHIEVED THROUGH A SIMPLE DECODER

We compare CoA-CTRL's performance with other generative models on CIFAR-10 and STL-10. The data in Table 2 is directly cited from those relevant papers except for CoA-CTRL. For CoA-CTRL, We use ResNet18 as the encoder on both CIFAR-10 and STL-10. Table 2 shows that CoA-CTRL performs better with a simple decoder and experiment setting, compared with other generative models, such as GANs with regularization techniques (SNGAN, WGAN-GP, WGAN-ALP) (Miyato et al., 2018; Gulrajani et al., 2017; Terjék, 2019) , self-supervised GAN (Chen et al., 2019) , latent optimisation GAN (LOGAN) (Wu et al., 2019) , complex model GAN (BIGGAN) (Brock et al., 2018) , a recent combination of GAN and VAE (DCVAE) (Parmar et al., 2021) , and the original CTRL (Dai et al., 2022) . Compared with CTRL, the FID value of CoA-CTRL is decreased by 10.06 on CIFAR-10 and 20.06 on STL-10. The improvements are clear and substantial.

4.3. SAMPLE-WISE CONSISTENCY

CoA-CTRL performs well on sample-wise reconstruction in terms of sample features, which is demonstrated by our experiments on several mainstream data sets using normal encoders. Figure 4 shows CoA-CTRL's reconstruction performance on MNIST compared with CTRL. We can see that CoA-CTRL's reconstruction is almost the same as the original input, better and more consistent than CTRL. For CIFAR-10 and ImageNet-1k (Russakovsky et al., 2015) , we use networks listed through Table 5 to Table 9 in the Appendix. We run 20,000 iterations on both data sets. Figure 4 displays CoA-CTRL's performance on CIFAR-10 and ImageNet-1k. We can see that CoA-CTRL reconstructs well in terms of features, color, and classes, which is benefited from cooperative adversarial learning and a loss function based on the feature space.

4.4. DISENTANGLED FEATURE SPACE

The latent space of GAN has no certain meanings and lacks inverse maps from data to the latent space. Some following works discussed this issue (Chen et al., 2016; Karras et al., 2019; 2020; Tov et al., 2021) . The latent space in CoA-CTRL has clear and disentangled meanings. CoA-CTRL possesses the concept of dual consistent maps, x → z, and z → x. Images in Figure 8 in the Appendix are the generated samples of CIFAR-10 along independent principal components. We select the top 10 components with every row referring to a component from top to bottom. We can see that different shapes, styles, backgrounds, and other visual attributes are well modeled in different principal components, and the images vary with the scale value. In addition, we test 13 in the Appendix shows, CoA-CTRL's performance is competitive compared with other methods (Springenberg, 2015; Kingma & Welling, 2013; Donahue et al., 2016; Dumoulin et al., 2016; Makhzani et al., 2015; Parmar et al., 2021; Dai et al., 2022) .

5. DISCUSSION AND CONCLUSION

In this paper, we propose cooperative adversarial learning, and based on this new learning method and closed-loop transcription, we build a promising generative model, which possesses the properties of active balance, better generative performance, and disentangled latent space. Other than that, we find it competitive in unsupervised representation. Although cooperative adversarial learning provides a way to balance deep nets, some questions are still unclear. For example, whether a deeper encoder would benefit to better performance, and how big a coop-ratio or ∆R is best for the training process and model performance. These questions deserve further explorations. 



Where X ∈ R D×n refers to data samples, Z ∈ R d×n refers to features, α = d nϵ



Figure 2: Forward propagation process (a) and backward propagation process (b) of CTRL. In the backward propagation process, encoder f (x, θ) was back-propagated three times, and therefore generated three gradient values g θ 1 ,g θ 2 , and g θ 3 respectively.

Figure 3: Loss evaluation of CTRL and CoA-CTRL in the training process on CIFAR-10, using ResNet18. CoA-CTRL keeps ∆R in a stable curve even with an unbalanced setting of a deep encoder and a shallow decoder.

Figure 4: Comparison of sample-wise reconstruction. CoA-CTRL performs well in sample-wise reconstruction on MNIST, CIFAR-10, and ImageNet-1k.

Figure 6: Loss evaluation of CoA-CTRL in the training process on CIFAR-10, using ResNet18 as the encoder and De8 as the decoder. Different CoA-ratio of 1.25, 1.5, 1.75, and 2.0 are used.

Stability and performance of CoA-CTRL compared with GAN and CTRL on CIFAR-10. Experiments show that CoA-CTRL gets excellent performances in a stable manner even with unbalanced settings of a deep encoder and a shallow decoder. Avg. of R and Avg. of ∆R refer to the average value of R(Z ∪ Ẑ) and ∆R in the training process. ↑ means higher is better, and ↓ means lower is better.

Comparison of performances of CoA-CTRL and other generative methods on CIFAR-10 and STL-10. Data of other generative models are cited from relevant papers. For CoA-CTRL (1), the nz is 128, and the bs is 512; for CoA-CTRL (2), the nz is 256, and the bs is 1024.

Decoder for MNIST

Encoder for CIFAR-10

Decoder for ImageNet

Encoder for ImageNet x ∈ R 64×64×3 4 × 4, stride=2, pad=1 conv 64 lReLU 4 × 4, stride=2, pad=1 conv. BN 128 lReLU 4 × 4, stride=2, pad=1 conv. BN 256 lReLU 4 × 4, stride=2, pad=1 conv 512 lReLU 4 × 4, stride=1, pad=0 conv 1024

Decoder for Celeb-A

Decoder for STL-10

Class-wise accuracy performance with respect to unsupervised representation on MNIST

A APPENDIX

A.1 LEARNING STRATEGY OF THE ORIGINAL CTRL Algorithm 2 Original CTRL's learning strategy Require: α, learning rate. ratio, CoA-ratio. ϵ 2 , coding rate parameter. bs, batch size. Require: θ, init parameters of encoder. η, init parameters of decoder.while η has not converged do Sample X ← x (i) bs i=1 a batch from the real dataWe conduct the experiments using nets in DCGAN (Radford et al., 2015) and some other simple nets on MNIST, CIFAR-10, and ImageNet-1k, aiming to justify CoA-CTRL's sample-wise consistency. The details of the networks can be found through Table 3 to Table 7 . The experiments we conduct are fair comparisons to the original CTRL, and the only difference is the learning strategy. For experiments with normal encoders, the encoder and the decoder have similar volumes. We set the hyperparameters β 1 and β 2 of the optimizer Adam (Kingma & Ba, 2014) at 0.0 and 0.9 for MNIST, and at 0.5 and 0.9 for CIFAR-10 and ImageNet-1k. We set the learning rate at 0.0001 and apply linear decay. ϵ 2 is set at 0.5. We adjust the CoA-ratio at 1.3 for MNIST and 1.5 for other data sets. For MNIST, we set batchsize at 256 and run 10,000 iterations. For CIFAR-10, we set the batchsize at 512 and run 20,000 iterations. For ImageNet-1k, we set the batchsize at 128 and run 20,000 iterations.

A.2.2 EXPERIMENTS USING DEEP ENCODERS

We conduct experiments using deep encoders on CIFAR-10, STL-10, and Celeb-A. The settings of the experiments using deep encoders are as follows. Adam (Kingma & Ba, 2014) would be used as the optimizer. The learning rate is set at 0.0001, and the linear decay is applied. For the classic hyperparameters β 1 and β 2 , we set them at 0.0 and 0.9 respectively. We fix ϵ 2 at 0.5 in all experiments. CoA-ratio is set at 1.5. The value of nz is set at 128. The batchsize is 512. For the decoder, we adopt the widely used networks in DCGAN (Radford et al., 2015) , SNGAN (Miyato et al., 2018) , and CTRL (Dai et al., 2022) . The details can be found in Table 5 , Table 9 , and Table 10 . For the encoder, we apply deep nets 18-layer preaction resnet (ResNet18), 50-layer preaction resnet (ResNet50), and 101-layer preaction resnet (ResNet101) to verify CoA-CTRL's stability and robustness to deep nets.As for ResNet18, ResNet50, and ResNet101 in this paper, we use preaction (He et al., 2016b) and average pooling to downsample, which will contribute to better feature extraction. For STL-10 and Celeb-A, we add a downsample at the first ResBlock of ResNet18. Spectral normalization (Miyato et al., 2018) , batch normalization (Ioffe & Szegedy, 2015) , or other regulation techniques are not applied, instead, just a simple and standard convolution layer without constraint is employed.We run 10,000 iterations on MNIST, 100,000 iterations on CIFAR-10, 150,000 iterations on STL-10 and Celeb-A. We resize the resolution of MNIST to 32 × 32, STL-10 to 48 × 48, and Celeb-A to 64 × 64. 

