COOPERATIVE ADVERSARIAL LEARNING VIA CLOSED-LOOP TRANSCRIPTION

Abstract

Generative models based on the adversarial process are sensitive to net architectures and difficult to train. This paper proposes a generative model that implements cooperative adversarial learning via closed-loop transcription. In the generative model training, the encoder and decoder are trained simultaneously, and not only the adversarial process but also a cooperative process is included. In the adversarial process, the encoder plays as a critic to maximize the distance between the original and transcribed images, in which the distance is measured by rate reduction in the feature space; in the cooperative process, the encoder and the decoder cooperatively minimize the distance to improve the transcription quality. Cooperative adversarial learning possesses the concepts and properties of Auto-Encoding and GAN, and it is unique in that the encoder actively controls the training process as it is trained in both learning processes in two different roles. Experiments demonstrate that without regularization techniques, our generative model is robust to net architectures and easy to train, sample-wise reconstruction performs well in terms of sample features, and disentangled visual attributes are well modeled in independent principal components.

1. INTRODUCTION

Minimax game provides an unsupervised learning method, which is widely used in generative models such as generative adversarial nets (GAN) (Goodfellow et al., 2014; Chen et al., 2016; Radford et al., 2015) and the recently-proposed closed-loop transcription framework (CTRL) (Dai et al., 2022) . Generative modeling based on minimax two-player game faces some problems, like the instability in training processes, the difficulty in maintaining the balance between the discriminator and the generator (as in GAN) or between the encoder and the decoder (as in CTRL), and the sensitiveness to net architectures (He et al., 2016a; b) . Maintaining balance and stability in the adversarial process attracts a lot of attention. The mainstream is to provide a constrained discriminator (Kurach et al., 2019) . Some regularization techniques are provided, such as weight normalization (Salimans & Kingma, 2016) , weight clip (Arjovsky et al., 2017) , gradient penalty (Gulrajani et al., 2017 ), spectral normalization (Miyato et al., 2018) , and adversarial lipschitz regularization (Terjék, 2019) . Different from the mainstream regularization methods, this paper considers the feasibility of letting the discriminator actively adapt to the rhythm of the generator. The reason why maintaining balance in the generative models via adversarial process is difficult is that the generator and the discriminator tend to merely play against each other. However, balance will break sooner or later once the discriminator learns faster than the generator. In contrast, generative models based on Auto-Encoding like variational Auto-Encoding (VAE) (Kingma & Welling, 2013; Lopez et al., 2018) tend to be stable, not facing instability and collapse problems. The reason is that the encoder and decoder in the Auto-Encoding framework learn and update themselves cooperatively to improve reconstruction quality and reduce data dimensions in the same direction. In one word, models work cooperatively rather than against each other. Inspired by this cooperation idea, this paper attempts to combine cooperative learning and adversarial learning in the generative model. In this paper, a generative model via cooperative adversarial learning (CoA-CTRL) is proposed. CoA-CTRL employs the closed-loop transcription framework (CTRL) proposed by (Dai et al., 2022; Ma et al., 2022) and naturally combines the learning strategies of the adversarial process and cooperative process. Firstly, like the discriminator in GAN, the encoder in CoA-CTRL plays as a critic to maximize the feature distance between the real data and the transcribed data. Secondly, consistent with Auto-Encoding, the encoder and decoder cooperatively minimize the difference between the real data and transcribed data. The confrontation and cooperation between the two take turns and intersect, which will actively keep the system in balance.

2. RELATED WORK

Auto-Encoding and its variants. Auto-Encoding is a typical neural network for representation learning and data dimension reduction (Kramer, 1991; Hinton & Zemel, 1993; Hinton & Salakhutdinov, 2006) . Auto-Encoding aims to learn the encoder E θ and the decoder D η simultaneously, and this process can be demonstrated by equation (1). Generally, Auto-Encoding tends to learn from L2 pixel-wise distance. min θ,η L(θ, η) = 1 N N i=1 ||x i -E θ (D η (x i ))|| 2 2 (1) Generative adversarial nets (GAN). Generative adversarial nets (GAN) provides a generative model based on the adversarial process (Goodfellow et al., 2014; Chen et al., 2016) . GAN includes a discriminator and a generator. The discriminator evaluates the performance of the generated images, and the generator tends to fool the discriminator. The two networks are trained based on the twoplayer minimax game by the value function V (G(η), D(θ)) as equation ( 2) displays, where G(η) and D(θ) donate to the generator and discriminator respectively. min η max θ V (η, θ) = E x∼p(x) [log D(x)] + E z∼p(z) [log(1 -D(G(z)))] MCRfoot_1 and CTRL. Recently, Chan et al. ( 2022) and Yu et al. (2020) proposed a new learning objective, the so-called principle of maximal coding rate reduction (MCR 2 ), which is to learn the low-dimension intrinsic structures from high dimension data and obtain discriminative representation between classes. Encoder f (x, θ) maps high dimension data X to the low dimension features Z. As is shown in equation (3), MCR 2 provides a method called coding rate (R(Z, ϵ)) to measure the compactness of learned feature Z integrally subject to the distortion ϵ. The rate reduction (∆R) measures the distance in the feature space. In a special case of two classesfoot_0 , as shown in equation ( 6), there will be data features Z and Ẑ. The distance between Z and Ẑ can be measured by coding rate reduction (∆R(Z, Ẑ)), that is, the difference between the coding rate of (Z ∪ Ẑ) and the average sum of them (R c ). R(Z, ϵ) = 1 2 log det(I + αZZ * ) (3) X f (x,θ) ----------→ Z g(z,η) ----------→ X f (x,θ) ----------→ Ẑ h(x,θ,η)=f •g•f (4) h(x, θ, η) = f (g(f (x, θ), η), θ) (5) ∆R(Z, Ẑ) = R(Z ∪ Ẑ) - 1 2 (R(Z) + R( Ẑ)) Rc (6) min η max θ T (θ, η) = ∆R(f (X, θ), h(X, θ, η)) = ∆R(Z(θ), Ẑ(θ, η)) CTRL (Dai et al., 2022) provides a closed-loop framework based on MCR 2 , consisting an encoder (f (x, θ)) and a decoder (g(z, η)). As equation ( 7) shows, CTRL aims to transcribe data via minimaxing coding rate reduction, in which h(x, θ, η) captures a closed-loop map as demonstrated by equations ( 4) and ( 5). The first segment (x → z → x) in (4) resembles Auto-Encoding, and the second segment (z → x → ẑ) resembles GAN. While GAN generates images from Random Gaussian Distribution noise, in CTRL, as (4) displays, decoder g(z, η) maps from feature Z (which is encoded from X), and then encoder f (x, θ) maps X to feature Ẑ. The distance between X



Where X ∈ R D×n refers to data samples, Z ∈ R d×n refers to features, α = d nϵ

