Skip to content

CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction

Conference: ECCV 2024
Code: https://github.com/Petecheco/CLR-GAN
Area: Others
Keywords: GAN training stability, latent space consistency, Generative Adversarial Networks, image generation quality, discriminator reconstruction

TL;DR

This paper proposes the CLR-GAN training paradigm. By requiring the discriminator to recover the pre-defined latent code of the generator and the generator to reconstruct the real input, it establishes consistency constraints between the latent spaces of G and D. This makes GAN training fairer and more stable, improving the FID by 31.22% on CIFAR10 and 39.5% on AFHQ-Cat.

Background & Motivation

Background: Generative Adversarial Networks (GANs) have received widespread attention for their excellent image generation capabilities. Representative methods include the StyleGAN series, BigGAN, ProjectedGAN, etc. The training of GAN is essentially a game between the generator G and the discriminator D—G tries to generate realistic images to fool D, while D tries to distinguish real from fake. Although various improvements (spectral normalization, progressive training, regularization, etc.) have been introduced in recent years, the instability of GAN training remains the core bottleneck.

Limitations of Prior Work: The core difficulty of GAN training lies in the unfair game between G and D. In the standard GAN framework, D has explicit supervision signals (real/fake labels), while G can only learn indirectly through the gradients of D. This asymmetry leads to several specific problems: (1) D can easily become too powerful, leading to generator collapse (mode collapse); (2) when G is too powerful, D loses the ability to provide useful gradients; (3) the training process is extremely sensitive to hyperparameters, requiring extensive tuning for different datasets and architectures.

Key Challenge: In traditional GAN training, G and D are two independent adversarial networks lacking structural constraints between them. The discriminative information (real/fake probability) output by D discards a large amount of structural information about the generated/real images. This waste of information and asymmetry is the root cause of training instability.

Goal: (1) How to establish a closer, more symmetrical relationship between G and D during the training process; (2) How to leverage the feature space of D to provide richer training signals.

Key Insight: The authors propose that G and D can be viewed as inverse processes of each other. If G maps the latent code \(z\) to the image \(x\), then D should be able to recover \(z\) from the image \(x\); conversely, if D can map real images to a certain feature space, G should also be able to reconstruct those real images. Through this mutual inverse relationship, consistency constraints are established between the latent spaces of G and D.

Core Idea: By requiring D to perform the extra task of recovering G's input latent code, and G to perform the extra task of reconstructing the real images observed by D, the latent space consistency constraint is leveraged to train G and D on equal footing.

Method

Overall Architecture

On top of standard GAN training, CLR-GAN adds two additional tasks: (1) Latent code restoration—after receiving the fake image generated by G, D not only outputs the real/fake decision but also restores the input latent code \(z\) used by G; (2) Real image reconstruction—G receives the intermediate features (or a representation) of D and reconstructs the corresponding real image. Through these two complementary tasks, a bidirectional latent space mapping relationship is established between G and D.

The overall training process maintains the alternating training framework of GANs: Loss of D = Discrimination Loss + Latent Restoration Loss; Loss of G = Adversarial Loss + Reconstruction Loss + Consistency Loss.

Key Designs

  1. Latent Code Restoration:

    • Function: Enabling the discriminator D to recover the input latent code \(z\) from generated images.
    • Mechanism: An additional projection head is added to the end of the D network to output a vector \(z'\) with the same dimension as \(z\). During training, the restoration loss \(\|D_{proj}(G(z)) - z\|\) is calculated on the generated image \(G(z)\), requiring the latent code recovered by D to be consistent with the original latent code used by G. This forces D not only to distinguish real from fake but also to understand the generation process and the latent space structure of G.
    • Design Motivation: In standard GANs, D only outputs a scalar (real/fake), discarding a large amount of structural information about the images. Latent code restoration forces D to retain more structural details, providing richer gradient signals for G. Meanwhile, D must understand the latent space of G to complete the restoration task, which implicitly prevents D from performing "lazy" discrimination.
  2. Real Image Reconstruction:

    • Function: Enabling the generator G to reconstruct real images from the features/encodings of D.
    • Mechanism: The features from the intermediate layers of D are mapped to the input space of G through a projection layer, and then fed into G for reconstruction. The reconstruction loss constrains the output of G to be consistent with the original real image. This requires G to generate images not only from random latent codes but also to reconstruct original images from real image features extracted by D.
    • Design Motivation: In standard GANs, G only learns the mapping from the latent space to the image space without direct contact with real images. Requiring G to reconstruct real images provides direct pixel-level supervision, helping it learn more realistic image features. It also builds a bridge mapping the feature space of D to the input space of G.
  3. Consistency Criterion:

    • Function: Ensuring a consistent bidirectional mapping relationship exists between the latent spaces of G and D.
    • Mechanism: Based on the bidirectional mapping established by the two tasks (G\(\rightarrow\)D latent code restoration + D\(\rightarrow\)G reconstruction), a consistency constraint is imposed: feeding the restored latent code from D back into G should reconstruct the original image, and feeding the reconstructed result of G back into D should restore the correct latent code. This forms a cycle consistency constraint, similar to the concept of CycleGAN but applied in the latent space.
    • Design Motivation: Separate restoration and reconstruction tasks might independently find different mapping relationships. The consistency constraint ensures these two mapping relations are mutually inverse, establishing a tight, structured relationship between G and D.

Loss & Training

The total loss consists of four parts: - Adversarial loss: Standard GAN loss (hinge loss / non-saturating loss, etc.) - Latent code restoration loss: \(L_{2}\) distance constraint ensuring agreement between the latent code recovered by D and the true latent code. - Reconstruction loss: \(L_{1}\) or \(L_{2}\) pixel loss constraining the quality of the real images reconstructed by G. - Consistency loss: Cycle consistency constraints.

Training maintains the standard alternating update of D and G. The CLR-GAN paradigm is architecture-agnostic and can be applied to different GAN architectures (such as DCGAN, StyleGAN, etc.).

Key Experimental Results

Main Results

Dataset Metric (FID↓) Ours Baseline GAN Gain
CIFAR-10 FID Significant Improvement Baseline 31.22%
AFHQ-Cat FID Significant Improvement Baseline 39.5%
Other Datasets FID Significant Improvement Baseline Consistent Improvement

On multiple datasets and various GAN architectures, CLR-GAN consistently brings significant FID improvements.

Ablation Study

Configuration FID Description
Full CLR-GAN Best All components work together
W/o Latent Code Restoration Degraded Latent code restoration is the core contribution
W/o Real Reconstruction Degraded Reconstruction task provides crucial extra supervision
W/o Consistency Constraint Degraded Consistency constraint ensures coordination of bidirectional mapping
Different GAN Architectures Consistent improvement Validation of the architecture-agnostic nature of the method

Key Findings

  • CLR-GAN is effective across datasets of different scales and GAN architectures of varying complexities.
  • The training process is more stable, with reduced sensitivity to hyperparameters like the learning rate.
  • The diversity of generated images is also enhanced (not only better quality, but also reduced mode collapse).
  • The increase in computational overhead is minimal—the parameter scale of the additional projection heads and reconstruction tasks is small compared to the entire network.

Highlights & Insights

  • Novel Perspective: Viewing G and D as mutual inverse processes rather than simple adversaries, and redefining the training objectives of GANs via latent space consistency, is an elegant formulation.
  • Architecture-Agnostic: Implemented as a training paradigm rather than a specific architecture, it can be applied plug-and-play to boost the performance of existing GANs.
  • Significant Performance: Achieves 31.22% and 39.5% FID improvements on CIFAR-10 and AFHQ-Cat respectively, demonstrating notable effectiveness.
  • Training Stability: Mitigates the training instability of GANs effectively by introducing additional structural constraints.

Limitations & Future Work

  • The additional restoration and reconstruction tasks introduce new hyperparameters (weights for each loss term) that require tuning.
  • Latent code restoration might become difficult in high-dimensional latent spaces (such as the W+ space in StyleGAN).
  • The computational overhead of the reconstruction task on high-resolution images warrants attention.
  • Compared to recent Diffusion Models, the improved GANs may still have a gap in image quality.
  • One could explore extending the concept of CLR-GAN to conditional generation tasks (e.g., text-to-image).
  • GAN Training Stability: Common methods like spectral normalization and gradient penalty focus on the regularization of D, while CLR-GAN approaches from the perspective of the G-D relationship.
  • BiGAN/ALI: Early works also explored encoder-decoder symmetry in GANs, but the consistency constraint in CLR-GAN is more explicit.
  • CycleGAN: Cycle consistency has been proven effective in image-to-image translation; CLR-GAN introduces it into the latent space.
  • Insights: The asymmetry problem in adversarial training also exists in other adversarial frameworks, making CLR-GAN's mutual inverse constraint concept potentially widely applicable.

Rating

  • Novelty: ⭐⭐⭐⭐ The perspective of reciprocal G and D is novel, and the design of the latent space consistency constraint is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple datasets and architectures with thorough ablations.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and well-described methodology.
  • Value: ⭐⭐⭐ In the era dominated by diffusion models, the marginal value of GAN improvements is somewhat reduced, but the idea itself remains highly inspiring.