Skip to content

FreeAugment: Data Augmentation Search Across All Degrees of Freedom

Conference: ECCV 2024
arXiv: 2409.04820
Code: https://tombekor.github.io/FreeAugment-web
Area: Others
Keywords: Data Augmentation, AutoML, Differentiable Optimization, Gumbel-Sinkhorn, Bi-level Optimization

TL;DR

This paper proposes FreeAugment, the first fully differentiable search method capable of simultaneously and globally optimizing the four degrees of freedom (number, type, sequence, and magnitude of transformations) of data augmentation strategies. By learning the depth distribution via Gumbel-Softmax and the permutation distribution via Gumbel-Sinkhorn to avoid duplicate sampling, it achieves state-of-the-art (SOTA) performance across multiple benchmarks.

Background & Motivation

Data augmentation is a core technology for improving generalization capability in deep learning, yet different tasks and domains require distinct augmentation strategies. Differentiable data augmentation search (DAS) aims to reduce the engineering burden of manually designing augmentation pipelines.

Existing DAS methods suffer from key limitations: - Limited Depth: Most methods (such as AutoAugment, DADA, DDAS, etc.) fix the policy depth to 2, failing to explore richer combinations of transformations. - Independent Sampling Leading to Duplication: Methods such as SLACK and DRA utilize independent Gumbel-Softmax sampling at each layer, which may result in duplicate sampling of the same transformation. - Non-global Optimization: Although DeepAA breaks the two-layer limitation, it adopts a greedy layer-by-layer stacking strategy, which is not globally optimal. - Enum Search: Some methods exhaustively search for the depth value, which is highly inefficient.

Core Motivation: Can we simultaneously optimize all four degrees of freedom of the augmentation strategy—namely, the number of transformations (depth), transformation type, transformation sequence, and transformation magnitude—and perform end-to-end optimization within a fully differentiable framework?

Method

Overall Architecture

FreeAugment models the data augmentation strategy as a probabilistic model \(\mathcal{P}_\phi\), where parameters \(\phi = (\delta, \Pi, \mu)\) control the three sets of degrees of freedom. Given an input image \(X_0\), the policy sampling procedure is as follows: 1. Sample a depth one-hot vector \(\mathbf{d}\) from a Gumbel-Softmax distribution. 2. Sample a permutation matrix \(\mathbf{P}\) from a Gumbel-Sinkhorn distribution. 3. Sample a magnitude matrix \(\mathbf{M}\) from a parameterized uniform distribution. 4. Apply the transformations sequentially across \(K\) augmentation layers, and finally mix them via weighted summation using the depth vector.

The final augmented image is: \(X' = \sum_{k=0}^{K} \mathbf{d}_k \cdot X_k\)

Key Designs

  1. The First Degree of Freedom — Depth Search (Gumbel-Softmax): A Gumbel-Softmax distribution is induced via a learnable logits vector \(\delta\) to represent the probability distribution of the policy depth. Each \(\delta_k\) denotes the unnormalized log-probability of choosing depth \(k\):

    \(\mathbb{P}(d_k = 1 \mid t) = \frac{e^{(\delta_k + g_k)/t}}{\sum_{i=1}^{K} e^{(\delta_i + g_i)/t}}\)

A Straight-Through (ST) gradient estimator is utilized to achieve discrete forward and continuous backward gradient propagation, making the depth distribution learnable. The maximum depth is set to \(K=7\) in the experiments.

  1. The Second and Third Degrees of Freedom — Type and Sequence Search (Gumbel-Sinkhorn): The search for transformation types and sequential order is unified as learning a permutation distribution mapping \(N\) types of transformations to \(K\) augmentation layers. The core idea is to learn an \(N \times K\) logit matrix \(\Pi\), and sample the permutation matrix through the following steps:

    • Pad \(\Pi\) to an \(N \times N\) square matrix \(\bar{\Pi}\) using minimal negative values.
    • Perform \(L\) Sinkhorn iterations on the perturbed matrix to obtain a doubly stochastic matrix (DSM).
    • Crop the first \(K\) columns to obtain \(\mathbf{P} = \bar{\mathbf{P}}_{1:N, 1:K}\).

    \(\bar{\mathbf{P}} = S^L((\bar{\Pi} + G) / t)\)

The Sinkhorn operation guarantees row and column normalization, structurally preventing duplicate sampling of the same transformation. This is one of the most critical contributions of this work—reducing the duplicate sampling rate by an order of magnitude compared to independent Gumbel-Softmax.

  1. The Fourth Degree of Freedom — Magnitude Search (Differentiable Uniform Distribution): The magnitude of each transformation \(\tau_i\) at the \(k\)-th layer is sampled from a parameterized uniform distribution. Differentiability is achieved via the reparameterization trick:

    \(M_{ik} = [\sigma(h_{ik}) - \sigma(l_{ik})] \cdot \epsilon + \sigma(l_{ik}), \quad \epsilon \sim \text{Uniform}(0,1)\)

The sigmoid function constraints the range to \((0,1)\), and the lower and upper bounds \(l_{ik}, u_{ik}\) are learned through backpropagation. For non-differentiable transformations (e.g., Posterize, Solarize), the ST estimator is used.

Loss & Training

A bi-level optimization framework is adopted: - Outer Loop (Policy Tier): Minimize the cross-entropy loss \(\mathcal{L}_{val}(\theta^*(\phi))\) on the validation set. - Inner Loop (Model Tier): Train the model parameters \(\theta\) on the augmented training set.

A single-step approximation is used to alternately optimize \(\theta\) and \(\phi\). Although policy parameter updates involve second-order derivatives, the dimension of \(\phi\) is much smaller than \(\theta\), making the computation feasible.

Key training details: - Around 10% of the dataset is used for searching, divided 50/50 into training and validation sets. - Progressive warm-up of the three sets of parameters: magnitude (50 epochs), type (65 epochs), and depth (80 epochs). - The temperature exponentially anneals from 1.0 to 0.5, with \(L=20\) Sinkhorn iterations. - The augmentation policy is sampled independently for each image to reduce gradient variance.

Key Experimental Results

Main Results

Dataset Model Metric FreeAugment Best Comparison Method Gain
CIFAR-10 WRN-40-2 Top-1 Acc 96.54 SLACK 96.29 +0.25
CIFAR-10 WRN-28-10 Top-1 Acc 97.66 DeepAA 97.56 +0.10
CIFAR-100 WRN-40-2 Top-1 Acc 80.04 SLACK 79.87 +0.17
ImageNet-100 ResNet-18 Top-1 Acc 86.62 SLACK 86.19 +0.43
DomainNet Avg ResNet-18 Top-1 Acc 62.93 TA(Wide) 61.71 +1.22

The method achieves optimal or near-optimal performance across all six sub-domains of DomainNet, demonstrating its strong cross-domain generalization capability.

Ablation Study

Configuration Top-1 Acc (CIFAR-100, WRN-40-2) Description
Freeze uniform magnitude 79.64 Magnitude is not learned
Freeze uniform type & order 79.54 Permutation is not learned
Freeze uniform depth 79.61 Depth is not learned
Joint Learning (FreeAugment) 80.04 All degrees of freedom are jointly optimized

Key Findings

  • Joint optimization outperforms freezing any single degree of freedom: Each degree of freedom makes an independent contribution to the final performance.
  • Variable depth outperforms fixed depth: The learned depth distribution yields higher performance than any fixed depth value.
  • Gumbel-Sinkhorn significantly reduces duplication: Compared to Gumbel-Softmax, the duplicate transformation sampling rate is reduced by approximately 10 times, with performance saturating at \(L=20\).
  • Robustness to hyperparameters: The same search configuration is used across all datasets and architectures.

Highlights & Insights

  1. Introducing permutation learning to data augmentation search: The search for transformation types and sequential order is unified using Gumbel-Sinkhorn, mathematically preventing duplicate sampling, which is more elegant than heuristic constraints.
  2. Depth parsed as a learnable probability distribution: There is no longer a need to determine the number of transformations via enumeration or greedy stacking; the model automatically learns the optimal depth distribution.
  3. Fully end-to-end differentiable: The search space of all four degrees of freedom is implemented via differentiable relaxation, eliminating the need for reinforcement learning or evolutionary algorithms.
  4. Cross-domain generalization: The same set of search hyperparameters is effective across multiple domains, including natural images, sketches, and paintings.

Limitations & Future Work

  • The search space still adopts the 14 standard transformations from AutoAugment, without exploring a richer library of transformations.
  • Only about 10% of the dataset is utilized in the search phase; the search efficiency on larger-scale datasets remains to be validated.
  • The Sinkhorn operation introduces extra computational overhead (\(L=20\) iterations), which could become a bottleneck in resource-constrained scenarios.
  • The performance on downstream dense prediction tasks, such as object detection and semantic segmentation, has not been discussed.
  • Gumbel-Sinkhorn (Mena et al., ICLR 2018): The core tool used in this work, enabling differentiable permutation sampling via the Sinkhorn operator.
  • SLACK (Marrie et al., CVPR 2023): Augmentation search using KL regularization, but independent sampling still suffers from duplicate selection.
  • DeepAA (Zheng et al.): The first method to break the two-layer limitation, but it relies on a greedy stacking strategy.
  • The permutation learning scheme presented in this paper can be transferred to other problems requiring ordered sampling without replacement (e.g., operation selection in NAS).

Rating

  • Novelty: ⭐⭐⭐⭐ — Introducing permutation learning to the data augmentation search space is a novel contribution, and the unified optimization framework for the four degrees of freedom is elegantly designed.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comparative experiments across multiple datasets and domains are thorough, and the ablation study is properly designed.
  • Writing Quality: ⭐⭐⭐⭐ — The methodology is clearly described with intuitive schemas and rigorous mathematical derivations.
  • Value: ⭐⭐⭐⭐ — This work offers a more holistic and elegant solution for data augmentation search, though its value under real-world large-scale applications warrants further validation.