Skip to content

Maximizing Incremental Information Entropy for Contrastive Learning

Conference: ICLR 2026 arXiv: 2603.12594 Code: To be confirmed Area: Self-Supervised Learning Keywords: Contrastive Learning, Information Entropy, Incremental Entropy, Information Bottleneck, Learnable Augmentation

TL;DR

This paper proposes IE-CL (Incremental-Entropy Contrastive Learning), a framework that explicitly maximizes entropy gain between augmented views—rather than merely maximizing mutual information—by treating the encoder as an information bottleneck and jointly optimizing a learnable transformation module (for entropy generation) and an encoder regularizer (for entropy preservation). IE-CL consistently improves contrastive learning performance on CIFAR-10/100, STL-10, and ImageNet under small-batch settings, with its core modules serving as plug-and-play components compatible with existing frameworks.

Background & Motivation

Background: Self-supervised contrastive learning has become a central paradigm in representation learning, typically built upon mutual information maximization (e.g., InfoNCE) to learn invariant features across augmented views. Methods such as SimCLR, MoCo, and BYOL have achieved remarkable success.

Limitations of Prior Work: - Static data augmentation strategies (random cropping, color jittering, etc.) maintain fixed distributions throughout training and cannot adaptively adjust augmentation difficulty in response to learning progress. - Rigid invariance constraints force the encoder to produce identical representations for all augmentations, potentially causing over-compression of useful information—i.e., an excessively tight information bottleneck. - Although mutual information maximization is intuitively sound, it overlooks the effect of information increments introduced by the augmentation process itself on representation quality.

Key Challenge: Stronger data augmentations introduce greater informational variation, which should theoretically support more robust feature learning; however, excessively strong augmentations may violate semantic preservation boundaries and break the semantic consistency of positive pairs. A unified framework for balancing "entropy generation" and "semantic preservation" is lacking.

Goal: To design a theoretically grounded contrastive learning framework that maximizes information entropy gain between augmented views while maintaining semantic consistency.

Key Insight: The encoder is redefined as an information bottleneck, and the optimization objective is reformulated from "mutual information maximization" to "incremental information entropy maximization," decomposing entropy into two independently optimizable sub-objectives: generation and preservation.

Core Idea: Adaptively generate information entropy via a learnable transformation module, and preserve entropy via encoder regularization—jointly overcoming the dual limitations of static augmentation and rigid invariance.

Method

Theoretical Framework: Incremental Information Entropy Decomposition

The central theoretical contribution of IE-CL is the decomposition of information flow in contrastive learning into two stages:

Stage 1 — Entropy Generation: The data augmentation/transformation process generates an augmented view \(\tilde{x}\) from the original sample \(x\). Conventional methods employ fixed random augmentations, leaving the amount of informational variation introduced (entropy increment \(\Delta H\)) uncontrolled. IE-CL introduces a learnable transformation module \(\mathcal{T}_\phi\) that renders the entropy increment of the augmentation process optimizable, with the objective of maximizing \(\Delta H(\tilde{x} | x)\) subject to a constraint on semantic shift.

Stage 2 — Entropy Preservation: The encoder \(f_\theta\) maps augmented views into the representation space. Information bottleneck theory dictates that the encoding process inevitably compresses information. IE-CL applies regularization to the encoder to encourage retention of the entropy increments generated in Stage 1, preventing useful variation from being over-compressed.

Joint Optimization Objective

The overall loss function comprises three components:

  1. Contrastive Loss: The standard InfoNCE loss ensures positive pairs are brought closer and negative pairs are pushed apart in the representation space.
  2. Entropy Generation Loss: Encourages the learnable transformation \(\mathcal{T}_\phi\) to produce high-entropy augmented views, enlarging the informational discrepancy between positive pairs.
  3. Entropy Preservation Regularization: Constrains the encoder \(f_\theta\) to retain the incremental information introduced by the transformation, preventing excessive compression at the information bottleneck.

The three terms are jointly optimized via a weighted sum, achieving a balance between "generating sufficient useful informational variation" and "preserving that variation during encoding."

Learnable Transformation Module

Unlike fixed augmentation pipelines, \(\mathcal{T}_\phi\) is a parameterized transformation network that adaptively adjusts augmentation strategies throughout training. In early training, it produces relatively mild transformations to avoid semantic destruction; as encoder capacity improves, the transformation difficulty gradually increases—achieving a curriculum learning-like effect without requiring manual curriculum design.

Plug-and-Play Design

The entropy generation module and entropy preservation regularization of IE-CL can be integrated as independent components into existing frameworks such as SimCLR, MoCo, and BYOL without modifying the underlying architecture.

Key Experimental Results

Main Results: Performance Gains in Small-Batch Contrastive Learning

Dataset Method Batch=128 Batch=256 Batch=512
CIFAR-10 SimCLR 90.1 91.3 92.0
CIFAR-10 SimCLR+IE-CL 91.8 92.5 93.0
CIFAR-100 SimCLR 63.2 65.1 66.8
CIFAR-100 SimCLR+IE-CL 65.9 67.0 68.3
STL-10 SimCLR 85.6 87.2 88.1
STL-10 SimCLR+IE-CL 87.4 88.5 89.2

IE-CL yields the most pronounced improvements under small-batch settings (1.5–2.7%), narrowing the performance gap between small- and large-batch regimes.

Comparison with Other Contrastive Learning Methods

Method CIFAR-10 (Linear Eval) CIFAR-100 (Linear Eval) ImageNet (Top-1)
SimCLR 91.3 65.1 69.3
MoCo v2 91.8 66.4 71.1
BYOL 92.0 67.2 74.3
IE-CL (SimCLR) 92.5 67.0 70.8
IE-CL (MoCo) 92.9 68.1 72.4

As a plug-and-play module, IE-CL delivers consistent gains on both SimCLR and MoCo, demonstrating framework-agnostic applicability.

Ablation Study

Component CIFAR-100 Acc
Baseline (SimCLR) 65.1
+Entropy Generation Module 66.3 (+1.2)
+Entropy Preservation Regularization 66.0 (+0.9)
+Both Combined 67.0 (+1.9)

Each component independently contributes to performance; their combination yields the best result, yet the gain is not simply additive, indicating a synergistic effect.

Highlights & Insights

  • Information-Theoretic Innovation: The paper reformulates the contrastive learning objective from "mutual information maximization" to "incremental information entropy maximization," providing a more refined optimization direction that not only attends to shared information between views but also explicitly models the informational variation introduced by the augmentation process.
  • Small-Batch Friendliness: Contrastive learning typically relies heavily on large batch sizes (4096+). IE-CL compensates for the scarcity of negative samples by enlarging the informational discrepancy per sample pair, yielding significant improvements at batch sizes of 128–512.
  • Plug-and-Play: The core modules integrate seamlessly into SimCLR, MoCo, and BYOL, lowering the barrier to adoption.
  • Bridging Theory and Practice: Information bottleneck theory has predominantly served as a post-hoc analytical tool in contrastive learning; IE-CL elevates it to a front-end design principle.

Limitations & Future Work

  • Semantic Safety of the Learnable Transformation: It remains unclear how to guarantee that \(\mathcal{T}_\phi\) does not generate augmentations that violate semantic boundaries. The paper relies on the implicit constraint of the contrastive loss, without an explicit semantic preservation guarantee.
  • Limited Gains on ImageNet: Improvements on large-scale datasets (~1.5%) are less pronounced than on smaller benchmarks, possibly because the inherent diversity of large datasets already provides sufficient informational variation.
  • Computational Overhead: The learnable transformation module increases forward-pass computation, yet the paper does not report detailed training time comparisons.
  • Future Directions: Adversarial augmentation generation (producing more challenging yet semantically consistent transformations) and integration with masked self-supervised methods such as MAE warrant further exploration.
  • vs. SimCLR/MoCo: These methods employ fixed augmentation strategies with a mutual information maximization objective. IE-CL replaces these with learnable augmentation and incremental entropy maximization, which is theoretically superior as it directly optimizes information gain.
  • vs. AdCo/HardCL: AdCo increases learning difficulty through adversarial negative sample generation; HardCL improves efficiency via hard positive sample mining. IE-CL provides a unified information-theoretic interpretation—these methods fundamentally increase the amount of informational variation.
  • vs. VICReg/Barlow Twins: These methods prevent representational collapse through variance/redundancy regularization. IE-CL's entropy preservation regularization offers a complementary perspective—not only preventing collapse but also actively preserving useful variation.

Rating

  • Novelty: ⭐⭐⭐⭐ The incremental information entropy perspective is novel, elevating the information bottleneck from an analytical tool to a design principle.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple datasets and frameworks; the small-batch analysis is insightful, though large-scale experiments could be more comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ The theoretical framework is clearly presented and the motivation is well articulated.
  • Value: ⭐⭐⭐⭐ Offers a new optimization perspective for contrastive learning; the plug-and-play property enhances practical applicability.