Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://lyf1212.github.io/UniPrior (Project Page)
Area: Low-light understanding / Representation learning / Domain adaptation
Keywords: Low-light adaptation, Illumination-invariant prior, Visual Foundation Models, Contrastive learning, Test-time augmentation
TL;DR¶
UniPrior unifies the "illumination-invariant signal prior" with the "semantic prior from Visual Foundation Models (DINOv2/CLIP)." Without using any real low-light data for training, it enables models trained on daytime data to robustly generalize to unseen nocturnal/low-light scenes, significantly setting new zero-shot SOTAs across classification, segmentation, and face detection tasks.
Background & Motivation¶
Background: There are three mainstream paradigms for enabling machines to "see clearly" at night: 1) enhance-then-understand, which uses methods like Zero-DCE or Retinex to brighten images before feeding them to downstream models; 2) supervised adaptation, which trains directly on paired, annotated low-light data; and 3) domain adaptation, which transfers annotated knowledge from the daytime domain to an unannotated low-light target domain.
Limitations of Prior Work: Each approach has significant drawbacks. Enhancement methods are optimized for human vision, often introducing artifacts and losing task-related details, which hampers machine perception (as highlighted in Fig.1/Fig.2c). Supervised methods suffer from the scarcity of low-light annotations and overfit to specific training sets. Domain adaptation methods are limited by biased degradation patterns and scene distributions in the training data; switching datasets (e.g., from DarkFace to ExDark) drops accuracy by 1.72%, and reducing data from 4800 to 600 samples drops it by another 2.27%.
Key Challenge: Low-light degradation is highly diverse (low visibility, noise, motion blur, neon lights, night reflections, etc.), making it impractical to cover through exhaustive sampling. Current methods lack a universal, illumination-independent semantic prior to anchor "what the object essentially is," leading to failure when encountering unseen degradations.
Goal: To construct representations that are both stable (no signal shift under different illumination) and discriminative (rich in high-level semantic cues) without touching any real low-light data, facilitating the zero-shot transfer of daytime models to any low-light condition.
Key Insight: The authors observe that two types of priors are naturally complementary: illumination-invariant priors (such as the physical ISP prior in QuadPrior) provide stable representations by suppressing illumination perturbations at the signal layer but are vulnerable to complex noise; Visual Foundation Models (VFMs) (DINOv2, CLIP) carry robust semantic priors that remain stable across scenes and degradations. Integrating both yields both "signal constancy" and "semantic enrichment."
Core Idea: Use illumination-invariant priors to establish signal constancy for stabilizing feature distributions, then employ VFM self-correlation map-guided contrastive learning to achieve semantic enrichment by aligning with the VFM semantic space. Finally, inject this unified prior into pixel-level enhancement for sample-wise test-time adaptation (TTA). This translates to a three-layer prior alignment trained with zero real low-light data.
Method¶
Overall Architecture¶
UniPrior is a two-stage "unified prior" framework. The input consists of a downstream model trained on daytime data plus synthetic low-light data (dimmed via Dark-ISP); the output is an adapted model capable of zero-shot generalization to real low-light. The pipeline comprises three modules across two training stages:
- Stage 1 (Offline Adaptation Training): At the high-level feature side, two objectives are pursued: ① Signal Constancy: Illumination-invariant priors are fed as auxiliary inputs into the backbone, with a lightweight decoder performing cross-illumination reconstruction regularization to force the backbone to learn compact, illumination-independent representations; ② Semantic Enrichment: DINOv2 self-correlation maps guide contrastive learning and correlation alignment to align the task backbone's feature space with the VFM semantic space.
- Stage 2 (Test-Time Augmentation): On real low-light data, a machine-oriented enhancer (CoLIE backbone) performs sample-wise pixel optimization. By employing light-effect suppression loss, signal prior consistency, and CLIP intermediate manifold loss, "semantic alignment" and "pixel correction" are coupled to further adapt to unseen low-light distributions.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Daytime Model + Synthetic LL Images<br/>(Dark-ISP synthesis)"] --> B["Signal Constancy<br/>IIP Conditioning<br/>+ Decoder Regularization"]
B --> C["Semantic Enrichment<br/>VFM Self-correlation Guided<br/>Contrastive + Correlation Alignment"]
C -->|Stage 1: Offline Adaptation| D["Machine-oriented Enhancer<br/>CoLIE + Light-effect Suppression<br/>+ CLIP Intermediate Manifold"]
D -->|Stage 2: Sample-wise TTA| E["Zero-shot Low-light Understanding<br/>Classification/Segmentation/Detection"]
Key Designs¶
1. Signal Constancy: Illumination-Invariant Prior Conditioning + Decoder Regularization as Information Bottleneck
This component addresses the "feature drift under illumination" problem. For an image \(I\in\mathbb{R}^{h\times w\times 3}\), an illumination-invariant prior \(p_{iip}=G_{iip}(I)\in\mathbb{R}^{h\times w\times 6}\) is extracted (\(G_{iip}\) is initialized with QuadPrior weights and fine-tuned online). The original image and prior are channel-concatenated (9 channels) and fused into the backbone via a convolution \(K_{merge}\). A critical trick is zero initialization: weights corresponding to the prior channels are initialized to 0, allowing the network to gradually learn to utilize the prior without disrupting its daytime pre-trained capabilities—removing zero-init causes accuracy to plummet from 74.52% to 62.25%.
Two training constraints are used. First, an outlier-robust prior consistency loss: synthetic low-light images generated via inverse-ISP are used to force the synthetic prior \(G_{iip}(I_{low})\) to match the daytime prior \(G_{iip}(I_{normal})\). To handle artifacts in synthetic data, a difference map \(d=|G_{iip}(I_{low})-G_{iip}(I_{normal})|\) is calculated, and an \(\alpha\)-quantile \(d_\alpha\) serves as an adaptive threshold to filter outliers:
Second, Decoder Regularization: A lightweight decoder \(D_{prior}\) (approx. 10% of backbone parameters) reconstructs the daytime illumination-invariant prior \(p_{iip}^{normal}\) from low-light intermediate features \(\{f^i\}\), via \(\mathcal{L}_{\text{iip-decode}}=\|D_{prior}(\{f^i\})-p_{iip}^{normal}\|_2^2\). The intentionally small decoder acts as an information bottleneck, forcing the backbone to distill the most compact, illumination-independent information.
2. Semantic Enrichment: Guidance via VFM Self-correlation Maps (rather than static features)
A core insight (Fig.4) is that in DINOv2's raw features, an object might be indistinguishable from its surroundings, but its self-correlation map precisely highlights semantically relevant regions. Thus, the model should align "correlation structures" rather than "absolute feature values."
Contrastive Enhancement: For VFM features \(f_{vfm}\), a normalized self-similarity matrix \(\mathcal{S}=\big(\tfrac{f_{vfm}}{\|f_{vfm}\|_2^2}\big)\cdot\big(\tfrac{f_{vfm}}{\|f_{vfm}\|_2^2}\big)^\top\) is computed and binarized using an adaptive threshold \(\mathcal{S}_\alpha\) to obtain a mask. For an anchor point \(p\), the semantic cluster region \(\mathcal{M}\) is defined. Features within the mask are positive samples, while those outside are negative. InfoNCE is applied to the synthetic low-light anchor \(f\), daytime positive \(f^+\), and negative \(f^-\):
Correlation Alignment: Since VFM capacity far exceeds the task backbone, forcing the small model to mimic VFM feature distributions is difficult. Instead, alignment is performed at the self-correlation map level. For backbone features \(f\) and DINOv2 features \(f_{dino}\), self-correlation maps \(\mathcal{S}, \mathcal{S}_{dino}\) are computed, followed by softmax and cross-entropy: \(\mathcal{L}_{\text{align}}=\text{CE}(\text{Softmax}(\mathcal{S}),\text{Softmax}(\mathcal{S}_{dino}))\).
3. Machine-oriented Enhancement: Injecting Unified Priors into Pixel Correction (Sample-wise TTA)
Stage 2 employs CoLIE as the enhancement backbone to optimize pixels per sample. Beyond human-centric enhancement losses \(\mathcal{L}_{enh}\), three machine-oriented objectives are added:
- Light-effect Suppression Loss: An overexposure mask \(\mathcal{M}_{LE}\) is extracted (extending the signal prior to handle glare/headlights), and \(\mathcal{L}_{LE}=\tfrac{1}{hw}\sum_{i,j}\mathcal{M}_{i,j}\) penalizes the overexposed area to prevent amplification of harmful artifacts.
- Signal Prior Consistency: The same outlier-robust L1 constraint is applied between the enhanced image prior \(p_{iip}^{enh}\) and input prior \(p_{iip}^{low}\).
- CLIP Intermediate Manifold Loss: To erase "illumination attributes," the enhanced image is pushed towards an "intermediate manifold" between daytime and low-light text embeddings in CLIP space using a maximum entropy loss \(\mathcal{L}_{\text{clip-inter}}=\sum_i p_{sim}^i\log p_{sim}^i\).
Loss & Training¶
Two-stage training:
Stage 1 (Offline Adaptation): Combined training of high-level components using Dark-ISP synthetic pairs. $\(\mathcal{L}_{\text{high}}=\lambda_{\text{task}}\mathcal{L}_{\text{task}}+\lambda_{\text{con}}\mathcal{L}_{\text{iip-consis}}+\lambda_{\text{dec}}\mathcal{L}_{\text{iip-decode}}+\lambda_{\text{contra}}\mathcal{L}_{\text{contra}}+\lambda_{\text{ca}}\mathcal{L}_{\text{align}}\)$
Stage 2 (Test-Time Augmentation): Sample-wise optimization of low-level enhancement on real data. $\(\mathcal{L}_{\text{low}}=\lambda_{\text{enh}}\mathcal{L}_{\text{enh}}+\lambda_{\text{entropy}}\mathcal{L}_{\text{entropy}}+\lambda_{\text{consis}}\mathcal{L}_{\text{consis}}\)$
Zero real low-light data is used throughout the training.
Key Experimental Results¶
Main Results¶
On CoDaN low-light classification (ResNet-18 baseline), UniPrior significantly outperforms all zero-shot adaptation methods with minimal overhead (parameters 11.70M→11.72M, FLOPs 1.80G→2.04G):
| Category | Method | Top-1 ACC (%) | Remarks |
|---|---|---|---|
| Baseline | ResNet-18 | 53.32 | Direct Inference |
| Enhancement | GEFU | 60.92 | Optimized for humans, suboptimal for machines |
| Zero-shot | Sim-MinMax | 65.87 | Contrastive learning, lacks universal prior |
| Zero-shot | DAI-Net | 68.44 | Prev. SOTA |
| Ours | UniPrior | 74.52 | +6.08 over SOTA, negligible overhead |
| Ours | UniPrior + TTA | 75.72 | Further gain via sample-wise enhancement |
Segmentation (mIoU) and Face Detection ([email protected]) also set new zero-shot SOTAs:
| Task / Dataset | Metric | Best Prev. (Zero-shot) | Ours | Ours + TTA |
|---|---|---|---|---|
| Seg. / Nighttime Driving | mIoU | 44.9 (Sim-MinMax) | 48.2 | 48.6 |
| Seg. / ACDC-Night | mIoU | 27.6 (Sim-MinMax) | 27.9 | 28.5 |
| Detection / DarkFace | mAP | 28.0 (DAI-Net) | 31.3 | 33.6 |
Ablation Study¶
| Configuration | CoDaN ACC (%) | ND mIoU (%) | Explanation |
|---|---|---|---|
| Full (Ours) | 74.52 | 48.20 | Complete model |
| w/o zero-init | 62.25 | 39.31 | Largest performance drop |
| with naive feature align | 63.26 | 41.72 | L1 align instead of self-correlation |
| w/o contrastive aug. | 68.89 | 47.07 | Removed contrastive augmentation |
| w/o correlation align. | 69.71 | 46.67 | Removed correlation alignment |
| w/o prior decoder | 70.26 | 45.23 | Loss of information bottleneck |
| w/o invariant prior | 67.24 | 44.62 | Removed signal layer prior |
Key Findings¶
- Zero-init and "Correlation Map Alignment" are critical: Removing either drops accuracy below 64%, proving the gain comes from these specific designs—preserving daytime capabilities and capturing semantic structures rather than absolute values.
- Decoder Regularization as an Information Bottleneck is effective: A decoder with only ~10% parameters contributes a 4.26 drop when removed, validating the "compact distillation" motivation.
- Enhancement can harm segmentation: Human-oriented enhancement (e.g., EnlightenGAN) can drop ND mIoU to 25.2 (lower than baseline 34.3), highlighting the necessity of machine-oriented designs.
Highlights & Insights¶
- The "Correlation Map vs. Feature" Insight: Rather than mimicking high-dimensional DINOv2 features, aligning the softmax-normalized self-correlation matrix (contextual patterns) bypasses the capacity gap and captures semantic essence.
- Cross-level Unified Prior: The signal layer (invariant prior), feature layer (VFM semantics), and pixel layer (enhancement) are coupled through a single unified prior framework, which is the root cause of its generalization to unseen degradations.
- Transferable Tricks: ① Zero-init for "painless integration" of new modalities; ② Outlier-robust loss using quantile thresholds for synthetic data; ③ Maximum entropy in CLIP space to erase nuisance attributes (illumination).
Limitations & Future Work¶
- Dependency on Pre-trained Quality: If foundation models (DINOv2/CLIP) are weak in specific domains (e.g., thermal, underwater), UniPrior's performance ceiling will be limited.
- TTA Overhead: Single image inference with TTA takes 2.593s (vs. 0.005s for the main model), making it unsuitable for real-time systems like autonomous driving unless optimized.
- Synthetic Gap: Stage 1 relies on Dark-ISP synthesis. While outlier-robust losses help, a large gap between synthetic and real sensor noise may still affect signal constancy.
Related Work & Insights¶
- vs. Sim-MinMax: Both use contrastive learning, but Sim-MinMax lacks universal priors. UniPrior leads by +8.65% in classification.
- vs. DAI-Net: DAI-Net uses reflectance decomposition; UniPrior integrates VFM semantic priors to gain +5.6 mAP in face detection.
- vs. Enhancement-only: Pure enhancement often introduces artifacts. UniPrior's machine-oriented approach couples the alignment and correction, leading to superior perception.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Self-correlation alignment insight + 3-layer prior integration)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (3 tasks, complexity analysis, detailed ablation)
- Writing Quality: ⭐⭐⭐⭐ (Clear logic, though some notation is dense)
- Value: ⭐⭐⭐⭐⭐ (New zero-shot SOTA, transferable tricks)