Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://lyf1212.github.io/UniPrior (Project Page)
Area: Low-light understanding / Representation learning / Domain adaptation
Keywords: Low-light adaptation, Illumination-invariant prior, Visual Foundation Models, Contrastive learning, Test-time augmentation

TL;DR¶

UniPrior unifies the "illumination-invariant signal prior" with the "semantic prior from Visual Foundation Models (DINOv2/CLIP)." Without using any real low-light data for training, it enables models trained on daytime data to robustly generalize to unseen nocturnal/low-light scenes, significantly setting new zero-shot SOTAs across classification, segmentation, and face detection tasks.

Background & Motivation¶

Background: There are three mainstream paradigms for enabling machines to "see clearly" at night: 1) enhance-then-understand, which uses methods like Zero-DCE or Retinex to brighten images before feeding them to downstream models; 2) supervised adaptation, which trains directly on paired, annotated low-light data; and 3) domain adaptation, which transfers annotated knowledge from the daytime domain to an unannotated low-light target domain.

Limitations of Prior Work: Each approach has significant drawbacks. Enhancement methods are optimized for human vision, often introducing artifacts and losing task-related details, which hampers machine perception (as highlighted in Fig.1/Fig.2c). Supervised methods suffer from the scarcity of low-light annotations and overfit to specific training sets. Domain adaptation methods are limited by biased degradation patterns and scene distributions in the training data; switching datasets (e.g., from DarkFace to ExDark) drops accuracy by 1.72%, and reducing data from 4800 to 600 samples drops it by another 2.27%.

Key Challenge: Low-light degradation is highly diverse (low visibility, noise, motion blur, neon lights, night reflections, etc.), making it impractical to cover through exhaustive sampling. Current methods lack a universal, illumination-independent semantic prior to anchor "what the object essentially is," leading to failure when encountering unseen degradations.

Goal: To construct representations that are both stable (no signal shift under different illumination) and discriminative (rich in high-level semantic cues) without touching any real low-light data, facilitating the zero-shot transfer of daytime models to any low-light condition.

Key Insight: The authors observe that two types of priors are naturally complementary: illumination-invariant priors (such as the physical ISP prior in QuadPrior) provide stable representations by suppressing illumination perturbations at the signal layer but are vulnerable to complex noise; Visual Foundation Models (VFMs) (DINOv2, CLIP) carry robust semantic priors that remain stable across scenes and degradations. Integrating both yields both "signal constancy" and "semantic enrichment."

Core Idea: Use illumination-invariant priors to establish signal constancy for stabilizing feature distributions, then employ VFM self-correlation map-guided contrastive learning to achieve semantic enrichment by aligning with the VFM semantic space. Finally, inject this unified prior into pixel-level enhancement for sample-wise test-time adaptation (TTA). This translates to a three-layer prior alignment trained with zero real low-light data.

Method¶

Overall Architecture¶

UniPrior is a two-stage "unified prior" framework. The input consists of a downstream model trained on daytime data plus synthetic low-light data (dimmed via Dark-ISP); the output is an adapted model capable of zero-shot generalization to real low-light. The pipeline comprises three modules across two training stages:

Stage 1 (Offline Adaptation Training): At the high-level feature side, two objectives are pursued: ① Signal Constancy: Illumination-invariant priors are fed as auxiliary inputs into the backbone, with a lightweight decoder performing cross-illumination reconstruction regularization to force the backbone to learn compact, illumination-independent representations; ② Semantic Enrichment: DINOv2 self-correlation maps guide contrastive learning and correlation alignment to align the task backbone's feature space with the VFM semantic space.
Stage 2 (Test-Time Augmentation): On real low-light data, a machine-oriented enhancer (CoLIE backbone) performs sample-wise pixel optimization. By employing light-effect suppression loss, signal prior consistency, and CLIP intermediate manifold loss, "semantic alignment" and "pixel correction" are coupled to further adapt to unseen low-light distributions.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Daytime Model + Synthetic LL Images<br/>(Dark-ISP synthesis)"] --> B["Signal Constancy<br/>IIP Conditioning<br/>+ Decoder Regularization"]
    B --> C["Semantic Enrichment<br/>VFM Self-correlation Guided<br/>Contrastive + Correlation Alignment"]
    C -->|Stage 1: Offline Adaptation| D["Machine-oriented Enhancer<br/>CoLIE + Light-effect Suppression<br/>+ CLIP Intermediate Manifold"]
    D -->|Stage 2: Sample-wise TTA| E["Zero-shot Low-light Understanding<br/>Classification/Segmentation/Detection"]

Key Designs¶

1. Signal Constancy: Illumination-Invariant Prior Conditioning + Decoder Regularization as Information Bottleneck

This component addresses the "feature drift under illumination" problem. For an image $I\in\mathbb{R}^{h\times w\times 3}$, an illumination-invariant prior $p_{iip}=G_{iip}(I)\in\mathbb{R}^{h\times w\times 6}$ is extracted ($G_{iip}$ is initialized with QuadPrior weights and fine-tuned online). The original image and prior are channel-concatenated (9 channels) and fused into the backbone via a convolution $K_{merge}$. A critical trick is zero initialization: weights corresponding to the prior channels are initialized to 0, allowing the network to gradually learn to utilize the prior without disrupting its daytime pre-trained capabilities—removing zero-init causes accuracy to plummet from 74.52% to 62.25%.

Two training constraints are used. First, an outlier-robust prior consistency loss: synthetic low-light images generated via inverse-ISP are used to force the synthetic prior $G_{iip}(I_{low})$ to match the daytime prior $G_{iip}(I_{normal})$. To handle artifacts in synthetic data, a difference map $d=|G_{iip}(I_{low})-G_{iip}(I_{normal})|$ is calculated, and an $\alpha$-quantile $d_\alpha$ serves as an adaptive threshold to filter outliers:

\[\mathcal{F}=\begin{cases}0,& d>d_\alpha\\1,&\text{otherwise}\end{cases},\qquad \mathcal{L}_{\text{iip-consis}}=\big\|\mathcal{F}\odot(G_{iip}(I_{low})-G_{iip}(I_{normal}))\big\|_2^2\]

Second, Decoder Regularization: A lightweight decoder $D_{prior}$ (approx. 10% of backbone parameters) reconstructs the daytime illumination-invariant prior $p_{iip}^{normal}$ from low-light intermediate features $\{f^i\}$, via $\mathcal{L}_{\text{iip-decode}}=\|D_{prior}(\{f^i\})-p_{iip}^{normal}\|_2^2$. The intentionally small decoder acts as an information bottleneck, forcing the backbone to distill the most compact, illumination-independent information.

2. Semantic Enrichment: Guidance via VFM Self-correlation Maps (rather than static features)

A core insight (Fig.4) is that in DINOv2's raw features, an object might be indistinguishable from its surroundings, but its self-correlation map precisely highlights semantically relevant regions. Thus, the model should align "correlation structures" rather than "absolute feature values."

Contrastive Enhancement: For VFM features $f_{vfm}$, a normalized self-similarity matrix $\mathcal{S}=\big(\tfrac{f_{vfm}}{\|f_{vfm}\|_2^2}\big)\cdot\big(\tfrac{f_{vfm}}{\|f_{vfm}\|_2^2}\big)^\top$ is computed and binarized using an adaptive threshold $\mathcal{S}_\alpha$ to obtain a mask. For an anchor point $p$, the semantic cluster region $\mathcal{M}$ is defined. Features within the mask are positive samples, while those outside are negative. InfoNCE is applied to the synthetic low-light anchor $f$, daytime positive $f^+$, and negative $f^-$:

\[\mathcal{L}_{\text{contra}}=-\log\frac{\sigma(f,f^+)}{\sigma(f,f^+)+\sum_{M[i]=0}\sigma(f,f^-_i)},\quad \sigma(x,y)=\exp(x\cdot y/\tau)\]

Correlation Alignment: Since VFM capacity far exceeds the task backbone, forcing the small model to mimic VFM feature distributions is difficult. Instead, alignment is performed at the self-correlation map level. For backbone features $f$ and DINOv2 features $f_{dino}$, self-correlation maps $\mathcal{S}, \mathcal{S}_{dino}$ are computed, followed by softmax and cross-entropy: $\mathcal{L}_{\text{align}}=\text{CE}(\text{Softmax}(\mathcal{S}),\text{Softmax}(\mathcal{S}_{dino}))$.

3. Machine-oriented Enhancement: Injecting Unified Priors into Pixel Correction (Sample-wise TTA)

Stage 2 employs CoLIE as the enhancement backbone to optimize pixels per sample. Beyond human-centric enhancement losses $\mathcal{L}_{enh}$, three machine-oriented objectives are added:

Light-effect Suppression Loss: An overexposure mask $\mathcal{M}_{LE}$ is extracted (extending the signal prior to handle glare/headlights), and $\mathcal{L}_{LE}=\tfrac{1}{hw}\sum_{i,j}\mathcal{M}_{i,j}$ penalizes the overexposed area to prevent amplification of harmful artifacts.
Signal Prior Consistency: The same outlier-robust L1 constraint is applied between the enhanced image prior $p_{iip}^{enh}$ and input prior $p_{iip}^{low}$.
CLIP Intermediate Manifold Loss: To erase "illumination attributes," the enhanced image is pushed towards an "intermediate manifold" between daytime and low-light text embeddings in CLIP space using a maximum entropy loss $\mathcal{L}_{\text{clip-inter}}=\sum_i p_{sim}^i\log p_{sim}^i$.

Loss & Training¶

Two-stage training:

Stage 1 (Offline Adaptation): Combined training of high-level components using Dark-ISP synthetic pairs. $$\mathcal{L}_{\text{high}}=\lambda_{\text{task}}\mathcal{L}_{\text{task}}+\lambda_{\text{con}}\mathcal{L}_{\text{iip-consis}}+\lambda_{\text{dec}}\mathcal{L}_{\text{iip-decode}}+\lambda_{\text{contra}}\mathcal{L}_{\text{contra}}+\lambda_{\text{ca}}\mathcal{L}_{\text{align}}$$

Stage 2 (Test-Time Augmentation): Sample-wise optimization of low-level enhancement on real data. $$\mathcal{L}_{\text{low}}=\lambda_{\text{enh}}\mathcal{L}_{\text{enh}}+\lambda_{\text{entropy}}\mathcal{L}_{\text{entropy}}+\lambda_{\text{consis}}\mathcal{L}_{\text{consis}}$$

Zero real low-light data is used throughout the training.

Key Experimental Results¶

Main Results¶

On CoDaN low-light classification (ResNet-18 baseline), UniPrior significantly outperforms all zero-shot adaptation methods with minimal overhead (parameters 11.70M→11.72M, FLOPs 1.80G→2.04G):

Category	Method	Top-1 ACC (%)	Remarks
Baseline	ResNet-18	53.32	Direct Inference
Enhancement	GEFU	60.92	Optimized for humans, suboptimal for machines
Zero-shot	Sim-MinMax	65.87	Contrastive learning, lacks universal prior
Zero-shot	DAI-Net	68.44	Prev. SOTA
Ours	UniPrior	74.52	+6.08 over SOTA, negligible overhead
Ours	UniPrior + TTA	75.72	Further gain via sample-wise enhancement

Segmentation (mIoU) and Face Detection ([email protected]) also set new zero-shot SOTAs:

Task / Dataset	Metric	Best Prev. (Zero-shot)	Ours	Ours + TTA
Seg. / Nighttime Driving	mIoU	44.9 (Sim-MinMax)	48.2	48.6
Seg. / ACDC-Night	mIoU	27.6 (Sim-MinMax)	27.9	28.5
Detection / DarkFace	mAP	28.0 (DAI-Net)	31.3	33.6

Ablation Study¶

Configuration	CoDaN ACC (%)	ND mIoU (%)	Explanation
Full (Ours)	74.52	48.20	Complete model
w/o zero-init	62.25	39.31	Largest performance drop
with naive feature align	63.26	41.72	L1 align instead of self-correlation
w/o contrastive aug.	68.89	47.07	Removed contrastive augmentation
w/o correlation align.	69.71	46.67	Removed correlation alignment
w/o prior decoder	70.26	45.23	Loss of information bottleneck
w/o invariant prior	67.24	44.62	Removed signal layer prior

Key Findings¶

Zero-init and "Correlation Map Alignment" are critical: Removing either drops accuracy below 64%, proving the gain comes from these specific designs—preserving daytime capabilities and capturing semantic structures rather than absolute values.
Decoder Regularization as an Information Bottleneck is effective: A decoder with only ~10% parameters contributes a 4.26 drop when removed, validating the "compact distillation" motivation.
Enhancement can harm segmentation: Human-oriented enhancement (e.g., EnlightenGAN) can drop ND mIoU to 25.2 (lower than baseline 34.3), highlighting the necessity of machine-oriented designs.

Highlights & Insights¶

The "Correlation Map vs. Feature" Insight: Rather than mimicking high-dimensional DINOv2 features, aligning the softmax-normalized self-correlation matrix (contextual patterns) bypasses the capacity gap and captures semantic essence.
Cross-level Unified Prior: The signal layer (invariant prior), feature layer (VFM semantics), and pixel layer (enhancement) are coupled through a single unified prior framework, which is the root cause of its generalization to unseen degradations.
Transferable Tricks: ① Zero-init for "painless integration" of new modalities; ② Outlier-robust loss using quantile thresholds for synthetic data; ③ Maximum entropy in CLIP space to erase nuisance attributes (illumination).

Limitations & Future Work¶

Dependency on Pre-trained Quality: If foundation models (DINOv2/CLIP) are weak in specific domains (e.g., thermal, underwater), UniPrior's performance ceiling will be limited.
TTA Overhead: Single image inference with TTA takes 2.593s (vs. 0.005s for the main model), making it unsuitable for real-time systems like autonomous driving unless optimized.
Synthetic Gap: Stage 1 relies on Dark-ISP synthesis. While outlier-robust losses help, a large gap between synthetic and real sensor noise may still affect signal constancy.

vs. Sim-MinMax: Both use contrastive learning, but Sim-MinMax lacks universal priors. UniPrior leads by +8.65% in classification.
vs. DAI-Net: DAI-Net uses reflectance decomposition; UniPrior integrates VFM semantic priors to gain +5.6 mAP in face detection.
vs. Enhancement-only: Pure enhancement often introduces artifacts. UniPrior's machine-oriented approach couples the alignment and correction, leading to superior perception.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Self-correlation alignment insight + 3-layer prior integration)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (3 tasks, complexity analysis, detailed ablation)
Writing Quality: ⭐⭐⭐⭐ (Clear logic, though some notation is dense)
Value: ⭐⭐⭐⭐⭐ (New zero-shot SOTA, transferable tricks)