DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning¶
Conference: ICLR 2026 arXiv: 2602.00795 Code: None Area: Reinforcement Learning Keywords: Few-Shot Learning, Vision-Language Alignment, Reinforcement Learning Gating, Dual-Level Semantics, Cross-Modal Fusion
TL;DR¶
This paper proposes DVLA-RL, a framework that employs Dual-level Semantic Construction (DSC) to generate complementary low-level attributes and high-level descriptions, and uses RL-gated Attention (RLA) to dynamically balance the contributions of self-attention and cross-attention across different network layers. This achieves hierarchical vision-language alignment from low to high levels, attaining state-of-the-art performance on 9 few-shot learning benchmarks.
Background & Motivation¶
Few-shot learning (FSL) aims to generalize to novel categories from only a handful of examples. Current semantic-based FSL methods leverage LLM-generated textual semantics to enhance visual representations, but suffer from two key limitations:
Single-level semantic bottleneck: Existing methods rely either solely on high-level descriptions (e.g., SemFew generates class-level descriptions) or solely on low-level attributes (e.g., ECER generates specific entity attributes), and thus fail to simultaneously provide fine-grained discrimination and holistic class understanding.
Static fusion modules: Existing methods employ fixed MLPs to fuse cross-modal information, unable to adaptively adjust the vision-language alignment strategy across different network depths—shallower layers should focus on local details, while deeper layers should emphasize global semantics.
Core innovations: (1) constructing complementary dual-level semantics (attributes + descriptions); (2) being the first to introduce RL into vision-language alignment for FSL, enabling dynamic gating of cross-modal fusion.
Method¶
Overall Architecture¶
DVLA-RL consists of two core modules:
- Dual-level Semantic Construction (DSC): Generates dual-level semantics comprising low-level attributes and high-level descriptions.
- RL-gated Attention (RLA): Dynamically balances cross-modal attention via an RL policy.
Key Design 1: Dual-level Semantic Construction (DSC)¶
Step 1: Visual Attribute Extraction
Conditioned on the class name and support samples, an LLM (Qwen2.5-VL-32B) is queried: "What are the key distinguishing attributes of the CLASS in the given image? List concise attributes", yielding a candidate attribute set \(A^{C^i_{sup}} = \{a_1, \dots, a_s\}\).
Step 2: Progressive Top-k Selection
Each attribute is encoded via the CLIP text encoder, and cosine similarity \(s_j = \cos(T^{(i)}, a_j)\) is computed against the current template embedding, with the initial template being "A photo of a {CLASS}". At each step, the most relevant attribute is selected and the template is updated; after \(k\) iterations, the most discriminative attributes are retained, suppressing hallucinations and redundant attributes from the LLM. Each selected attribute is embedded into the template "A photo of a {CLASS}, which has {attribute}" for low-level alignment.
Step 3: Attribute Description Synthesis
The selected attributes are synthesized into a fluent scientific description \(D_i\) by the LLM, providing holistic semantics complementary to the local attributes. For example: "The Komondor is a … dog with massive size and uniquely corded white coat."
Key Design 2: RL-gated Attention (RLA)¶
Given visual tokens \(H_{\mathrm{img}}\) and textual semantics \(H_{\mathrm{text}}\), RLA executes two dual attention pathways:
- Image-guided path (cross-attention): \(\hat{H} = \mathrm{Attn}(W^q_\text{text}\bar{H}_\text{text}, W^k_\text{img}\bar{H}_\text{img}, W^v_\text{img}\bar{H}_\text{img})\)
- Text-guided path (self-attention): \(\tilde{H} = \mathrm{Attn}(W^q_\text{text}\bar{H}_\text{text}, W^k_\text{text}\bar{H}_\text{text}, W^v_\text{text}\bar{H}_\text{text})\)
These are fused via stochastic gating: \(H = \alpha \hat{H} + (1-\alpha) \tilde{H}\), where \(\alpha \sim \pi_\theta(\cdot|s)\).
State representation: \(s = \phi([\mathrm{GAP}(\bar{H}_\text{img}) \| \mathrm{GAP}(\bar{H}_\text{text}) \| \cos(\mathrm{GAP}(\bar{H}_\text{img}), \mathrm{GAP}(\bar{H}_\text{text}))])\)
Policy distribution: \(\pi_\theta(\alpha|s) = \mathrm{Beta}(\kappa p_\theta(s), \kappa(1 - p_\theta(s)))\), where \(\kappa\) controls the trade-off between exploration and determinism.
Loss & Training¶
RL Reward: \(R_t = \lambda_\text{sim} \cdot \cos(U \cdot \mathrm{GAP}(H), \mathbf{t}^\star) + \lambda_\text{imp} \cdot (\mathrm{Acc}_t - \mathrm{Acc}_{t-1})\)
- The first term encourages vision-text alignment (cosine similarity with CLIP ground-truth text embeddings).
- The second term measures intra-episode accuracy improvement.
Policy gradient: \(\nabla_\theta \mathcal{J} = \mathbb{E}[(R_t - b_t) \nabla_\theta \log \pi_\theta(\alpha|s)] + \tau \nabla_\theta \mathsf{H}(\pi_\theta(\cdot|s))\)
Entropy regularization is included to prevent premature policy collapse; an exponential moving average baseline is used to reduce variance.
Total loss: \(\mathcal{L}_\text{total} = \mathcal{L}_\text{sup} + \lambda \mathcal{L}_\text{RL}\), where \(\mathcal{L}_\text{sup}\) is cross-entropy loss based on a prototypical classifier.
Training procedure: Two stages—(1) large-scale pre-training for 300–800 epochs; (2) episodic meta-tuning for 100 epochs. RL hyperparameters: \(\kappa=10\), \(\lambda_\text{sim}=0.5\), \(\lambda_\text{imp}=1.0\), \(\lambda=0.1\), \(\tau=0.2\).
Key Experimental Results¶
Main Results: General Few-Shot Classification¶
| Method | miniImageNet 1-shot | miniImageNet 5-shot | tieredImageNet 1-shot | CIFAR-FS 1-shot |
|---|---|---|---|---|
| SemFew (CVPR'24) | 78.94 | 86.49 | 82.37 | 84.34 |
| ECER (AAAI'25) | 81.14 | - | 81.81 | 86.01 |
| CPL (TPAMI'25) | 72.82 | 87.93 | 78.05 | 78.82 |
| DVLA-RL | 81.69 | 88.25 | 83.02 | 87.18 |
Main Results: Fine-Grained Few-Shot Classification¶
| Method | CUB 1-shot | CUB 5-shot | Dogs 1-shot | Cars 1-shot |
|---|---|---|---|---|
| SUITED (AAAI'25) | 86.02 | 94.13 | 76.55 | 89.97 |
| BSFA (TCSVT'23) | 86.00 | 92.53 | 69.58 | 88.93 |
| DVLA-RL | 91.93 | 95.06 | 89.64 | 92.95 |
On fine-grained tasks, DVLA-RL surpasses the second-best method by 5.4%–15.3% (1-shot), demonstrating that dual-level semantics are particularly effective at capturing subtle inter-class distinctions.
Cross-Domain Few-Shot Classification¶
| Method | CUB 1-shot | Places 1-shot | ChestX 1-shot |
|---|---|---|---|
| MEFP (NeurIPS'24) | 51.55 | 52.06 | 22.82 |
| SVasP (AAAI'25) | 49.49 | 59.07 | 23.23 |
| DVLA-RL | 67.46 | 69.26 | 23.47 |
In cross-domain settings, DVLA-RL outperforms the second-best by 15.9% on CUB and 10.2% on Places, demonstrating strong domain transfer capability.
Ablation Study¶
Ablation experiments validate the necessity of each component:
- Removing DSC (using only class-name templates): ~3–5% drop under 1-shot.
- Fixing \(\alpha\) (removing RL gating): significant performance degradation, confirming adaptive fusion is superior to static fusion.
- Removing low-level attributes or high-level descriptions: both cause performance drops, confirming the complementarity of dual-level semantics.
- Removing Progressive Top-k: degraded attribute quality leads to lower performance.
Key Findings¶
- In shallower RLA layers, the policy tends toward larger \(\alpha\) (more cross-attention → focus on attribute details); in deeper layers, toward smaller \(\alpha\) (more self-attention → integration of global semantics).
- The Beta distribution policy exhibits clear adaptive behavior across different episodic tasks.
Highlights & Insights¶
- First application of RL to vision-language alignment in FSL: The Beta distribution policy combined with the REINFORCE algorithm elegantly achieves layer-adaptive fusion.
- Complementary dual-level semantics: Low-level attributes provide fine-grained discriminative cues, while high-level descriptions offer holistic class understanding; Progressive Top-k effectively suppresses LLM hallucinations.
- Large margins on fine-grained and cross-domain tasks (5–16%) indicate the method is particularly effective at capturing subtle differences and enabling domain transfer.
- Lightweight design: the RLA module introduces minimal additional parameters, and RL training is stable.
Limitations & Future Work¶
- Reliance on an LLM (Qwen2.5-VL-32B) for attribute generation introduces inference latency.
- Attributes and descriptions can be precomputed, but novel classes still require LLM inference.
- Gains are limited in extreme cross-domain scenarios such as ChestX (<1%), indicating that vision-language alignment under severe domain shift remains challenging.
- Hyperparameters such as \(\kappa\) in RL gating require validation-set tuning.
Related Work & Insights¶
- Compared to SemFew, which uses only high-level descriptions, and ECER, which uses only low-level entities, the dual-level design of DSC represents a natural unification of both approaches.
- The RL gating concept is generalizable to any scenario requiring adaptive cross-modal fusion (e.g., VQA, image-text retrieval).
- The Progressive Top-k selection mechanism can be applied to other tasks that require filtering high-quality information from LLM outputs.
- A Beta distribution policy is more suitable than Bernoulli or Gaussian for continuous gating over the \([0,1]\) interval.
Rating¶
- Novelty: ⭐⭐⭐⭐ (RL gating + dual-level semantics is a meaningful and novel combination)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (9 benchmarks, 3 scenarios, 20+ baselines)
- Writing Quality: ⭐⭐⭐⭐ (clear structure, complete formulations)
- Value: ⭐⭐⭐⭐ (significant SOTA results, good methodological generalizability)