Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models¶
Conference: ICML 2026
arXiv: 2605.08145
Code: None
Area: Multimodal VLM
Keywords: Modality redundancy, PID decomposition, self-captioning, robust instruction tuning, modality corruption
TL;DR¶
This work leverages Pointwise Partial Information Decomposition to quantify vision-text modality interactions and proposes a Multimodal Interaction Gate: it automatically selects samples dominated by "image-unique information" and lets the VLM self-generate captions to inject into the text side, thereby converting unique visual signals into redundant shared signals. As a result, the VLM's visual hallucination under blurry or corrupted inputs drops by 38.3%, and consistency improves by 16.8%.
Background & Motivation¶
Background: Current mainstream VLM instruction tuning (e.g., LLaVA, SmolVLM series) deliberately reduces text-image redundancy, concentrating task-relevant information solely in the image to enforce "visual grounding" and suppress text-only shortcuts.
Limitations of Prior Work: This excessive grounding strategy backfires—once the image is corrupted by noise/occlusion or the text is already ambiguous, the model lacks shared information for mutual compensation, immediately exposing hallucinations and inconsistent outputs. Existing robustness solutions (e.g., redundancy-based objectives by Wörtwein/Nguyen et al.) are only effective when the data is already redundant, failing on grounding-centric datasets.
Key Challenge: Visual grounding and modality robustness are at odds at the data level—reducing redundancy benefits grounding, increasing redundancy benefits robustness, yet current dataset curation relies entirely on intuition, lacking a quantifiable redundancy control knob.
Goal: (1) Propose a quantification framework using PID to decompose modality interaction into redundant \(R\), unique \(U_V, U_T\), and synergistic \(S\) components; (2) Design a systematic data augmentation algorithm to explicitly increase exploitable redundancy \(R\) while preserving the structure of synergy-dominated samples.
Key Insight: The authors observe that grounding-centric datasets typically exhibit a "visual unique \(U_V\) dominant" distribution. Translating this exclusive visual information into text directly converts it into redundancy, while keeping the image unchanged and maintaining \(I(X_V; Y)\).
Core Idea: Let the VLM "write captions" for self-selected samples, transferring image-unique information to the text side, converting \(U_V\) into \(R\), thus systematically increasing modality redundancy without altering the image.
Method¶
Overall Architecture¶
Input: a grounding-centric instruction dataset \(\mathcal{D}=\{(x_V, x_T, y)\}\). The process has three steps: (1) Use a PPID estimator \(\mathcal{F}\) to estimate four interaction quantities \(r, u_V, u_T, s\) for each sample; (2) The Multimodal Interaction Gate selects the subset \(S_{valid}\) where \(u_V\) dominates according to threshold \(\tau\), sends these samples to the VLM itself or a smaller caption model to generate descriptions \(c_n\), concatenating with the original text as \(x_T' = \text{Concat}(x_T, c_n)\); (3) Fine-tune the VLM (SmolVLM, LLaVA-OneVision-1.5) with the enhanced \(\mathcal{D}'\), keeping the training objective unchanged and only adding LoRA SFT on the data side.
Key Designs¶
-
Sample-level Interaction Estimator based on PPID:
- Function: Approximates \(r, u_V, u_T, s\) in embedding space, serving as the selection signal for the MI Gate.
- Mechanism: For each sample, compute point-wise specificity \(i^+(x_m;y)=h(x_m)\) and ambiguity \(i^-(x_m;y)=h(x_m|y)\); redundant specificity is the minimum across modalities \(r^+ = \min_m i^+(x_m;y)\), redundant ambiguity is \(r^- = \min_m i^-(x_m;y)\), thus \(r = r^+ - r^-\). Then, using \(i(x_m;y)=r+u_m\), back-calculate \(u_V, u_T\); subtracting these from the overall multimodal information \(i(x_V,x_T;y)\) yields \(s\). The entropy estimator uses KNIFE Gaussian mixture differentiable estimation, and the classifier is a 3-layer MLP.
- Design Motivation: Sample-level estimation, rather than aggregate, is needed to precisely identify which samples can be safely converted; this turns modality interaction from a "dataset-level label" into a "per-sample signal".
-
Multimodal Interaction Gate:
- Function: Selects convertible samples and controls injection ratio without disrupting synergy structure.
- Mechanism: Define \(S_{valid}=\{n \mid u_{V,n}=\max(r_n,u_{V,n},u_{T,n},s_n)\}\), i.e., only when \(u_V\) is the largest interaction for a sample is it eligible. Then, select \(k = \min(\lfloor \tau N \rfloor, |S_{valid}|)\) samples globally, use a captioner to generate \(c_n\) and append to the text. Synergy-dominated samples (e.g., UR-FUNNY) are explicitly bypassed to avoid introducing \(u_T\) noise.
- Design Motivation: Experiments (Table 2) show that forcibly captioning synergy samples causes \(U_T\) to surge by +750%, replacing synergy with unique-text. Thus, the Gate must exclude synergy samples; the threshold \(\tau\) provides a "redundancy strength" knob on the training side, which monotonically correlates with downstream robustness.
-
Self-Captioning SFT Workflow:
- Function: Uses the VLM itself as the captioner to close the loop and enhance training data.
- Mechanism: Before training, 25% or 50% of Cauldron samples are captioned by the target VLM (or a smaller SmolVLM-2B), and the captions are written to the text side, followed by standard SFT with LoRA. Caption generation and training are decoupled, so the cost can be amortized.
- Design Motivation: Avoids introducing confounding knowledge from large external models, ensuring redundancy is the only independent variable. Hypothesis 4 notes that caption errors average out as injection ratio increases; in practice, even a 2B captioner can raise \(R\) by 243% and reduce \(U_V\) by 43%, proving small models suffice.
Loss & Training¶
The training loss is standard LoRA SFT next-token prediction, with no new objectives introduced; all robustness gains come from data-side \(R\) injection. Captioning uses temperature 0 and length constraints to avoid irrelevant drift. Task-specific settings fully utilize the MI Gate; for open-ended general settings where \(y\) is undefined, a weakened version randomly selects 25%/50% of samples for captioning.
Key Experimental Results¶
Main Results¶
| Model Family | \(\tau\) | \(\Delta Acc \uparrow\) | \(\Delta VI \downarrow\) | \(\Delta LI\) | \(\Delta Consist. \uparrow\) |
|---|---|---|---|---|---|
| SmolVLM (256M/500M/2B) | 25% | +2.7% | -23.6% | +9.5% | +8.5% |
| SmolVLM (256M/500M/2B) | 50% | +4.0% | -38.3% | +15.2% | +16.8% |
| LLaVA-OneVision (4B/8B) | 25% | +2.4% | -34.4% | +2.9% | +6.2% |
| LLaVA-OneVision (4B/8B) | 50% | +2.5% | -6.5% | -6.8% | +5.5% |
Ablation Study¶
| Configuration | \(R\) Change | \(U_V\) Change | \(U_T\) | Notes |
|---|---|---|---|---|
| Baseline (Hateful Memes train) | \(0.0553\) | \(0.3465\) | \(-0.0125\) | Original data |
| + Random text concatenation | +23% | -2% | \(0\) | Adding text alone is insufficient; semantics are necessary |
| + SmolVLM-2B caption | +243% | -43% | \(0\) | Small captioner suffices |
| + Qwen2.5-32B caption | +319% | -51% | \(0\) | Larger captioner yields limited marginal gains |
| Synergy-dominated UR-FUNNY + caption | +0% | +0% | +750% | Failure case, validates Hypothesis 5 |
Key Findings¶
- Increasing \(\tau\) (higher caption injection ratio) monotonically improves performance stability \(\Delta P\) under modality corruption, consistently across five SmolVLM/LLaVA sizes (256M→8B), showing that caption noise from small captioners averages out.
- There is a trade-off in redundancy enhancement: while visual hallucination VI decreases, language-induced and mixed errors rise slightly, as the model indeed uses the text channel more frequently—this directly validates Hypothesis 1.
- On general benchmarks (MMMU, MMStar, MathVista, TextVQA), redundancy enhancement often brings unexpected "positive side effects," e.g., 8B model's MMMU score rises from 41.4 to 49.9, attributed to more robust multimodal fusion also improving general grounding tasks.
Highlights & Insights¶
- By using PID's redundant specificity/ambiguity, "redundancy" is grounded as a sample-level, quantifiable scalar, giving data augmentation a measurable target signal for the first time; this estimator is decoupled from downstream models and can be applied to any pretrained multimodal backbone.
- The MI Gate provides an elegant "single knob": \(\tau\) allows continuous adjustment between robustness and grounding, offering a reproducible protocol for dataset curation rather than relying on intuition.
- The synergy-bypass detail is crucial: using UR-FUNNY, the authors show that "not adding captions" is part of the design—this explicit refusal to convert certain samples can be transferred to any PID-driven data augmentation method, avoiding "more augmentation, worse results."
Limitations & Future Work¶
- The estimator relies on a trained auxiliary classifier and entropy estimator; for open-ended generation tasks (no discrete \(y\)), it degrades to "random caption addition," losing Gate selection ability and reducing result consistency (for 4B models, \(\Delta LI\) becomes positive at \(\tau=50\%\)).
- Only vision+text modalities are considered, with conversion in the image→text direction; reverse (text→image, requiring diffusion models) and audio/video modalities are only demonstrated as proof-of-concept, with cost and error control not systematically quantified.
- The upper bound of caption quality determines the upper bound of \(r\); for fine-grained structure, spatial relations, OCR, etc., a 2B captioner is likely to miss key unique information, causing both \(r\) and \(u_V\) to rise—captioner capability detection is needed.
Related Work & Insights¶
- vs Wörtwein et al. 2024 / Nguyen et al. 2025: They encode redundancy into the training objective, but only if the data is already redundant; this work proactively creates redundancy from the data side, making the approaches complementary.
- vs LLaVA-1.5 / Cauldron-style grounding data: These works deliberately reduce redundancy to enhance grounding; this work does the opposite and shows that sacrificing grounding for robustness is worthwhile under modality corruption.
- vs Mixture-of-Interaction Experts (Xin et al. 2025): They use PID to guide expert division of labor, still "using" interaction; this work "modifies" interaction, offering a completely different algorithmic path.
- vs HallusionBench / GQA-corruption evaluation protocols: This work is not a new benchmark, but systematically links existing robustness protocols with PID metrics for the first time, providing interpretable information-theoretic indicators for "robustness improvement."
- vs Simple caption data augmentation: Randomly adding captions without sample selection (Random text control in the paper) only achieves +23% \(R\) improvement and introduces negative \(U_T\), proving that MI Gate's "sample selection" is the real contributor.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to apply sample-level PPID to data augmentation and systematically validate the feasibility of "converting unique to redundant."
- Experimental Thoroughness: ⭐⭐⭐⭐ 5 model sizes × two VLM families + modality corruption + general benchmarks + failure cases + bi-directional proof-of-concept.
- Writing Quality: ⭐⭐⭐⭐ 5 hypotheses matched with experiments, clear argument chain; formulas are dense but figures are intuitive.
- Value: ⭐⭐⭐⭐ Provides a quantifiable dial for multimodal instruction data curation, immediately usable in engineering with very low overhead.