AAAI 2026 Multimodal VLM Positive-Incentive Noise Parameter-Efficient Fine-Tuning Multimodal Large Language Models Variational Inference Cross-Modal Alignment

Explore How to Inject Beneficial Noise in MLLMs¶

Conference: AAAI 2026 arXiv: 2511.12917 Code: https://github.com/zhuruishu0848/MuNG Area: Multimodal VLM Keywords: Positive-Incentive Noise, Parameter-Efficient Fine-Tuning, Multimodal Large Language Models, Variational Inference, Cross-Modal Alignment

TL;DR¶

This paper proposes the Multimodal Noise Generator (MuNG), which dynamically generates "beneficial noise" from image-text pairs via a variational inference framework and injects it into the frozen visual features of an MLLM. The approach suppresses task-irrelevant semantics and enhances cross-modal representation alignment, requiring only ~1% additional parameters while outperforming full fine-tuning and PEFT methods such as LoRA.

Background & Motivation¶

Current MLLMs (e.g., LLaVA, Qwen-VL) still exhibit notable deficiencies in spatial relationship understanding, hallucination suppression, and over-reliance on language priors. Full fine-tuning (Full-FT) is effective but incurs prohibitive computational costs and is prone to overfitting, particularly damaging the general knowledge acquired during pretraining when fine-tuning data is limited. Existing parameter-efficient fine-tuning methods (LoRA, Adapter, VPT, etc.) largely follow a unimodal optimization paradigm—either fine-tuning the LLM Decoder (LoRA, Adapter) or prepending learnable prompts on the visual side (VPT)—and thus neglect the cross-modal co-optimization needed for vision–language alignment, making it difficult to address distributional shift and alignment requirements in downstream tasks.

The authors identify a previously overlooked direction: rather than fine-tuning model parameters, one can instead modify the inputs fed to the LLM Decoder. Inspired by the theory of Positive-Incentive Noise (π-noise), they propose a novel paradigm—injecting carefully designed beneficial noise into the visual features of a frozen MLLM in order to reduce task complexity and improve model performance.

Core Problem¶

How to design a lightweight, multimodal-aware noise generation mechanism such that the injected noise reduces the conditional entropy of VQA tasks (i.e., simplifies the task), thereby substantially improving MLLM performance without altering the main model parameters?

The essence of this problem is: can a small, plug-in noise generator exploit cross-modal information to dynamically produce input-specific "beneficial perturbations," enabling the model to better focus on question-relevant visual semantics in the high-dimensional feature space while suppressing irrelevant interference?

Method¶

Overall Architecture¶

MuNG is inserted between the Feature Alignment Layer and the LLM Decoder of the MLLM. The overall pipeline is: 1. Visual and textual inputs are encoded separately to obtain features $X_V$ and $X_L$. 2. MuNG takes $X_V$, $X_L$ (and target answer $A$ during training) as input, models cross-modal relationships via a cross-attention mechanism, and outputs the mean $\mu$ and variance $\sigma$ of the noise distribution. 3. Noise $\mathcal{E} = \sigma \cdot \epsilon + \mu$ is generated by sampling $\epsilon$ from a standard normal distribution using the reparameterization trick. 4. The noise is injected additively into the visual features $X_V$ to produce enhanced visual representations. 5. The enhanced visual features, together with the language features, are fed into the frozen LLM Decoder to produce the final output.

Key Designs¶

Theoretical Foundation of π-noise: From an information-theoretic perspective, the authors define VQA task complexity as the conditional entropy $H(\mathcal{T}) = \mathbb{E}[-\log p(A|X_V, X_L)]$. If injecting noise $\mathcal{E}$ satisfies $I(\mathcal{T}, \mathcal{E}) = H(\mathcal{T}) - H(\mathcal{T}|\mathcal{E}) > 0$—i.e., the noise reduces the conditional entropy of the task—then the noise is classified as "beneficial." Since $H(\mathcal{T})$ is constant for a given model, maximizing mutual information is equivalent to minimizing $H(\mathcal{T}|\mathcal{E})$, providing rigorous theoretical justification for noise injection.
Variational Approximation: Since directly computing $p(A|X_V, X_L, \mathcal{E})$ is intractable, the authors introduce variational inference. Exploiting the non-negativity of KL divergence to derive a variational upper bound, and approximating the expectation via Monte Carlo sampling, the following optimizable loss function is obtained: $$L \approx \frac{1}{n \cdot m} \sum_{i=1}^{n} \sum_{j=1}^{m} [-\log q(A_i | X_V^i, X_L^i, G_\theta(\epsilon_j^i, A_i, X_V^i, X_L^i))]$$ where $m$ is the number of samples drawn per training example.
Cross-Attention-Based Multimodal Noise Generator (MuNG): In the concrete implementation, the noise generator uses visual features as Query and textual/answer features as Key/Value. A cross-attention mechanism dynamically analyzes cross-modal relationships and outputs $\mu$ and $\log(\sigma)$. During inference, answer $A$ is not used; noise is generated from $X_V$ and $X_L$ alone. Ablation studies confirm that the cross-attention architecture substantially outperforms an MLP-based architecture.
Noise Injection Strategy: Noise is injected additively into visual features (rather than multiplicatively), preserving the original inference pathway as much as possible. The injection point is chosen after the feature alignment layer and before the LLM Decoder, since the features at this stage have already been partially aligned by the pretrained model and carry richer integrated semantic information; injecting at a location close to the output layer also reduces the number of parameters involved in backpropagation.

Loss & Training¶

Loss Function: The loss is essentially a standard autoregressive language modeling loss, but with visual features augmented by noise generated by MuNG. During training, the question and target answer are concatenated as input, while the loss is computed only over the answer tokens (consistent with LLM SFT practice).
Training Strategy: All parameters of the pretrained encoders and LLM Decoder are frozen; only the small number of MuNG parameters are trained. When the LLM Decoder has not been pretrained on multimodal data (e.g., the pretraining stage of LLaVA-1.5), a small number of low-rank LoRA adapters can be additionally introduced for joint training.
$m$ samples of $\epsilon$ are drawn per training example to estimate the training loss.

Key Experimental Results¶

Qwen2.5-VL-3B (Fine-tuned on MMPR-v1.1)¶

Method	Trainable Params	MME-P	MME-C	VQAv2	GQA	VisWiz	MM-Vet	POPE	MMB	SQA	Avg
Base*	-	1563	584	76.68	79.68	65.00	54.30	86.32	73.50	47.33	68.97
Full-FT*	100%	1555	587	76.51	80.68	66.20	43.26	85.50	71.93	52.00	68.01
LoRA*	7.82%	1624	613	79.88	79.25	65.30	55.33	86.50	73.29	53.33	70.41
DoRA*	7.99%	1567	639	78.77	79.25	65.40	54.06	86.37	73.41	48.67	69.42
MuNG*	0.67%	1613	625	79.92	79.54	66.50	54.46	86.95	73.64	53.33	70.62

Qwen2.5-VL-7B (Fine-tuned on MMPR-v1.1)¶

Method	Trainable Params	MME-P	MME-C	Sum(%)	MMVet	POPE	Avg
Base*	-	1694	611	82.32	72.00	87.01	80.44
Full-FT*	100%	1693	631	82.98	69.00	87.28	79.75
LoRA*	6.38%	1646	627	81.16	72.20	86.74	80.03
MuNG*	1.83%	1717	610	83.11	71.00	87.41	80.51

LLaVA-1.5-7B (Fine-tuned on LLaVA-Instruct-150K)¶

Method	Trainable Params	SQA	POPE	MM-Vet	Avg
Full-FT	100%	67.2	85.9	31.1	59.1
LoRA	4.61%	68.3	86.4	30.2	59.0
DoRA	4.63%	68.4	87.2	33.3	60.2
MuNG	2.78%	70.0	86.9	32.4	59.3

Efficiency Comparison (Qwen2.5-VL-3B)¶

Method	Trainable Params	Training Time (relative)	TTFT (s)	TPOT (ms)	Avg
Full-FT	100%	5.17×	0.9	20.5	68.01
LoRA	7.99%	2.42×	2.5	21.4	70.41
MuNG	0.67%	1.00×	3.2	20.5	70.62

Ablation Study¶

Noise Generator Architecture: Cross-Attention + additive injection + noise sampling yields the optimal combination (Avg 71.89). The MLP architecture falls far behind (Avg ~42), and multiplicative injection is substantially inferior to additive injection.
Noise vs. Pure Cross-Attention: Using cross-attention for feature extraction alone (without noise sampling) yields Avg 70.49; adding beneficial noise sampling improves this to 71.89, demonstrating that the gain does not stem solely from the cross-attention structure but that the noise itself provides critical semantic guidance.
Noise vs. Random Gaussian Noise: Directly adding Gaussian noise achieves Avg 71.48, lower than beneficial noise at 71.89, indicating that the key lies not in randomness per se but in the semantic guidance carried by the noise.
LoRA Rank Ablation (LLaVA): LoRA rank=32 + MuNG (Avg 63.1) > LoRA rank=128 (Avg 61.6) > LoRA rank=32 (Avg 61.9), demonstrating strong complementarity between MuNG and low-rank LoRA.

Highlights & Insights¶

Theory-Driven Method Design: Starting from the π-noise theory, the training objective is rigorously derived through variational inference. This is not an ad hoc engineering trick but a principled method grounded in information theory.
Extreme Parameter Efficiency: MuNG surpasses Full-FT and LoRA/DoRA (which use 7%+ parameters) with only 0.67%–1.83% additional parameters, while also achieving the shortest training time (1× baseline).
Compelling Noise Visualizations: Visualizations clearly demonstrate that the noise generated by MuNG precisely suppresses semantic regions irrelevant to the query (e.g., when asked about the number of zebras, the noise suppresses giraffe features), validating the design principle of "reducing task entropy."
Plug-and-Play Framework: MuNG serves as a plug-in between the feature alignment layer and the LLM Decoder without modifying the main model architecture, and is orthogonal to methods such as LoRA and can be combined with them.
Cross-Model Generalization: Effectiveness is demonstrated across three different models: Qwen2.5-VL-3B/7B and LLaVA-1.5-7B.

Limitations & Future Work¶

Limited Effectiveness Without Multimodal Pretraining of the LLM Decoder: When the LLM Decoder has not been exposed to multimodal data (e.g., the pretrain-stage LLaVA model), MuNG alone suffers a substantial performance drop and must be combined with LoRA to recover effectiveness. This suggests that MuNG's efficacy relies on the LLM Decoder already possessing basic multimodal understanding.
Slight Increase in Inference TTFT: MuNG requires running the noise generator at inference time, increasing TTFT from 0.9 s (Full-FT) to 3.2 s (Qwen2.5-VL-3B), which may be a concern in latency-sensitive applications.
Validation Limited to VQA-Type Tasks: All experiments are conducted on visual question answering and understanding benchmarks; effectiveness on generative tasks (e.g., image captioning, visual grounding) has not been verified.
Requires Target Answers During Training: The noise generator uses answer information $A$ during training; while this is not needed at inference time, it means training is restricted to supervised data and cannot be directly applied in unsupervised self-supervised settings.
Noise Injected Only on the Visual Side: The current work explores noise injection only on visual features, without attempting injection on textual or cross-modal features.
Limited Improvement on LLaVA: On LLaVA-1.5-7B, MuNG's average score (59.3) offers no advantage over DoRA (60.2) and is slightly lower.

vs. LoRA/DoRA: LoRA/DoRA modify internal parameters of the LLM Decoder via low-rank matrices, operating in the model parameter space; MuNG instead modifies the data fed to the LLM Decoder without changing model parameters. MuNG uses far fewer parameters than LoRA (0.67% vs. 7.82%), achieving comparable or slightly better performance on Qwen, but underperforming DoRA on LLaVA.
vs. VPT (Visual Prompt Tuning): VPT appends learnable prompt tokens after the visual embedding layer and remains a unimodal optimization approach; MuNG generates noise using cross-modal information, constituting genuine multimodal co-optimization.
vs. Adversarial Noise Injection: Adversarial perturbations (FGSM/PGD) are designed to cause model errors, whereas MuNG generates positive-incentive noise aimed at simplifying the task and guiding the model toward better responses. The two approaches are diametrically opposed in their design objectives and optimization directions.
Noise as a New Perspective on Regularization: The conventional view holds that noise is harmful; this paper demonstrates that carefully designed noise can serve as a "task simplifier." This idea can be extended to other modalities (audio, point clouds) or tasks (detection, segmentation).
Connection to Data Augmentation: The authors distinguish MuNG from adversarial noise but do not thoroughly discuss its relationship to feature-level augmentation methods (e.g., Dropout, Manifold Mixup). MuNG can be interpreted as a form of "semantically-aware feature augmentation."
Potential Connection to Hallucination Suppression: By suppressing task-irrelevant semantics to reduce task entropy, MuNG's objective is closely aligned with hallucination suppression in MLLMs—hallucinations largely stem from the model attending to irrelevant visual regions. Applying MuNG specifically to hallucination suppression tasks merits exploration.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing π-noise theory into MLLM fine-tuning is a novel perspective, though noise injection and variational inference are not in themselves entirely new techniques.
Experimental Thoroughness: ⭐⭐⭐⭐ Extensive experiments are conducted on two mainstream MLLMs with thorough ablation and visualization analyses; however, improvements on LLaVA are limited, and generative tasks are not evaluated.
Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear and experimental organization is sound, but the discussion of related feature augmentation methods in the Related Work section is insufficient.
Value: ⭐⭐⭐⭐ The work offers a new MLLM fine-tuning paradigm (modifying inputs rather than model parameters) with meaningful implications for the PEFT field, though practical deployment requires weighing the latency overhead introduced by increased TTFT.