AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs¶

Conference: AAAI 2026
arXiv: 2601.02771
Code: https://github.com/ChangPtR/AbdMLLM
Area: Multimodal VLM
Keywords: visual abductive reasoning, MLLM, diffusion model, contrastive learning, pictorial thinking

TL;DR¶

Inspired by the dual-mode human cognitive process of verbal abduction and pictorial imagination, this paper proposes AbductiveMLLM, which enhances visual abductive reasoning in MLLMs via two collaborative components — a Reasoner (causal contrastive learning for hypothesis selection) and an Imaginer (diffusion-model-based pictorial reasoning) — achieving state-of-the-art performance on the VAR and YouCookII benchmarks.

Background & Motivation¶

Visual Abductive Reasoning (VAR) requires AI systems to infer the most plausible explanation from incomplete visual observations, representing a core capability of human cognition. The central challenges are:

Insufficient abductive capacity in MLLMs: Despite strong performance on tasks such as VQA, MLLMs like GPT-4o exhibit a significant gap from humans in causal reasoning — GPT-4o-mini achieves only a CIDEr of 7.30 on VAR, far below the human score of 147.79.
Limitations of existing methods: Traditional small models (REASONER, UPD-Trans) focus exclusively on verbal reasoning, overlooking the role of pictorial thinking in human cognition — the ability to mentally imagine plausible scenes, not merely reason in language.
Core starting point: Simulating the synergy between verbal abduction and pictorial abduction observed in human cognition.

Method¶

Overall Architecture¶

AbductiveMLLM consists of two components trained jointly end-to-end: 1. Reasoner (verbal domain): A blind LLM generates candidate hypotheses → causal contrastive learning filters them → top-\(k\) hypotheses serve as prior guidance for MLLM inference. 2. Imaginer (visual domain): A Stable Diffusion-based model that uses the Reasoner's output embeddings and visual observations to generate "imagined" scenes, which in turn provide feedback to verbal reasoning.

Task definition: Given a video sequence \(\mathcal{V}=\{O_1,\dots,O_{t-1},H,O_t,\dots,O_{T-1}\}\), where \(H\) is an unobserved event, the goal is to infer the most plausible verbal explanation \(E_h\) for \(H\).

Key Designs¶

Design 1: Causal-Aware Hypothesis Generation and Filtering (CHG)

Implemented in two steps:

Step 1 — Candidate Hypothesis Generation: A pretrained MLLM generates video captions \(\mathcal{C}=\{C_t\}_{t=1}^{T-1}\) for each observed event, after which GPT-4o-mini is prompted at high temperature (1.4) to produce \(L\) diverse candidate hypotheses \(\mathcal{Y}=\{Y_i\}_{i=1}^{L}\).

Step 2 — Causal Contrastive Filtering: The video sequence is partitioned into an initial segment \(\mathcal{I}\), a process segment \(\mathcal{P}\), and a final segment \(\mathcal{F}\). These are mapped into a joint causal space via a visual encoder \(\Phi_V\) and a text encoder \(\Phi_T\). Training employs the NT-Xent loss:

\[\mathcal{L}_{\text{Contrast}}=-\log\frac{\exp(\langle \boldsymbol{X}_{\mathcal{I}}+\boldsymbol{X}_{\mathcal{P}}^{+}, \boldsymbol{X}_{\mathcal{F}}\rangle/\tau)}{\sum_{i=1}^{M}\exp(\langle \boldsymbol{X}_{\mathcal{I}}+\boldsymbol{X}_{\mathcal{P}}^{i-,+}, \boldsymbol{X}_{\mathcal{F}}\rangle/\tau)}\]

At inference, each candidate hypothesis is scored as \(\text{Score}(Y_i)=\langle \boldsymbol{X}_{\mathcal{I}}+\boldsymbol{X}_{Y_i}, \boldsymbol{X}_{\mathcal{F}}\rangle\), and the top-\(k\) (\(k=3\)) hypotheses are retained. The key distinction from standard contrastive learning is that positive samples are defined by causal validity rather than surface similarity — hypotheses that resemble the video content but lack causal grounding are excluded.

Design 2: Pictorial Reasoning via the Imaginer Diffusion Model

Three lightweight adapters are introduced into the U-Net of Stable Diffusion:

V-Adapter (visual cross-attention): Injects visual priors from observed video using a local–global hybrid representation:
- Local representation: CLIP computes per-frame similarity \(\gamma^i\) to explanation \(E_h\); high-scoring frames are concatenated as \(\boldsymbol{c}_{local}\)
- Global representation: Weighted average \(\boldsymbol{c}_{global}=\sum_{i=1}^{N}\gamma^i \boldsymbol{c}_v^i\)
- Cross-attention: \(\text{V-Adapter}(\boldsymbol{Q},\boldsymbol{K}_v,\boldsymbol{V}_v)=\text{Softmax}(\frac{\boldsymbol{Q}\boldsymbol{K}_v^{\top}}{\sqrt{d_k}})\boldsymbol{V}_v\)
T-Adapter (temporal convolution): Models inter-frame temporal dependencies using depthwise separable 3D convolutions: \(\text{T-Adapter}(\boldsymbol{x})=\boldsymbol{x}+\text{Conv3D}_{up}(\text{Conv3D}_{down}(\boldsymbol{x}))\)
F-Adapter (FFN adapter): Enhances spatial representations in parallel with the FFN: \(\text{F-Adapter}(\boldsymbol{x})=\boldsymbol{x}+\text{FC}_{up}(\text{GELU}(\text{FC}_{down}(\boldsymbol{x})))\)

Design 3: Two-Stage End-to-End Training

Stage I: Separate training — the MLLM is fine-tuned with LoRA (\(\mathcal{L}_{CE}\)); the Imaginer freezes SD weights and trains only the adapters (\(\mathcal{L}_{Diffusion}\)) with Min-SNR loss weighting.
Stage II: Joint end-to-end fine-tuning — \(\mathcal{L}=\mathcal{L}_{CE}+\alpha\mathcal{L}_{Diffusion}\), with \(\alpha=5\).

Loss & Training¶

The total loss is \(\mathcal{L}=\mathcal{L}_{CE}+\alpha\mathcal{L}_{Diffusion}\), with \(\alpha=5\) yielding the best performance. Stage I trains for 2 epochs (contrastive learning module trained for 10 epochs with 100 hard negatives per positive sample); Stage II joint fine-tuning runs for 1 epoch. Training uses 4× A800 80GB GPUs.

Key Experimental Results¶

Main Results¶

Results on the VAR test set:

Method	BLEU@4	METEOR	ROUGE	CIDEr	BERT-S
Human	11.35	19.36	36.92	147.79	40.59
REASONER	3.44	9.05	22.89	30.75	30.64
UPD-Trans	5.40	11.16	25.62	41.66	30.80
GPT-4o-mini	0.63	7.38	13.64	7.30	12.27
Qwen2VL-7B	2.41	11.29	21.61	29.25	30.01
Qwen2VL-7B (FT)	5.67	12.77	27.11	50.82	36.03
AbductiveMLLM	6.54	13.41	27.95	57.04	36.80

Results on the YouCookII test set:

Method	BLEU@4	METEOR	ROUGE	CIDEr	BERT-S
REASONER	3.54	9.47	24.62	32.99	23.19
Qwen2VL-7B (FT)	5.66	12.62	28.64	68.44	29.09
AbductiveMLLM	6.16	13.46	30.06	77.70	30.77

Ablation Study¶

Core component ablation (VAR test set):

CHG	Imaginer	BLEU@4	METEOR	ROUGE	CIDEr	BERT-S
✗	✗	5.67	12.77	27.11	50.82	36.03
✓	✗	6.33	12.96	27.21	53.60	36.31
✗	✓	6.35	13.07	27.52	55.00	36.40
✓	✓	6.54	13.41	27.95	57.04	36.80

Imaginer adapter ablation:

Variant	CIDEr	BERT-S
Full model	57.04	36.80
w/o V-Adapter	54.51	36.68
w/o T-Adapter	54.99	36.68
w/o F-Adapter	54.52	36.63

Top-\(k\) hypothesis count: \(k=3\) is optimal (CIDEr 57.04); performance drops to 53.66 at \(k=10\).

Key Findings¶

CHG and Imaginer independently contribute approximately +2.78 and +4.18 CIDEr, respectively; their combination yields +6.22.
The Imaginer (pictorial reasoning) contributes more to semantic metrics (METEOR/ROUGE), indicating that visual imagination enriches language generation.
Even the strongest MLLM baseline (Qwen2VL-7B FT) remains far below human performance (57.04 vs. 147.79 CIDEr).
The model is robust to the \(\alpha\) coefficient across the range 1–9.

Highlights & Insights¶

This work is the first to incorporate pictorial thinking into visual abductive reasoning, simulating the dual-mode cognitive process observed in humans.
Causal contrastive learning — rather than surface similarity matching — is central to effective hypothesis filtering, capturing the causal chain from premises to conclusions.
The diffusion model serves as a reasoning guide rather than a high-fidelity image generator; the denoising loss in latent space drives the model toward visually plausible outcomes.
The lightweight adapter design (V/T/F-Adapter) makes video-grounded reasoning on Stable Diffusion practically feasible.

Limitations & Future Work¶

A substantial gap to human performance remains (CIDEr 57.04 vs. 147.79), underscoring that abductive reasoning remains a major challenge for AI.
The Imaginer is built on SD-v1-4 (256×256 resolution); upgrading to more powerful generative models may yield further improvements.
Hypothesis generation relies on GPT-4o-mini, inheriting its knowledge and reasoning limitations.
Evaluation is limited to two datasets; broader generalizability remains to be verified.

vs. REASONER: REASONER is a conventional small model with a causal decoder; AbductiveMLLM employs an MLLM and a diffusion model to realize dual-mode verbal and pictorial reasoning, improving CIDEr from 30.75 to 57.04.
vs. UPD-Trans: UPD-Trans introduces probabilistic distillation but remains confined to verbal reasoning; AbductiveMLLM's Imaginer supplements this with pictorial thinking, achieving comprehensive gains (+15.38 CIDEr).
vs. KN-VLM: KN-VLM augments reasoning with an external knowledge base but retains a conventional architecture; AbductiveMLLM instead leverages the intrinsic knowledge of MLLMs alongside the imaginative capacity of generative models.

Rating¶

Novelty: ⭐⭐⭐⭐ First to introduce pictorial thinking into VAR; the dual-mode Reasoner+Imaginer design is genuinely innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, comprehensive ablations (components, hypothesis count, coefficient, adapters), with detailed analysis.
Writing Quality: ⭐⭐⭐⭐ Motivation grounded in human cognition is clearly articulated; methodology is described in detail.
Value: ⭐⭐⭐⭐ The paradigm of using diffusion models as reasoning guides rather than generators is broadly transferable, though the gap to human performance remains large.