UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark¶

Conference: CVPR 2026 arXiv: 2603.05075 Code: Available (Project Page) Area: Audio/Speech (Multimodal Benchmark) Keywords: multimodal benchmark, any-to-any, interleaved multimodal, evaluation suite, agentic model

TL;DR¶

This paper proposes UniM, the first unified any-to-any interleaved multimodal benchmark (31K samples, 7 modalities, 30 domains), accompanied by a three-dimensional evaluation suite and an agentic baseline UniMA based on traceable evidence reasoning, revealing critical deficiencies of existing MLLMs under the interleaved multimodal paradigm.

Background & Motivation¶

1. State of the Field¶

Multimodal large language models (MLLMs) have rapidly evolved from early vision-language understanding to unified frameworks that simultaneously support understanding and generation (e.g., NExT-GPT, AnyGPT, MIO). Interleaved multimodal learning has become a core capability for next-generation systems.

2. Limitations of Prior Work¶

Existing interleaved multimodal benchmarks (MMIE, CoMM, ISG-Bench, OpenING, etc.) suffer from three critical shortcomings:

Narrow modality coverage: Limited to text and image only, unable to evaluate broader modality combinations such as audio, video, documents, code, and 3D.
Single-capability evaluation: Each data instance tests only a single capability, failing to reflect compound reasoning that interweaves multiple capabilities in real-world scenarios.
Insufficient domain diversity: Concentrated in general domains, neglecting professional scenarios such as natural science and social science.

3. Root Cause¶

Model capabilities have expanded to any-to-any multimodal conversion, yet a systematic evaluation benchmark to match this development is absent — existing benchmarks lag far behind in evaluation dimensions, modality coverage, and difficulty gradation.

4. Paper Goals¶

To construct a unified interleaved multimodal benchmark that simultaneously covers multiple modalities (7), multiple domains (30), multiple capabilities (multi-task per instance), and multiple difficulty levels (3 tiers), along with a compatible evaluation methodology and baseline model.

5. Starting Point¶

Starting from real-world data (public datasets, social media, knowledge bases such as Wikipedia and YouTube), the paper constructs a large-scale interleaved multimodal dataset in open-ended QA format, where both inputs and outputs are interleaved sequences of arbitrary modalities.

6. Core Idea¶

Three main contributions: (1) UniM dataset — the first unified any-to-any interleaved multimodal benchmark; (2) UniM evaluation suite — a three-dimensional assessment covering semantic correctness, structural integrity, and interleaved coherence; (3) UniMA — an agentic baseline model based on traceable evidence reasoning.

Method¶

Overall Architecture¶

UniM adopts an open-ended QA format where inputs and outputs are interleaved sequences of arbitrary modality combinations, with non-textual content represented by placeholder tags (e.g., <<image1>>, <<video2>>). The dataset comprises 31,026 high-quality instances spanning 7 modalities (text, image, audio, video, document, code, 3D) across 30 domains (three major categories: natural science, social science, and general domain), partitioned by rule into three difficulty levels: Easy, Medium, and Hard.

Key Designs¶

1. Three-Dimensional Evaluation Suite¶

Traditional metrics (e.g., accuracy) are inadequate for open-ended multimodal generation. The paper designs three complementary evaluation dimensions:

Semantic Quality and Correctness Score (SQCS): - Function: Evaluates semantic alignment and perceptual quality of generated content. - Mechanism: All modality outputs are converted to caption-like text representations; LLM-as-Judge evaluates semantic correctness (SC); modality-specific reference-free quality assessment (GQ) is designed accordingly. - Formula: \(\text{SQCS} = \text{SC} \cdot (\eta^{\text{SQCS}} + (1 - \eta^{\text{SQCS}}) \cdot \text{GQ})\), where \(\eta^{\text{SQCS}} = 0.7\)

Response Structure Integrity (StS/LeS): - Function: Evaluates whether the model adheres to the modality type and quantity requirements defined by the task. - Mechanism: StS (Strict Structure Score) requires exact match in both modality type and placeholder count; LeS (Lenient Structure Score) only requires consistent modality type coverage. - Design Motivation: Decouples structural compliance from semantic correctness to independently measure instruction-following ability.

Interleaved Coherence Score (ICS): - Function: Evaluates cross-modal logical coherence and stylistic consistency. - Mechanism: \(\text{ICS} = \eta^{\text{ICS}} \cdot \text{HC} + (1 - \eta^{\text{ICS}}) \cdot \text{SH}\), where HC measures cross-modal semantic-structural consistency, SH measures writing style/visual aesthetic consistency, and \(\eta^{\text{ICS}} = 0.8\).

2. Supporting Rate Correction¶

Function: Distinguishes between a model's absolute and relative capabilities.
Mechanism: A supporting rate \(\tau\) is introduced as a conditional correction, \(\mathcal{X}^{rel} = \tau \cdot \mathcal{X}^{abs}\), mitigating evaluation bias caused by models not supporting certain modalities.

3. UniMA Agentic Baseline Model¶

Receiving Module: Converts non-textual modalities into task-conditioned dense captions (TCDC), forming a unified text space.

Traceable Evidence Reasoning Module (TER): The core reasoning engine, operating through a four-step Structured Evidence Reasoning Chain (SERC): - Step 1: Generate TCDC and a rewritten question → improve semantic correctness. - Step 2: Determine whether data analysis is involved → invoke a code interpreter to generate a data report. - Step 3: Organize modal content, textual content, and tool lists → improve SQCS, ICS, and StS/LeS respectively. - Step 4: Integrate all evidence to generate a final report draft.

Key mechanisms: a Checker detects factual and logical errors in the report; a Judger performs backtracking and corrective reasoning; a reliable traceable reasoning process is achieved through an iterative "generate → check → backtrack → regenerate" cycle.

Generating Module: Produces interleaved multimodal output based on the verified final report.

Loss & Training¶

UniMA is constructed as an agentic pipeline rather than an end-to-end trained model, integrating dedicated multimodal encoders/decoders. Its core relies on the structured reasoning process of TER rather than gradient optimization. In the evaluation suite, \(\eta^{\text{SQCS}} = 0.7\) and \(\eta^{\text{ICS}} = 0.8\) are determined by optimal alignment with human evaluation.

Key Experimental Results¶

Main Results¶

Table 1: Semantic Quality and Correctness Score (SQCS) and Supporting Rate

Domain	Model	SC	GQ	SQCS_abs	τ	SQCS_rel
Natural Science	AnyGPT	13.7	37.9	11.1	90.4	10.7
Natural Science	NExT-GPT	8.4	23.4	6.2	62.0	2.9
Natural Science	MIO	19.7	29.1	15.9	59.2	10.0
Natural Science	UniMA	59.8	79.7	57.3	100	57.3
Social Science	AnyGPT	18.0	23.8	15.5	94.7	14.7
Social Science	NExT-GPT	16.8	31.9	13.3	89.0	10.8
Social Science	MIO	25.2	32.8	21.4	80.8	16.1
Social Science	UniMA	76.2	81.0	72.7	100	72.7
General Domain	UniMA	64.7	83.6	62.2	100	62.2

Table 2: Interleaved Coherence Score (ICS)

Domain	Model	HC	SH	ICS_abs	ICS_rel
Natural Science	AnyGPT	39.9	46.3	41.8	38.5
Natural Science	NExT-GPT	23.5	26.1	24.9	16.3
Natural Science	MIO	49.4	63.7	52.1	31.8
Natural Science	UniMA	68.4	71.9	69.1	69.1
Social Science	AnyGPT	31.3	35.3	32.1	29.2
Social Science	MIO	46.3	55.0	51.6	42.0
Social Science	UniMA	73.1	76.5	73.8	73.8
General Domain	MIO	68.3	77.7	60.0	45.7
General Domain	UniMA	68.7	74.3	69.8	69.8

Ablation Study¶

Table 3: UniMA Ablation Study

Configuration	SQCS	ICS	StS	LeS
UniMA (Full)	85.1	63.4	52.7	82.6
w/o TER	72.9 (-12.2)	56.6 (-6.8)	16.4 (-36.3)	21.8 (-60.8)
w/o TCDC	78.4 (-6.7)	57.7 (-5.7)	46.2 (-6.5)	82.1 (-0.5)
w/o Verification	72.9 (-12.2)	54.7 (-8.7)	38.3 (-14.4)	66.8 (-15.8)

Key findings: removing TER causes the largest drops in StS/LeS (−36.3/−60.8), demonstrating that traceable reasoning is critical for structural integrity; removing the verification submodule leads to across-the-board degradation, indicating that the check–backtrack–regenerate mechanism is indispensable for reliable output.

Key Findings¶

Existing models perform extremely poorly: Baseline model SQCS is mostly below 20%; NExT-GPT and MIO StS/LeS are mostly below 5%, demonstrating that existing MLLMs fall far short of the requirements for interleaved multimodal learning.
Supporting rate severely constrains relative performance: AnyGPT general domain StS drops from 12.5% to 9.8% (rel); MIO natural science SQCS drops from 15.9% to 10.0% (rel) — incomplete modality support is a core bottleneck.
Significant domain variation: Social science achieves the highest SQCS (common concepts + descriptive reasoning); general domain achieves the highest ICS (open-domain data better matches training distribution); natural science performs worst (requires precise terminology and structured logic).
UniMA leads by a large margin: StS/LeS is 2–6× higher than AnyGPT and 15–40× higher than NExT-GPT/MIO.
Difficulty sensitivity: Only UniMA exhibits a performance gradient consistent with task difficulty; baseline models already fail on the simplest tasks and cannot differentiate task complexity.

Highlights & Insights¶

High value in problem formulation: The paper is the first to systematically define "any-to-any interleaved multimodal learning" and provide a complete evaluation framework, filling the assessment gap across 7 modalities, 30 domains, and multiple difficulty levels.
Elegant evaluation suite design: The three dimensions of SQCS/StS-LeS/ICS decouple semantics, structure, and coherence; Pearson correlation with human evaluation reaches 0.974/0.960.
Supporting rate correction (\(\tau\)) fairly handles incomplete modality support across models, accounting for both absolute and relative capabilities.
TER module design: Traceable evidence combined with check–backtrack mechanisms effectively improves structured output quality within an agentic framework.

Limitations & Future Work¶

UniMA is fundamentally an agentic pipeline (multi-module integration) rather than an end-to-end unified model; its advantages partly stem from engineering integration rather than breakthroughs in model capability.
Among the 7 modalities, code (2.6%) and 3D (1.4%) are severely underrepresented, raising questions about the representativeness of evaluation for these modality types.
The evaluation relies heavily on LLM-as-Judge, introducing biases inherent to the evaluator model itself.
The absence of a human performance baseline makes it difficult to assess what UniMA's ~60% SQCS represents in absolute terms.
A portion of the data expansion uses GPT-5-mini to generate candidate instances, potentially introducing synthetic data bias.

Comparison with MMIE/CoMM: UniM expands modalities from 2 to 7, domains from ~10 to 30, and interleaved combinations from 3–4 to 41, representing an order-of-magnitude leap.
Relationship with NExT-GPT/AnyGPT: These models serve as evaluated baselines; experiments expose their critical limitations in interleaved settings.
Inspiration from TER: The traceable evidence reasoning chain yields significant gains on complex multimodal tasks; the iterative "generate → check → backtrack → regenerate" paradigm is worth adopting.
Evaluation methodology inspiration: Decomposing multimodal evaluation into three orthogonal dimensions — semantics, structure, and coherence — is more informative than a single metric.

Rating¶

⭐⭐⭐⭐ An important benchmark contribution that systematically defines and evaluates any-to-any interleaved multimodal learning for the first time, with large data scale and thoughtful evaluation design. However, the UniMA baseline is engineering-heavy, and the core contributions lie in the dataset and evaluation methodology rather than model innovation.