Evian: Towards Explainable Visual Instruction-tuning Data Auditing¶

Conference: ACL 2026 arXiv: 2604.20544 Code: N/A Area: Interpretability Keywords: Data Auditing, Visual Instruction Tuning, Explainable Evaluation, Data Quality, Multimodal Large Language Models

TL;DR¶

This paper proposes a Decomposition-then-Evaluation paradigm and the EVIAN framework, which decomposes responses in visual instruction tuning data into three components—visual description, subjective reasoning, and factual claims—and evaluates them along three orthogonal dimensions: image-text consistency, logical coherence, and factual accuracy. Models trained on the small high-quality subset selected by EVIAN outperform those trained on large-scale datasets.

Background & Motivation¶

Background: Large Vision-Language Models (LVLMs) rely on Visual Instruction Tuning (VIT) to align visual perception with language understanding, yet the quality of training data varies considerably.

Limitations of Prior Work: (1) Large-scale data synthesis (e.g., LLaVA-Instruct-150K) improves instruction following but introduces noise; (2) existing filtering methods (e.g., CLIP score) employ coarse-grained, single-dimensional scoring that cannot detect subtle semantic defects such as logical fallacies and factual errors; (3) the LLM-as-a-Judge paradigm suffers from bias, instability, and reasoning shortcuts.

Key Challenge: Existing data filtering compresses multiple error types into a single opaque score, making it impossible to distinguish between visual misrepresentation, factual inaccuracy, and reasoning defects.

Goal: To construct an explainable, fine-grained data auditing framework that decomposes responses into verifiable cognitive components for multi-dimensional evaluation.

Key Insight: Responses are treated as composite structures consisting of visual descriptions, subjective reasoning, and factual claims, rather than indivisible text blocks.

Core Idea: By decomposing the complex auditing task into verifiable sub-tasks targeting distinct cognitive components, data quality assessment can be made more precise than coarse-grained scoring, with logical coherence identified as the most critical factor in data quality.

Method¶

Overall Architecture¶

EVIAN operates in two phases. Phase 1 (Response Decomposition) decomposes responses into a labeled structured form and a pure visual summary via a three-step chain-of-thought (semantic annotation → visual distillation → fluent synthesis). Phase 2 (Multi-dimensional Evaluation) scores responses along three orthogonal dimensions—logical coherence \(S_L\), factual accuracy \(S_K\), and image-text consistency \(S_V\)—on a 1–5 scale, with the final score computed as \(S_{\text{overall}} = (S_L + S_K + S_V) / 3\).

Key Designs¶

Three-Step Chain-of-Thought Decomposition:
- Function: Decomposes complex responses into independently verifiable cognitive components.
- Mechanism: Step 1 (Semantic Annotation) marks subjective reasoning with <INFER> tags and factual claims with <KNOW> tags, leaving unannotated content as pure visual description; Step 2 (Visual Distillation) removes or rewrites tagged content to retain only objective descriptions; Step 3 (Fluent Synthesis) organizes the fragmented distilled results into coherent paragraphs.
- Design Motivation: Decomposition enables each component to be evaluated independently along the most appropriate dimension, avoiding the ambiguity of mixed evaluation.
Three-Dimensional Orthogonal Evaluation System:
- Function: Separately assesses logical reasoning, factual knowledge, and visual alignment quality.
- Mechanism: \(S_L\) evaluates the logical validity of reasoning within <INFER> tags (i.e., whether visual evidence supports the inference); \(S_K\) fact-checks the knowledge claims within <KNOW> tags; \(S_V\) measures the consistency between the pure visual summary and the image, prioritizing consistency over completeness.
- Design Motivation: Different types of defects require different evaluation criteria; orthogonal separation prevents cross-dimensional interference.
Controlled Defect Injection Benchmark:
- Function: Provides a systematic test platform with 300K samples.
- Mechanism: Fifteen semantic defect categories are designed (5 for visual consistency + 5 for logical coherence + 5 for factual accuracy), and subtle context-dependent defects are injected through a three-stage pipeline (content analysis → context-aware error selection → guided rewriting).
- Design Motivation: Existing datasets lack systematically injected, controllable errors, making it impossible to quantitatively evaluate the fine-grained detection capability of auditing pipelines.

Loss & Training¶

Qwen3-235B is used for response decomposition, and Qwen2.5-VL-7B serves as the automated auditor for scoring. Downstream validation fine-tunes Qwen2-VL-2B on the selected 10K subset. All experiments share the same architecture and SFT procedure.

Key Experimental Results¶

Main Results (Fine-tuning Qwen2-VL-2B on 10K Subset)¶

Method	MME	MMBench	ScienceQA	A-OKVQA	POPE	Avg
Random	1475.76	0.5353	0.6614	0.7092	75.50	63.18
Full Data (300K)	1553.05	0.5953	0.6267	0.6934	78.17	63.77
SCALE (Prev. SOTA)	1814.97	0.6318	0.6916	0.7066	73.81	67.41
EVIAN (Ours)	1876.89	0.6463	0.7115	0.7493	79.87	70.20

Ablation Study¶

Configuration	Avg	Note
EVIAN (Full)	70.20	Full framework achieves best performance
w/o Decomposition	67.93	Removing decomposition causes a drop of 2.27
w/o \(S_L\) (Logical Coherence)	57.27	Largest drop when logical coherence is removed (↓12.93)
w/o \(S_K\) (Factual Accuracy)	64.21	Removing factual accuracy causes a drop of 5.99
Only \(S_V\) (Image-Text Consistency)	65.36	Visual consistency alone is acceptable but POPE drops sharply to 68.56

Key Findings¶

Logical coherence is the most critical dimension: Removing \(S_L\) causes Avg to collapse from 70.20 to 57.27, because relying solely on \(S_K\) and \(S_V\) selects samples that are factually correct but logically inconsistent, producing contradictory supervision signals.
"Less is more": The 10K subset selected by EVIAN (3.3% of 300K) yields better training outcomes than the full 300K dataset.
In the score distribution, 92.3% of original high-quality samples receive scores ≥ 3.0, while defective samples cluster around 3.0 (JSD = 0.35, AUC = 0.86).
Cross-architecture validation on InternVL2-2B confirms that the gains stem from data quality rather than inductive bias alignment between the auditor and the target model.

Highlights & Insights¶

The central insight of the Decomposition-then-Evaluation paradigm is that decomposing auditing into verifiable sub-tasks renders complex auditing reliable.
The work challenges the prevailing assumption that more data is better, surpassing full-data training with only 3.3% of the data.
The counter-intuitive finding that logical coherence—rather than visual alignment or factual accuracy—is the most critical data quality factor carries broad implications.
The defect injection benchmark features a systematic taxonomy covering three major categories (consistency, reasoning, and knowledge), each with five error subtypes.

Limitations & Future Work¶

The framework depends on large multimodal models for decomposition and evaluation, potentially inheriting their biases and blind spots.
Errors introduced during the decomposition phase propagate to subsequent evaluation stages, leaving robustness to be improved.
High computational cost (multiple invocations of large models) limits applicability to very large-scale datasets.
Other data quality dimensions such as stylistic diversity and pedagogical value are not modeled.

vs. SCALE: SCALE employs multi-stage filtering (modality quality, relevance, clarity, task rarity) but performs no component-level decomposition; EVIAN achieves more precise fine-grained auditing through cognitive component decomposition.
vs. CLIPScore/BLIP: Similarity-based coarse-grained filtering cannot capture logical fallacies and factual errors.
vs. LLM-as-a-Judge: Directly prompting models for holistic scores introduces bias and instability; EVIAN mitigates this through structured decomposition.

Rating¶

Novelty: ⭐⭐⭐⭐ The Decomposition-then-Evaluation paradigm is novel, and the 15-category defect taxonomy is systematic.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Includes multi-baseline comparison, comprehensive ablation, cross-architecture validation, and a 300K-sample benchmark.
Writing Quality: ⭐⭐⭐⭐ Well-structured, richly illustrated, and analytically thorough.
Value: ⭐⭐⭐⭐ Offers important guidance for multimodal data curation; the finding that logical coherence should be prioritized has wide-ranging implications.