ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking¶
Conference: NeurIPS 2025 arXiv: 2511.09833 Code: None Area: Data Annotation / MLLM Applications Keywords: data annotation, critical thinking, MLLM, error estimation, human-in-the-loop
TL;DR¶
This paper proposes ACT (Annotation with Critical Thinking), a data pipeline in which an MLLM annotates all samples in bulk, a second MLLM acting as a critic estimates the error probability of each annotation, and only high-suspicion samples are routed to human reviewers. Combined with a theoretically derived ACT loss function, the approach achieves 70–90% reduction in human annotation cost across six cross-modal datasets while maintaining a downstream performance gap of less than 2%.
Background & Motivation¶
Background: Supervised learning depends on high-quality labeled data, yet manual annotation is expensive and difficult to scale. Automatic annotation via LLMs/MLLMs is cheap but still lags behind human quality.
Limitations of Prior Work: (1) Pure MLLM annotation accuracy falls 5–20% short of human annotation, causing notable downstream performance degradation; (2) existing methods such as CDI require training an additional XGBoost detector and generalize poorly; (3) some approaches are restricted to white-box models and cannot leverage powerful black-box models such as GPT-4o; (4) the normalized sampling rule used in existing active M-estimation collapses under low-budget conditions.
Key Challenge: How can the annotation capability of MLLMs be maximally exploited under a limited human annotation budget while maintaining data quality close to fully human-annotated data?
Key Insight: MLLMs are assigned dual roles as both annotator and critic — first generating labels, then performing self- or cross-critique — so that human effort is precisely allocated to the most suspicious samples.
Method¶
Overall Architecture¶
ACT is a three-stage, training-free pipeline: (1) Annotation stage: MLLM \(f^{(m)}\) generates labels \(\hat{y}_i^{(m)}\) for all \(N\) samples; (2) Error estimation stage: a separate MLLM \(g\) acts as a critic and estimates the error probability \(\hat{\epsilon}_i = g(\mathbf{x}_i, \hat{y}_i^{(m)})\) for each annotation; (3) Correction stage: budget-aware sampling \(\delta_i(B) \sim \mathbb{B}(\pi_B(\hat{\epsilon}_i))\) selects samples for human review subject to \(\sum \delta_i(B) \leq B\). Downstream training uses a specially designed ACT loss function.
Key Designs¶
-
MLLM Criticizer Strategy Family:
- Function: Seven criticizer strategies spanning black-box and white-box settings are designed to enable MLLMs to estimate annotation error probabilities.
- Mechanism: Black-box strategies include Naïve direct estimation, CoT-based estimation, multiple-choice grading (MC), and Devil's Advocate (reviewing the annotator's CoT before judging); white-box strategies use logit probabilities \(\hat{\epsilon} = \mathbb{P}(\text{"yes"}) / (\mathbb{P}(\text{"yes"}) + \mathbb{P}(\text{"no"}))\) or CoT perplexity (PPL) as indirect error signals. Experiments show that CoT-based criticism achieves the highest ABS improvement of up to 22.46%, and cross-criticism (using a different model for annotation and criticism) generally outperforms self-criticism.
- Design Motivation: Different task–model combinations favor different strategies; systematic exploration provides actionable deployment guidance. The training-free design allows the pipeline to be applied directly with any MLLM.
-
Budget-Aware Sampling:
- Function: Determines which samples are routed to human reviewers under a fixed budget \(B\).
- Mechanism: Three sampling rules are proposed — normalized \(\pi_B(\hat{\epsilon}_i) = B \cdot \hat{\epsilon}_i / \sum \hat{\epsilon}_i\), exponential-weighted \(\pi_B(\hat{\epsilon}_i) = 1/(1 + e^{-\beta(\hat{\epsilon}_i - \alpha)})\), and thresholded \(\pi_B(\hat{\epsilon}_i) = \mathbf{1}(\hat{\epsilon}_i \geq \tau)\). Theorem 5.2 proves that the parameter gap between the ACT loss and the true loss is upper-bounded by a quantity depending on \(q\) (the lower bound of the transformed error probability for selected samples); exponential-weighted and thresholded rules push \(q\) toward 1, whereas the normalized rule yields \(q \to 0\) under low budgets, causing collapse.
- Design Motivation: The normalized sampling rule used in prior work produces highly unstable loss behavior under limited human budgets — yielding a 76.34% gap from full supervision on the Cars dataset — whereas exponential-weighted and thresholded rules remain stable at 1.69%.
-
ACT Loss Function:
- Function: A theoretically grounded loss function is designed so that models trained on ACT-annotated data approach the performance of models trained on fully human-annotated data.
- Mechanism: \(\mathcal{L}_\theta^{(ACT)} = \frac{1}{N}\sum_{i=1}^{N}\left(\ell_{\theta,i}^{(m)} + (\ell_{\theta,i} - \ell_{\theta,i}^{(m)}) \frac{\delta_i(B)}{\pi_B(\hat{\epsilon}_i)}\right)\), where \(\ell_{\theta,i}^{(m)}\) is the machine-annotation loss and \(\ell_{\theta,i}\) is the true-label loss estimated from human annotations. Proposition 5.1 proves that the ACT loss is an unbiased estimator of the true loss, with variance minimized in two cases: a perfect annotator or a perfectly accurate critic.
- Design Motivation: Naively mixing human and machine annotations introduces label noise, while using only human-annotated samples wastes the already-labeled machine-annotated data. The ACT loss achieves unbiased estimation via importance weighting, and the exponential-weighted/thresholded rules prevent weight explosion.
Loss & Training¶
The ACT loss is an improvement over active M-estimation. Its core mechanism uses the sampling probability \(\pi_B(\hat{\epsilon}_i)\) for importance-weighted correction: samples selected for review use their true loss \(\ell_{\theta,i}\) computed from human labels, while unselected samples use the machine-annotation loss \(\ell_{\theta,i}^{(m)}\). The thresholded rule is recommended in practice (only the threshold \(\tau\) needs to be set, which is simpler than the two hyperparameters \(\alpha, \beta\) of the exponential-weighted rule). Downstream tasks use standard cross-entropy loss with power-tuning hyperparameters.
Key Experimental Results¶
Main Results: Downstream Task Test Accuracy (%)¶
| Training Data – Loss | CIFAR-10 | Fashion | Cars | Emotion | Irony | VQA-RAD |
|---|---|---|---|---|---|---|
| Full Human – CE | 88.66±0.97 | 93.01±0.63 | 87.88±0.36 | 81.82±0.57 | 70.18±3.23 | 67.81±1.47 |
| Full Machine – CE | 81.55±1.93 | 82.86±0.84 | 83.68±0.17 | 78.96±2.40 | 60.71±5.43 | 61.03±2.05 |
| ACT – Normalized Loss | 64.70±5.46 | 69.27±7.25 | 11.54±0.96 | 79.87±0.88 | 65.66±2.00 | 62.55±3.01 |
| ACT – Exp-Weighted Loss | 87.73±0.36 | 89.73±0.35 | 86.19±0.14 | 81.44±0.51 | 68.49±3.20 | 67.73±1.33 |
| ACT – Thresholded Loss | 87.95±0.35 | 89.16±0.89 | 86.00±0.26 | 81.41±0.64 | 68.21±1.94 | 67.02±1.32 |
| Human–ACT Gap | 0.71% | 3.28% | 1.69% | 0.38% | 1.69% | 0.08% |
| Human Budget Ratio | 11.52% | 21.81% | 9.56% | 17.98% | 33.79% | 30.15% |
Ablation Study: Criticizer Strategy ABS (%) Comparison (GPT-4o annotation + CoT)¶
| Critic Model | Naïve | CoT | MC | Devil |
|---|---|---|---|---|
| GPT-4o (self-criticism) | 41.2 | 53.8 | 48.6 | 50.1 |
| Gemini-1.5-Pro | 45.3 | 56.2 | 51.4 | 52.7 |
| Claude 3.5 Sonnet | 43.7 | 54.9 | 52.1 | 55.3 |
| InternVL 2.5 | 38.5 | 44.1 | 40.3 | 42.8 |
Key Findings¶
- Seven core insights are identified: GPT-4o is the best general-purpose annotator; CoT benefits criticism more than annotation (ABS improvement up to 22.46%); cross-criticism outperforms self-criticism; black-box models are stronger critics; annotation capability and criticizing capability are positively correlated.
- Normalized sampling collapses completely on Cars (11.54%), while exponential-weighted and thresholded sampling remain robust (86%+).
- White-box strategies (logit/PPL) outperform black-box strategies on 2 of 6 datasets, but results are inconsistent.
Highlights & Insights¶
- The annotate–criticize–correct three-stage pipeline is elegantly designed and entirely training-free, allowing plug-and-play use with any MLLM.
- Seven systematic insights provide actionable best-practice guidance for real-world deployment.
- The ACT loss function offers theoretical guarantees (unbiased estimation + variance control), and the exponential-weighted/thresholded rules substantially outperform the normalized rule used in prior work.
- The study conducts systematic exploration across three domains (NLP, CV, VQA), six datasets, and six MLLMs with seven criticizer strategies and three sampling rules, constituting a highly comprehensive experimental design.
- The finding that annotation capability and criticizing capability are positively correlated simplifies model selection: use the top-1 model as the annotator and the top-2 model as the critic.
Limitations & Future Work¶
- Validation is limited to classification tasks; generative tasks such as text summarization and open-ended QA are not covered.
- Critic accuracy is bounded by the MLLM's capability ceiling, and a 5–15% false positive rate limits the maximum achievable performance.
- Budget setting is based on annotator accuracy (an "ideal budget"), and practical budget allocation strategies are not discussed in depth.
- Performance on non-English scenarios such as Chinese or low-resource languages is not validated.
Related Work & Insights¶
- vs. CDI: CDI requires training an XGBoost detector and uses normalized sampling (which collapses under low budgets), whereas ACT is fully training-free and employs stable thresholded sampling.
- vs. LLM-as-a-Judge: The ACT criticizer design is closely related to LLM self-evaluation; the finding that cross-criticism outperforms self-criticism echoes the self-evaluation bias literature.
- vs. Active Learning: Traditional active learning requires model retraining within the annotation loop, whereas the ACT pipeline requires no training at any stage.
- Insights: The budget-aware sampling paradigm is generalizable to any human–AI collaboration scenario; the positive correlation between annotator and critic capability simplifies pipeline configuration.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combined design of criticizer + budget-aware sampling + ACT loss is practical and novel, though the core idea of LLM mutual evaluation is not entirely new.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six datasets, six MLLMs, seven criticizer strategies, three sampling rules, and complete ablations constitute an exceptionally systematic study.
- Writing Quality: ⭐⭐⭐⭐ The seven insights are clearly summarized and theoretical analysis is tightly integrated with empirical results.
- Value: ⭐⭐⭐⭐⭐ The work has direct practical value for reducing AI data annotation costs and provides strong guidance for practitioners.