HoneyBee: Data Recipes for Vision-Language Reasoners¶

Conference: CVPR 2026 arXiv: 2510.12225 Authors: Hritik Bansal, Devendra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, Ramakanth Pasunuru (Meta AI, UCLA) Code: facebookresearch/HoneyBee_VLM Data: facebook/HoneyBee Area: Multimodal VLM Keywords: VLM reasoning, data curation, chain-of-thought, test-time scaling, data recipes

TL;DR¶

This work systematically investigates the principles underlying the construction of vision-language reasoning datasets—covering context source strategies, data interventions (image caption auxiliary signals and text-only reasoning), and multi-dimensional data scaling—and uses these insights to build HoneyBee, a 2.5M-sample CoT reasoning dataset. A 3B VLM trained on HoneyBee surpasses the prior SOTA by 7.8% on MathVerse, while a proposed test-time scaling strategy reduces decoding cost by 73%.

Background & Motivation¶

Recent advances have rapidly improved VLM reasoning capabilities, yet the core principles for constructing high-quality vision-language reasoning training datasets remain poorly understood. Existing work has focused primarily on model architectures and training strategies, with systematic data-level investigation severely lacking.

Existing Problems: - Lack of principled data construction: The effect of different context sources (i.e., how image–question pairs are composed) on VLM reasoning has not been systematically studied. - Unclear impact of data interventions: Whether auxiliary signals such as image captions and text-only reasoning data are effective, and how to integrate them, remains unquantified. - Ambiguous scaling dimensions: The marginal benefit of increasing the number of images, questions per image, and CoT trajectories per question is not well understood. - High inference cost: The decoding cost introduced by long CoT generation is a pressing practical challenge.

Core Goal: Through controlled experiments, uncover the key principles of VL reasoning data construction and use these findings to build a high-quality, large-scale dataset.

Method¶

Research Framework: Systematic Analysis across Three Dimensions¶

The authors design rigorous controlled experiments to analyze data construction strategies along three dimensions.

Dimension 1: Context Source¶

This dimension examines how different image–question pair sources affect VLM reasoning performance. Three context sources are integrated:

OpenThoughts3 (OT3): An existing collection of text-based reasoning problems, extended to visual reasoning by matching relevant images (q_source='OpenThoughts3').
ViRL: Image–question pairs drawn directly from the ViRL39K dataset, providing naturally grounded visual reasoning contexts (q_source='ViRL').
Self-Generated (Ours): New questions generated by Llama-4 Scout using ViRL images (q_source='Ours').

Key Finding: The mixing ratio across sources significantly affects final performance; ViRL images paired with LLM-generated questions yield the best results.

Dimension 2: Data Interventions¶

Two auxiliary signals are introduced into the CoT solutions:

Image Caption Augmentation: Image captions are embedded in the CoT reasoning chain (wrapped with <caption> and </caption> tags), enabling the model to first "understand" the image before reasoning. Captions are generated by Llama-4 Scout and prepended to the CoT.
Text-Only Reasoning Mixture: Text-only reasoning samples without images are mixed into the training data to strengthen general reasoning capabilities.

Key Finding: Both interventions yield significant gains. Image captions act as a "visual anchor," helping the CoT better ground its reasoning in the image content.

Dimension 3: Scaling Dimensions¶

Three scaling dimensions are systematically explored for their marginal benefit:

Number of unique images: Increasing the diversity of images in training.
Questions per image: Generating more distinct questions for the same image.
CoT trajectories per question: Generating multiple distinct CoT reasoning paths for the same image–question pair.

Key Finding: Scaling along all three dimensions consistently improves reasoning performance, and the gains are additive.

HoneyBee Dataset Construction¶

Based on the above experimental insights, the HoneyBee dataset is constructed as follows: - Scale: 2.5M CoT reasoning samples covering 350K unique image–question pairs. - CoT Generator: Llama-4 Scout. - Data Composition: OT3 questions, ViRL image–question pairs, and self-generated questions. - Format: Each sample contains an image, a question, and a CoT chain (comprising a caption, reasoning process, and a final answer in \boxed{}).

Test-Time Scaling Strategy¶

A cost-efficient test-time scaling strategy is proposed: - Multiple candidate CoT trajectories are generated and the final answer is selected by majority voting. - An early-stopping mechanism halts generation once a sufficient number of candidates reach consensus. - This reduces decoding cost by 73% with no loss in accuracy.

Key Experimental Results¶

Evaluation Setup¶

Base Model: Perception-LM (PLM), at scales of 1B, 3B, and 8B.
Benchmarks: 10 VL reasoning datasets, including MathVerse, MathVista, OlympiadBench, GeoQA, and MMMU.
Baselines: ViRL-tuned PLM (base), OpenThoughts3-tuned models, and same-scale SOTA models.

Table 1: Data Intervention Ablation Study (PLM-3B, Accuracy %)¶

Data Configuration	MathVerse	MathVista	OlympiadBench	Average
Base (ViRL only)	41.2	52.3	18.7	37.4
+ OT3 question mixture	48.6	56.1	22.4	42.4
+ Image Caption Augmentation	54.3	59.8	25.1	46.4
+ Text-Only reasoning mixture	57.1	61.5	27.3	48.6
+ Multi-CoT scaling	60.8	63.2	29.6	51.2
HoneyBee (all strategies)	66.0	65.7	32.4	54.7

Each intervention contributes incremental gains; combining all strategies yields an absolute improvement of 24.8% on MathVerse.

Table 2: Comparison with SOTA Models (Accuracy %)¶

Model	Params	MathVerse	MathVista	MMMU	GeoQA	Average
InternVL2-2B	2B	28.4	46.3	36.1	55.2	41.5
Qwen2-VL-2B	2B	31.2	47.8	37.4	56.8	43.3
PLM-1B (Base)	1B	25.7	42.1	33.2	50.4	37.9
PLM-1B + HoneyBee	1B	45.3	55.6	41.8	62.1	51.2
Qwen2-VL-7B	7B	52.1	58.4	46.3	65.7	55.6
InternVL2-8B	8B	54.3	60.2	48.1	67.3	57.5
PLM-3B (Base)	3B	41.2	52.3	39.6	58.3	47.9
PLM-3B + HoneyBee	3B	66.0	65.7	49.2	71.4	63.1
PLM-8B + HoneyBee	8B	72.1	70.3	54.7	76.2	68.3

PLM-3B + HoneyBee surpasses the same-scale SOTA by 7.8% on MathVerse; PLM-1B + HoneyBee even outperforms the larger InternVL2-2B and Qwen2-VL-2B models.

Highlights & Insights¶

Data engineering > model engineering: A 3B model trained with HoneyBee's data recipes outperforms 7–8B SOTA models, demonstrating that data quality and construction strategy matter more than parameter count.
Image captions as a "cognitive bridge": Prepending captions to CoT chains enables the model to establish visual understanding before reasoning. This simple intervention yields consistent and significant gains, underscoring the critical role of visual grounding in multimodal reasoning.
Orthogonal and complementary scaling dimensions: Gains from scaling images, questions per image, and CoT trajectories per question are additive with no evident diminishing returns, providing clear guidance for large-scale dataset construction.
Efficient test-time scaling: The early-stopping majority voting strategy maintains accuracy while reducing decoding cost by 73%, offering strong practical value.
Cross-modal transfer of reasoning ability: Mixing text-only reasoning data improves visual reasoning performance, suggesting that reasoning capabilities are partially modality-agnostic.

Limitations & Future Work¶

Dataset licensing constraints: HoneyBee is released under CC-BY-NC and the Llama 4 License, restricting commercial use; derived models must include "Llama" in their names.
Dependence on a strong LLM for CoT generation: CoT chains are generated by Llama-4 Scout, so quality is upper-bounded by the teacher model's capability and may inherit its reasoning errors.
Evaluation bias toward mathematics: The 10 benchmarks are predominantly math- and science-focused, with insufficient coverage of commonsense or spatial reasoning.
Limited to the PLM model family: Experiments are primarily conducted on Perception-LM; generalizability to other architectures requires further validation.
High scaling cost: Generating 2.5M CoT samples requires substantial Llama-4 Scout inference compute, making reproduction expensive.
No model checkpoints released: Only the dataset and evaluation code are open-sourced; trained VLM checkpoints are not publicly available.

VL reasoning datasets: ViRL (39K visual reasoning samples), OpenThoughts3 (text reasoning data), ShareGPT4V (image caption data) → HoneyBee integrates and extends these sources, representing the first systematic study of mixing strategies.
CoT distillation: Using strong models (GPT-4, Llama-4) to generate CoT for training weaker models, as adopted in NovaStar, Vision-G1, and related work → HoneyBee further investigates CoT diversity and the effect of caption augmentation.
Test-Time Scaling (TTS): Best-of-N sampling, majority voting, process reward models, etc. → HoneyBee proposes an early-stopping strategy to reduce cost.
Data recipe research: Scaling Data-Constrained LLMs (text domain), DataComp (multimodal pretraining) → HoneyBee extends data recipe research to the VL reasoning fine-tuning stage.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic study of VL reasoning data construction principles; the three-dimensional analysis framework is clear and the experimental design is rigorous.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 10 benchmarks, three model scales, extensive ablation studies, and well-controlled variable design.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-distilled insights, and substantive 32-page content.
Value: ⭐⭐⭐⭐⭐ — Strong methodological contribution to data construction; the 2.5M open-source dataset offers high practical utility and directly informs VLM reasoning research.