OneThinker: All-in-one Reasoning Model for Image and Video¶
Conference: CVPR 2026
arXiv: 2512.03043
Code: https://github.com/tulerfeng/OneThinker (Available)
Area: Multimodal VLM / LLM Reasoning
Keywords: Multimodal Reasoning Generalist, Reinforcement Learning, GRPO, Multi-task Reward Normalization, Image-Video Unification
TL;DR¶
OneThinker utilizes an 8B model to unify 10 basic visual tasks across image and video (QA, captioning, spatio-temporal grounding, tracking, and segmentation) into a "think-then-structured-output" reasoning paradigm. It introduces EMA-GRPO to resolve optimization imbalances caused by significant differences in reward magnitudes and densities across multiple tasks, outperforming specialized models of comparable size across 31 benchmarks.
Background & Motivation¶
Background: Following DeepSeek-R1, the paradigm of using rule-based rewards with GRPO for Reinforcement Learning (RL) to stimulate reasoning capabilities has been extensively applied to Multimodal Large Language Models (MLLMs): Vision-R1 for image QA, Video-R1 for video QA, VLM-R1 for detection, Seg-R1 for segmentation, and Time-R1 for temporal localization. Each work focuses on pushing reasoning limits within its specific task and modality.
Limitations of Prior Work: These "reasoning models" are predominantly single-task and single-modality, processing either only images or only videos and handling either only QA or only grounding. Few attempts at multi-task learning (e.g., VideoChat-R1) are restricted to narrow subsets—jointly training on only 3 spatio-temporal perception tasks with 18k samples, confined exclusively to the video modality. This fragmentation results in poor versatility for real-world deployment and prevents beneficial knowledge transfer between tasks and modalities.
Key Challenge: Integrating heterogeneous visual tasks into a single model for multi-task RL encounters an unavoidable reward imbalance problem. Reward scales and densities differ vastly across tasks: mathematical QA provides sparse 0/1 rewards, while grounding provides continuous, narrow-range IoU rewards. Standard GRPO normalizes using "standard deviation within each prompt group," which biases towards low-variance samples (intra-task imbalance); meanwhile, removing normalization (as in Dr.GRPO) allows sparse rewards (math) to dominate gradients, suppressing dense, small-magnitude rewards (detection) (inter-task imbalance).
Goal: To train a true "Multimodal Reasoning Generalist"—a single model capable of simultaneously handling a series of basic visual tasks across images and videos while maintaining stable training under heterogeneous rewards. This is decomposed into two sub-problems: (1) constructing a large-scale, modality-balanced dataset covering all tasks, and (2) designing an RL algorithm that simultaneously addresses intra-task and inter-task imbalances.
Key Insight: The authors take the fact that "vision naturally encompasses both static images and dynamic videos" as a starting point. Since the real world requires unified reasoning for diverse visual tasks, all tasks are cast into the same textual interface (<think> reasoning + <answer> structured output) to enable joint training within a unified RL framework.
Core Idea: Replace the "immediate group-wise standard deviation" in GRPO with a task-level Moving Average (EMA) of reward standard deviations for advantage normalization. Each task uses its own smooth, adaptive normalization scale, concurrently curing both intra-task and inter-task imbalances.
Method¶
Overall Architecture¶
OneThinker aims to solve 10 heterogeneous visual tasks with an 8B model via a three-stage pipeline: First, Data Construction (OneThinker-600k covers 8 categories; Seed1.5-VL is used for annotation and filtering to create a 340k high-quality CoT subset); Second, SFT Cold Start using the 340k CoT data to teach the Qwen3-VL-Instruct-8B base model the "think-before-answer" format; Finally, EMA-GRPO Reinforcement Learning on the full 600k corpus to refine reasoning capabilities.
Crucially, all tasks are unified into a single textual interface: the model generates internal reasoning within <think>...</think> and task-specific results within <answer>...</answer>. For perception tasks (grounding/tracking/segmentation), <answer> contains structured representations (time intervals, bounding boxes, sparse points) in a predefined JSON schema, allowing for automatic parsing and rule-based reward calculation. Each task category has a custom accuracy reward \(R = R_{\text{acc}} + R_{\text{format}}\). Heterogeneous rewards are handled by EMA-GRPO during the RL phase.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Image + Video<br/>10 Visual Tasks"] --> B["Unified Text Interface<br/>think Reasoning + answer Structured Output"]
B --> C["Dataset Construction<br/>OneThinker-600k → Seed1.5-VL Labeling CoT → SFT-340k"]
C --> D["SFT Cold Start<br/>Qwen3-VL-8B learns think-then-answer"]
D --> E["Task-specific Reward Design<br/>IoU / Gaussian Kernel / Reward Model Score"]
E --> F["EMA-GRPO Reinforcement Learning<br/>Task-level EMA Std. Dev. for Advantage Normalization"]
F --> G["Unified Multimodal Reasoning Generalist<br/>31 Benchmarks / 10 Tasks"]
Key Designs¶
1. Unified Text Interface + Task-specific Verifiable Rewards: Casting 10 Heterogeneous Tasks into the "Think \(\rightarrow\) Structured Output" Format
Multi-task RL requires comparable and automatically verifiable rewards. OneThinker mandates <think> reasoning and <answer> structured output, with tailored accuracy rewards: Rule-based QA (choice/numerical/math/OCR) uses equivalence checks; regression tasks use Mean Relative Accuracy; OCR uses Word Error Rate. For open-ended QA and captioning without unique answers, an external reward model (POLAR-7B) provides similarity scores \(R_{\text{acc}}=\mathrm{RM}(q,\hat a,a)\). Temporal grounding uses \(R_{\text{acc}}=\mathrm{tIoU}([\hat s,\hat e],[s,e])\), spatial grounding uses \(\mathrm{sIoU}(\hat b,b)\), spatio-temporal grounding sums both \(\mathrm{tIoU}+\overline{\mathrm{sIoU}}\), and tracking uses average box IoU \(\overline{\mathrm{sIoU}}\).
Segmentation is ingeniously handled: instead of masks, the model predicts a bounding box and a set of positive/negative points, which are fed into SAM2 to generate the final mask (video segmentation also identifies a keyframe time \(\hat t\)). To avoid the high latency of running SAM2 during rollouts, the mask reward is replaced by a Gaussian kernel \(\mathcal{G}(d)=\exp(-d^2/2\sigma^2)\) that normalizes the "minimum matching distance from predicted points to ground truth points" to \([0,1]\) (spatial \(\sigma=50\), temporal \(\sigma=1\)). Image segmentation reward is \(R_{\text{acc}}=\mathrm{sIoU}(\hat b,b)+\mathcal{G}(\mathrm{dis}_+)+\mathcal{G}(\mathrm{dis}_-)\), with video segmentation adding a \(\mathcal{G}(|\hat t-t|)\) term for the keyframe time.
2. OneThinker-600k Dataset + Seed1.5-VL CoT Labeling: Training Corpus for a Multimodal Generalist
To master logic, knowledge, spatial, and temporal reasoning, large-scale, diverse, and balanced data is required. The authors curated OneThinker-600k from public datasets, covering 8 task categories across image/video modalities. Using Seed1.5-VL, they generated CoT annotations for the 600k samples, filtered by task-specific thresholds and quality checks to produce OneThinker-SFT-340k for cold starting.
3. EMA-GRPO: Curing Intra-task and Inter-task Imbalance via Task-level EMA Standard Deviation
The core algorithm. Standard GRPO uses immediate group-wise standard deviation, causing samples with extreme group variance to receive stronger updates, while moderate-difficulty samples are under-optimized (intra-task imbalance). Dr.GRPO removes normalization, avoiding intra-task bias but allowing tasks with higher reward densities/scales (math) to drown out others (grounding) (inter-task imbalance).
EMA-GRPO maintains the first and second moments of rewards for each task \(\tau\). For rewards \(\{R_i\}\) of task \(\tau\) in the current batch, calculate \(\mu^\tau(t)=\mathrm{mean}(\{R_i\})\) and \(\nu^\tau(t)=\mathrm{mean}(\{R_i^2\})\), then update moments with decay factor \(\beta=0.99\): $\(m_1^\tau(t)=\beta\,m_1^\tau(t-1)+(1-\beta)\,\mu^\tau(t),\quad m_2^\tau(t)=\beta\,m_2^\tau(t-1)+(1-\beta)\,\nu^\tau(t).\)$ The task-level standard deviation \(\sigma^\tau(t)=\sqrt{m_2^\tau(t)-(m_1^\tau(t))^2}\) is used to normalize the advantage: $\(A_i^\tau(t)=\frac{R_i-\mathrm{mean}(\{R_j\})}{\sigma^\tau(t)}.\)$ This strategy ensures that rollouts within the same task share a normalization scale, preventing bias toward specific groups (solving intra-task imbalance). Across different tasks, independent \(\sigma^\tau(t)\) values level the gradient contributions, preventing sparse reward tasks from dominating (solving inter-task imbalance).
Loss & Training¶
RL employs the standard GRPO objective with EMA-normalized advantages: $\(\mathbb{E}_{q,\{o_i\}}\Big[\tfrac{1}{G}\sum_{i=1}^{G}\big(\min(r_i A_i^\tau, \mathrm{clip}(r_i,1-\epsilon,1+\epsilon)A_i^\tau)-\beta_{\mathrm{KL}}D_{\mathrm{KL}}(\pi_\theta\|\pi_{\mathrm{ref}})\big)\Big],\)$ where \(r_i=\pi_\theta(o_i|q)/\pi_{\theta_{\text{old}}}(o_i|q)\). Training utilized 32 H800 GPUs. Base: Qwen3-VL-Instruct-8B. SFT: batch 32, lr \(1\times10^{-5}\). RL: batch 128, lr \(2\times10^{-6}\), group size 8, \(\beta_{\mathrm{KL}}=0.01\). Max response length 4096 tokens, max 128 frames for video.
Key Experimental Results¶
Evaluated across 31 benchmarks and 10 task categories, using a reproduced Qwen3-VL-Instruct-8B as the baseline.
Main Results¶
Image/Video QA Highlights: | Task | Benchmark | Metric | Qwen3-VL-8B | OneThinker-8B | |------|-----------|------|------|------| | Image QA | MMMU | acc | 60.2 | 70.6 | | Image QA | MathVerse | acc | 58.1 | 64.3 | | Image QA | ScienceQA | acc | 92.0 | 96.5 | | Video QA | LongVideo-Reason | acc | 71.5 | 79.2 | | Video QA | VideoMathQA | acc | 24.3 | 35.0 | | Video QA | VideoHolmes | acc | 40.9 | 48.7 |
Perception Task Highlights (grounding / tracking / segmentation): | Task | Benchmark | Metric | Qwen3-VL-8B | OneThinker-8B | |------|-----------|------|------|------| | Temporal grounding | ActivityNet | mIoU | 29.1 | 45.9 | | Spatial grounding | RefCOCO testA | acc | 92.2 | 93.7 | | Spatio-temporal grounding | STVG | sIoU | 13.6 | 36.7 | | Tracking | GOT-10k | AO | 33.7 | 73.0 | | Tracking | GOT-10k | [email protected] | 28.9 | 84.4 | | Video Segmentation | ReasonVOS | J&F | 19.6 | 54.9 | | Video Segmentation | MeViS | J&F | 22.9 | 52.7 |
Perception improvements are significant: Tracking AO rose from 33.7 to 73.0, and Video Segmentation J&F more than doubled.
Ablation Study¶
Removing/replacing key components (averages across task-related benchmarks): | Configuration | QA | Temporal grd | Spatial grd | Spatio-temporal grd | Tracking | Seg | Description | |------|----|---------|---------|---------|------|------|------| | Qwen3-VL-8B | 65.0 | 30.8 | 86.6 | 19.5 | 33.7 | 50.0 | Base | | OneThinker-SFT | 67.0 | 31.8 | 87.8 | 27.1 | 48.1 | 62.8 | SFT only, no RL | | OneThinker-GRPO | 67.2 | 46.9 | 86.5 | 34.5 | 65.5 | 62.3 | Std GRPO | | OneThinker-DrGRPO | 67.6 | 46.3 | 88.2 | 34.0 | 67.8 | 61.2 | Dr.GRPO | | OneThinker-8B | 69.8 | 49.7 | 89.2 | 38.1 | 73.0 | 64.2 | Full EMA-GRPO |
Key Findings¶
- EMA-GRPO is the primary driver of performance: Replacing it with standard GRPO or Dr.GRPO leads to performance drops across all tasks (e.g., Tracking 73.0 \(\rightarrow\) 65.5/67.8), confirming the necessity of addressing both intra-task and inter-task imbalances.
- Cross-task/Cross-modality Knowledge Transfer is evident: Ablating temporal grounding data significantly hurts video QA and tracking (temporal localization enhances sequence reasoning); removing spatial grounding lowers image QA and segmentation scores. Notably, removing image QA data severely degrades video QA (61.1 \(\rightarrow\) 58.2), indicating that reasoning learned from image QA transfers to video.
- Zero-shot Generalization: OneThinker outperforms the base model on unseen tasks (point tracking, rotation detection, etc.), showing the benefit of unified reasoning.
Highlights & Insights¶
- Addressing "Reward Heterogeneity" as the primary obstacle in multi-task RL: EMA-GRPO provides a clean insight—intra-task and inter-task imbalances are two sides of the same coin. A task-level adaptive standard deviation resolves both. This approach is transferable to any multi-task RL scenario with varying reward scales.
- Conversion of Segmentation to "Box+Points+Time \(\rightarrow\) SAM2": By reducing high-dimensional masks to structured predictions and Gaussian kernel rewards, the authors circumvented high RL rollout latency while incorporating segmentation into a rule-based RL framework.
- Image QA data as a benefit for Video QA: High-quality image reasoning data is a viable shortcut to improving video reasoning models when video-specific reasoning data is scarce.
Limitations & Future Work¶
- Lack of Mask-level Reward: Due to SAM2 latency, proxy rewards (boxes/points) were used, which may limit the upper bound of segmentation quality compared to direct mask optimization.
- Heavy Dependency on External Components: CoT labeling and perception rewards rely on Seed1.5-VL and POLAR-7B. Performance is capped by these external proxies.
- Scope of Generalization: Zero-shot verification is preliminary; the stability of EMA-GRPO with even larger task sets or additional modalities (audio/3D) remains to be explored.
Related Work & Insights¶
- vs. Video-R1 / Vision-R1: These models achieve high performance on single tasks/modalities. OneThinker covers 10 tasks and 2 modalities, surpassing Video-R1 on LongVideo-Reason (79.2 vs 67.2) through generalized knowledge transfer.
- vs. VideoChat-R1: While VideoChat-R1 joint-trains on 18k samples for 3 video tasks, OneThinker scales to 600k samples and 10 tasks across images and videos, highlighting the importance of balanced, large-scale data for generalists.
- vs. Standard GRPO / Dr.GRPO: EMA-GRPO specifically targets imbalances in heterogeneous multi-task settings, consistently outperforming both in ablation studies.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐