IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling¶

Conference: ICLR 2026 arXiv: 2511.07327 Code: Available Area: LLM Efficiency Keywords: Deep Research Agent, Iterative Workspace, MDP Framework, Interaction Scaling, Reinforcement Learning

TL;DR¶

IterResearch is proposed as an MDP-based iterative deep research paradigm that replaces mono-contextual linear accumulation with periodic workspace reconstruction, enabling agents to scale to 2048 interactions within a 40K context length (performance improves from 3.5% to 42.5%), surpassing open-source agents by an average of 14.5 percentage points across 6 benchmarks.

Background & Motivation¶

Deep research agents (e.g., OpenAI Deep Research, Gemini Deep Research) construct knowledge through autonomous reasoning and information retrieval. However, existing open-source approaches adopt a mono-contextual paradigm — appending all retrieved information and reasoning steps to a continuously expanding context window. This leads to two fundamental problems:

Context Suffocation: As the context fills up, the space available for model reasoning progressively shrinks, forcing increasingly terse responses and ultimately degenerating into premature or superficial conclusions.

Noise Contamination: Irrelevant search results and early exploration errors are permanently embedded in the context, producing cascading interference.

Core Idea: Effective long-horizon research requires periodic synthesis and strategic forgetting — periodically compressing findings into an evolving report, then continuing exploration based on the report rather than the full history. This reduces state dimensionality from \(O(t)\) to \(O(1)\).

Method¶

Overall Architecture¶

IterResearch models deep research as an MDP \(\langle\mathcal{S},\mathcal{D},\mathcal{E},\mathcal{T},R\rangle\). At each iteration, the agent performs "think–update report–execute action" on a reconstructed workspace; after the environment returns results, the next workspace is reconstructed, retaining only the question, the evolving report, and the most recent round's context.

Key Designs¶

Iterative Workspace Reconstruction:
- Function: Maintains a constant agent workspace size rather than linear growth.
- Mechanism: The state \(s_t = (q, \mathcal{M}_t, \{a_{t-1}, \text{TR}_{t-1}\})\) comprises three components — the fixed question \(q\), the evolving report \(\mathcal{M}_t\) (a compressed summary of historical findings), and the previous step's action result. Each decision is \(d_t = [\text{Think}_t, \mathcal{M}_{t+1}, a_t]\), and the transition function reconstructs the workspace: \(s_{t+1} = (q, \mathcal{M}_{t+1}, \{a_t, \text{TR}_t\})\). Historical trajectories are "strategically forgotten," preserved only through the report.
- Comparison: Mono-contextual paradigm: \(|s_t| \propto O(t)\); IterResearch: \(|s_t| \approx O(1)\).
- Design Motivation: Reports are naturally generated by the LLM, leveraging its information compression and relevance filtering capabilities without additional algorithmic intervention.
Efficiency-Aware Policy Optimization (EAPO):
- Function: Trains the agent to explore efficiently rather than search aimlessly.
- Mechanism (two components):
  - Geometric Discount Reward: \(r_t = \gamma^{T-t} \cdot R_T\). The sooner a correct answer is reached, the higher the per-step reward, creating implicit efficiency pressure.
  - Adaptive Downsampling: Since the iterative paradigm naturally decomposes each trajectory into multiple training samples (one per round), the number of samples varies across questions. The total sample count is truncated to the largest multiple of the data-parallel (DP) size: \(|\mathcal{C}_{\text{train}}| = \lfloor|\mathcal{C}|/\text{DP}_{\text{size}}\rfloor \times \text{DP}_{\text{size}}\).
- Implemented on the GSPO algorithm; the training objective includes PPO-style clipping and within-group advantage normalization.
Two-Stage Training Pipeline:
- Stage 1 RFT: Rejection sampling fine-tuning to enable the model to acquire basic competency in the iterative paradigm.
- Stage 2 RL: Further optimization of search strategy and reasoning ability via EAPO.
- Backbone: Qwen3-30B-A3B (balancing performance and efficiency).

Three Core Findings¶

Interaction Scaling: Scaling interactions from 2 to 2048 improves BrowseComp accuracy from 3.5% to 42.5%.
Cross-Paradigm Knowledge Transfer: Trajectories generated by IterResearch also improve performance when used to train mono-contextual agents.
As a Prompting Strategy: When applied directly to frontier models such as GPT-4o/Claude without any training, the paradigm improves BrowseComp performance by 12.7–19.2 pp over ReAct.

Key Experimental Results¶

Main Results¶

Model	HLE	BC	BC-zh	GAIA	Xbench-DS	SEAL-0
WebSailor-72B	9.8	12.0	30.1	55.4	55.0	19.8
MiroThinker-32B	19.1	17.2	29.4	64.1	56.0	—
IterResearch-30B-A3B	28.8	37.3	45.2	72.8	71.0	39.6
Gain	+8.8	+20.1	+15.8	+8.7	+15.0	+18.9
OpenAI DeepResearch	26.6	51.5	42.9	67.4	—	—

Ablation Study on Interaction Scaling¶

Max Interactions	BrowseComp Accuracy
2	3.5%
32	~15%
128	~28%
512	~35%
2048	42.5%

Key Findings¶

Outperforms the best open-source agent by an average of 14.5 pp across 6 benchmarks.
Surpasses OpenAI DeepResearch on HLE and BC-zh.
Scaling interactions to 2048 yields a 12× performance improvement, suggesting that failures on long-horizon tasks may stem primarily from insufficient exploration capacity rather than model capability.
Applied as a zero-training prompting strategy for GPT-4o, achieves +19.2 pp on BrowseComp, demonstrating the general value of the paradigm itself.

Highlights & Insights¶

The MDP-based "strategic forgetting" design is elegant — the evolving report serves as a compressed state representation, perfectly satisfying the Markov property of MDPs.
The interaction scaling finding is significant — it suggests that current agent failures are more attributable to insufficient exploration than to inadequate capability.
Cross-paradigm knowledge transfer and the zero-training prompting strategy extend the applicability of the proposed approach.

Limitations & Future Work¶

Report quality is a critical bottleneck — if important information is lost during summarization, subsequent reasoning will be adversely affected.
Per-round reconstruction requires re-reading the report, potentially introducing redundant computation.
Training is conducted only on Qwen3-30B-A3B; performance on larger or smaller models remains to be validated.
The discount factor \(\gamma\) in the geometric discount reward may be a sensitive hyperparameter.

vs. WebThinker/WebDancer: These methods employ the mono-contextual paradigm and inevitably suffer from context suffocation.
vs. InftyThink: A similar iterative-plus-summarization idea, but applied to reasoning tasks; IterResearch targets information retrieval agents.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ MDP modeling combined with iterative workspace reconstruction represents an important breakthrough in the deep research agent paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated across 6 benchmarks with multi-dimensional analyses including interaction scaling, cross-paradigm transfer, and zero-training prompting.
Writing Quality: ⭐⭐⭐⭐⭐ Problem motivation is clearly articulated and the method is rigorously formalized.
Value: ⭐⭐⭐⭐⭐ Directly advances the state of the art in deep research agents with high practical utility.