DynaAct: Large Language Model Reasoning with Dynamic Action Spaces¶
Conference: NeurIPS 2025 arXiv: 2511.08043 Code: https://github.com/zhaoxlpku/DynaAct Area: LLM Reasoning / Decision Optimization Keywords: Dynamic Action Space, Submodular Function, MCTS, LLM Reasoning, Subset Selection
TL;DR¶
DynaAct frames action space construction in LLM reasoning as a subset selection problem, dynamically assembling a compact action space at each step via a submodular function that balances utility and diversity. It outperforms rStar, RAP, and other baselines on 6 benchmarks, surpassing rStar by 6.8% on MATH-500.
Background & Motivation¶
Background: Complex LLM reasoning is commonly formulated as an MDP: a state space and action space are defined, and search strategies such as MCTS are used to find optimal reasoning paths. RAP employs automatically generated sub-questions as actions, while rStar manually defines 5 action types.
Limitations of Prior Work: (a) Manually defined action spaces (e.g., rStar's 5 actions) are overly domain-specific and lack scalability; (b) Automatically generated action spaces (e.g., RAP's on-the-fly sub-questions) exhibit high redundancy, making exhaustive search computationally expensive.
Key Challenge: A high-quality action space must simultaneously satisfy two conflicting properties—scalability (automatically learned from data, generalizable across domains) and compactness (retaining only a small number of high-value candidate actions at each step).
Goal: To automatically construct an action space that is both general and compact for LLM reasoning.
Key Insight: Action space construction is recast as a subset selection problem, leveraging the diminishing marginal returns property of submodular functions to ensure that selected subsets balance utility and diversity.
Core Idea: A submodular function combined with a greedy algorithm dynamically selects the optimal 5 actions at each step from a large surrogate action space.
Method¶
Overall Architecture¶
The framework proceeds in three stages: (1) Surrogate Action Space Estimation—general reasoning patterns (observation sketches) are extracted from a diverse corpus to form the complete action space \(\mathcal{A}\); (2) Submodular Function Definition—a scoring function \(F\) that balances utility and diversity is defined, with an embedding model trained via Q-learning; (3) Dynamic Action Space Construction—at each step, a greedy algorithm selects \(m=5\) actions from \(\mathcal{A}\) to form \(\mathcal{A}_t\), which is then evaluated by MCTS to determine the optimal action.
Key Designs¶
-
Surrogate Action Space Estimation:
- Function: Automatically extracts reusable reasoning patterns from multi-domain problem corpora.
- Mechanism: The Open-Platypus corpus (24,652 problems) is partitioned into \(k=2500\) groups; Llama-3.1-70B extracts an observation sketch from each group. After merging and deduplication, 40,822 observations constitute \(\mathcal{A}\).
- Design Motivation: Eliminates manual definition by automatically discovering cross-domain, generalizable reasoning operations from data.
-
Submodular Function Design:
- Function: Defines the scoring function \(F(\mathcal{A}_t; s_t) = \alpha f_\text{util}(\mathcal{A}_t; s_t) + \beta f_\text{div}(\mathcal{A}_t)\).
- Mechanism:
- Utility term: \(f_\text{util} = \log(\sum_{a \in \mathcal{A}_t} \exp(\mathbf{e}(s_t)^T \mathbf{e}(a)))\); LogSumExp guarantees submodularity.
- Diversity term: \(f_\text{div} = \sum_{a_i} \min_{a_j \neq a_i} (1 - \mathbf{e}(a_i)^T \mathbf{e}(a_j))\); encourages maximal dissimilarity in the embedding space.
- Weights: \(\alpha=0.9\), \(\beta=0.1\).
- Design Motivation: The utility term ensures that selected actions are relevant to the current reasoning state; the diversity term prevents redundancy.
-
Embedding Model Training (Q-learning):
- Function: Trains the embedding function \(\mathbf{e}(\cdot)\) such that \(\mathbf{e}(s_t)^T \mathbf{e}(a)\) approximates the Q-value.
- Mechanism: Observation sketches serve as demonstration data. The loss is defined as \(\mathcal{L} = (\mathbf{e}(s_t)^T \mathbf{e}(a) - (r + \log\sum_{a'} \exp(\mathbf{e}(s_{t+1})^T \mathbf{e}(a'))))^2\), where \(r=1\) for correct actions and \(r=0\) otherwise.
- Implementation: Llama-3.2-1B serves as the embedding backbone, trained on 83,083 state–action pairs.
-
Greedy Action Selection:
- Function: Selects \(m=5\) actions from \(\mathcal{A}\) at each step.
- Mechanism: Exploiting submodularity, the greedy algorithm guarantees a \((1-1/e)\) approximation ratio with complexity \(O(m^2|\mathcal{A}|)\).
- Implementation Optimization: \(\mathbf{e}(a)\) can be precomputed and cached; only \(\mathbf{e}(s_t)\) needs to be encoded online.
Search Strategy¶
- MCTS performs 16 rollouts; the world model is Llama-3.1-8B-Instruct.
- Only the embedding model requires training; the base LLM remains frozen throughout.
Key Experimental Results¶
Main Results¶
| Type | Benchmark | Zero-shot CoT | SC@maj16 | RAP | rStar | DynaAct |
|---|---|---|---|---|---|---|
| General | MMLU | 68.87 | 69.66 | 69.46 | 68.61 | 70.22 |
| General | MMLU-Pro | 43.45 | 49.36 | 48.70 | 48.81 | 51.40 |
| Reasoning | GPQA | 31.82 | 34.34 | 38.89 | 36.87 | 39.39 |
| Reasoning | ARC-C | 81.06 | 80.63 | 85.41 | 86.43 | 88.31 |
| Math | GSM8K | 76.12 | 86.66 | 87.79 | 87.11 | 89.16 |
| Math | MATH-500 | 45.40 | 52.00 | 51.60 | 54.20 | 61.00 |
DynaAct achieves state-of-the-art results across all six benchmarks, outperforming rStar by 6.8 percentage points on MATH-500.
Ablation Study¶
| Configuration | ARC-C | MATH-500 | Note |
|---|---|---|---|
| DynaAct (full) | 88.31 | 61.00 | Full model |
| - util | 87.63 | 53.40 | Remove utility term; MATH drops 7.6% |
| - div | 86.52 | 53.80 | Remove diversity term; drops 7.2% |
| - q-learning | 87.80 | 55.80 | No embedding training; drops 5.2% |
| - submodular | 85.15 | 52.00 | Remove submodular function; drops 9.0% |
Efficiency Comparison¶
| Method | Relative Time ↓ | Accuracy ↑ |
|---|---|---|
| DynaAct | 1.00 | 61.00 |
| rStar | 0.95 | 54.20 |
| RAP | 1.12 | 51.60 |
DynaAct incurs only 5% more latency than rStar while achieving 6.8% higher accuracy, and is faster than RAP, whose on-the-fly sub-question generation introduces substantial overhead.
Key Findings¶
- The submodular function is central to performance: removing it causes a 9% drop on MATH-500, demonstrating that a dynamic, compact action space is far more critical than random or fixed alternatives.
- Both utility and diversity terms are indispensable: removing either reduces MATH-500 accuracy from 61% to approximately 53%.
- Compactness confers clear advantages: even with \(m=5\), DynaAct improves consistently as rollout count increases; RAP shows negligible gains with \(m=5\) or \(m=10\) and requires \(m=15\) to exhibit improvement.
- Critical-step coverage: DynaAct achieves 0.63 vs. rStar's 0.47, confirming that its action selection is meaningfully more relevant.
Highlights & Insights¶
- Formalizing action space construction as subset selection: This is an underexplored research problem, orthogonal to policy learning and reward modeling, and opens a new direction for MDP-based LLM reasoning.
- Elegant submodular function design: The LogSumExp utility term naturally satisfies submodularity; using the sum of nearest-neighbor distances in the embedding space as a diversity measure is concise and effective.
- Precomputed embeddings with greedy linear complexity: Action embeddings can be cached, requiring only online encoding of the current state, thereby introducing negligible additional latency.
Limitations & Future Work¶
- The surrogate action space is extracted from Open-Platypus, introducing domain bias; coverage may be insufficient for specific vertical domains such as programming or scientific experimentation.
- Substantial redundancy remains among the 40,822 observations; \(\mathcal{A}\) itself could be further compressed.
- Extracting observation sketches relies on Llama-3.1-70B, which is unfriendly to smaller-model deployments.
- The current design uses a single embedding function \(\mathbf{e}(\cdot)\) with limited representational capacity; multi-head or hierarchical embeddings warrant exploration.
Related Work & Insights¶
- vs. rStar: rStar's manually defined 5-action space lacks scalability and underperforms DynaAct on MATH-500 and general-purpose tasks.
- vs. RAP: RAP generates sub-questions on the fly as the action space, resulting in high redundancy and large latency; DynaAct's pre-extraction combined with dynamic selection is more efficient.
- vs. test-time scaling: DynaAct focuses on action space quality rather than search strategy or data quality, constituting an orthogonal contribution.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Modeling action space construction as submodular subset selection is a novel and theoretically grounded perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six benchmarks, complete ablations, efficiency comparisons, compactness analysis, and utility analysis are all provided.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear, with strong integration between theory and experiments.
- Value: ⭐⭐⭐⭐ Provides an important infrastructural contribution to MDP-based LLM reasoning.