AAAI 2026 LLM Agent multi-turn interaction behavioral guidance trustworthy agent task profiler reasoning module generation module

Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance¶

Conference: AAAI 2026 arXiv: 2512.11421 Code: None Area: LLM Agents Keywords: multi-turn interaction, behavioral guidance, trustworthy agent, task profiler, reasoning module, generation module

TL;DR¶

This paper proposes a task completion framework in which a Task Profiler, a Reasoning Module, and a Generation Module co-evolve to enable verifiable and reliable behavioral guidance for LLM agents in multi-turn interactive environments.

Background & Motivation¶

Background: LLM agents have made progress on task completion through mechanisms such as memory, tool use, and reflection (e.g., ReAct, Reflexion, ToolFormer), but these mechanisms are largely implicit and difficult to inspect or steer.

Limitations of Prior Work: In multi-turn tasks, agents lack reliability and verifiability — their reasoning processes cannot be audited, and the generated behaviors cannot be guaranteed to consistently satisfy task constraints. Different tasks demand different styles of behavioral guidance (rapid local responses vs. long-horizon cumulative constraints), and LLM agents tend to drift across inconsistent reasoning patterns.

Key Challenge: Agents must flexibly handle diverse task structures while simultaneously maintaining verifiable reasoning consistency and reliable constraint adherence — a fundamental tension between flexibility and controllability.

Key Insight: Tasks are modeled in a reinforcement learning formulation (observation–action–reward loop), and a three-tier architecture is designed: a Task Profiler meta-learns structural task features and selects strategies; a Reasoning Module extracts reusable condition–action rules from historical trajectories; and a Generation Module ensures that outputs always satisfy all constraints. All three components co-evolve across multiple execution epochs.

Method¶

Overall Architecture¶

The framework augments an RL prompting backbone with three components: (1) a Task Profiler that analyzes environment variables and selects reasoning and generation strategies; (2) a Reasoning Module that learns observation–action mapping rules from past trajectories and stores them in a Rule Bank; and (3) a Generation Module that selects verification or deterministic generation strategies according to task complexity.

Key Designs¶

Task Profiler
Functions as a cognitive strategy engine (LLM-based) that analyzes the structural characteristics of the task environment.
Outputs task features including: temporal dependency type (sequential vs. cumulative), constraint intensity, and suitable reasoning and generation strategies.
Runs for the first time after a warm-up period (epoch \(k\)) and is refreshed at the end of each subsequent epoch.
Acts as a meta-learner that determines how to generate behaviors rather than directly solving the task.
Reasoning Module
Analyzes high-reward trajectories and extracts rules of the form "if [observation condition] then [optimal action]."
Rules are stored in the Rule Bank, accumulating across trajectories and epochs with associated success rates and usage histories.
Adapts to the Task Profiler's guidance: sequential tasks focus on single-step transition reasoning, while cumulative tasks aggregate long-horizon information.
Rules stabilize after multi-round trajectory validation, transitioning from ad hoc reasoning to generalized, consistent reasoning.
When familiar conditions recur, validated rules can be applied directly.
Generation Module
Selects appropriate generation strategy tools based on Task Profiler guidance.
For lightly constrained tasks: directly validates the native LLM output for validity.
For heavily constrained tasks (e.g., Wordle, Sudoku): employs deterministic enumeration or guided sampling.
Performs validity checking before each action is submitted; automatically falls back to deterministic enumeration upon violation.
Ensures every output is verifiably valid with respect to environmental feedback and reasoning rules.

Loss & Training¶

Rather than conventional training, the framework adopts iterative execution based on RL prompting. Each epoch consists of \(T\) trajectories, each being a complete observation–action–reward sequence. GPT-4.1-mini is used as the underlying LLM (intentionally a non-reasoning model to isolate the framework's contribution). Evaluation spans 30 epochs × 20 trajectories/epoch with 95% confidence intervals.

Key Experimental Results¶

Main Results¶

Task / Metric	Baseline (no framework)	Baseline + ICL	Guided Agent	Effect
GmN mean reward (post-stabilization)	~15–20	~15–20	~45–50	2–3× improvement
GmN reward trend	no improvement	no improvement	steady increase and convergence	continuous learning
Wordle task completion rate	low	slight improvement	significant improvement	constraint adherence
Wordle invalid guess rate	high	medium	near zero	ensured by Generation Module

Ablation Study¶

Component	GmN Effect	Wordle Effect
RL prompting backbone only	baseline level, no learning curve	frequent constraint violations
+ Task Profiler	improved strategy selection	correctly identifies cumulative constraint structure
+ Reasoning Module	rules progressively stabilize, reward increases	elimination and filtering rules accumulate
+ Generation Module	full effect	invalid outputs reduced to near zero

Key Findings¶

Baseline agents (with or without ICL) show no sustained improvement across all 30 epochs, demonstrating that mere exposure to past trajectories is insufficient for reliable behavior.
The Guided Agent exhibits performance drops at epochs 8, 11, and 13, corresponding to exploration–exploitation switching during new rule formation, consistent with RL theory.
Rule stabilization after epoch 15 marks the transition from ad hoc reasoning to generalized, consistent reasoning.
The reasoning consistency ratio (proportion of correctly applied learned rules) rises steadily across epochs.
In Wordle, the deterministic enumeration fallback mechanism guarantees zero invalid outputs even when Reasoning Module estimates are imprecise.

Highlights & Insights¶

The three-component co-evolution design is conceptually profound: the Task Profiler refines its understanding across epochs, the Reasoning Module accumulates better rules, and the Generation Module adapts to updated reasoning states, forming a positive feedback loop.
The dual objective of verifiability + reliability is precisely defined: reasoning can be inspected and audited (the rule bank is auditable), and behaviors consistently satisfy constraints (enforced by the Generation Module).
The meta-learning role of the Task Profiler is novel: rather than solving the task directly, it determines how the task should be solved — analogous to the concept of flexible human learning in cognitive science.
The natural evolution of rules from overfitting to generalization is compelling and consistent with the exploration–exploitation theory in RL.

Limitations & Future Work¶

Validation is limited to two simple game tasks (GmN and Wordle); applicability to real-world multi-turn agent scenarios remains unverified.
The Task Profiler is currently implemented as a simple LLM prompt; more complex tasks may require data-driven analysis strategies.
Details of Rule Bank management (registration, testing, filtering) are not sufficiently elaborated in the main text.
GPT-4.1-mini is used to isolate the framework's contribution, but the effect of combining the framework with stronger reasoning models remains unknown.

Aspect	ReAct / Reflexion	Ours
Reasoning approach	implicit chain-of-thought	explicit condition–action rule bank
Verifiability	reasoning process not auditable	rules are inspectable; applications are traceable
Constraint assurance	relies on LLM self-regulation	Generation Module enforces validation + fallback mechanism
Cross-epoch learning	no persistent memory	Rule Bank accumulates and validates across epochs

vs. Constitutional AI: requires full model retraining to embed behavioral constraints; the proposed framework dynamically learns and applies rules at runtime without modifying the underlying model, offering greater deployment flexibility.

Rating¶

Dimension	Score	Rationale
Novelty	⭐⭐⭐⭐	The three-component co-evolution framework is distinctively designed; the meta-learning role of the Task Profiler is novel.
Technical Depth	⭐⭐⭐	The framework design is sound, but individual component implementations are relatively straightforward (LLM prompt-based).
Experimental Thoroughness	⭐⭐⭐	Only two simple games are evaluated; analysis is detailed but task diversity is insufficient.
Practical Value	⭐⭐⭐⭐	The framework concept is applicable to all multi-turn agent scenarios; verifiability is a core requirement for industrial deployment.