Skip to content

ProPer Agents: Proactivity Driven Personalized Agents for Advancing Knowledge Gap Navigation

Conference: ACL 2026
arXiv: 2601.09926
Code: GitHub
Area: LLM Agent / Personalized Assistant / Proactivity Calibration
Keywords: Proactive Assistant, Knowledge Gap, Dimensional Modeling, Personalized Agent, Calibrated Proactivity

TL;DR

ProPer models proactive assistants as a problem of "discovering and calibrating unspoken task dimensions." Through a Dimension Generating Agent (DGA), a post-hoc reranker, and a Response Generating Agent (RGA), it selectively fills knowledge gaps, significantly improving response quality and win rates in medical, coding, and shopping recommendation tasks.

Background & Motivation

Background: Traditional conversational assistants follow an ask-and-respond pattern, simply answering what the user asks. Proactive assistants aim to anticipate user needs, such as supplementing risks, suggesting constraints, asking for missing information, or providing more complete recommendations.

Limitations of Prior Work: Many proactive systems either ask questions too frequently, increasing the user's burden, or perform context-based extrapolation, which often inserts unwanted suggestions at the wrong time. They typically handle "known unknowns" that the user has already expressed but lack explicit modeling for "unknown unknowns."

Key Challenge: Good proactivity is not about "saying more," but about intervening moderately on key dimensions that the user has not yet realized. Insufficient intervention leads to responses missing critical risks, while excessive intervention can appear intrusive, off-topic, or disrespectful of user intent.

Goal: Propose a controllable proactive personalized Agent framework that explicitly models user knowledge gaps and uses a calibration mechanism to decide which implicit dimensions are worth including in the final response.

Key Insight: The authors introduce "dimensions" as an intermediate representation. A dimension is a structured factor to be considered when completing a task, such as input scale, risk constraints, preference trade-offs, disease severity, budget, or alternatives.

Core Idea: First, a DGA learns to generate implicit dimensions that are "currently unrealized by the user but task-relevant." Then, a reranker controls the quantity, relevance, and diversity of these dimensions. Finally, an RGA supplements these dimensions without interrupting the user's original intent.

Method

Overall Architecture

The inputs to ProPer are the current user query, conversation history, and an optional persona. The system first generates a baseline response, then uses the DGA to extract user-explicit dimensions, system-covered dimensions, and candidate implicit dimensions. Next, a reranker selects "activated dimensions" from the candidates within a budget. Finally, the RGA generates an updated response based on the original query, baseline response, explicit dimensions, and activated dimensions. The entire process decomposes proactivity into two steps: "gap discovery" and "calibrated intervention."

Key Designs

  1. Dimension Generating Agent (DGA):

    • Function: Generates implicit dimensions that may be missing in the current task.
    • Mechanism: The DGA is fine-tuned using dimension-level supervision from successful interaction trajectories. During training, it learns dimensions that explicitly appear in user queries and reference answers and contribute to task success; during inference, it proposes candidate implicit dimensions based on the current user state and outputs confidence scores.
    • Design Motivation: While standard LLMs often add content based on linguistic fluency when expanding responses, the DGA isolates "what should be added" into a structured prediction problem.
  2. Post-hoc Calibrated Reranker:

    • Function: Selects a few dimensions truly worth proactive intervention from the candidates.
    • Mechanism: All dimensions are encoded using BGE-small. The reranker selects a set within a budget \(k\), aiming to balance DGA confidence, alignment with unmet explicit needs, and non-redundancy among candidates. \(\lambda_1\) and \(\lambda_2\) control gap activation and diversity.
    • Design Motivation: Proactivity requires restraint. Without reranking, the system might cram all potential dimensions into the response; with budget and diversity constraints, the response provides more targeted assistance.
  3. Response Generating Agent (RGA):

    • Function: Integrates selected dimensions into the final response without rewriting the user's intent.
    • Mechanism: The RGA is a prompt-driven generation module taking the baseline response, user query, explicit dimensions, and activated dimensions as input. The prompt requires maintaining the baseline structure and prioritizing brief supplements; if an implicit dimension requires specific user information, it asks at most one clarifying question.
    • Design Motivation: A common failure of proactive assistants is being "over-eager." The RGA's constraints ensure the model remains calibrated in tone, scope, and intervention level after discovering gaps.

Loss & Training

Supervision for the DGA comes from dimension annotation: GPT-5 is used to extract user-explicit and system-explicit dimensions from raw interactions to form structured JSON training samples. The reranker does not involve training a large model but uses a fixed objective function to select subsets from candidate sets. The RGA primarily relies on domain-specific prompts, defining proactivity boundaries for medical, coding, and shopping recommendation domains respectively.

Key Experimental Results

Main Results

Evaluation covers three domains: Medical (MD), Code-Contests, and PWAB (Shopping). GPT-5 serves as the judge, scoring responses from 0-5 and providing win rates.

Comparison MD μScore / Win% Code μScore / Win% PWAB μScore / Win%
Llama-8B 2.19 / 10.52 1.26 / 15.51 2.34 / 6.83
Llama-8B + ProPer 3.86 / 89.48 2.13 / 84.49 4.06 / 93.17
Qwen-8B 2.93 / 18.73 2.24 / 24.76 3.12 / 12.50
Qwen-8B + ProPer 4.03 / 81.27 2.84 / 75.24 4.29 / 87.50
GPT-4 vs Llama-ProPer 3.28 / 29.74 3.19 / 68.93 3.46 / 23.61
Llama-8B + ProPer vs GPT-4 3.73 / 70.26 2.08 / 31.07 4.11 / 76.39
GPT-4 vs Qwen-ProPer 3.26 / 19.26 3.11 / 43.63 3.53 / 17.40
Qwen-8B + ProPer vs GPT-4 4.03 / 80.74 2.97 / 56.37 4.24 / 82.60

Ablation Study

\((\lambda_1, \lambda_2)\) Llama MD Qwen MD Llama Code Qwen Code Llama PWAB Qwen PWAB
(8.0, 1.0) 4.00 4.15 2.11 2.81 3.96 3.71
(2.0, 0.5) 3.75 4.01 2.12 2.89 4.06 3.91
(0.0, 0.2) 3.70 3.91 2.08 2.79 4.17 3.80

Multi-turn Robustness

Domain Multi-turn Sim Samples ProPer Wins Explanation
Medical 12 11 Risks, constraints, and needs emerge per turn; proactive dimensions are useful.
Code-Contests 12 9 Tasks are more defined; baseline is occasionally sufficient for narrow problems.
PWAB 12 12 Shopping preferences and trade-offs are well-suited for implicit dimension completion.

Key Findings

  • ProPer outperforms the base LLM on approximately 84% of samples, indicating that the gain comes from explicit knowledge gap modeling rather than a larger model.
  • Improvements are most significant in Medical and Shopping Recommendations, as these tasks naturally involve risks, preferences, constraints, and trade-offs; coding contest goals are more explicit, leaving less room for proactive supplementation.
  • The performance drop when removing the DGA is more significant than when removing the reranker or RGA, suggesting that "discovering the gap" is more fundamental than "how to phrase the supplement."
  • CoT prompting improves the base LLM but remains inferior to ProPer, suggesting that simply letting the model self-reflect is insufficient to reliably discover user "unknown unknowns."
  • Multi-turn small-sample experiments show no significant drift in ProPer's proactivity, maintaining moderate intervention at least in short trajectories.

Highlights & Insights

  • Dimensions serve as a highly useful intermediate representation. They are not as heavy as full plans and are more structured than simple keywords, making them suitable for carrying information that "the user didn't say but the task requires."
  • The paper shifts the definition of proactivity from "whether to proactively ask questions" to "selecting which gaps are worth filling," a definition closer to a real assistant experience.
  • The budget \(k\) and parameters \((\lambda_1, \lambda_2)\) turn proactivity into controllable knobs. Different domains can adopt different intervention intensities rather than a global prompt for all tasks.
  • Gains in medical cases are illustrative: ProPer doesn't just provide answers; it supplements risk frameworks, vaccine backgrounds, practical protections, and comorbidity factors, helping users build more complete problem representations.

Limitations & Future Work

  • Primary evaluation relies on a GPT-5 judge, which may favor more detailed or structured responses; user studies are needed to measure trust, perceived intrusiveness, and long-term task success.
  • Implicit dimensions are currently free text; while highly interpretable, they lack standardization, potentially leading to redundancy, inconsistent wording, and difficulties in cross-domain comparisons.
  • \((\lambda_1, \lambda_2)\) are fixed through parameter sweeps rather than dynamically learned based on user state; real systems need to adapt based on user proficiency, risk preference, and session stage.
  • Multi-turn experiments only include 12 simulated dialogues per domain; more rounds, complex interactions, and real user feedback remain to be validated.
  • ProPer does not maintain a persistent user model or utilize multimodal/environmental states; personalized proactivity remains relatively shallow.
  • Despite strong performance in the medical domain, safety, compliance, and liability boundaries for real clinical assistants far exceed benchmark scores.
  • vs clarification-based agents: Clarification systems primarily handle needs the user knows but hasn't clarified; ProPer focuses more on task dimensions the user has not yet realized.
  • vs context extrapolation agents: Many proactive agents extrapolate from environmental states or historical behavior; ProPer explicitly generates dimensions and controls intervention via a reranker.
  • vs CoT/self-refine: CoT allows models to reflect on response flaws but lacks an explicit representation of knowledge gaps; ProPer's DGA acts more like an inspectable gap generator.

Rating

  • Novelty: ⭐⭐⭐⭐ Using dimensions to model unknown unknowns and decomposing proactivity calibration into DGA/reranker/RGA is a clear and effective approach.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers three domains, component ablations, and multi-turn robustness, though human evaluation and long-term interactions are limited.
  • Writing Quality: ⭐⭐⭐⭐ Concept definitions are complete, and the methodology diagrams and Research Questions (RQs) are well-organized.
  • Value: ⭐⭐⭐⭐ Highly insightful for building non-intrusive yet helpful personalized agents.