Skip to content

Aligning VLM Assistants with Personalized Situated Cognition

Conference: ACL 2025
arXiv: 2506.00930
Code: https://github.com/NLPGM/PCogAlign
Area: Multimodal VLM
Keywords: personalized alignment, situated cognition, Role-Set, VLM assistant, reward model

TL;DR

Based on the sociological concept of "Role-Set" to characterize user diversity, this paper proposes the PCogAlign framework. By utilizing a cognition-aware, action-oriented reward model to generate personalized responses for VLM assistants, the framework ensures that users with different roles receive advice tailored to their specific needs within the same visual scene.

Background & Motivation

Background: VLMs have achieved general alignment goals such as instruction following, hallucination reduction, and value alignment through techniques like visual instruction tuning, visual preference optimization, and safety alignment. Currently, VLMs generate "one-size-fits-all" uniform responses for all users.

Limitations of Prior Work: Even when facing the same visual scene, people from different backgrounds have different cognitions and expectations. For example, seeing a broken swing, a child expects comfort and safety guidance, while a repairman expects professional repair advice. General alignment fails to satisfy such personalized needs.

Key Challenge: Human diversity stems from countless factors (age, socioeconomic status, etc.) that are almost impossible to fully operationalize in experiments, requiring reasonable simplification. Additionally, how to evaluate whether "personalized alignment is successful" lacks clear standards.

Goal: (1) How to define diverse individuals? (2) How to evaluate whether personalized alignment is achieved? (3) How to train VLMs to generate personalized responses?

Key Insight: Drawing on the concept of "Role-Set" from Goffman's role theory, the combination of "Role@Location" (e.g., "Teacher@School, Parent@Home") is used to succinctly characterize individual diversity. Successful alignment is determined by evaluating whether the response helps the individual take the optimal action.

Core Idea: Use Role-Sets to define individual differences, and utilize an action-oriented reward model to select the optimal personalized responses for alignment training.

Method

Overall Architecture

The input is a triplet \(s = (RS, v, q)\) (Role-Set, visual scene, user query), and the output is a personalized response \(r\). The optimization objective is: \(\theta^* = \arg\max_\theta \mathbb{E}_{s \sim S_{\text{train}}} P_A(a^* | r=f_\theta(s), c)\), which aims to maximize the probability of the user taking the optimal action \(a^*\) given the response generated by the VLM.

The PCogAlign framework consists of three steps: (a) estimating the user's situated cognition \(c = C(s)\) and optimal action \(a^*\); (b) sampling multiple candidate personalized responses through cooperative agents; (c) selecting the optimal response for alignment training using a cognition-aware action-oriented reward model.

Key Designs

  1. Role-Set and Benchmark Construction (PCogAlignBench):

    • Function: Defines 8 social locations (home, community, museum, airport, store, school, hospital, restaurant), with 3-5 roles per location (a total of 32 roles), constructing 20 different Role-Sets through compositional constraints.
    • Mechanism: Divides the 20 Role-Sets into two subsets, LS1 and LS2 (partitioned by different location sets), supporting cross-role-set generalization testing (e.g., LS1→LS2). A hierarchical collection strategy (Role-Set -> scene type -> scene phrase -> scene description -> search image) is used to ensure data diversity.
    • Design Motivation: Simplifies the modeling of human diversity while maintaining sufficient expressiveness. Dividing into two subsets is designed to test the model's generalization ability to unseen Role-Sets.
  2. Cognition and Action Estimation (Step a):

    • Function: Uses the VLM itself to estimate the user's situated cognition \(c = C(s)\) and optimal action \(a^*\) in specific visual scenes.
    • Mechanism: Prompts the VLM using in-context learning with human-written exemplars. Cognition encompasses three layers: cognition of the visual scene state, physical/mental state, and appropriate action. Action is divided into external physical behaviors and internal psychological feelings.
    • Design Motivation: \(c\) and \(a^*\) are independent of the VLM parameters \(\theta\) and can be pre-computed.
  3. Cooperative Agent Sampling for Personalized Responses (Step b):

    • Function: Designs two cooperative agents, KeyG (Keypoint Generator) and ResG (Response Generator), to iteratively sample candidate responses.
    • Mechanism: KeyG generates core keypoints on "how to consider user cognition and enhance physical/mental states" based on user cognition and desired actions, and ResG regenerates responses according to these keypoints. This process is iterated to collect \(N\) candidate responses.
    • Design Motivation: Direct generation of personalized responses by VLMs has unstable quality; dual-agent collaboration improves personalization by planning keypoints before generation.
  4. Cognition-Aware Action-Oriented Reward Model (Step c):

    • Function: Trains a reward model to evaluate response quality, choosing the optimal response using a Best-of-N strategy.
    • Mechanism: Collects preference data using "negative Role-Sets"—for individual I1 (Teacher@School), a response generated for I2 (Student@School) serves as a negative sample that is inappropriate for I1. Actions (physical behavior + psychological feelings) are incorporated into preference pairs to train the cognition-aware reward model.
    • Design Motivation: General preference data is not applicable to personalized scenarios; using role-swapping to automatically construct negative samples avoids large-scale manual annotation.

Loss & Training

After the optimal response \(r^*\) is selected, the VLM is trained using the standard SFT loss: \(\theta^* = \arg\min_\theta \mathbb{E}_{s \sim S_{\text{train}}} -\log P(f_\theta(r^* | s))\).

Key Experimental Results

Main Results

Method LS1→LS1 P.Score LS1→LS2 P.Score LS2→LS1 P.Score LS2→LS2 P.Score
Standard Prompt Baseline Baseline Baseline Baseline
RS Prompt (Adding role descriptions) Gain Gain Gain Gain
PCogAlign Optimal Optimal Optimal Optimal
  • PCogAlign outperforms comparison methods across all four settings.
  • Cross-role-set generalization (e.g., LS1→LS2) maintains a leading edge despite some performance decay.

Ablation Study

Configuration Effect Explanation
Full PCogAlign Optimal Full framework
w/o Cognition Estimation Decline Lacking situated cognition leads to insufficiently personalized responses
w/o Cooperative Sampling Decline Directly generated responses lack adequate personalization
w/o Action-oriented Reward Decline Fails to select responses that truly help user action
Replacement with General Reward Model Significant Decline General preferences cannot capture role specificity

Key Findings

  • High agreement between automatic and human evaluation: On 200 samples, automatic evaluation aligns with human evaluation in 88% of cases, validating the reliability of LLM-as-judge in personalized evaluation.
  • Generalization ability of Role-Sets: Significant improvements are still observed in the LS1→LS2 setting (completely different Role-Sets), indicating that the framework has learned the generalization mapping from roles to behaviors.
  • RSA (Role-Set Awareness) achieves the highest improvement across five evaluation dimensions: This indicates that the most significant contribution of the framework is enabling the VLM to truly "see" the user's role and adjust responses accordingly.

Highlights & Insights

  • Clever adoption of sociological concepts: Modeling human diversity using the Role-Set theory is both theoretically grounded and operationalizable. This interdisciplinary approach is highly valuable.
  • Constructing preference data from negative Role-Sets: Automatically constructing preference data based on the intuitive relationship that "a response suitable for Role A is unsuitable for Role B" eliminates the need for large-scale manual annotation. This idea can be migrated to other personalized tasks.
  • Action-oriented evaluation perspective: Rather than evaluating "whether the response is personalized", it evaluates "whether the response helps the user take better actions". This outcome-oriented evaluation approach is closer to actual application needs.

Limitations & Future Work

  • The 20 Role-Sets are still limited and cannot cover all real-world diversity.
  • Data construction relies on GPT-4o and search engines, which may introduce bias and noise (despite quality control measures).
  • The definition of "optimal action" in evaluations relies on LLM estimations, which may be controversial in scenarios such as moral dilemmas.
  • Experiments are restricted to static image scenarios; the applicability to video and multi-turn dialogue scenarios remains unexplored.
  • vs General VLM Alignment (visual preference optimization, etc.): General alignment strives for "the best for everyone," whereas PCogAlign aims for "the best for specific individuals," making them complementary.
  • vs LLM Personalized Assistants (Jang et al. 2023, etc.): Prior works primarily focused on the personalization of text styles or values, without involving visual scenarios; this paper introduces both visual scenes and role diversity.

Rating

  • Novelty: ⭐⭐⭐⭐ First to introduce Role-Set theory to VLM personalized alignment, with a novel task definition.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprising 4 settings + ablation study + human evaluation, which is relatively comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem definition with intuitive methodology flowcharts.
  • Value: ⭐⭐⭐⭐ Opens up a new direction for VLM personalization research, with open-source benchmarks and code.