Skip to content

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

Conference: ACL 2026
arXiv: 2604.11741
Code: None
Area: Multimodal VLM
Keywords: Imperfect Information Reasoning, Murder Mystery Game, Multi-Agent Data Generation, Vision-Language Model, Reinforcement Learning

TL;DR

A collaborative multi-agent framework is proposed to automatically generate high-quality murder mystery game scripts and training data. Through a two-stage training strategy (CoT Fine-tuning + GRPO Reinforcement Learning with ScoreAgent reward shaping), the framework enhances VLM multi-hop reasoning under imperfect information, significantly improving narrative reasoning, fact extraction, and deception resistance on WhodunitBench.

Background & Motivation

Background: Vision-Language Models (VLMs) excel in perception tasks but still degrade in complex multi-hop reasoning involving imperfect information, deception, and multi-player social interactions. Murder Mystery games, as a form of social reasoning, require players to infer hidden truths based on partial clues, making them an ideal testbed for such reasoning.

Limitations of Prior Work: (1) The murder mystery domain lacks large-scale, high-quality datasets for fine-tuning and evaluation. (2) Manual production of high-quality scripts is costly and difficult to scale. (3) Existing VLMs struggle with role consistency (murderers needing to deceive, innocents needing to cooperate) and multimodal multi-hop reasoning (combining text and visual clues). (4) Role-playing and interactive discussions lack standard answers, making pure SFT insufficient for training these behaviors.

Key Challenge: VLMs need to perform reliable reasoning in imperfect and deceptive information environments but lack appropriate training data and methodologies.

Goal: (1) Construct a scalable multi-agent data synthesis framework. (2) Design a two-stage training strategy suitable for reasoning under imperfect information.

Key Insight: Strong LLMs (e.g., Gemini 2.5 Pro) are utilized as Agents to collaboratively generate game scripts, and an Agent-monitored training strategy is then used to enhance the target VLM.

Core Idea: A collaborative pipeline of Generation Agents (Story Outline → Character Scripts → Clues → Dialogue → QA) and Evaluation Agents (Quality Control + Reward Shaping) constructs the training data, followed by two-stage training (SFT + GRPO with ScoreAgent) to enhance the VLM.

Method

Overall Architecture

The framework consists of two main modules: (1) Data Generation Module: Multiple specialized Agents (OutlineAgent, CharacterAgent, ClueAgent, RoleplayAgent, QaAgent, CriticAgent) collaborate to generate scripts and training data; (2) Model Enhancement Module: Stage 1 SFT establishes basic reasoning capabilities → Stage 2 GRPO Reinforcement Learning optimizes role-specific behaviors under ScoreAgent monitoring.

Key Designs

  1. Multi-Agent Script Generation Framework:

    • Function: Automatically generates diverse, high-quality murder mystery scripts and training data.
    • Mechanism: A pipeline of six specialized agents: OutlineAgent constructs the crime narrative (motives + secrets) → CharacterAgent refines daily actions and interactions → CriticAgent evaluates and provides feedback across four dimensions: plot complexity, character development, difficulty, and logical consistency → ClueAgent generates multimodal clues (visual + text) → RoleplayAgent simulates multi-round dialogues → QaAgent generates reasoning chains and QA pairs ranging from single-hop to multi-hop.
    • Design Motivation: Generating an entire script with a single model often leads to logical inconsistencies. Specialized division of labor combined with CriticAgent feedback ensures script quality.
  2. ScoreAgent-Monitored GRPO Reinforcement Learning:

    • Function: Optimizes VLM role consistency and reasoning quality.
    • Mechanism: Tailored reward functions are designed for different data types. Non-verifiable data (Self-introductions, Discussions): ScoreAgent (LLM-as-Judge) scores role consistency. For discussions, \(S_{\text{choice}}\) is added (1 point for questioning suspects, 0.5 for others, 0 for self). Verifiable data (QA): A weighted combination of answer correctness, format correctness, and clue matching accuracy.
    • Design Motivation: SFT establishes base capabilities but cannot handle role-playing behaviors without gold labels. GRPO utilizes ScoreAgent evaluations to distinguish between superior and inferior role-playing performances.
  3. Reasoning Chain Generation under Imperfect Information:

    • Function: Provides reasoning examples under conditions of incomplete information for training.
    • Mechanism: Automatically generates reasoning chains based on incomplete information—players only see their own clues and public info, requiring multi-hop inference. This contrasts with traditional CoT which assumes full information.
    • Design Motivation: Traditional reasoning data assumes complete information, whereas the core challenge of murder mysteries is imperfect information and deception.

Key Experimental Results

Main Results (WhodunitBench)

Method MMR CMD RP DM LSU TIU MIU
GPT-4V 58.75 26.43 6.43 24.2% 92.40 51.88 69.25
Gemini-1.5-Pro 57.39 19.20 7.22 16.9% - - -
Qwen2.5-VL-3B baseline - - - - - -
Qwen2.5-VL-3B + Ours Significant Gain Gain Gain Gain Gain Gain Gain

Ablation Study

Configuration Description
SFT Only Establishes basic reasoning but poor role consistency
SFT + RL w/o ScoreAgent Inaccurate reward signals, limited improvement
SFT + ScoreAgent GRPO Improvement in both role consistency and reasoning quality

Key Findings

  • The multi-agent framework successfully generates diverse and logically consistent data, with the CriticAgent feedback mechanism significantly enhancing script quality.
  • Two-stage training is consistently effective across both 3B and 7B model scales.
  • ScoreAgent's role-specific reward design enables the model to learn distinct behavior patterns for murderers versus innocents.
  • GRPO provides particularly significant improvements for role-playing behaviors, where SFT shows limited effectiveness for tasks without standard answers.
  • Low-score examples exhibit clear characteristics: off-topic remarks, self-contradiction, and premature identity exposure.

Highlights & Insights

  • Modeling murder mystery as a reasoning training platform for VLMs is an ingenious task choice, covering challenges such as imperfect information, deception detection, multi-hop reasoning, and multimodal integration.
  • The differentiated reward design of ScoreAgent (different functions for verifiable vs. non-verifiable data) is a practical solution that avoids training a separate reward model for tasks without gold-standard answers.
  • Scalability of the data generation framework: By adding or adjusting specialized Agents, the framework can be adapted to other game-theoretic tasks such as Werewolf or courtroom simulations.

Limitations & Future Work

  • WhodunitBench contains only 50 scripts, leading to a limited evaluation scale.
  • Script generation quality depends on Gemini 2.5 Pro, which results in high costs.
  • Role-playing evaluation still relies primarily on LLM-as-Judge, which is subjective.
  • Real-world multi-player interaction training between multiple VLMs has not been explored.
  • Visual clues are currently simple and do not involve complex scene understanding (e.g., surveillance video analysis).
  • The diversity of training data is limited by the creativity of the generation Agents.
  • vs. WhodunitBench (Xie et al., 2024): WhodunitBench provides an evaluation platform but lacks data. This paper provides a data generation framework and training methodology.
  • vs. AgentInstruct / MATRIX: While those focus on general synthetic data, this work concentrates on structured data generation for imperfect information game scenarios.
  • vs. Reason-RFT / SRPO: Unlike general reasoning enhancement methods, the ScoreAgent design here is specialized for role consistency.

Rating

  • Novelty: ⭐⭐⭐⭐ Utilizing murder mystery as a VLM reasoning training scenario is novel; the multi-agent framework design is thorough.
  • Experimental Thoroughness: ⭐⭐⭐ WhodunitBench scale is limited, and some specific numerical data are incomplete.
  • Writing Quality: ⭐⭐⭐⭐ The framework description is clear, though the length is substantial.
  • Value: ⭐⭐⭐⭐ Makes unique contributions to VLM reasoning training under imperfect information.