Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data¶

Conference: ACL2025
arXiv: 2505.11900
Authors: Philipp Christmann, Gerhard Weikum (Max Planck Institute for Informatics)
Code: reqap.mpi-inf.mpg.de
Area: NLP Understanding
Keywords: personal data QA, heterogeneous data, question decomposition, operator tree, on-device deployment

TL;DR¶

This paper proposes the ReQAP method, which constructs executable operator trees via recursive question decomposition to achieve complex question answering over heterogeneous (structured + unstructured) personal data, supporting lightweight on-device deployment.

Background & Motivation¶

Growing demand for personal data management: User devices generate massive amounts of daily data (calendars, fitness logs, shopping records, streaming history, etc.), and users require convenient querying capabilities over this heterogeneous data.
Data privacy as a core constraint: Sensitive personal data must be processed entirely on local devices, which strictly limits computational power and memory resources.
Limitations of two dominant paradigms in existing methods:
- Verbalization (RAG): Serializes all data into text to feed into LLMs. However, it exceeds context windows when relevant events reach hundreds, and LLMs struggle with complex operations like aggregation and grouping.
- Translation (NL2SQL/CodeGen): Translates questions into SQL queries. However, it cannot handle unstructured text (e.g., email bodies, social media posts) and relies on complete database schemas.
Key Challenge: How to achieve complex analytical question-answering capabilities over heterogeneous data (structured and unstructured) while ensuring lightweight on-device deployment.

Method¶

Overall Architecture: ReQAP¶

ReQAP bridges the verbalization and translation paradigms, operating in two phases:

Phase I: QUD (Question Understanding and Decomposition)

Performs recursive decomposition of user questions to generate executable operator trees.
Key Innovation: Instead of generating the entire operator tree at once (one-shot, which is error-prone), it makes multiple LLM calls. Each call generates a partial operator tree, leaving sub-questions for subsequent recursive calls.
Training Pipeline: First, a large model (GPT-4o) is used with In-Context Learning (8-shot) to generate (question, operator tree) pairs, which are then distilled into a 1B small model (LLaMA-3.2-1B) for on-device inference.

Phase II: OTX (Operator Tree Execution)

Executes the generated operator tree from bottom to top. Core operators include:

Operator	Function
RETRIEVE	Retrieves matching events from personal data sources (core operator)
EXTRACT	Extracts specified key-value pairs from events (core operator)
JOIN	Connects two sets of event lists based on conditions
GROUP_BY	Groups events by a specified key
FILTER	Filters events based on conditions
MAP / APPLY	Applies function transformations
ARGMIN/MAX, SUM, AVG	Aggregates operations

RETRIEVE Operator (5-Step Pipeline)¶

SPLADE Pre-filtering: Sparse retrieval to obtain a candidate event pool (aiming for high recall).
Schema Discovery: Identifies high-frequency key-value patterns in the candidate pool.
Schema Classification: Evaluates schema relevance using a Cross-encoder to classify patterns as fully relevant, fully irrelevant, or partially relevant.
Event Classification: Conducts binary classification on each event in the partially relevant schemas.
De-duplication: Merges events with overlapping times (preventing double-counting the same event across calendar, email, or social media).

EXTRACT Operator¶

Uses a small-scale seq2seq model (BART-base) to extract semantic information from unstructured text.
Example: Extracting cuisine="Italian" from the email body text containing "pizza oven".
Efficiency Optimization: Constructs a static lookup map for keys that cover \(\ge 70\%\) of inputs to avoid redundant inference.

Data Model¶

All data sources are unified into a chronologically sorted event list, where each event is a key-value dictionary covering calendar, email, social media, fitness, shopping, streaming, etc.

Key Experimental Results¶

PerQA Benchmark¶

The authors constructed the PerQA dataset: containing 20 fictional characters with approximately 40K events each, and 3,567 complex questions.

Main Results (Table 3): Hit@1 / Rlx-Hit@1 on PerQA Test Set

Method	GPT-4o (>100B)	LLaMA-3.3 (70B)	SFT (1B)
RAG	0.149 / 0.20	0.123 / 0.18	0.029 / 0.06
CodeGen	0.319 / 0.44	0.239 / 0.33	0.315 / 0.47
ReQAP	0.386 / 0.52	0.322 / 0.46	0.380 / 0.53

All ReQAP variants significantly outperform the baselines (McNemar test, \(p < 0.05\)).
The 1B SFT variant achieves performance close to GPT-4o (Hit@1: 0.380 vs 0.386) and even achieves the best Rlx-Hit@1 (0.53).

Ablation Study (Table 5): PerQA Dev Set

Variant	Hit@1	Rlx-Hit@1
ReQAP (SFT)	0.396	0.54
w/o Recursive Decomposition (one-shot)	0.356	0.50
w/o Cross-encoder (SPLADE-only)	0.269	0.36
w/o EXTRACT (key matching only)	0.138	0.23

The removal of the EXTRACT operator causes the largest performance drop (~-65%), demonstrating that extracting information from unstructured text is a crucial capability.

Performance Across Different Question Complexities (Table 4, GPT-4o)

Question Type	RAG	CodeGen	ReQAP
Ordering	0.167	0.440	0.529
Grouping	0.172	0.444	0.537
Temporal	0.129	0.290	0.417
Aggregation	0.130	0.228	0.296
Join	0.073	0.176	0.236
Multi-source	0.196	0.237	0.365

User Study¶

20 undergraduate students used their actual personal data in an offline Docker environment. 28% of the answers were completely correct, and 41% were nearly correct. 94% of use cases mapped to isomorphic operator trees in PerQA, validating the representativeness of the benchmark design.

Highlights & Insights¶

Elegant recursive decomposition strategy: Mitigates error accumulation in one-shot generation of complex operator trees by breaking the task down into multiple recursive steps, each producing simpler subtrees.
On-device deployability: The 1B variant achieves competitive performance with GPT-4o, with the entire pipeline designed considering strict computational constraints.
RETRIEVE 5-step pipeline: Bridges high recall and high efficiency, significantly reducing computational load via source pruning and schema classification.
EXTRACT operator bridging the structured/unstructured divide: Extracts structured fields online from unstructured texts like emails and social media posts.
Robust privacy protection: During the user study, the system was deployed offline with Docker, ensuring personal data never left local devices.

Limitations & Future Work¶

Limited data modalities: Currently supports only specific structured/text sources (calendar, email, streaming, etc.), lacking support for multimodal data such as photos and location trajectories.
QUD as the primary bottleneck: Error analysis reveals that 50% of the failures originate from incorrect operator tree generation.
Evaluation limitations: Assessed only on the synthetic PerQA dataset and a small-scale user study (20 participants), lacking large-scale validation in real-world scenarios.
Room for improvement in analytical queries: The best Hit@1 for aggregation tasks remains below 40%, and join-heavy queries achieve only 23.6%.

TimelineQA (Tan et al., 2023): The closest pioneering work, but limited by templates (42 tasks) and insufficient data diversity. ReQAP achieves a substantial performance advantage on this benchmark (SFT: 0.313 vs 0.135).
Text2SQL (Li et al., 2024; Liu et al., 2024b): Performs well in purely structured scenarios but fails to handle unstructured text.
RAG-based methods (Oguz et al., 2022; Badaro et al., 2023): Effective given sparse evidence, but underperform on intensive aggregation and numerical computation tasks.
Question decomposition methods (Jia et al., 2024; Saeed et al., 2024): Designed for localized, specific scenarios and not generalized to heterogeneous data QA.

Rating¶

Novelty: ⭐⭐⭐⭐ (Recursive decomposition + operator tree + RETRIEVE/EXTRACT bridging the two major paradigms)
Experimental Thoroughness: ⭐⭐⭐⭐ (Self-created benchmark + ablations + multi-scale model comparison + user study; lacks extensive external datasets)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear organization, unified running example, and intuitive diagrams)
Value: ⭐⭐⭐⭐ (Personal data QA is a highly practical demand, and the edge-deployment strategy offers solid industrial value)