Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering¶

Conference: ACL2025
arXiv: 2505.19112
Code: zchuz/SiGIR-MHQA
Area: NLP Understanding
Keywords: Multi-hop Question Answering, Retrieval-Augmented Generation, Self-Critique, Iterative Reasoning, Beam Search

TL;DR¶

The SiGIR framework is proposed to enable models with iterative question decomposition, retrieval, reasoning, and self-evaluation capabilities through end-to-end training. During inference, it utilizes self-critique feedback to guide iteration-level beam search for selecting optimal reasoning paths, outperforming the state-of-the-art (SOTA) by an average of 8.6% across three multi-hop QA datasets.

Background & Motivation¶

Large Language Models (LLMs) face three major challenges in knowledge-intensive multi-hop reasoning tasks: (1) decomposing complex questions into sub-questions in a single pass is highly difficult, where initial errors cause subsequent reasoning to deviate from the correct path; (2) iterative retrieval methods struggle with complex question planning and cannot express retrieval intent clearly, leading to inaccurate retrieval; (3) the lack of guided feedback for intermediate steps easily leads to cascading errors. Although existing methods (such as IRCoT, Self-RAG, Auto-RAG) partially alleviate these issues, they still fail to effectively combine intermediate step quality evaluation to guide the reasoning process. This paper focuses on two key aspects: iterative question decomposition to facilitate precise retrieval, and intermediate step feedback to guide reasoning, expanding the reasoning space and reducing error propagation.

Method¶

Overall Architecture¶

SiGIR (Self-Critique Guided Iterative Reasoning) consists of two phases: training and inference.

The training phase constructs the SC-Reasoner (Self-Critique Reasoner) in three steps:

Iterative Reasoner (R): Synthesizes iterative reasoning trajectories (including sub-question decomposition, retrieval triggering, and knowledge reasoning) using DeepSeek-V2.5. The data is organized in an interleaved format to train a small model, yielding the base reasoner.
Critic Model (C): Generates reasoning trajectories from non-overlapping corpora using R (instead of the LLM) to ensure a balance of positive and negative samples, followed by LLM evaluation of retrieval quality, reasoning quality, and overall quality, to train an independent critic model.
SC-Reasoner (R_sc): Annotates reward signals on sub-processes of reasoning trajectories using C, appending self-scores after special tokens. Finally, end-to-end training is conducted to obtain a model with both reasoning and self-evaluation capabilities.

SC-Reasoner Inference Mode¶

The model performs five operations in each step: (1) determining whether the question needs decomposition; (2) decomposing one atomic sub-question at a time; (3) triggering external retrieval and conducting knowledge reasoning; (4) self-evaluating retrieval and reasoning quality (generating \(r_{\text{retr}}\) and \(r_{\text{reas}}\)); (5) reducing the original question to identify unresolved parts. This process is repeated until the final answer is reached.

Self-Critique Guided Iterative Reasoning (Iteration-level Beam Search)¶

The inference phase employs iteration-level beam search:

Branch Exploration: At each reasoning step, branch expansion is performed on sub-question decomposition, retrieval, and reasoning via temperature sampling to generate multiple candidate reasoning paths.
Candidate Selection: Candidate paths are scored using cumulative process reward (\(r_c = r_{\text{retr}} + r_{\text{reas}}\) incrementally accumulated), keeping the top-\(k\) optimal candidates.
Final Selection: After reasoning concludes, the final answer trajectory is selected using cumulative process rewards or outcome rewards.

Self-Improvement Mechanism¶

SC-Reasoner can perform reasoning and self-evaluate quality on unlabeled data, retaining high-quality trajectories for subsequent training, thereby achieving self-improvement in data-sparse scenarios.

Key Experimental Results¶

Main Results (Table 1)¶

F1 results on three multi-hop QA datasets (with Mistral-7B as the default backbone):

Method	2WikiMQA	HotpotQA	MuSiQue
IRCoT	46.80	49.03	21.43
Self-RAG	31.60	54.60	22.00
Auto-RAG	56.06	50.03	22.08
DR-Distillation	70.51	58.06	22.74
SiGIR (Ours)	74.47	63.09	37.15

SiGIR achieves the most significant improvement on MuSiQue (3-4 hop difficult questions): 3-hop \(+47.1\%\), 4-hop \(+24.37\%\). Competitive results are also achieved on Qwen2.5 and LLaMA2.

Ablation Study (Table 2)¶

Setting	2WikiMQA	HotpotQA	MuSiQue
Full SiGIR	74.47	63.09	37.15
w/o Guided Search	72.37	60.86	35.47
w/o Reward Signals	67.23	54.94	29.45
w/o Self-Critique	66.96	56.86	28.62
Coarse-grained Reward	73.40	59.85	34.32

Removing reward signals leads to an approximate 10% performance drop; fine-grained rewards are more effective than coarse-grained rewards in search-based reasoning.

Search and Selection Strategy Analysis (Table 4)¶

Strategy	Average F1
Guided Search (Full)	58.23
Keep Only 1 Candidate	56.81
No Sub-question Branching	50.03
Cumulative Process Reward Selection	58.23
Outcome Reward Selection	55.99
Random Selection	55.81

Performance drops by 13.57% when the sub-question exploration width is set to 1, demonstrating the importance of diverse sub-question exploration.

Highlights & Insights¶

Unified Framework: Unifies reasoning and self-critique into a single model, avoiding the inference overhead of decoupled generator-critics (yielding a 4.46x throughput improvement).
Iterative Decomposition + Beam Search: Avoids the high risk of one-pass decomposition, mitigating error propagation through step-by-step decomposition and searching.
Self-Improvement Capability: With only 40% labeled data + two rounds of self-improvement, the model's performance approaches that of training with 100% data (with only a 2.2% gap).
Retrieval System Flexibility: The method naturally supports hybrid retrieval (sparse + dense), where hybrid retrieval yields an average improvement of 5.7%.

Limitations & Future Work¶

Sub-question exploration adopts a sampling approach rather than a dedicated decomposition module, which may limit the breadth of exploration.
It only employs a Supervised Fine-Tuning (SFT) paradigm, without introducing reinforcement learning to optimize reasoning trajectories.
Beam search increases computational overhead during inference (although partially mitigated by GenCritic).
Experiments are only validated on English multi-hop QA datasets; cross-lingual and cross-domain generalizability remains unknown.

Retrieval-Augmented Generation (RAG): Evolves from single-pass retrieval to multi-round iterative retrieval (e.g., IRCoT, Self-RAG). SiGIR incorporates self-critique signals on top of this to guide the retrieval direction.
Multi-hop Question Answering: Compares question decomposition methods (e.g., ProbTree, BeamAggR) vs. iterative reasoning methods (e.g., Auto-RAG, DR-Distillation). SiGIR combines iterative decomposition with reward-guided search.
Inference-Time Scaling: Includes sampling ensembles (e.g., Self-Consistency) and MCTS search. SiGIR's iteration-level beam search balances performance and efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of unifying self-critique and iterative reasoning into beam search is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets + four backbone models + rich ablation analysis + retrieval system experiments.
Writing Quality: ⭐⭐⭐⭐ Clear structure, intuitive diagrams.
Value: ⭐⭐⭐⭐ Provides practical reference value for knowledge-intensive multi-hop reasoning.