REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control¶
Conference: ACL2026
arXiv: 2511.20233
Code: The paper claims it is open-sourced, but the cache does not provide a specific URL.
Area: AIGC Detection / Explainable Fact-Checking
Keywords: Explainable Fact-Checking, activation steering, hallucination suppression, verdict-anchored explanation, self-refinement
TL;DR¶
REFLEX couples verdict prediction and explanation generation in fact-checking. By constructing internal steering vectors from self-disagreement samples between the backbone and fine-tuned models, it enhances verdict Macro-F1 and produces shorter, more consistent, and less misleading explanations without relying on search APIs or closed-source teacher models.
Background & Motivation¶
Background: Automatic fact-checking systems typically provide both a veracity verdict and an explanation. As LLMs are increasingly applied to fact-checking, mainstream solutions either direct the model to generate both or rely on retrieval, Google Search APIs, closed-source teacher models, or multi-agent dialogues to provide evidence and reasoning traces.
Limitations of Prior Work: Such external dependencies introduce two issues. First, retrieved evidence, teacher distillation, and multi-turn agent interactions may introduce hallucinations or propagate errors. Second, external APIs and multi-agent workflows increase latency, making them unsuitable for real-time fact-checking. Crucially, explanations generated by LLMs may appear plausible but remain inconsistent with the final verdict, potentially misleading human judgment through deceptive narrative styles.
Key Challenge: Fact-checking explanations comprise both factual content and reasoning/narrative style. Existing methods often mix these: focusing solely on external evidence may amplify noise, while pure fine-tuning might bake knowledge conflicts from local training signals into model behavior. The authors argue for decoupling "fact-sensitive signals" and "style/reasoning-sensitive signals" within internal model representations.
Goal: The paper aims to achieve more accurate verdicts and faithful explanations under conditions of single models, few-shot learning, and low external dependency. Specifically, it seeks to identify which samples reflect reasoning gains from fine-tuning versus knowledge loss, and control the generation process accordingly.
Key Insight: The authors observe prediction discrepancies between the backbone and SFT models on the same training samples. Samples shifting from "incorrect to correct" after fine-tuning are viewed as activated reasoning styles, while those shifting from "correct to incorrect" are viewed as perturbed factual knowledge. This cross-stage self-disagreement provides internal supervision without requiring manually constructed contrastive samples.
Core Idea: Use backbone/SFT self-disagreement samples to decompose steering vectors into Inference Vectors and Knowledge Vectors. Then, adaptively select and intervene in the generation based on verdict probability gains, anchoring the explanation to the verdict rather than superficial styles.
Method¶
Instead of retrieving evidence before writing explanations, REFLEX reformulates fact-checking as a dialogue-style single-turn QA task and identifies controllable explanation directions within the model. The workflow involves training a fact-checker, extracting "good" and "bad" directions from backbone-SFT disagreements, and using these to refine explanations during inference.
Overall Architecture¶
Input is a claim (optionally with evidence); output is a veracity label and explanation. REFLEX operates in three steps.
The first step is Dialogue-style Fact-Checker Training. The paper converts traditional document-style supervision into QA/dialogue training, enabling the model to generate \(v\) or \(v;exp\) in a single turn. The authors suggest that since the backbone contains significant factual knowledge, limited supervision is more effective for activating existing knowledge and shaping task style than simple document completion.
The second step is Adaptive Sample Selection. After training, the backbone and SFT models are used for inference on the training set, and samples are categorized into quadrants based on whether predictions match the gold verdict. Q2 samples (backbone incorrect, SFT correct) are defined as Reasoning Gain; Q4 samples (backbone correct, SFT incorrect) are defined as Knowledge Loss.
The third step is Self-Explanation Guided Steering (S-EGS). Steering directions are extracted from Q2/Q4 and decomposed into Inference Vectors and Knowledge Vectors. During inference, the direction that maximizes the probability gap of the gold verdict is used to control decoder block activations, followed by cleaning explanation segments contradictory to the optimal direction.
Key Designs¶
-
Cross-stage Self-Disagreement Sample Selection:
- Function: Automatically identifies samples for control based on backbone/SFT prediction differences instead of relying on manually annotated counterfactual pairs.
- Mechanism: Let \(\hat{v}^{base}\) and \(\hat{v}^{sft}\) denote backbone and SFT predictions. If \(\hat{v}^{base}\neq v^{gold}\) and \(\hat{v}^{sft}=v^{gold}\), the sample reflects reasoning gain; if \(\hat{v}^{base}=v^{gold}\) and \(\hat{v}^{sft}\neq v^{gold}\), it reflects knowledge loss.
- Design Motivation: It is difficult to manually construct clean contrastive samples that change only explanation style without altering facts. Cross-stage disagreement provides more natural weak supervision to capture fine-tuning-activated reasoning and fine-tuning-damaged factual representations.
-
Decoupling Knowledge and Inference Vectors:
- Function: Splits the traditional single steering vector into a factual-consistency-oriented Knowledge Vector (KV) and a reasoning-style-oriented Inference Vector (IV).
- Mechanism: IVs are derived from samples where fine-tuning corrected backbone errors, representing reasoning/style signals to be amplified. KVs are derived from samples where fine-tuning broke correct backbone judgments, representing factual conflict directions to be suppressed. Intervention is performed via logistic probes at the decoder block level to minimize overhead.
- Design Motivation: Explanation hallucinations are often not pure factual errors but results of entangled factual content and explanation style. Decoupling allows REFLEX to preserve consistent factual representations while enhancing verdict-consistent explanation styles.
-
Verdict-Anchored Explanation Refinement:
- Function: Ensures steering selection serves the verdict rather than just explanation fluency.
- Mechanism: For candidate directions, the method compares the probability gap of the gold verdict between steered and unsteered outputs, selecting the direction with the maximum gain. It then calculates cosine similarity \(a_{l,t}=h_{l,t}\cdot s_l/(\|h_{l,t}\|\|s_l\|)\) between token hidden states and the optimal direction, treating redundant segments with high negative similarity as noise and removing them using Ratcliff-Obershelp pattern matching.
- Design Motivation: The goal of a fact-checking explanation is to faithfully support the verdict. Using the verdict probability gap anchors the explanation control to the final task objective.
Loss & Training¶
The training phase uses a standard cross-entropy objective for joint verdict and explanation generation. Four input-output configurations were compared: \(c\to v\), \(c\to v;exp\), \(c;evi\to v\), and \(c;evi\to v;exp\). Authors chose the version without evidence but with explanations for RAW-FC and LIAR-RAW, as evidence often introduced noise and amplified hallucinations in most settings; AVeriTeC was handled according to its task format as its explanations naturally depend on evidence. Inference temperature is fixed at 0.
Key Experimental Results¶
Main Results¶
The main experiment compares external-dependency solutions and REFLEX on RAW-FC and LIAR-RAW. Key data from Table 1 indicated that REFLEX outperformed RAV and L-Defense on RAW-FC using only a single open-source backbone and 465 self-extracted samples.
| Method | External Dependency | Training Explanation Scale | RAW-FC Macro-F1 | LIAR-RAW Macro-F1 | Notes |
|---|---|---|---|---|---|
| ChatGPT | Closed API | None | 44.43 / 39.31 | 25.11 / 21.90 | Evidence inclusion worsened results |
| HiSS | Google Search API | None | 53.90 | 37.50 | Retrieval-based evidence |
| FactLLaMA | Google Search API | LLaMA2-7B | 55.65 | 30.44 | Relies on external search |
| L-Defense | ChatGPT + RoBERTa-Large | 32,240 | 61.20 | 30.53 | Uses massive GPT-3.5 distillation |
| RAV | 3 LLaMA-3.1-70B-Instruct | Not reported | 59.19 | 25.40 | Multi-agent approach |
| Ours (REFLEX / S-EGS) | None | 465 self-extracted samples | 64.99 | 50.59 | +3.79 over L-Defense, +5.80 over RAV |
Ablation Study¶
The authors conducted ablations across backbones, cross-dataset transfer, pairing strategies, and vector types.
| Ablation Setting | Key Metric | Description |
|---|---|---|
| S-EGS across backbones | Up to 5.03 Macro-F1 gain | Better than SFT in most LLaMA-2/Qwen-3 settings |
| Cross-dataset transfer: LLaMA-2 R→L | Target LIAR-RAW Macro-F1 50.59 | Directions from strong source models help weak targets (+7.54) |
| Cross-dataset transfer: LLaMA-2 L→R | Target RAW-FC Macro-F1 47.20 | Weak source directions hurt strong targets (-13.39) |
| Vertical steering w/o exp (LLaMA-2 RAW-FC) | 34.01 (+7.57) | Explanation-guided signals help verdict-only output |
Key Findings¶
- REFLEX is highly data-efficient: 465 self-extracted samples on RAW-FC outperformed 32,240 GPT-3.5 distilled explanations in L-Defense.
- Evidence conditioning is not always beneficial: omitting evidence often yielded better results on RAW-FC and LIAR-RAW, suggesting external evidence can introduce noise.
- Transferability is conditional: Source Macro-F1 is highly correlated with target gain (Pearson correlation 0.95), meaning effective steering directions originate from strong source configurations.
- KV and IV behave differently: KV samples exhibit lower misleadingness, while IV samples show higher informativeness and soundness.
Highlights & Insights¶
- Using "before-and-after training self-disagreement" as a supervision signal is the most unique contribution, avoiding the difficulty of constructing counterfactual pairs in fact-checking.
- Faithfulness is not equated simply with more evidence; the paper highlights that external evidence can sometimes hinder model performance.
- Verdict probability anchoring is a transferable design for tasks where "plausible but unhelpful" explanations are common (e.g., medical triage, legal QA).
- The discussion on transferability is realistic, specifying when the approach generalizes (strong-to-weak) versus when it degrades.
Limitations & Future Work¶
- Model scale is limited to 7B-8B parameters due to resource constraints.
- LIAR-RAW uses a three-class simplification, which may not fully represent fine-grained 6-way political fact-checking.
- Reduced reliance on external retrieval increases dependency on internal knowledge, which may become outdated.
- Evaluation still partially relies on LLM-as-a-judge; future work could include stricter factual consistency audits.
Related Work & Insights¶
- vs. HiSS / FactLLaMA: Those rely on APIs for new information; REFLEX is faster and avoids retrieval hallucinations but faces stalled internal knowledge.
- vs. L-Defense: While L-Defense uses massive distillation, REFLEX demonstrates that high-signal internal samples are more effective.
- vs. ITI / CAA: While traditional methods use single truthfulness directions, REFLEX contributes decoupled KV/IV steering anchored by verdict probability.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Decoupling KV/IV via cross-stage disagreement effectively addresses the entanglement of facts and style.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers various backbones, transferability, and human evaluation, though limited to 7B-8B models.
- Writing Quality: ⭐⭐⭐⭐☆ Solid motivation and explanation, though many detailed settings in the appendix require cross-referencing.
- Value: ⭐⭐⭐⭐⭐ High relevance for explainable fact-checking and internal activation control.
Related Papers¶
- [ICLR 2026] Calibrating Verbalized Confidence with Self-Generated Distractors
- [ACL 2026] mdok-style at SemEval-2026 Task 10: Finetuning LLMs for Conspiracy Detection
- [ACL 2026] MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization
- [ACL 2026] DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection
- [ACL 2026] From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment