LegalReasoner: Step-wised Verification-Correction for Legal Judgment Reasoning¶

Conference: ACL 2025
arXiv: 2506.07443
Code: LegalHK Dataset
Area: Other
Keywords: Legal Judgment Prediction, Step-by-step Verification, Reasoning Correction, Process Verifier, Dispute Points

TL;DR¶

This paper proposes the LegalReasoner framework to enhance the reliability of legal judgment prediction through dispute point identification, step-by-step reasoning, logical validation of each step using a process verifier, and an expert-designed attribution-based correction strategy. Combined with the newly released LegalHK dataset containing 58,130 Hong Kong court cases, the framework improves the concordance rate with court judgments on LLAMA-3.1-70B from 72.37% to 80.27%.

Background & Motivation¶

Background: Legal Judgment Prediction (LJP) aims to automatically make rulings based on case facts and claims, which is highly valuable for supporting court decision-making and improving judicial efficiency. In recent years, LLMs have demonstrated strong capabilities in legal text understanding, but directly using LLMs for legal reasoning remains challenging.

Limitations of Prior Work: Existing LJP methods are prone to logical errors when dealing with complex legal cases. Legal reasoning requires rigorous step-by-step argument—from fact finding to legal application and finally to the ruling, with each step needing to be logically sound. However, the reasoning chains generated by LLMs often feature skipped steps, contradictions, and deviations from legal clauses, and they lack quality control mechanisms for intermediate steps. Once an intermediate step is incorrect, the error propagates along the reasoning chain, leading to a wrong final judgment.

Key Challenge: LLMs possess the capability to generate long reasoning chains but lack mechanisms to autonomously identify and correct reasoning errors. Legal reasoning demands extremely high accuracy for every step—a single fact-finding error or misapplication of law can lead to a completely different judgment.

Goal: To design an integrated "reasoning-verification-correction" framework for legal judgment prediction that performs multi-dimensional verification at each step of reasoning and automatically attributes and corrects errors when detected.

Key Insight: Drawing inspiration from the reasoning patterns of judges in legal practice—first identifying dispute points to decompose complex cases, then analyzing and arguing each dispute point step-by-step, where each step must be supported by facts and legal provisions. This process is formalized into a computational workflow.

Core Idea: Proposing a four-stage framework: "identify dispute points first → perform step-by-step reasoning → verify each step using a process verifier across three dimensions (correctness, progression, and potential) → execute expert-designed attribution-correction strategies upon error detection."

Method¶

Overall Architecture¶

The workflow of LegalReasoner: input the claims and facts of a case → the Dispute Point Identification module decomposes the case into several disputes → perform step-by-step reasoning for each dispute point → after each reasoning step is completed, the process verifier conducts logical validation across three dimensions (correctness, progression, and potential) → if validation passes, proceed to the next reasoning step → if an error is detected, the expert attribution module analyzes the cause of the error and generates a correction prompt → the corrected reasoning step replaces the original step to continue subsequent reasoning → after reasoning for all dispute points is complete, synthesize them to form the final judgment.

Key Designs¶

Dispute Point Identification Module:
- Function: Decomposes complex cases into independent dispute points to reduce reasoning complexity.
- Mechanism: Analyzes claims and defense arguments of both parties to identify points of disagreement (e.g., disputes over damage compensation, liability determination, or legal application). Each dispute point is formalized as a sub-question that requires independent argumentation. LLMs extract the list of dispute points from the case description using specialized prompt templates.
- Design Motivation: Legal cases usually involve multiple interwoven disputes, and directly reasoning over the entire case easily confuses different issues. After decomposition, the reasoning chain for each sub-question is shorter, clearer, and much easier to verify.
Process Verifier — Three-dimensional Step-by-step Verification:
- Function: Performs multi-dimensional logical verification at each reasoning step to detect errors in a timely manner.
- Mechanism: Trains a specialized verification model to score each reasoning step across three dimensions: correctness (whether the step is logically sound, fact citations are accurate, and legal clauses are correctly applied); progression (whether the step makes substantial progress compared to the previous step rather than stalling or deviating); and potential (whether the reasoning direction has the potential to correctly reach the final conclusion). The scores across the three dimensions are synthesized to determine if a correction is needed.
- Design Motivation: Traditional methods that only verify the final outcome cannot pinpoint exactly which step failed. Step-by-step verification coupled with three-dimensional analysis not only detects errors but also identifies the error type (factual, logical, or directional), providing precise guidance for subsequent corrections.
Expert Attribution and Correction Strategy:
- Function: Executes targeted reasoning correction based on the verifier's diagnostic results.
- Mechanism: Selects corresponding correction strategies according to the error dimensions flagged by the verifier (correctness, progression, or potential). Correctness error → backtrack to inspect fact citations and legal applications; progression error → refocus on the core question of the current dispute point; potential error → adjust the reasoning direction or add alternative paths. Correction strategies are implemented as expert-designed prompt templates.
- Design Motivation: Different types of reasoning errors require different remedy strategies. Correctness errors are typically local issues (e.g., factual errors), progression issues may relate to reasoning strategies, and potential errors might require fundamental direction adjustments. A one-size-fits-all correction strategy yields subpar results.

Loss & Training¶

The process verifier is trained using annotations from the LegalHK dataset, which includes dispute point labels, reasoning chain annotations, and step-wise verification labels. Training adopts multitask learning to simultaneously optimize the verification accuracy across the three dimensions. The overall LegalReasoner framework is fine-tuned via SFT on LLMs.

Key Experimental Results¶

Main Results¶

Model/Method	Base LLM	+CoT	+LegalReasoner	Gain
LLAMA-3.1-70B	72.37%	73.x%	80.27%	+7.9pp
LLAMA-3.1-8B	Lower baseline	Slight improvement	Significant improvement	Larger gain
Other LLMs	Baseline	Baseline+CoT	Baseline+LegalReasoner	Consistent improvement

Ablation Study¶

Configuration	Concordance Rate	Description
Full LegalReasoner	80.27%	Full framework
w/o Dispute Point Identification	Decreased	Reasoning directly without decomposition performs poorly on complex cases
w/o Process Verifier	Decreased	No intermediate step verification, leading to error accumulation
w/o Correction Strategy	Decreased	Errors are detected but not corrected
Correctness verification only	Decreased	Single-dimension verification is insufficient
Three-dimensional verification	Best	Complementary verification dimensions

Key Findings¶

The three-dimensional design of the process verifier is indispensable: correctness contributes the most, while progression and potential dimensions provide irreplaceable complementary information.
Dispute point identification brings the most significant improvement to complex cases (multi-dispute), with smaller gains observed in simpler cases.
Targeted correction strategies are crucial—generic "rethink" prompts perform far worse than specialized strategies tailored to specific error types.
The 58,130 annotated case records in the LegalHK dataset provide an important resource for legal reasoning research.
The effect is most prominent on LLAMA-3.1-70B, indicating that the framework unleashes greater value on stronger base models.

Highlights & Insights¶

Elegant design of the three-dimensional process verifier: Correctness, progression, and potential perfectly cover three independent dimensions of reasoning chain quality—whether each step is right, whether progress is being made, and whether it goes in the right direction. This verification framework is not only applicable to legal reasoning but can also be directly transferred to other scenarios requiring step-by-step verification, such as mathematical and scientific reasoning.
Formalization of judges' reasoning patterns: The application-oriented three-stage design (dispute decomposition → step-by-step argumentation → verification-correction) highly conforms to the actual flow of legal reasoning, making it more domain-specific than general CoT or Tree-of-Thought approaches.
Scarcity value of the LegalHK dataset: 58,130 court cases featuring dispute points, reasoning chains, and verification labels represent a highly scarce and high-quality annotated resource in the legal AI domain.

Limitations & Future Work¶

This work is currently validated only on Hong Kong court cases; its applicability to civil law systems (e.g., mainland China, continental Europe) and other common law jurisdictions requires further investigation.
The training of the process verifier relies on a massive amount of manually annotated reasoning chains and verification labels, which incurs high annotation costs.
The correction strategies are based on expert-designed templates and may need extension when tackling entirely new types of legal reasoning errors.
Moving from 72.37% to 80.27% still leaves significant room for improvement; actual judicial assistance may necessitate even higher accuracy.

vs Direct CoT: Direct CoT lacks intermediate verification mechanisms, allowing errors to accumulate along the reasoning chain.
vs Self-Consistency: Self-Consistency leverages majority voting across multiple sample paths but does not utilize intermediate verification signals during the reasoning process.
vs Process Reward Model (PRM): PRMs verify the correctness of each step but do not actively correct them. LegalReasoner introduces attribution and correction mechanisms on top of verification.
The verification-correction paradigm presented in this paper can be promoted as a general reasoning enhancement framework across other domains.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of three-dimensional process verification and expert-guided attribution-correction in the LJP domain is novel, although individual component ideas have precedents.
Experimental Thoroughness: ⭐⭐⭐⭐ Includes ablation studies and multi-model comparisons; the release of the LegalHK dataset enhances the academic contribution.
Writing Quality: ⭐⭐⭐⭐ The methodology is clearly described, and appropriate legal terminology is used.
Value: ⭐⭐⭐⭐ Direct application value for legal AI, and the process verification framework exhibits solid transferability.