LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues¶

Conference: ACL 2026 arXiv: 2604.19464 Code: None Area: Legal NLP / Interpretability Keywords: Legal issue relevance assessment, neuro-symbolic reasoning, feature selection, legal AI, structured factor classification

TL;DR¶

This paper proposes LePREC, a neuro-symbolic framework inspired by legal professionals' analytical processes. It uses LLMs to generate reasoning question–answer pairs that convert unstructured legal text into structured features, which are then fed into a sparse linear model for relevance classification. On the LIC dataset constructed from 769 Malaysian contract law cases, LePREC achieves 30–40% improvement over LLM baselines such as GPT-4o.

Background & Motivation¶

Background: More than half of the global population struggles to meet their civil justice needs. Within the IRAC (Issue-Rule-Application-Conclusion) framework, legal issue identification is a critical first step, encompassing both the generation of candidate legal issues and the assessment of their relevance. Although LLMs have demonstrated strong language capabilities, their precision in real-world legal settings remains insufficient.

Limitations of Prior Work: Existing legal AI benchmarks are largely confined to simplified or synthetic scenarios (e.g., textbook cases) and lack expert-annotated datasets grounded in real court decisions. Directly applying GPT-4o to legal issue relevance assessment yields only 62% precision, as LLMs fail to distinguish between issues that are "factually related" and those that "genuinely concern the core dispute of the case."

Key Challenge: When assessing relevance, legal professionals must consider multiple layers of context—jurisdictional constraints, procedural background, and case-specific factors—whereas LLMs tend to perform superficial fact matching and lack deep legal reasoning capacity. End-to-end "black-box" approaches cannot produce such fine-grained judgments.

Goal: (1) Construct LIC, the first legal issue relevance assessment dataset grounded in real court cases; (2) Propose LePREC, a data-efficient and interpretable neuro-symbolic framework that reformulates legal reasoning as statistical classification over structured factors.

Key Insight: The authors observe that legal professionals follow a two-stage analytical process—first identifying key analytical factors (brainstorming), then weighing those factors to reach a judgment. This decomposition naturally maps onto the neuro-symbolic paradigm: the neural component extracts factors, while the symbolic component performs the weighing and reasoning.

Core Idea: Legal issue relevance assessment is reformulated from "evaluating the fact–issue relationship" to "classifying factor–issue relevance." LLMs generate binary reasoning questions as structured features, and a sparse linear model learns explicit algebraic weights, enabling interpretable and data-efficient relevance judgments.

Method¶

Overall Architecture¶

LePREC consists of two stages: (1) the neural component—an LLM generates binary reasoning questions from fact–issue pairs and computes answer probabilities, converting unstructured legal text into structured feature vectors; (2) the symbolic component—a sparse linear model is applied over the discrete features to learn explicit weights for relevance classification. The input is a (fact set, candidate legal issue) pair, and the output is a binary relevance label (Relevant/Irrelevant).

Key Designs¶

LIC Dataset Construction and Incremental Issue Generation:
- Function: Provides the first benchmark for legal issue relevance assessment grounded in real court cases.
- Mechanism: Facts and issues are extracted from 769 Malaysian contract law cases using GPT-4o. To increase candidate issue diversity, an incremental generation strategy is adopted: given a fact list \(\mathbf{X}=\{\mathbf{x}_1,\ldots,\mathbf{x}_m\}\), issues are generated by progressively adding facts, \(\hat{\mathcal{Y}}=\bigcup_{i=1}^{m}\hat{\mathcal{Y}}_i\), rather than providing all facts at once. Relevance labels are assigned by senior legal experts, with Fleiss' \(\kappa = 0.659\).
- Design Motivation: By varying the contextual "depth," the approach encourages the LLM to attend to different combinations of facts, uncovering subtle candidate issues that single-pass generation may miss. The incremental method outperforms baselines on quality metrics (FBD, EMBD) and diversity metrics (Self-BLEU, Distinct-N).
Neural Component: Reasoning Question Generation and Answering:
- Function: Converts unstructured legal text into structured symbolic features.
- Mechanism: For each fact–issue pair in LICU, the LLM generates binary reasoning questions, which are accumulated into a shared question pool \(\mathcal{Q}\) (totaling 2,486 questions). For each question \(q_t \in \mathcal{Q}\), a generative verifier computes an answer probability \(G_{q_t}(\mathbf{X}, \hat{Y}_j) \in (0,1)\), which is collected into a feature vector \(\mathbf{f} = G_{\mathcal{Q}}(\mathbf{X}, \hat{Y}_j) \in \mathbb{R}^h\).
- Design Motivation: Probability scores are used instead of direct binary answers, as preliminary experiments show that direct answers are unreliable. Continuous probability information proves critical for classification, consistently outperforming binary label variants.
Symbolic Component: Relevance-Aware Linear Prediction:
- Function: Achieves interpretable relevance classification through explicit algebraic operations.
- Mechanism: Prediction is performed as \(\hat{y}_j = \text{sign}(\mathbf{w}^\top \mathbf{f})\). The linear model performs relevance-aware feature weighting via learned coefficients: it automatically down-weights noisy or redundant features (addressing the challenge of conflicting signals from semantically similar questions) and applies adaptive weighting to domain-specific questions rather than globally discarding them (addressing the challenge of narrow-domain questions introducing noise in irrelevant cases).
- Design Motivation: Linear models combine symbolic interpretability (explicit weight coefficients and transparent algebraic composition) with practical advantages (high data efficiency and parameter counts comparable to the training set size), while also supporting statistical analysis of reasoning question contributions.

Loss & Training¶

The neural component uses GPT-4o for question generation; this process is model-agnostic, as subsequent sparse feature selection automatically retains the most predictive factors. The symbolic component employs standard linear classifiers (SVC, LR, Ridge, etc.) trained via 5-fold stratified cross-validation on LICL. L1-regularized variants are used for feature selection experiments.

Key Experimental Results¶

Main Results¶

RQ1: SOTA LLM Baselines (Direct Judgment)

Method	F1	Accuracy	Precision	Recall
Claude	54.55	70.91	66.00	56.19
GPT-4o	57.80	70.91	64.46	58.07
GenQwen	63.70	68.59	63.84	63.92
LegalBERT	52.31	41.28	52.10	50.79

RQ2: LePREC Framework (Neural + Symbolic)

Method	F1	Accuracy	Precision	Recall
SVCPhi	80.19	82.66	79.67	81.01
LRPhi	79.70	82.49	79.58	80.05
RidgePhi	80.10	82.91	80.06	80.28
L1RegPhi	80.01	83.34	81.13	79.32
LDAPhi	79.56	83.50	81.77	78.39

Ablation Study¶

Configuration	F1	Notes
Linear models (SVC/LR/Ridge)	79.70–80.19%	Best; consistent and stable
Tree/distance models (RF/KNN)	74–75%	Slightly lower but competitive
Deep learning (Transformer/FFN)	75.44/75.65%	Nonlinearity provides no additional gain
LLM-Select feature selection	45–58%	Fails; LLMs cannot identify predictive questions
L1 SVC feature selection	77.60%	Only 2.5 percentage-point drop

Key Findings¶

LePREC achieves approximately 16.5 percentage points of F1 improvement over the best LLM baseline (GenQwen 63.70%), reaching 80.19%.
Linear models (SVC, LR, Ridge) perform most consistently across all classifiers (79.70–80.19% F1), demonstrating that simple linear weighting suffices to capture legal reasoning patterns.
Stability analysis reveals no universal "golden question set": L1 LR selects only 0.04–0.53% of features consistently across all folds, and L1 LR and L1 SVC share only 38% feature overlap.
Interviews with legal practitioners confirm that lawyers do not reason from fixed checklists but draw judgments from a broad, context-sensitive pool of analytical factors.

Highlights & Insights¶

Reformulating legal reasoning as statistical classification over structured factors is a conceptually elegant application of the neuro-symbolic paradigm to legal AI, simultaneously achieving interpretability and high performance.
The finding that "no universal core question set exists" is supported by both quantitative evidence (feature selection instability) and qualitative evidence (practitioner interviews), revealing a fundamental characteristic of legal reasoning.
The question generation process is model-agnostic—sparse feature selection automatically filters model-specific noise—giving the framework strong generalizability.

Limitations & Future Work¶

The dataset focuses exclusively on Malaysian contract law (a Commonwealth legal system) and has not been validated in other legal traditions such as civil law systems.
The framework relies on LLMs to generate reasoning questions; alternative question acquisition methods may yield new insights.
The linear model assumes that a linear combination can capture relevance patterns, and extracting high-level insights from detailed weight distributions requires careful analysis.
Deployment in actual legal practice requires additional validation to mitigate potential biases.

vs. Direct LLM Judgment (GPT-4o/Claude): Direct LLM judgment achieves only 55–58% F1, whereas LePREC reaches 80% F1 by decomposing the reasoning process, demonstrating that structured approaches substantially outperform end-to-end black-box methods.
vs. LegalBERT: Legal pre-trained models exhibit high variance due to insufficient training data (F1 = 52.31±13.4); LePREC addresses this through data-efficient linear models.
vs. GCI (causal inference approach): GCI's strict causal discovery over-constrains the feature space, whereas LePREC's correlation-based approach preserves a broader set of signals.

Rating¶

Novelty: ⭐⭐⭐⭐ The reformulation of legal reasoning as structured factor classification is novel, and the neuro-symbolic decomposition aligns well with legal practice.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three RQs are systematically addressed, with comparisons across 14 classifiers, stability analysis, and practitioner interviews—remarkably comprehensive.
Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with rigorous logic and a well-motivated experimental progression.
Value: ⭐⭐⭐⭐ The work introduces a new interpretable and data-efficient paradigm for legal AI, and the LIC dataset fills an important gap in the field.