CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation¶
Conference: ACL 2025
arXiv: 2508.05534
Code: None (JPMorgan AI Research)
Area: Text Generation
Keywords: legal text generation, copy mechanism, decoding strategy, faithfulness, RAG
TL;DR¶
Proposes CoCoLex, a training-free decoding strategy that constructs a copy distribution using the Euclidean distance between decoding hidden states and context token hidden states. By using a prediction entropy-based confidence score to dynamically balance the ratio of "copying from context" and "free generation", it consistently improves faithfulness and correctness across five legal benchmarks, showcasing particularly outstanding performance in long-text generation tasks.
Background & Motivation¶
The legal domain has a immense demand for LLMs (e.g., contract drafting, legal research, compliance checks), but demands extremely high accuracy and faithfulness to source documents—any paraphrasing can alter the legal meaning, and inaccurate outputs can lead to legal liabilities. RAG mitigates hallucinations by retrieving external knowledge, but it does not guarantee that the model effectively utilizes the context. Existing context-aware decoding methods (such as CAD) amplify context influence by contrasting probability distributions, but they do not explicitly force the model to copy faithful expressions from the context.
Key Challenge: The "templated structure" and "verbatim quotation" nature of legal texts require highly accurate copying of precise phrasing, while standard autoregressive decoding naturally tends to paraphrase rather than quote. Pointer-generator networks require training a copy gate, making them unsuitable for plug-and-play scenarios in modern LLMs.
Key Insight: Harnessing the hidden state representations and prediction uncertainty naturally generated by the LLM during decoding to construct a training-free copy-generation interpolation mechanism—copying more from context when the model is uncertain, and generating freely when the model is confident.
Method¶
Overall Architecture¶
Building upon the standard RAG decoding pipeline, CoCoLex executes the following operations at each decoding step: (1) obtains the model's standard vocabulary distribution \(p_\theta(y_t)\); (2) constructs a copy distribution \(p_{\text{copy}}(y_t)\) using the similarity between the hidden state of the current step and the hidden states of context tokens; (3) computes a prediction entropy-based confidence score \(\lambda_t\); (4) dynamically interpolates the two distributions with \(\lambda_t\) as the weight to obtain the final sampling distribution.
Key Designs¶
-
Copy-based Decoding based on Hidden State Similarity:
- Function: Constructs a probability distribution that encourages directly copying tokens from the context.
- Mechanism: When processing context tokens, extract and store the hidden state vectors \(h_i\) of all context tokens along with their corresponding next tokens. At decoding step \(t\), compute the Euclidean \(L_2\) distance between the current hidden state \(h_t\) and all context hidden states, convert these to similarity scores \(s_t(i) = \exp(-\text{dist}_t(i))\) via exponential decay, and then sum and normalize all similarity scores mapped to the same vocabulary token \(v\) to obtain \(p_{\text{copy}}(y_t=v)\).
- Efficiency Optimization: Aggregates only the top-\(k\) most similar context vectors, neglecting the negligible contributions from low-similarity tokens.
- Design Motivation: The similarity in hidden states reflects the degree of alignment between what the model internally believes "should be output at the current position" and "what was actually output at a certain position in the context." This requires no extra forward pass, as hidden states are naturally generated during autoregressive decoding.
-
Confidence-Guided Interpolation based on Entropy:
- Function: Dynamically balances the ratio between copying and free generation.
- Mechanism: Calculates the entropy \(H_t\) of the model's output distribution at each step, and obtains the confidence score \(\lambda_t = \exp(-H_t^{\text{norm}})\) via exponential transformation after normalization. At low entropy (high confidence), \(\lambda_t\) approaches 1, favoring the model's own distribution; at high entropy (low confidence), \(\lambda_t\) approaches 0, biasing towards the copy distribution.
- Smoothing Mechanism: Applies sliding-window smoothing to \(\lambda_t\) combined with historical confidence values to prevent abrupt changes that cause unstable generation.
- Final Distribution: \(p(y_t) = \lambda_t \cdot p_\theta(y_t) + (1-\lambda_t) \cdot p_{\text{copy}}(y_t)\)
- Design Motivation: The model is more prone to hallucinate when it is uncertain, and should thus "look for answers" in the context; when it is confident (e.g., generating grammatical words or conjunctions), it should maintain free generation to preserve fluency.
-
CoCoLex+ (Full-Document Copy Extension):
- Function: Expands the copy scope from top-\(k\) retrieved chunks to the entire document.
- Mechanism: Divides the document into overlapping segments and encodes them separately to obtain the hidden state representations of all tokens. Each token uses the hidden state from the segment with the most autoregressive context as its unique representation. During inference, only the top-\(k\) retrieved segments are used as explicit textual context, but the copy distribution can retrieve from the full-document hidden states.
- Design Motivation: Relevant information in legal documents is often scattered across different paragraphs, and limiting the scope to retrieved chunks will miss critical references.
Loss & Training¶
- Completely Training-Free: A pure decoding-stage strategy that does not modify model parameters.
- Can be stacked and combined with other decoding methods (such as AdaCAD), offering complementary improvements.
Key Experimental Results¶
Main Results (Five legal benchmarks, two models)¶
Mistral-7B-Instruct-v0.3 Results:
| Dataset | Metric | Regular | CAD | AdaCAD | CoLex | CoCoLex |
|---|---|---|---|---|---|---|
| CUAD | Cor-AS / Fth-AS | 68.24 / 76.31 | 69.57 / 79.55 | 69.63 / 79.56 | 70.65 / 80.66 | 71.06 / 80.96 |
| OALQA | Cor-AS / Fth-AS | 41.39 / 59.85 | 42.90 / 59.00 | 42.49 / 59.44 | 48.61 / 60.14 | 49.84 / 60.87 |
| ObliQA | Cor-AS / Fth-AS | 73.35 / 90.84 | 71.14 / 89.73 | 71.04 / 89.61 | 85.35 / 93.48 | 86.01 / 95.96 |
| AQuAECHR | Cor-AS / Fth-AS | 52.79 / 89.66 | 49.15 / 89.28 | 48.69 / 89.37 | 59.79 / 91.85 | 60.10 / 92.27 |
| CLERC | Cor-AS / Fth-AS | 42.38 / 74.02 | 34.98 / 66.35 | 35.11 / 66.46 | 54.94 / 78.62 | 58.12 / 79.54 |
Human Evaluation (AQuAECHR, Legal Experts, 5-point scale)¶
| Method | Correctness | Faithfulness | Fluency | Coherence |
|---|---|---|---|---|
| Regular | 4.40 | 4.24 | 4.88 | 4.88 |
| AdaCAD | 4.24 | 3.84 | 4.80 | 4.84 |
| CoCoLex | 4.64 | 4.44 | 4.96 | 4.92 |
Ablation Study¶
CoCoLex vs CoLex (the static interpolation version without confidence guidance):
| Configuration | Key Metric | Notes |
|---|---|---|
| CoLex (static interpolation) | Inferior to CoCoLex across all datasets | Static ratios fail to adapt to copying needs of different tokens |
| CoCoLex (dynamic interpolation) | Consistently optimal | Confidence guidance allows deterministic tokens like entity names to retain the model's choices, while prioritizing copying for uncertain tokens |
| Ada + CoCo | CUAD gains, CLERC/AQuAECHR drops | Complementarity depends on whether AdaCAD itself is effective |
Inference Time Comparison (Relative to Regular Decoding)¶
| Method | CUAD | AQuAECHR |
|---|---|---|
| Regular | 1.00x | 1.00x |
| CAD | 1.75x | 1.71x |
| CoCoLex | 1.51x | 1.62x |
| CoCoLex+ | 1.96x | 2.96x |
Key Findings¶
- CAD/AdaCAD are effective in short-text generation (CUAD), but they instead reduce correctness and faithfulness in long-text generation (CLERC, AQuAECHR) and severely damage fluency and coherence.
- CoCoLex exhibits its most prominent advantage in long-text generation tasks, where a higher proportion of tokens in lengthy expressions require precise copying.
- The domain-specialized legal model Saul is unexpectedly out-performed by general-purpose Mistral; however, CoCoLex significantly compensates for Saul's deficiencies (e.g., boosting Saul's Cor-AS from 32.26 to 52.74 on OALQA).
- CoCoLex's inference overhead is only 1.5x, significantly lower than CAD's 1.7x, because it does not require an extra context-free forward pass.
- CoCoLex+ brings massive improvements to Saul on CLERC (Cor-AS from 34.95 to 44.91), indicating that weaker models benefit more from full-document copying.
Highlights & Insights¶
- Using hidden state distance as a copy signal is an elegant design: instead of relying on attention weights (where attention patterns exhibit large discrepancies across different layers), it utilizes the Euclidean distance of the final-layer hidden states, reflecting more directly "what token is required at the current position".
- The key intuition of confidence-guidance: model uncertainty \(\approx\) potential hallucination \(\approx\) look for answers in the context. Empirical results verify that CoCoLex consistently outperforms CoLex (static interpolation), proving the necessity of dynamic adjustment.
- Being training-free allows it to be plugged and played into any RAG pipeline—requiring no parameter modification or training data collection, which is highly valuable for real-world deployment.
- The full-document copy mechanism of CoCoLex+ is particularly suited for legal scenarios, where legal documents are often dozens of pages long with key references scattered throughout.
- The improvement for weaker models is exceptionally remarkable—suggesting that, in resource-constrained scenarios, the copy mechanism can serve as a cost-effective means to enhance the correctness and faithfulness of smaller models.
Limitations & Future Work¶
- Experiments are conducted under an Oracle document setup (assuming perfect retrieval), leaving the robustness under real-world retrieval noise unevaluated.
- It does not handle scenarios requiring cross-document reasoning, multi-source synthesis, or resolving contradictory precedents—it only copies without reasoning.
- The copying granularity is restricted to the token level; phrase- or clause-level copying units remain unexplored.
- The full-document encoding in CoCoLex+ incurs substantial inference overhead on extremely long documents (2.96x on AQuAECHR).
- Validated only in the legal domain; effectiveness in general domains is yet to be explored.
Related Work & Insights¶
- vs Pointer Generator Networks (See et al. 2017): PGN requires learning a copy gating mechanism, whereas CoCoLex is training-free and directly leverages decoding hidden states.
- vs CAD (Shi et al. 2023): CAD amplifies contextual influence by contrasting context-aware and context-free logits but does not copy explicitly; CoCoLex explicitly constructs a copy distribution, yielding greater improvements in faithfulness.
- vs kNN-LM (Khandelwal et al. 2019): kNN-LM retrieves from an external datastore of the pre-training corpus; CoCoLex retrieves from the current context, targeting faithfulness rather than perplexity.
- vs AdaCAD (Wang et al. 2024): AdaCAD dynamically adjusts contrastive strength without explicit copying; its mechanism is complementary to CoCoLex, making combined use effective in specific scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ Blends the retrieval concept of kNN-LM with the copying concept of Pointer Networks into a training-free decoding strategy, featuring an elegantly designed confidence-guided dynamic interpolation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage across five legal benchmarks, two models, human evaluations, CoCoLex+ extensions, combination experiments, and inference time analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear methodological descriptions, complete formal derivations, and detailed experimental analyses.
- Value: ⭐⭐⭐⭐ Addresses highly demanding practical needs in legal AI, with the plug-and-play nature of the training-free decoding strategy making it highly valuable for deployment.