Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models¶

Conference: AAAI 2026 arXiv: 2512.18004
Code: github
Area: Multimodal VLM Keywords: Handwritten document recognition, OCR, vision-language models, legal document translation, low-resource languages

TL;DR¶

This paper systematically compares traditional OCR+machine translation (OCR-MT) pipelines against vision large language models (vLLMs) on the task of translating handwritten Marathi legal documents into English. The study finds that neither approach meets legal-grade deployment requirements: OCR-MT suffers severely from cascading errors, while vLLMs exhibit critical hallucination issues. Nevertheless, vLLMs demonstrate potential for unified end-to-end processing.

Background & Motivation¶

State of the Field¶

The Indian judicial system is among the most complex legal systems in the world. Grassroots courts and police stations still rely heavily on handwritten documents, including First Information Reports (FIRs), case diaries, witness statements, and court proceedings. These documents are critical to criminal and civil proceedings, yet their handwritten and unstructured nature renders filing, retrieval, and analysis extremely difficult.

Limitations of Prior Work¶

Challenges in handwritten text recognition: Handwriting styles vary widely and writing quality is inconsistent; conventional OCR systems (e.g., Tesseract, EasyOCR, PaddleOCR) perform poorly on handwritten legal documents.

Low-resource language challenges: Indian languages such as Marathi lack large-scale digitized corpora, posing data scarcity problems for both OCR and translation models.

Cascading error propagation: In OCR-MT pipelines, recognition errors from the OCR stage directly degrade downstream translation quality, causing loss of legal semantics.

Specificity of legal terminology: Legal documents contain specialized terminology, official stamps, signatures, and structured tables, increasing the difficulty of recognition and translation.

Root Cause¶

The digitization of legal documents urgently requires accurate and scalable translation systems, yet existing technical approaches—whether modular OCR-MT or end-to-end vLLMs—face fundamental limitations when handling handwritten, low-resource language legal documents.

Starting Point¶

The paper constructs a unified evaluation framework that systematically compares the two paradigms (OCR-MT vs. vLLM) in realistic legal document scenarios, providing actionable baselines and directional guidance for future research.

Method¶

Overall Architecture¶

Rather than proposing a novel method, this work constructs a systematic comparative experimental framework to evaluate two major categories of approaches on the task of handwritten Marathi legal document translation.

Key Designs¶

OCR-MT Pipeline (6 combinations):
- OCR tools: Three OCR engines—Tesseract, EasyOCR, and PaddleOCR
- Translation models: IndicTrans2 (a Transformer encoder-decoder model supporting 22 Indian languages) and Sarvam-1 (a 2B-parameter model optimized for 10 Indian languages)
- Workflow: Scanned document image → OCR text extraction → machine translation → English output
- Design Motivation: The modular architecture facilitates pinpointing performance bottlenecks—whether quality degradation originates in the OCR stage or the translation stage
vLLM End-to-End Translation (3 models):
- Model selection: Chitrarth (an Indian-language vision-language bridging model), Maya-8B (a multilingual instruction-tuned model), and Ovis2 (34B int4-quantized and 16B variants)
- Mechanism: Handwritten document images are fed directly into vLLMs, with zero-shot prompting instructing the model to produce English translations
- Design Motivation: Bypasses the intermediate OCR step to avoid cascading errors, leveraging the multimodal reasoning capabilities of vLLMs
Evaluation Protocol Design:
- OCR evaluation: Character Error Rate (CER) and Word Error Rate (WER) are used to measure the fidelity of Marathi text extraction
- Translation evaluation: Human evaluation along three dimensions—fluency (grammatical correctness), adequacy (preservation of original meaning), and correctness (alignment with the gold standard)
- Dataset: Approximately 60 scanned PDF Marathi legal documents from real legal sources, translated by native speakers and reviewed by legal language experts

Key Experimental Results¶

Main Results¶

Method	Representative Model	Handwritten Text Performance	Translation Quality	Main Issues
OCR-MT	EasyOCR + IndicTrans2	Acceptable for print; poor for handwriting	Severely affected by OCR errors	Cascading errors; loss of legal semantics
OCR-MT	PaddleOCR + Sarvam-1	Worst	Mixed-language output	Weakest handwriting support
OCR-MT	Tesseract + IndicTrans2	Moderate	Incomplete translations	Lack of handwriting adaptation
vLLM	Chitrarth	Fails to recognize	Complete hallucination	Generates fictitious meeting content
vLLM	Maya-8B	Partial recognition	Irrelevant output	Misidentifies legal documents as study guides
vLLM	Ovis2-34B (int4)	Partial recognition	Partially correct but fabricated content	Recognizes structure but introduces semantic errors
vLLM	Ovis2-16B	Relatively best	Partial translation	Incomplete and partially incoherent

Ablation Study (OCR Model Comparison)¶

OCR Model	Print Performance	Handwriting Performance	Overall Assessment
EasyOCR	Good	Moderate (still struggles)	Best among the three
PaddleOCR	Moderate	Poor	Errors in digit and date recognition
Tesseract	Moderate	Poor	Limited low-resource language support

Key Findings¶

The OCR stage is the primary bottleneck in the OCR-MT pipeline: EasyOCR achieves the best performance among the three OCR tools, yet still fails to handle inconsistent handwriting styles effectively.
Severe error propagation: OCR transliterates "Gaav" (meaning "village") as "Gaon" rather than translating it as "Village," causing complete failure in downstream translation.
Hallucination in vLLMs: Chitrarth generates descriptions of fictitious meetings, including non-existent names, dates, and locations; Maya-8B outputs legal documents as study guides.
Structural recognition advantage of vLLMs: The Ovis2 series partially recognizes document structure (e.g., account numbers, names, locations), but content accuracy remains insufficient.
High-stakes nature of legal documents: In the legal domain, vLLM hallucinations pose serious risks—generating plausible-sounding yet entirely fabricated text.

Highlights & Insights¶

Clear problem definition: The work is grounded in the real needs of the Indian judicial system, targeting a task of genuine practical value.
Comprehensive comparative framework: Covers 9 combinations across the OCR-MT and vLLM paradigms, with rich evaluation dimensions.
Exposes fundamental issues of vLLMs in high-stakes domains: Hallucination is not merely a performance concern but a matter of safety and trustworthiness.
Dataset contribution: A high-quality handwritten Marathi legal document dataset is constructed, translated by native speakers and reviewed by legal experts.
Future research directions: Concrete directions are proposed, including hybrid OCR-vLLM pipelines, domain-specific fine-tuning, and prompt engineering.

Limitations & Future Work¶

Small dataset scale: Only approximately 60 documents, insufficient to support large-scale quantitative evaluation.
Absence of automatic evaluation metrics: Translation quality relies primarily on human assessment, limiting reproducibility.
No fine-tuning experiments: All vLLMs are evaluated under zero-shot settings; the potential of fine-tuning remains unexplored.
Single language pair: Coverage is limited to Marathi→English; other Indian languages are not addressed.
Hybrid approaches not implemented: Although combining OCR structural cues with vLLM contextual translation is suggested, no corresponding experiments are conducted.
Missing edge deployment analysis: Despite claims of concern for low-resource deployment environments, no computational efficiency or model compression experiments are performed.

VISTA-OCR / olmOCR: Introduces generative, layout-aware OCR pipelines that may be better suited to the complex layouts of legal documents.
Nirnayak: A pioneering work on OCR applications in the Indian legal domain, though constrained by OCR error propagation.
TransDocAnalyser: A framework specifically targeting FIR documents, combining FastRCNN+ViT encoders with a BERT decoder.
PLATTER: An end-to-end handwriting OCR framework supporting 10 Indian languages, serving as a potential upgrade to the OCR module in this work.
Insight: A hybrid OCR+vLLM approach—using OCR for structural detection and vLLMs for contextual translation—may represent the most promising direction.

Rating¶

Novelty: ⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐