Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models¶
Conference: AAAI 2026 arXiv: 2512.18004 Code: github Area: Multimodal VLM Keywords: Handwritten document recognition, OCR, vision-language models, legal document translation, low-resource languages
TL;DR¶
This paper systematically compares traditional OCR+machine translation (OCR-MT) pipelines against vision large language models (vLLMs) on the task of translating handwritten Marathi legal documents into English. The study finds that neither approach meets legal-grade deployment requirements: OCR-MT suffers severely from cascading errors, while vLLMs exhibit critical hallucination issues. Nevertheless, vLLMs demonstrate potential for unified end-to-end processing.
Background & Motivation¶
State of the Field¶
The Indian judicial system is among the most complex legal systems in the world. Grassroots courts and police stations still rely heavily on handwritten documents, including First Information Reports (FIRs), case diaries, witness statements, and court proceedings. These documents are critical to criminal and civil proceedings, yet their handwritten and unstructured nature renders filing, retrieval, and analysis extremely difficult.
Limitations of Prior Work¶
Challenges in handwritten text recognition: Handwriting styles vary widely and writing quality is inconsistent; conventional OCR systems (e.g., Tesseract, EasyOCR, PaddleOCR) perform poorly on handwritten legal documents.
Low-resource language challenges: Indian languages such as Marathi lack large-scale digitized corpora, posing data scarcity problems for both OCR and translation models.
Cascading error propagation: In OCR-MT pipelines, recognition errors from the OCR stage directly degrade downstream translation quality, causing loss of legal semantics.
Specificity of legal terminology: Legal documents contain specialized terminology, official stamps, signatures, and structured tables, increasing the difficulty of recognition and translation.
Root Cause¶
The digitization of legal documents urgently requires accurate and scalable translation systems, yet existing technical approaches—whether modular OCR-MT or end-to-end vLLMs—face fundamental limitations when handling handwritten, low-resource language legal documents.
Starting Point¶
The paper constructs a unified evaluation framework that systematically compares the two paradigms (OCR-MT vs. vLLM) in realistic legal document scenarios, providing actionable baselines and directional guidance for future research.
Method¶
Overall Architecture¶
Rather than proposing a novel method, this work constructs a systematic comparative experimental framework to evaluate two major categories of approaches on the task of handwritten Marathi legal document translation.
Key Designs¶
-
OCR-MT Pipeline (6 combinations):
- OCR tools: Three OCR engines—Tesseract, EasyOCR, and PaddleOCR
- Translation models: IndicTrans2 (a Transformer encoder-decoder model supporting 22 Indian languages) and Sarvam-1 (a 2B-parameter model optimized for 10 Indian languages)
- Workflow: Scanned document image → OCR text extraction → machine translation → English output
- Design Motivation: The modular architecture facilitates pinpointing performance bottlenecks—whether quality degradation originates in the OCR stage or the translation stage
-
vLLM End-to-End Translation (3 models):
- Model selection: Chitrarth (an Indian-language vision-language bridging model), Maya-8B (a multilingual instruction-tuned model), and Ovis2 (34B int4-quantized and 16B variants)
- Mechanism: Handwritten document images are fed directly into vLLMs, with zero-shot prompting instructing the model to produce English translations
- Design Motivation: Bypasses the intermediate OCR step to avoid cascading errors, leveraging the multimodal reasoning capabilities of vLLMs
-
Evaluation Protocol Design:
- OCR evaluation: Character Error Rate (CER) and Word Error Rate (WER) are used to measure the fidelity of Marathi text extraction
- Translation evaluation: Human evaluation along three dimensions—fluency (grammatical correctness), adequacy (preservation of original meaning), and correctness (alignment with the gold standard)
- Dataset: Approximately 60 scanned PDF Marathi legal documents from real legal sources, translated by native speakers and reviewed by legal language experts
Key Experimental Results¶
Main Results¶
| Method | Representative Model | Handwritten Text Performance | Translation Quality | Main Issues |
|---|---|---|---|---|
| OCR-MT | EasyOCR + IndicTrans2 | Acceptable for print; poor for handwriting | Severely affected by OCR errors | Cascading errors; loss of legal semantics |
| OCR-MT | PaddleOCR + Sarvam-1 | Worst | Mixed-language output | Weakest handwriting support |
| OCR-MT | Tesseract + IndicTrans2 | Moderate | Incomplete translations | Lack of handwriting adaptation |
| vLLM | Chitrarth | Fails to recognize | Complete hallucination | Generates fictitious meeting content |
| vLLM | Maya-8B | Partial recognition | Irrelevant output | Misidentifies legal documents as study guides |
| vLLM | Ovis2-34B (int4) | Partial recognition | Partially correct but fabricated content | Recognizes structure but introduces semantic errors |
| vLLM | Ovis2-16B | Relatively best | Partial translation | Incomplete and partially incoherent |
Ablation Study (OCR Model Comparison)¶
| OCR Model | Print Performance | Handwriting Performance | Overall Assessment |
|---|---|---|---|
| EasyOCR | Good | Moderate (still struggles) | Best among the three |
| PaddleOCR | Moderate | Poor | Errors in digit and date recognition |
| Tesseract | Moderate | Poor | Limited low-resource language support |
Key Findings¶
- The OCR stage is the primary bottleneck in the OCR-MT pipeline: EasyOCR achieves the best performance among the three OCR tools, yet still fails to handle inconsistent handwriting styles effectively.
- Severe error propagation: OCR transliterates "Gaav" (meaning "village") as "Gaon" rather than translating it as "Village," causing complete failure in downstream translation.
- Hallucination in vLLMs: Chitrarth generates descriptions of fictitious meetings, including non-existent names, dates, and locations; Maya-8B outputs legal documents as study guides.
- Structural recognition advantage of vLLMs: The Ovis2 series partially recognizes document structure (e.g., account numbers, names, locations), but content accuracy remains insufficient.
- High-stakes nature of legal documents: In the legal domain, vLLM hallucinations pose serious risks—generating plausible-sounding yet entirely fabricated text.
Highlights & Insights¶
- Clear problem definition: The work is grounded in the real needs of the Indian judicial system, targeting a task of genuine practical value.
- Comprehensive comparative framework: Covers 9 combinations across the OCR-MT and vLLM paradigms, with rich evaluation dimensions.
- Exposes fundamental issues of vLLMs in high-stakes domains: Hallucination is not merely a performance concern but a matter of safety and trustworthiness.
- Dataset contribution: A high-quality handwritten Marathi legal document dataset is constructed, translated by native speakers and reviewed by legal experts.
- Future research directions: Concrete directions are proposed, including hybrid OCR-vLLM pipelines, domain-specific fine-tuning, and prompt engineering.
Limitations & Future Work¶
- Small dataset scale: Only approximately 60 documents, insufficient to support large-scale quantitative evaluation.
- Absence of automatic evaluation metrics: Translation quality relies primarily on human assessment, limiting reproducibility.
- No fine-tuning experiments: All vLLMs are evaluated under zero-shot settings; the potential of fine-tuning remains unexplored.
- Single language pair: Coverage is limited to Marathi→English; other Indian languages are not addressed.
- Hybrid approaches not implemented: Although combining OCR structural cues with vLLM contextual translation is suggested, no corresponding experiments are conducted.
- Missing edge deployment analysis: Despite claims of concern for low-resource deployment environments, no computational efficiency or model compression experiments are performed.
Related Work & Insights¶
- VISTA-OCR / olmOCR: Introduces generative, layout-aware OCR pipelines that may be better suited to the complex layouts of legal documents.
- Nirnayak: A pioneering work on OCR applications in the Indian legal domain, though constrained by OCR error propagation.
- TransDocAnalyser: A framework specifically targeting FIR documents, combining FastRCNN+ViT encoders with a BERT decoder.
- PLATTER: An end-to-end handwriting OCR framework supporting 10 Indian languages, serving as a potential upgrade to the OCR module in this work.
- Insight: A hybrid OCR+vLLM approach—using OCR for structural detection and vLLMs for contextual translation—may represent the most promising direction.
Rating¶
- Novelty: ⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐