Do Language Models Understand Honorific Systems in Javanese?¶
Conference: ACL 2025
arXiv: 2502.20864
Code: https://github.com/JavaneseHonorifics/Unggah-Ungguh
Area: LLM/NLP - Low-Resource Language Evaluation
Keywords: honorifics, Javanese, low-resource, Unggah-Ungguh, linguistic evaluation
TL;DR¶
This work constructs the first Javanese honorific corpus, Unggah-Ungguh (4,024 sentences covering four honorific levels), and systematically evaluates the capability of LLMs to understand the Javanese honorific system across four tasks: classification, style transfer, cross-lingual translation, and dialogue generation. The results reveal that even the strongest closed-source model (GPT-4o) achieves a zero-shot classification accuracy of only 53.5% and shows a severe bias toward specific honorific levels.
Background & Motivation¶
Background: Javanese has over 98 million speakers. One of its core features is a complex honorific system known as Unggah-Ungguh Basa, which comprises four levels—Ngoko (most informal), Ngoko Alus (semi-formal), Krama (formal), and Krama Alus (most formal). The choice of honorifics depends on the social relationship between the speaker, the listener, and the referent.
Limitations of Prior Work: (1) Existing Javanese corpora suffer from a severe imbalance in honorific levels, with the majority biased toward Ngoko; (2) There is a lack of specially annotated Javanese honorific corpora for NLP tasks; (3) As LLMs are increasingly deployed as personal assistants, their ability to understand and generate appropriate honorifics directly influences cultural sensitivity and user trust.
Key Challenge: The honorific system requires models to not only understand semantics but also capture pragmatic information, such as social hierarchy, conversational roles, and situational context—presenting a massive challenge for existing models, particularly in low-resource settings.
Goal: To systematically evaluate the capabilities of LLMs in understanding and generating Javanese across the four honorific levels, and to identify their biases and limitations.
Key Insight: Build a balanced honorific corpus and design four benchmark tasks covering both comprehension and generation.
Core Idea: By constructing the first balanced corpus annotated with four Javanese honorific levels and proposing four evaluation tasks, this work reveals that current LLMs suffer from a severe lack of understanding of complex honorific systems.
Method¶
Overall Architecture¶
Construct the Unggah-Ungguh corpus → Design four evaluation tasks → Perform comparative evaluation using both fine-tuned models and zero-/few-shot general-purpose models. The fine-tuned models are used for classification tasks and serve as automatic evaluation tools for subsequent tasks. General-purpose models include closed-source (GPT-4o, Gemini 1.5 Pro) and open-source (Llama 3.1 8B, Gemma2 9B, Sailor2 8B, SahabatAI) models.
Key Designs¶
-
Unggah-Ungguh Corpus Construction:
- Function: Manually curate a labeled corpus of 4,024 sentences from four authoritative reference books, such as Kamus Unggah-Ungguh Basa Jawa.
- Mechanism: Since the original sources were not digitized, the pipeline involved scanning → OCR → two-stage native-speaker correction. The second stage of independent auditing identified and corrected 58 errors (1.5%). The final Shannon entropy reached 1.88, which is higher than nine other existing datasets, signifying the most balanced distribution.
- Design Motivation: Existing Javanese corpora have highly imbalanced honorific distributions (mostly concentrated in Ngoko), making it impossible to fairly evaluate model capabilities.
-
Task 1: Honorific Level Classification:
- Function: Classify the input text into one of the four honorific levels.
- Mechanism: Fine-tune Javanese BERT/DistilBERT/GPT-2 and LSTM/rule-based baselines. Javanese DistilBERT achieves the highest accuracy of 95.65% and is used as the automatic evaluator for Task 4.
- Design Motivation: To evaluate the models' capability to recognize honorific levels—a fundamental step in understanding Javanese honorific systems.
-
Task 2: Honorific Style Transfer:
- Function: Translate a given text from one honorific style to another (e.g., Ngoko → Krama Alus).
- Mechanism: Zero-shot translation to evaluate whether the model can alter the honorific level while preserving the semantic meaning.
- Design Motivation: Honorific transfer requires precise lexical substitution and grammatical adjustments, serving as a direct test of the model's depth of understanding of the honorific system.
-
Task 3: Cross-Lingual Honorific Translation:
- Function: Translate between Javanese and Indonesian at specific honorific levels.
- Mechanism: Indonesian lacks an explicit honorific system, whereas Javanese has a rich honorific hierarchy. The KL divergence between the two is as high as 2.26, indicating a large discrepancy in vocabulary distributions.
- Design Motivation: To test whether models can preserve honorific information in cross-lingual scenarios.
-
Task 4: Dialogue Generation:
- Function: Generate dialogues that use appropriate honorifics given the social status of two speakers (e.g., student and teacher) and the context.
- Mechanism: Manually design 160 evaluation scenarios and use the fine-tuned DistilBERT to automatically assess whether the generated text uses the correct honorific level.
- Design Motivation: The most challenging task—models must simultaneously understand role relationships, honorific rules, and maintain conversational coherence.
Key Experimental Results¶
Main Results (Task 1: Honorific Classification)¶
| Model | Accuracy | F1 |
|---|---|---|
| Dictionary-Based | 88.37 | 88.64 |
| LSTM | 93.47 | 91.34 |
| Javanese BERT (Fine-tuned) | 93.91 | 93.97 |
| Javanese DistilBERT (Fine-tuned) | 95.65 | 95.66 |
| GPT-4o (Zero-shot) | 53.50 | 40.70 |
| Gemini 1.5 Pro (Zero-shot) | 50.70 | 45.40 |
| Llama 3.1 8B (Zero-shot) | 43.00 | 24.00 |
Ablation Study (GPT-4o Classification Performance Per Level)¶
| Honorific Level | Precision | Recall | F1 |
|---|---|---|---|
| Ngoko | 78.00 | 91.10 | 84.00 |
| Ngoko Alus | 0 | 0 | 0 |
| Krama | 53.50 | 26.00 | 35.00 |
| Krama Alus | 29.90 | 82.40 | 43.80 |
Key Findings¶
- Fine-tuned specialized models (DistilBERT at 95.65%) far outperform general LLMs (GPT-4o at 53.5%), illustrating that Javanese honorifics remain a major low-resource challenge.
- GPT-4o completely fails to identify the Ngoko Alus level (F1=0), showing a severe level bias.
- Closed-source models bias towards the two extreme levels (Ngoko and Krama Alus) in classification, while ignoring the intermediate levels.
- The rule-based baseline (88.37%) is already quite strong, because honorifics are heavily realized through lexical substitution.
- In cross-lingual translation, KL divergence and Jensen-Shannon distance indicate a significant vocabulary gap between Javanese and Indonesian.
Highlights & Insights¶
- This work is the first to systematically evaluate LLM performance on complex honorific systems, filling a gap in the pragmatic evaluation of low-resource languages.
- The finding that GPT-4o is completely "blind" to Ngoko Alus is a crucial warning—seemingly "multilingual capabilities" are entirely insufficient at a fine-grained cultural level.
- The rigorous corpus construction process (scanning → OCR → two-stage native-speaker validation) provides a blueprint for the digitization of other low-resource languages.
Limitations & Future Work¶
- The scale of the corpus is relatively small (4,024 sentences), which might be insufficient to train larger models.
- Only four honorific levels were evaluated, while real-world usage has even more fine-grained distinctions.
- The effect of fine-tuning general LLMs (e.g., fine-tuning Llama with Unggah-Ungguh) was not tested.
Related Work & Insights¶
- vs Japanese Honorific Corpus (Liu & Kobayashi, 2022): The Javanese honorific system is more complex (four levels vs the Japanese "Keigo/Kenjougo" binary distinction) and exhibits a lower Yule's K value (105.43 vs 125.54), demonstrating higher lexical diversity.
- vs Wongso et al. (2021): The latter pre-trained Javanese models but did not address the honorific system.
- vs Marreddy et al. (2022): The latter noted that low-resource language models perform poorly due to the lack of labeled data; this work directly addresses this pain point by constructing a specialized corpus.
Rating¶
- Novelty: ⭐⭐⭐⭐ First evaluation benchmark for Javanese honorifics, with a clear and unique problem definition.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage of four tasks with comparisons across multiple model types, but the corpus size remains limited.
- Writing Quality: ⭐⭐⭐⭐ Informative linguistic background and clearly organized experiments.
- Value: ⭐⭐⭐⭐ Holds significant reference value for research in low-resource NLP and culturally sensitive AI.