The authors construct MiLiC-Eval, the first standardized LLM evaluation benchmark for China's minority languages (Tibetan, Uyghur, Kazakh, and Mongolian). It contains 24k instances across 9 tasks, revealing the severe deficiencies of current LLMs in processing non-mainstream writing systems.
Background: LLMs exhibit outstanding performance in high-resource languages such as English and Chinese, but support for thousands of low-resource languages remains severely inadequate, particularly for those using non-mainstream writing systems.
Key Gap: Despite having tens of millions of speakers, China's Tibetan, Uyghur, Kazakh, and Mongolian languages have been severely marginalized in NLP research, and they lack standardized evaluation benchmarks.
Limitations of Prior Work: Existing multilingual benchmarks (e.g., XTREME, MEGA) exhibit insufficient coverage of low-resource writing systems, limited task diversity, and a lack of cross-task comparability. Furthermore, some rely on machine-translated data, leading to distorted evaluations.
Key Challenge: These languages use non-Latin scripts (e.g., traditional Mongolian, Tibetan), posing extra challenges for tokenization and language modeling.
MiLiC-Eval covers 9 task categories with a total of 24,000 instances across 4 minority languages (Tibetan [bo], Uyghur [ug], Kazakh [kk], and Mongolian [mn]). The benchmark design adheres to three core principles: focusing on non-mainstream writing systems, cross-lingual/cross-task alignment, and fine-grained skill assessment.
Focus on Non-Mainstream Writing Systems: This work benchmarks Arabic-script Kazakh and traditional Mongolian (rather than the mainstream Cyrillic script) for the first time, addressing the weakest links of LLMs.
Cross-Lingual and Cross-Task Alignment: Parallel data for 6 tasks is provided across 6 languages (including Chinese and English). Using the same text for multiple tasks avoids evaluation bias introduced by a single task format.
Hierarchical Skill Assessment Framework: The 9 task categories are mapped into two dimensions: linguistic ability (vocabulary \(\rightarrow\) grammar \(\rightarrow\) pragmatics) and problem-solving capability (topic modeling \(\rightarrow\) context understanding \(\rightarrow\) generation \(\rightarrow\) reasoning).
Mongolian Represents the Most Significant Deficiency: Even the best open-source model, Gemma-3, performs only slightly better than random on Mongolian. The tokenization efficiency for traditional Mongolian is extremely low (GPT-4.1 requires 432 tokens per sentence, compared to only 54 for Cyrillic Mongolian).
Specialized Multilingual Adaptation Models Struggle to Excel: Although models like EMMA-500 and BayLing-2 are explicitly trained on minority language data, they underperform compared to native multilingual LLMs (e.g., Gemma-3, Qwen-3).
Imbalanced Skill Profiles: LLMs possess basic vocabulary understanding and topic modeling abilities, but remain highly deficient in tasks requiring syntactic knowledge, such as generation and translation.
Severe Script-Switching Performance: In Mongolian headline generation, GPT-4o-mini erroneously switches to the Cyrillic script in 36% of cases, and this figure rises to 95% for Kazakh.
Distortion Caused by Machine-Translated Evaluation Data: Evaluating with NLLB-translated data results in a performance drop of up to 52% (specifically in mathematical reasoning).