TigerLLM - A Family of Bangla Large Language Models¶
Conference: ACL 2025
arXiv: 2503.10995
Code: github.com/mraihan-gmu/TigerLLM
Area: LLM/NLP
Keywords: Low-resource languages, Bangla LLM, High-quality data, Continual pre-training, Model distillation
TL;DR¶
Addressing the severe shortage of LLMs for Bangla (the 5th most spoken language globally), this work develops a high-quality textbook corpus Bangla-TextBook (10M tokens) and native instruction dataset Bangla-Instruct (100K). The trained TigerLLM family surpasses all open-source alternatives and outperforms GPT-3.5 across six benchmarks.
Background & Motivation¶
Bangla has approximately 237 million native speakers and is the 5th most widely spoken language globally, yet it is severely underrepresented in the NLP field:
Limitations of Prior Work in Bangla LLMs¶
Non-standardized training processes: - Models such as titu-Gemma and Bong-LLaMA lack technical documentation and academic publications. - Post-finetuning performance actually drops below that of the base models (e.g., titu-LLM achieves only 0.06 on MMLU-bn, far below Gemma-2 base's 0.35). - Results are not reproducible.
Low data quality: - Most projects rely on translated versions of Alpaca-Instruct and OpenOrca. - These datasets were generated by early GPT-3.5, which has limited support for Bangla. - Translation via Google Translate further degrades quality.
Issues with training corpora: - Dependence is primarily on OSCAR and Common Crawl, with inadequate quality control. - High-quality educational Bangla content is heavily lacking.
Method¶
Overall Architecture¶
The development of TigerLLM comprises three core contributions:
- Bangla-TextBook Corpus: Collected from grades 6-12 textbooks published by the National Curriculum and Textbook Board (NCTB) of Bangladesh.
- Bangla-Instruct Dataset: Native Bangla instruction data generated using GPT-4o and Claude-3.5-Sonnet.
- TigerLLM Model Family: Continually pre-trained and finetuned based on LLaMA 3.2 (1B) and Gemma 2 (9B).
Key Designs¶
Bangla-TextBook Corpus Construction: - Source: 163 open-source textbooks published by the National Curriculum and Textbook Board of Bangladesh. - Grade Range: Grades 6-12, covering multiple academic disciplines. - Scale: 9,897,623 tokens, 697,903 sentences. - Core Philosophy: Data quality over quantity (inspired by "Textbooks Are All You Need", Gunasekar et al., 2023).
Bangla-Instruct Generation Pipeline (Four Stages):
-
Seed and Instruction Generation:
- 500 seed tasks were created by 50 undergraduate/postgraduate volunteers from major Bangladeshi universities.
- Covers 5 subject areas and 10 categories.
- In each round, \(k=8\) seeds are sampled, and Claude is used to generate new instruction candidates.
-
Task Type Classification:
- GPT-4o classifies each instruction into three categories: open-ended, classification, or generation.
- Establish a minimum response length threshold.
-
Response Drafting:
- Claude generates comprehensive responses based on the instructions and task types.
- Retain the version with the highest self-consistency score.
-
Multi-stage Filtering:
- GPT-4o applies a four-dimensional filter: Language (\(\mathcal{L}\)), Culture (\(\mathcal{C}\)), Quality (\(\mathcal{Q}\)), and Novelty (\(\mathcal{N}\)).
- Approximately 63% of the (instruction, response) pairs pass the filter.
- Complexity distribution: 40% basic, 40% intermediate, 20% advanced.
- Verified pairs are added to the seed pool, looping until 100K high-quality pairs are gathered.
Model Selection and Evolution: - Candidate base models: LLaMA 3.2 (1B, 3B), Gemma 2 (2B, 9B), Pangea (7B). - LLaMA 3.2 (1B) and Gemma 2 (9B) were selected as the final base models after screening. - Pangea was eliminated due to subpar performance on Bangla.
Loss & Training¶
Continual Pre-training: - Hardware: 8 \(\times\) NVIDIA A100 (40GB), 512GB RAM. - Utilizes the Bangla-TextBook corpus. - Trained for approximately 120 hours (gradient checkpointing enabled). - Hyperparameters were empirically determined through multiple trials.
Finetuning: - Hardware: Single NVIDIA A100 (40GB), Google Colab. - Full-parameter finetuning is used instead of LoRA to achieve better learning effects. - Flash Attention is used for acceleration. - Key parameters: maximum sequence length of 2048, batch size of 8, gradient accumulation steps of 4, trained for 3 epochs. - Learning rate \(5 \times 10^{-5}\), weight decay 0.02, 10% warmup steps. - Trained for approximately 96 hours.
Key Experimental Results¶
Main Results¶
Performance on Six Bangla Benchmarks (Pass@1):
| Model | MMLU-bn | PangBench | BanglaQuaD | mHumanEval | BEnQA | BanglaRQA |
|---|---|---|---|---|---|---|
| GPT-3.5 | 0.55 | 0.55 | 0.50 | 0.56 | 0.50 | 0.49 |
| GPT-4o-mini | 0.67 | 0.62 | 0.65 | 0.56 | 0.60 | 0.60 |
| Gemma 2 (27B) | 0.35 | 0.51 | 0.43 | 0.64 | 0.50 | 0.56 |
| LLaMA 3.2 (11B) | 0.22 | 0.19 | 0.21 | 0.15 | 0.18 | 0.20 |
| Titu-LLM | 0.06 | 0.19 | 0.08 | 0.02 | 0.17 | 0.21 |
| Bong-LLaMA | 0.05 | 0.12 | 0.08 | 0.02 | 0.15 | 0.13 |
| TigerLLM (1B) | 0.61 | 0.55 | 0.68 | 0.61 | 0.59 | 0.62 |
| TigerLLM (9B) | 0.72 | 0.68 | 0.70 | 0.63 | 0.65 | 0.68 |
Key Findings: - TigerLLM (9B) outperforms GPT-3.5 and GPT-4o-mini on all metrics (except for coding). - TigerLLM (1B) (with only 1B parameters!) outperforms GPT-3.5 and all open-source alternatives on most tasks. - Existing finetuned models (Titu-LLM, Bong-LLaMA) fail to replicate, with performance severely falling behind the base models.
Ablation Study¶
Verification of Data Quality vs. Quantity: - TigerLLM only uses 10M tokens for pre-training + 100K instructions for finetuning. - In contrast, titu-Gemma uses 4.4B tokens, and titu-LLaMA uses 37B tokens. - TigerLLM achieves results far exceeding large-scale alternatives with a fraction of the data. - This validates the hypothesis that "high-quality data is superior to massive low-quality data".
Loss Curves of Pre-training and Finetuning: - Continual Pre-training: Loss steadily decreases, indicating that the model effectively absorbs Bangla knowledge. - Finetuning: Loss converges quickly, achieving sound performance within 3 epochs.
Key Findings¶
- Data quality is overwhelmingly more important than data quantity: 10M tokens of textbook corpus \(>\) 37B tokens of web crawl data.
- Native-language instructions outperform translated instructions: Bangla-Instruct (natively generated) is far superior to translated Alpaca/OpenOrca.
- Full-parameter finetuning outperforms LoRA: Full-parameter finetuning yields superior results when resources permit.
- Potential of Small Models: A 1B model, powered by high-quality data, can surpass 11B-27B base models.
- Systemic issues in existing Bangla LLMs: Improper training leads to regression rather than improvement after finetuning.
Highlights & Insights¶
- Validation of "Textbooks Are All You Need" in a low-resource language: Successfully applies the philosophy of Phi-1 to Bangla, proving the universal value of high-quality curated data.
- Multicultural expansion of self-instruction generation: 500 hand-crafted seed tasks preserve cultural authenticity, preventing cultural artifacts commonly found in translated data.
- Completely Open-sourced: The corpus, instruction data, and models are fully open-sourced, demonstrating high reproducibility and community value.
- Pragmatic compute solutions: The entire training pipeline requires only \(8 \times\) A100 (pre-training) + \(1 \times\) A100 (finetuning), making it highly practical for resource-constrained teams.
- Systemic diagnosis of current issues: Provides an in-depth analysis of the root causes of failure in other Bangla LLMs.
Limitations & Future Work¶
- Narrow linguistic domains of the corpus: The corpus only stems from grades 6-12 textbooks, lacking domains like news, literature, and technical documentation.
- Restricted model scale: Only 1B and 9B variants are explored, without assessing whether larger model scales would yield further gains.
- Limited instruction types: The 100K instructions cover restricted task types, potentially failing to capture the complexity of real-world scenarios.
- Absence of deep qualitative analysis: The paper does not analyze error modes or failure cases of the models.
- Sparse evaluation benchmarks: Existing Bangla benchmarks are not fully comprehensive, potentially underestimating or overestimating certain capabilities.
Related Work & Insights¶
- Phi-1/Textbooks Are All You Need (Gunasekar et al., 2023): The philosophy of high-quality small data triumphing over low-quality massive data directly inspired Bangla-TextBook.
- BanglaBERT (Sami et al., 2022): A monolingual Bangla BERT that proves the efficacy of monolingual specialization.
- Self-Instruct (Wang et al., 2023): Methodology for instruction data generation, upgraded in this work by leveraging GPT-4o and Claude as teacher models.
- BLOOM/Aya: Multilingual open models that still exhibit massive performance gaps on low-resource languages.
- Insight: The critical bottleneck for low-resource LLMs is data quality rather than model scale.
Rating¶
- Novelty: ⭐⭐⭐ — Methodologically, the work combines existing techniques, but the problem formulation and data engineering are highly valuable.
- Practicality: ⭐⭐⭐⭐⭐ — Delivers the first high-quality, open-source LLM for 237 million Bangla speakers.
- Experimental Thoroughness: ⭐⭐⭐ — Evaluates across 6 benchmarks but lacks extensive ablation and qualitative/error analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear motivation, with a thoroughly described pipeline for data engineering.