TigerLLM - A Family of Bangla Large Language Models¶

Conference: ACL 2025
arXiv: 2503.10995
Code: github.com/mraihan-gmu/TigerLLM
Area: LLM/NLP
Keywords: Low-resource languages, Bangla LLM, High-quality data, Continual pre-training, Model distillation

TL;DR¶

Addressing the severe shortage of LLMs for Bangla (the 5th most spoken language globally), this work develops a high-quality textbook corpus Bangla-TextBook (10M tokens) and native instruction dataset Bangla-Instruct (100K). The trained TigerLLM family surpasses all open-source alternatives and outperforms GPT-3.5 across six benchmarks.

Background & Motivation¶

Bangla has approximately 237 million native speakers and is the 5th most widely spoken language globally, yet it is severely underrepresented in the NLP field:

Limitations of Prior Work in Bangla LLMs¶

Non-standardized training processes: - Models such as titu-Gemma and Bong-LLaMA lack technical documentation and academic publications. - Post-finetuning performance actually drops below that of the base models (e.g., titu-LLM achieves only 0.06 on MMLU-bn, far below Gemma-2 base's 0.35). - Results are not reproducible.

Low data quality: - Most projects rely on translated versions of Alpaca-Instruct and OpenOrca. - These datasets were generated by early GPT-3.5, which has limited support for Bangla. - Translation via Google Translate further degrades quality.

Issues with training corpora: - Dependence is primarily on OSCAR and Common Crawl, with inadequate quality control. - High-quality educational Bangla content is heavily lacking.

Method¶

Overall Architecture¶

The development of TigerLLM comprises three core contributions:

Bangla-TextBook Corpus: Collected from grades 6-12 textbooks published by the National Curriculum and Textbook Board (NCTB) of Bangladesh.
Bangla-Instruct Dataset: Native Bangla instruction data generated using GPT-4o and Claude-3.5-Sonnet.
TigerLLM Model Family: Continually pre-trained and finetuned based on LLaMA 3.2 (1B) and Gemma 2 (9B).

Key Designs¶

Bangla-TextBook Corpus Construction: - Source: 163 open-source textbooks published by the National Curriculum and Textbook Board of Bangladesh. - Grade Range: Grades 6-12, covering multiple academic disciplines. - Scale: 9,897,623 tokens, 697,903 sentences. - Core Philosophy: Data quality over quantity (inspired by "Textbooks Are All You Need", Gunasekar et al., 2023).

Bangla-Instruct Generation Pipeline (Four Stages):

Seed and Instruction Generation:
- 500 seed tasks were created by 50 undergraduate/postgraduate volunteers from major Bangladeshi universities.
- Covers 5 subject areas and 10 categories.
- In each round, \(k=8\) seeds are sampled, and Claude is used to generate new instruction candidates.
Task Type Classification:
- GPT-4o classifies each instruction into three categories: open-ended, classification, or generation.
- Establish a minimum response length threshold.
Response Drafting:
- Claude generates comprehensive responses based on the instructions and task types.
- Retain the version with the highest self-consistency score.
Multi-stage Filtering:
- GPT-4o applies a four-dimensional filter: Language (\(\mathcal{L}\)), Culture (\(\mathcal{C}\)), Quality (\(\mathcal{Q}\)), and Novelty (\(\mathcal{N}\)).
- Approximately 63% of the (instruction, response) pairs pass the filter.
- Complexity distribution: 40% basic, 40% intermediate, 20% advanced.
- Verified pairs are added to the seed pool, looping until 100K high-quality pairs are gathered.

Model Selection and Evolution: - Candidate base models: LLaMA 3.2 (1B, 3B), Gemma 2 (2B, 9B), Pangea (7B). - LLaMA 3.2 (1B) and Gemma 2 (9B) were selected as the final base models after screening. - Pangea was eliminated due to subpar performance on Bangla.

Loss & Training¶

Continual Pre-training: - Hardware: 8 \(\times\) NVIDIA A100 (40GB), 512GB RAM. - Utilizes the Bangla-TextBook corpus. - Trained for approximately 120 hours (gradient checkpointing enabled). - Hyperparameters were empirically determined through multiple trials.

Finetuning: - Hardware: Single NVIDIA A100 (40GB), Google Colab. - Full-parameter finetuning is used instead of LoRA to achieve better learning effects. - Flash Attention is used for acceleration. - Key parameters: maximum sequence length of 2048, batch size of 8, gradient accumulation steps of 4, trained for 3 epochs. - Learning rate \(5 \times 10^{-5}\), weight decay 0.02, 10% warmup steps. - Trained for approximately 96 hours.

Key Experimental Results¶

Main Results¶

Performance on Six Bangla Benchmarks (Pass@1):

Model	MMLU-bn	PangBench	BanglaQuaD	mHumanEval	BEnQA	BanglaRQA
GPT-3.5	0.55	0.55	0.50	0.56	0.50	0.49
GPT-4o-mini	0.67	0.62	0.65	0.56	0.60	0.60
Gemma 2 (27B)	0.35	0.51	0.43	0.64	0.50	0.56
LLaMA 3.2 (11B)	0.22	0.19	0.21	0.15	0.18	0.20
Titu-LLM	0.06	0.19	0.08	0.02	0.17	0.21
Bong-LLaMA	0.05	0.12	0.08	0.02	0.15	0.13
TigerLLM (1B)	0.61	0.55	0.68	0.61	0.59	0.62
TigerLLM (9B)	0.72	0.68	0.70	0.63	0.65	0.68

Key Findings: - TigerLLM (9B) outperforms GPT-3.5 and GPT-4o-mini on all metrics (except for coding). - TigerLLM (1B) (with only 1B parameters!) outperforms GPT-3.5 and all open-source alternatives on most tasks. - Existing finetuned models (Titu-LLM, Bong-LLaMA) fail to replicate, with performance severely falling behind the base models.

Ablation Study¶

Verification of Data Quality vs. Quantity: - TigerLLM only uses 10M tokens for pre-training + 100K instructions for finetuning. - In contrast, titu-Gemma uses 4.4B tokens, and titu-LLaMA uses 37B tokens. - TigerLLM achieves results far exceeding large-scale alternatives with a fraction of the data. - This validates the hypothesis that "high-quality data is superior to massive low-quality data".

Loss Curves of Pre-training and Finetuning: - Continual Pre-training: Loss steadily decreases, indicating that the model effectively absorbs Bangla knowledge. - Finetuning: Loss converges quickly, achieving sound performance within 3 epochs.

Key Findings¶

Data quality is overwhelmingly more important than data quantity: 10M tokens of textbook corpus \(>\) 37B tokens of web crawl data.
Native-language instructions outperform translated instructions: Bangla-Instruct (natively generated) is far superior to translated Alpaca/OpenOrca.
Full-parameter finetuning outperforms LoRA: Full-parameter finetuning yields superior results when resources permit.
Potential of Small Models: A 1B model, powered by high-quality data, can surpass 11B-27B base models.
Systemic issues in existing Bangla LLMs: Improper training leads to regression rather than improvement after finetuning.

Highlights & Insights¶

Validation of "Textbooks Are All You Need" in a low-resource language: Successfully applies the philosophy of Phi-1 to Bangla, proving the universal value of high-quality curated data.
Multicultural expansion of self-instruction generation: 500 hand-crafted seed tasks preserve cultural authenticity, preventing cultural artifacts commonly found in translated data.
Completely Open-sourced: The corpus, instruction data, and models are fully open-sourced, demonstrating high reproducibility and community value.
Pragmatic compute solutions: The entire training pipeline requires only \(8 \times\) A100 (pre-training) + \(1 \times\) A100 (finetuning), making it highly practical for resource-constrained teams.
Systemic diagnosis of current issues: Provides an in-depth analysis of the root causes of failure in other Bangla LLMs.

Limitations & Future Work¶

Narrow linguistic domains of the corpus: The corpus only stems from grades 6-12 textbooks, lacking domains like news, literature, and technical documentation.
Restricted model scale: Only 1B and 9B variants are explored, without assessing whether larger model scales would yield further gains.
Limited instruction types: The 100K instructions cover restricted task types, potentially failing to capture the complexity of real-world scenarios.
Absence of deep qualitative analysis: The paper does not analyze error modes or failure cases of the models.
Sparse evaluation benchmarks: Existing Bangla benchmarks are not fully comprehensive, potentially underestimating or overestimating certain capabilities.

Phi-1/Textbooks Are All You Need (Gunasekar et al., 2023): The philosophy of high-quality small data triumphing over low-quality massive data directly inspired Bangla-TextBook.
BanglaBERT (Sami et al., 2022): A monolingual Bangla BERT that proves the efficacy of monolingual specialization.
Self-Instruct (Wang et al., 2023): Methodology for instruction data generation, upgraded in this work by leveraging GPT-4o and Claude as teacher models.
BLOOM/Aya: Multilingual open models that still exhibit massive performance gaps on low-resource languages.
Insight: The critical bottleneck for low-resource LLMs is data quality rather than model scale.

Rating¶

Novelty: ⭐⭐⭐ — Methodologically, the work combines existing techniques, but the problem formulation and data engineering are highly valuable.
Practicality: ⭐⭐⭐⭐⭐ — Delivers the first high-quality, open-source LLM for 237 million Bangla speakers.
Experimental Thoroughness: ⭐⭐⭐ — Evaluates across 6 benchmarks but lacks extensive ablation and qualitative/error analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation, with a thoroughly described pipeline for data engineering.