Skip to content

TigerLLM - A Family of Bangla Large Language Models

Conference: ACL 2025
arXiv: 2503.10995
Code: github.com/mraihan-gmu/TigerLLM
Area: LLM/NLP
Keywords: Low-resource languages, Bangla LLM, High-quality data, Continual pre-training, Model distillation

TL;DR

Addressing the severe shortage of LLMs for Bangla (the 5th most spoken language globally), this work develops a high-quality textbook corpus Bangla-TextBook (10M tokens) and native instruction dataset Bangla-Instruct (100K). The trained TigerLLM family surpasses all open-source alternatives and outperforms GPT-3.5 across six benchmarks.

Background & Motivation

Bangla has approximately 237 million native speakers and is the 5th most widely spoken language globally, yet it is severely underrepresented in the NLP field:

Limitations of Prior Work in Bangla LLMs

Non-standardized training processes: - Models such as titu-Gemma and Bong-LLaMA lack technical documentation and academic publications. - Post-finetuning performance actually drops below that of the base models (e.g., titu-LLM achieves only 0.06 on MMLU-bn, far below Gemma-2 base's 0.35). - Results are not reproducible.

Low data quality: - Most projects rely on translated versions of Alpaca-Instruct and OpenOrca. - These datasets were generated by early GPT-3.5, which has limited support for Bangla. - Translation via Google Translate further degrades quality.

Issues with training corpora: - Dependence is primarily on OSCAR and Common Crawl, with inadequate quality control. - High-quality educational Bangla content is heavily lacking.

Method

Overall Architecture

The development of TigerLLM comprises three core contributions:

  1. Bangla-TextBook Corpus: Collected from grades 6-12 textbooks published by the National Curriculum and Textbook Board (NCTB) of Bangladesh.
  2. Bangla-Instruct Dataset: Native Bangla instruction data generated using GPT-4o and Claude-3.5-Sonnet.
  3. TigerLLM Model Family: Continually pre-trained and finetuned based on LLaMA 3.2 (1B) and Gemma 2 (9B).

Key Designs

Bangla-TextBook Corpus Construction: - Source: 163 open-source textbooks published by the National Curriculum and Textbook Board of Bangladesh. - Grade Range: Grades 6-12, covering multiple academic disciplines. - Scale: 9,897,623 tokens, 697,903 sentences. - Core Philosophy: Data quality over quantity (inspired by "Textbooks Are All You Need", Gunasekar et al., 2023).

Bangla-Instruct Generation Pipeline (Four Stages):

  1. Seed and Instruction Generation:

    • 500 seed tasks were created by 50 undergraduate/postgraduate volunteers from major Bangladeshi universities.
    • Covers 5 subject areas and 10 categories.
    • In each round, \(k=8\) seeds are sampled, and Claude is used to generate new instruction candidates.
  2. Task Type Classification:

    • GPT-4o classifies each instruction into three categories: open-ended, classification, or generation.
    • Establish a minimum response length threshold.
  3. Response Drafting:

    • Claude generates comprehensive responses based on the instructions and task types.
    • Retain the version with the highest self-consistency score.
  4. Multi-stage Filtering:

    • GPT-4o applies a four-dimensional filter: Language (\(\mathcal{L}\)), Culture (\(\mathcal{C}\)), Quality (\(\mathcal{Q}\)), and Novelty (\(\mathcal{N}\)).
    • Approximately 63% of the (instruction, response) pairs pass the filter.
    • Complexity distribution: 40% basic, 40% intermediate, 20% advanced.
    • Verified pairs are added to the seed pool, looping until 100K high-quality pairs are gathered.

Model Selection and Evolution: - Candidate base models: LLaMA 3.2 (1B, 3B), Gemma 2 (2B, 9B), Pangea (7B). - LLaMA 3.2 (1B) and Gemma 2 (9B) were selected as the final base models after screening. - Pangea was eliminated due to subpar performance on Bangla.

Loss & Training

Continual Pre-training: - Hardware: 8 \(\times\) NVIDIA A100 (40GB), 512GB RAM. - Utilizes the Bangla-TextBook corpus. - Trained for approximately 120 hours (gradient checkpointing enabled). - Hyperparameters were empirically determined through multiple trials.

Finetuning: - Hardware: Single NVIDIA A100 (40GB), Google Colab. - Full-parameter finetuning is used instead of LoRA to achieve better learning effects. - Flash Attention is used for acceleration. - Key parameters: maximum sequence length of 2048, batch size of 8, gradient accumulation steps of 4, trained for 3 epochs. - Learning rate \(5 \times 10^{-5}\), weight decay 0.02, 10% warmup steps. - Trained for approximately 96 hours.

Key Experimental Results

Main Results

Performance on Six Bangla Benchmarks (Pass@1):

Model MMLU-bn PangBench BanglaQuaD mHumanEval BEnQA BanglaRQA
GPT-3.5 0.55 0.55 0.50 0.56 0.50 0.49
GPT-4o-mini 0.67 0.62 0.65 0.56 0.60 0.60
Gemma 2 (27B) 0.35 0.51 0.43 0.64 0.50 0.56
LLaMA 3.2 (11B) 0.22 0.19 0.21 0.15 0.18 0.20
Titu-LLM 0.06 0.19 0.08 0.02 0.17 0.21
Bong-LLaMA 0.05 0.12 0.08 0.02 0.15 0.13
TigerLLM (1B) 0.61 0.55 0.68 0.61 0.59 0.62
TigerLLM (9B) 0.72 0.68 0.70 0.63 0.65 0.68

Key Findings: - TigerLLM (9B) outperforms GPT-3.5 and GPT-4o-mini on all metrics (except for coding). - TigerLLM (1B) (with only 1B parameters!) outperforms GPT-3.5 and all open-source alternatives on most tasks. - Existing finetuned models (Titu-LLM, Bong-LLaMA) fail to replicate, with performance severely falling behind the base models.

Ablation Study

Verification of Data Quality vs. Quantity: - TigerLLM only uses 10M tokens for pre-training + 100K instructions for finetuning. - In contrast, titu-Gemma uses 4.4B tokens, and titu-LLaMA uses 37B tokens. - TigerLLM achieves results far exceeding large-scale alternatives with a fraction of the data. - This validates the hypothesis that "high-quality data is superior to massive low-quality data".

Loss Curves of Pre-training and Finetuning: - Continual Pre-training: Loss steadily decreases, indicating that the model effectively absorbs Bangla knowledge. - Finetuning: Loss converges quickly, achieving sound performance within 3 epochs.

Key Findings

  1. Data quality is overwhelmingly more important than data quantity: 10M tokens of textbook corpus \(>\) 37B tokens of web crawl data.
  2. Native-language instructions outperform translated instructions: Bangla-Instruct (natively generated) is far superior to translated Alpaca/OpenOrca.
  3. Full-parameter finetuning outperforms LoRA: Full-parameter finetuning yields superior results when resources permit.
  4. Potential of Small Models: A 1B model, powered by high-quality data, can surpass 11B-27B base models.
  5. Systemic issues in existing Bangla LLMs: Improper training leads to regression rather than improvement after finetuning.

Highlights & Insights

  1. Validation of "Textbooks Are All You Need" in a low-resource language: Successfully applies the philosophy of Phi-1 to Bangla, proving the universal value of high-quality curated data.
  2. Multicultural expansion of self-instruction generation: 500 hand-crafted seed tasks preserve cultural authenticity, preventing cultural artifacts commonly found in translated data.
  3. Completely Open-sourced: The corpus, instruction data, and models are fully open-sourced, demonstrating high reproducibility and community value.
  4. Pragmatic compute solutions: The entire training pipeline requires only \(8 \times\) A100 (pre-training) + \(1 \times\) A100 (finetuning), making it highly practical for resource-constrained teams.
  5. Systemic diagnosis of current issues: Provides an in-depth analysis of the root causes of failure in other Bangla LLMs.

Limitations & Future Work

  1. Narrow linguistic domains of the corpus: The corpus only stems from grades 6-12 textbooks, lacking domains like news, literature, and technical documentation.
  2. Restricted model scale: Only 1B and 9B variants are explored, without assessing whether larger model scales would yield further gains.
  3. Limited instruction types: The 100K instructions cover restricted task types, potentially failing to capture the complexity of real-world scenarios.
  4. Absence of deep qualitative analysis: The paper does not analyze error modes or failure cases of the models.
  5. Sparse evaluation benchmarks: Existing Bangla benchmarks are not fully comprehensive, potentially underestimating or overestimating certain capabilities.
  • Phi-1/Textbooks Are All You Need (Gunasekar et al., 2023): The philosophy of high-quality small data triumphing over low-quality massive data directly inspired Bangla-TextBook.
  • BanglaBERT (Sami et al., 2022): A monolingual Bangla BERT that proves the efficacy of monolingual specialization.
  • Self-Instruct (Wang et al., 2023): Methodology for instruction data generation, upgraded in this work by leveraging GPT-4o and Claude as teacher models.
  • BLOOM/Aya: Multilingual open models that still exhibit massive performance gaps on low-resource languages.
  • Insight: The critical bottleneck for low-resource LLMs is data quality rather than model scale.

Rating

  • Novelty: ⭐⭐⭐ — Methodologically, the work combines existing techniques, but the problem formulation and data engineering are highly valuable.
  • Practicality: ⭐⭐⭐⭐⭐ — Delivers the first high-quality, open-source LLM for 237 million Bangla speakers.
  • Experimental Thoroughness: ⭐⭐⭐ — Evaluates across 6 benchmarks but lacks extensive ablation and qualitative/error analysis.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation, with a thoroughly described pipeline for data engineering.