SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning¶

Conference: ACL 2025
arXiv: 2506.02412
Code: None
Area: Multimodal/Educational AI
Keywords: Intelligent Tutoring Systems, Multilingual Dialogue, Picture Description, Language Learning, Scaffolding

TL;DR¶

This paper proposes SingaKids, a multilingual multimodal dialogic language learning tutoring system tailored for primary school students. Through a picture description task, it integrates dense image captioning, multilingual dialogue, speech understanding, and child-friendly speech generation, supporting interactive learning across four languages: English, Chinese, Malay, and Tamil.

Background & Motivation¶

Generative AI shows immense potential for personalized learning in education. However, in language learning scenarios, applications aimed at children still face multiple challenges:

Inconsistent Cross-lingual Performance: Most LLMs perform exceptionally in high-resource languages like English, but their performance drops significantly in low-resource languages such as Malay and Tamil. This poses a major barrier to educational applications in Singapore's multilingual environment.

Lack of Child-Friendly Design: Existing systems are mostly designed for adults and lack considerations for children's cognitive load, attention span, and developmental appropriateness. Children require simplified instructions, engaging dialogic patterns, and age-appropriate scaffolding support.

Disconnect Between Dialogic Pedagogy and Practice: Traditional intelligent tutoring systems heavily rely on rule-based systems or massive human-annotated datasets. Although the new generation of LLM-driven conversational tutors reduces data requirements, how to effectively integrate pedagogical and learning science principles remains an open question.

SingaKids is proposed against this backdrop to establish a multilingual interactive learning environment tailored for Singaporean primary school students through a picture description task.

Method¶

Overall Architecture¶

The system architecture comprises three key modules, forming a complete educational dialogue pipeline:

Multimodal Understanding Module:
- Scene Understanding: Extracting keywords, objects, and events from images.
- Multilingual Automatic Speech Recognition (ASR): Transcribing students' spoken responses into text.
- Speech Evaluation: Assessing students' speaking proficiency.
Multilingual LLM Core Module:
- Multilingual Semantic Understanding: Interpreting student responses in context.
- Language Evaluation: Assessing the linguistic accuracy and completeness of descriptions.
- Scaffolding Guidance: Determining the appropriate level of support.
- Pedagogical Anchoring: Establishing high-level instructional goals (e.g., vocabulary comprehension or sentence construction).
Output Module:
- Multilingual Text-to-Speech (TTS): Converting text into natural and engaging speech.
- Keyword Highlighting: Emphasizing important keywords or pronunciation errors.

Key Designs¶

1. Dense Image Captioning¶

Function: Generating rich descriptions for each key event in the image.
Mechanism: Adopting a two-stage approach: event bounding box proposal followed by caption generation.
- The first phase utilizes person/object detection (Liu et al., 2024a), human segmentation (Kirillov et al., 2023), and depth estimation (Bhat et al., 2023) for probabilistic reasoning.
- The second phase applies chain-of-thought prompting on InternVL2.5 to integrate global contextual understanding into individual event captions.
Design Motivation: State-of-the-art multimodal LLMs (especially smaller ones) perform suboptimally on dense image content, tending to generate generic descriptions and being prone to hallucination.
Effect: Achieved 75% sentence-level accuracy on the image test set.

2. Optimizing Multilingual ASR¶

Function: Improving speech recognition capabilities for Malay and Tamil, particularly for children's speech.
Mechanism: Fine-tuning Whisper-large-V3 as the base model using large-scale collected local data.
- Tamil: 2,800 hours, Malay: 1,000 hours, sourced from over 1,000 native speakers across different ages and linguistic backgrounds.
Design Motivation: Preliminary analysis revealed a significant performance gap in low-resource languages and children's speech.

Language	Test Set	WER Before FT	WER After FT
Malay	Conversational Speech	40.5%	28.4%
Malay	Children's Speech	20.3%	5.1%
Tamil	Bloom Speech	10.3%	7.1%
Tamil	Children's Speech	13.7%	7.9%

3. Optimizing Dialogic LLM¶

Multilingual Capability Enhancement:

Base Model: Qwen1.5-4B (balancing performance and efficiency).
Two-stage optimization pipeline:
- Stage 1: Continual pre-training on 14B tokens of a quadrilingual mixed dataset, applying balanced sampling rates to boost Malay and Tamil performance.
- Stage 2: Enhancing multilingual instruction-following capabilities through multi-task learning and cross-lingual alignment, including a multilingual role-playing corpus.

Scaffolding Guidance Enhancement:

Grounded in dialogic pedagogy theory (Alexander, 2006), where teachers foster the exchange of ideas through probing, cueing, elaborating, or reviewing.
Synthesizing dialogue samples with GPT-4 to train the smaller model to deliver scaffolded interactions based on student responses.
Building student persona classification (based on the Big Five framework) integrating both cognitive and non-cognitive aspects.
Side benefit: Scaffolding training improves system robustness against inappropriate language and out-of-domain inputs.

4. Optimizing Multilingual TTS¶

Framework: VITS (non-autoregressive, balancing speech quality and efficiency).
Data: Malay (22h adult + 9h child), Tamil (63h adult + 1.5h child).
Supporting multi-speaker generation utilizing one-hot speaker embeddings.

Loss & Training¶

The system employs a module-by-module optimization strategy, with components trained independently and integrated afterward: - ASR: Fine-tuned based on Whisper-large-V3. - LLM: Continual pre-training + Instruction tuning + Scaffolding enhancement. - TTS: Multi-speaker VITS training. - All experiments were conducted on Nvidia A100 40/80GB GPUs.

Key Experimental Results¶

TTS Evaluation (Main Results)¶

Metric	Malay (Adult)	Malay (Child)	Tamil (Adult)	Tamil (Child)
MOS (Subjective)	>3.50	>3.50	>3.50	>3.50
CER (Speech Intelligibility)	<10%	<10%	<10%	<10%

Evaluating with 20 native listeners, the speech intelligibility surpassed 90%.

User Study & Scaffolding Analysis (Ablation Study)¶

Scaffolding Type	High-Performing Students	Low-Performing Students
Feeding back	69%	43%
Explanation	21%	9%
Hints	5%	12%
Social-emotional	17%	31%

An empirical study on 35 Grade 1 and 2 students (IRB-2024-218) demonstrates that the system adaptively tailors its pedagogical strategies based on student performance.

Key Findings¶

Adaptive Scaffolding is Effective: High-performing students receive more feedback and explanations, guiding them toward deeper understanding, whereas low-performing students receive more hints and socio-emotional support.
Significant ASR Improvements on Children's Speech: Malay children's speech WER plummeted from 20.3% to 5.1%.
Substantial Boost in Multilingual Abilities: Through continual pre-training and cross-lingual alignment, both translation and instruction-following abilities improved.
Scaffolding Training Enhances Robustness: When facing inappropriate or out-of-domain inputs, the system successfully steers the students back to the picture description task.

Highlights & Insights¶

Comprehensive Systems Engineering Approach: Instead of an isolated model-level innovation, the work organically integrates four modules—ASR, LLM, TTS, and image understanding—into a fully functional educational system.
Blending Scaffolding Theory with AI: Systematically incorporating dialogic pedagogy from learning sciences into LLM training, achieving adaptive teaching using personalized student personas.
Focus on Low-Resource Languages: Tailoring optimization specifically to Malay and Tamil, demonstrating a strong commitment to linguistic equity.
Real-World Validation: Conducting user studies with actual primary school students rather than relying solely on automated evaluation metrics.

Limitations & Future Work¶

Hallucination Issues: LLMs still carry the risk of hallucinations and biases, which might lead to communication errors in educational settings.
Noisy Environments: Classroom noise and children's typical speech patterns elevate ASR errors, necessitating noise-robust speech recognition and speaker diarization.
Student Disengagement: Some students withdraw when encountering persistent challenges; the system needs better mechanisms to trigger modeling strategies.
Guiding Lower-Grade Students: The system cannot yet fully replace adult/parent guidance for younger children.
Visual Complexity: When an image contains too many objects, children are easily distracted, requiring auxiliary visual highlighting.

Evolution of Intelligent Tutoring Systems (ITS): Moving from rule-based engines to generative LLM-driven dialogic tutors.
Application prospects of multimodal LLMs in education.
Dialogic pedagogy theory (Alexander, 2006) provides a solid theoretical foundation for AI educational system designs.
Personalized learning pathways and adaptive feedback remain core directions for educational AI.

Rating¶

Novelty: ⭐⭐⭐ (System integration innovation, limited single-module innovation)
Experimental Thoroughness: ⭐⭐⭐ (Quantitative evaluation carried out for each module, but user study size is relatively small)
Writing Quality: ⭐⭐⭐⭐ (Well-structured, comprehensive system exposition)
Value: ⭐⭐⭐⭐ (Possesses highly practical application value for real-world educational scenarios)