COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus¶
Conference: ACL 2025
arXiv: 2506.15372
Code: None
Area: Multilingual / Multimodal
Keywords: Indian languages, Multimodal, Comment-sensitive, Summarization, Headline generation
TL;DR¶
Builds COSMMIC, the first comment-sensitive multimodal multilingual dataset for Indian languages—covering 9 Indian languages, 4,959 article-image pairs, and 24,484 reader comments; proposes comment filtering (IndicBERT) and image classification (CLIP) enhancement schemes, and establishes summarization and headline generation benchmarks using GPT-4 and LLaMA3.
Background & Motivation¶
Background: Reader comments contain reactions, sentiments, and supplementary information about articles, which can enhance the quality of summarization and headline generation; however, existing datasets are almost exclusively in English.
Limitations of Prior Work: There is a lack of comment-sensitive multimodal datasets for Indian languages; existing multilingual summarization resources do not combine reader comments and associated images.
Key Challenge: India has billions of internet users speaking multiple languages, yet NLP resources are severely lacking.
Goal: To construct the first Indian language dataset that simultaneously covers multilingual, multimodal, and comment-sensitive dimensions.
Key Insight: Crawling articles, images, and comments from mainstream Indian news websites.
Core Idea: Comments are not merely noise; after quality filtering, they can provide valuable contextual signals to enhance generation tasks.
Method¶
Overall Architecture¶
Data construction: Crawling from news websites in 9 languages \(\rightarrow\) Comment quality filtering (IndicBERT classification) \(\rightarrow\) Image relevance filtering (CLIP matching) \(\rightarrow\) GPT-4/LLaMA3 summarization and headline baselines.
Key Designs¶
-
9-Language Data Crawling and Standardization:
- Function: Crawls article-image-comment triplets in 9 languages from mainstream Indian news websites
- Mechanism: Covers Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, and Odia, with 400-600 articles per language
- Design Motivation: India is one of the countries with the highest linguistic diversity in the world; covering 9 major languages ensures representation
-
Comment Quality Filtering (IndicBERT):
- Function: Filters low-quality/irrelevant comments using an IndicBERT classifier
- Mechanism: Trains a binary classifier to judge whether comments are relevant to the article and informative, retaining high-quality comments
- Design Motivation: Many raw comments are spam, irrelevant, or of low quality; directly using them without filtering introduces noise and degrades generation quality
-
Image Relevance Filtering (CLIP):
- Function: Uses CLIP to determine if the article's associated image is relevant to the article content
- Mechanism: Computes the CLIP similarity between the image and the article text to filter out irrelevant images
- Design Motivation: News illustrations are sometimes advertisements or irrelevant images; retaining only relevant images is necessary to leverage multimodal enhancement
Loss & Training¶
The IndicBERT comment filter is trained using binary cross-entropy. Baseline experiments perform zero-shot and few-shot summarization and headline generation using GPT-4 and LLaMA3.
Key Experimental Results¶
Main Results¶
Summarization generation ROUGE scores (averaged across 9 languages):
| Model | Input | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|
| GPT-4 | Article | 32.5 | 12.1 | 28.3 |
| GPT-4 | Article + Comments | 35.2 | 14.3 | 30.8 |
| LLaMA3 | Article | 28.7 | 9.8 | 24.5 |
| LLaMA3 | Article + Comments | 31.4 | 11.5 | 27.1 |
Ablation Study¶
Impact of comment filtering:
| Comment Quality | ROUGE-L Change |
|---|---|
| No Comments | Baseline |
| + All Comments (Unfiltered) | +1.2 |
| + Filtered Comments | +2.5 |
Key Findings¶
- Comments indeed improve summarization quality: Average ROUGE-L increases by 2.5 points after incorporating comments.
- Comment filtering is crucial: Filtered comments yield double the performance gain compared to unfiltered ones.
- GPT-4 significantly outperforms LLaMA3: The gap is even wider for low-resource languages.
- Multimodal effects of combining images and comments are limited: Text + comments is already sufficient.
Highlights & Insights¶
- First Indian language dataset simultaneously covering three dimensions: Multilingual + Multimodal + Comment-sensitive.
- Comment filtering methodology: Proves that "comments do not equal noise—filtered comments are valuable signals".
- Coverage of 9 languages: Provides infrastructure for low-resource Indian language NLP research.
- GPT-4 vs LLaMA3 baseline: Provides a reference point for future work.
Limitations & Future Work¶
- Does not cover all Indian languages (such as Punjabi, Assamese, etc.).
- The distribution of the number of comments is uneven across different languages.
- Baseline experiments are mainly zero/few-shot, without fine-tuning.
- The multimodal enhancement effect of images is limited.
Related Work & Insights¶
- vs XL-Sum (English + multilingual summarization): Lacks comments—this work introduces the comment dimension.
- vs IndicNLPSuite: Covers various Indian language NLP tasks but lacks summarization/comments.
- Insight: NLP research in low-resource languages requires a "trinity" dataset (multimodal + multilingual + contextual signals).
Rating¶
- Novelty: ⭐⭐⭐⭐ A dataset construction work, with innovation in three-dimensional fusion.
- Experimental Thoroughness: ⭐⭐⭐⭐ Baseline experiments are relatively basic.
- Writing Quality: ⭐⭐⭐⭐ Data construction is clearly described.
- Value: ⭐⭐⭐⭐⭐ Fills the gap for low-resource Indian language NLP research.