COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus¶

Conference: ACL 2025
arXiv: 2506.15372
Code: None
Area: Multilingual / Multimodal
Keywords: Indian languages, Multimodal, Comment-sensitive, Summarization, Headline generation

TL;DR¶

Builds COSMMIC, the first comment-sensitive multimodal multilingual dataset for Indian languages—covering 9 Indian languages, 4,959 article-image pairs, and 24,484 reader comments; proposes comment filtering (IndicBERT) and image classification (CLIP) enhancement schemes, and establishes summarization and headline generation benchmarks using GPT-4 and LLaMA3.

Background & Motivation¶

Background: Reader comments contain reactions, sentiments, and supplementary information about articles, which can enhance the quality of summarization and headline generation; however, existing datasets are almost exclusively in English.
Limitations of Prior Work: There is a lack of comment-sensitive multimodal datasets for Indian languages; existing multilingual summarization resources do not combine reader comments and associated images.
Key Challenge: India has billions of internet users speaking multiple languages, yet NLP resources are severely lacking.
Goal: To construct the first Indian language dataset that simultaneously covers multilingual, multimodal, and comment-sensitive dimensions.
Key Insight: Crawling articles, images, and comments from mainstream Indian news websites.
Core Idea: Comments are not merely noise; after quality filtering, they can provide valuable contextual signals to enhance generation tasks.

Method¶

Overall Architecture¶

Data construction: Crawling from news websites in 9 languages \(\rightarrow\) Comment quality filtering (IndicBERT classification) \(\rightarrow\) Image relevance filtering (CLIP matching) \(\rightarrow\) GPT-4/LLaMA3 summarization and headline baselines.

Key Designs¶

9-Language Data Crawling and Standardization:
- Function: Crawls article-image-comment triplets in 9 languages from mainstream Indian news websites
- Mechanism: Covers Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, and Odia, with 400-600 articles per language
- Design Motivation: India is one of the countries with the highest linguistic diversity in the world; covering 9 major languages ensures representation
Comment Quality Filtering (IndicBERT):
- Function: Filters low-quality/irrelevant comments using an IndicBERT classifier
- Mechanism: Trains a binary classifier to judge whether comments are relevant to the article and informative, retaining high-quality comments
- Design Motivation: Many raw comments are spam, irrelevant, or of low quality; directly using them without filtering introduces noise and degrades generation quality
Image Relevance Filtering (CLIP):
- Function: Uses CLIP to determine if the article's associated image is relevant to the article content
- Mechanism: Computes the CLIP similarity between the image and the article text to filter out irrelevant images
- Design Motivation: News illustrations are sometimes advertisements or irrelevant images; retaining only relevant images is necessary to leverage multimodal enhancement

Loss & Training¶

The IndicBERT comment filter is trained using binary cross-entropy. Baseline experiments perform zero-shot and few-shot summarization and headline generation using GPT-4 and LLaMA3.

Key Experimental Results¶

Main Results¶

Summarization generation ROUGE scores (averaged across 9 languages):

Model	Input	ROUGE-1	ROUGE-2	ROUGE-L
GPT-4	Article	32.5	12.1	28.3
GPT-4	Article + Comments	35.2	14.3	30.8
LLaMA3	Article	28.7	9.8	24.5
LLaMA3	Article + Comments	31.4	11.5	27.1

Ablation Study¶

Impact of comment filtering:

Comment Quality	ROUGE-L Change
No Comments	Baseline
+ All Comments (Unfiltered)	+1.2
+ Filtered Comments	+2.5

Key Findings¶

Comments indeed improve summarization quality: Average ROUGE-L increases by 2.5 points after incorporating comments.
Comment filtering is crucial: Filtered comments yield double the performance gain compared to unfiltered ones.
GPT-4 significantly outperforms LLaMA3: The gap is even wider for low-resource languages.
Multimodal effects of combining images and comments are limited: Text + comments is already sufficient.

Highlights & Insights¶

First Indian language dataset simultaneously covering three dimensions: Multilingual + Multimodal + Comment-sensitive.
Comment filtering methodology: Proves that "comments do not equal noise—filtered comments are valuable signals".
Coverage of 9 languages: Provides infrastructure for low-resource Indian language NLP research.
GPT-4 vs LLaMA3 baseline: Provides a reference point for future work.

Limitations & Future Work¶

Does not cover all Indian languages (such as Punjabi, Assamese, etc.).
The distribution of the number of comments is uneven across different languages.
Baseline experiments are mainly zero/few-shot, without fine-tuning.
The multimodal enhancement effect of images is limited.

vs XL-Sum (English + multilingual summarization): Lacks comments—this work introduces the comment dimension.
vs IndicNLPSuite: Covers various Indian language NLP tasks but lacks summarization/comments.
Insight: NLP research in low-resource languages requires a "trinity" dataset (multimodal + multilingual + contextual signals).

Rating¶

Novelty: ⭐⭐⭐⭐ A dataset construction work, with innovation in three-dimensional fusion.
Experimental Thoroughness: ⭐⭐⭐⭐ Baseline experiments are relatively basic.
Writing Quality: ⭐⭐⭐⭐ Data construction is clearly described.
Value: ⭐⭐⭐⭐⭐ Fills the gap for low-resource Indian language NLP research.