GeoDiv: Framework for Measuring Geographical Diversity in Text-to-Image Models¶

Conference: ICLR 2026 arXiv: 2602.22120 Code: GitHub Area: Text-to-Image Generation / Fairness Evaluation Keywords: Geographical Diversity, Text-to-Image Models, Socioeconomic Bias, VLM Evaluation, Interpretable Metrics

TL;DR¶

This paper proposes GeoDiv, a framework that leverages the world knowledge embedded in LLMs and VLMs to systematically evaluate the geographical diversity of T2I models along two dimensions — the Socioeconomic Visual Index (SEVI) and the Visual Diversity Index (VDI) — revealing systematic impoverishment biases in model outputs for countries such as India and Nigeria.

Background & Motivation¶

Background: T2I models (e.g., Stable Diffusion, FLUX.1) are widely deployed in commercial applications, yet their generated outputs frequently lack geographical diversity and perpetuate stereotypical depictions of different regions.
Limitations of Prior Work: Existing diversity metrics either rely on annotated datasets (e.g., GeoDE) or focus solely on low-level visual similarity (e.g., Vendi-Score), failing to capture the multidimensional nature of geographical diversity in an interpretable manner.
Key Challenge: Geographical diversity spans multiple dimensions — economic, environmental, and cultural — which cannot be comprehensively captured by a single metric; moreover, existing approaches have limited capacity for fine-grained bias detection at the country level.
Key Insight: Exploit the implicit world knowledge of LLMs/VLMs to design an interpretable and automated evaluation framework.
Core Idea: Decompose geographical diversity into four interpretable dimensions — Affluence and Maintenance (constituting SEVI) and Entity Appearance and Background Appearance (constituting VDI) — and quantify diversity using Hill Numbers.

Method¶

Overall Architecture¶

The GeoDiv pipeline operates as follows: given an entity \(e\) and a country \(c\), an LLM generates entity-specific attribute questions alongside fixed background questions; a VQA model predicts answer distributions over the image set; VDI (normalized Hill Number) is computed from these distributions; simultaneously, the VQA model scores each image to produce SEVI.

Key Designs¶

Visual Diversity Index (VDI):
- Function: Evaluates visual variation across images along two axes — entity appearance and background appearance.
- Mechanism: An ensemble of LLMs generates attribute question–answer pairs; a VQA model predicts answer distributions; diversity is quantified using normalized Hill Numbers.
- Diversity score: \(\text{Diversity-Score} = \frac{\exp(H(\hat{P_k})) - 1}{|\hat{\mathcal{A}_k}| - 1}\), where \(H(\cdot)\) denotes Shannon entropy.
- Design Motivation: Questions vary in the number of possible answers; normalization enables fair comparison across questions.
Socioeconomic Visual Index (SEVI):
- Function: Captures Affluence (1–5 scale) and Maintenance (1–5 scale) of generated images.
- Mechanism: A VLM directly scores each image; Hill Numbers are then applied to the resulting score distributions to quantify diversity.
- Design Motivation: Integrates socioeconomic indicators with visual analysis to ensure consistency across subjective concepts.
Reliability Mechanisms:
- Visibility Step: Filters out images in which the target attribute is not visible, reducing VQA hallucinations.
- Multi-Select: Allows multiple answer selections to avoid distribution distortion caused by forced single-choice.
- NOTA Option: Appends a "None of the Above" option (selected in only 2.6% of cases) to reduce guessing.
- Large-Scale Human Validation: Local annotators from 14 countries verify the alignment between SEVI and human judgments.

Loss & Training¶

Gemini-2.5-flash is used as the VQA/VLM backbone (best accuracy: 86%; human correlation: ρ = 0.76/0.69).
Evaluation covers 4 T2I models, 10 entities, and 16 countries, comprising a total of 160,000 synthetic images.

Key Experimental Results¶

Main Results¶

VQA Model	VDI Entity Acc.	VDI Background Acc.	SEVI-Affluence ρ	SEVI-Maintenance ρ
Gemini-2.5-flash	0.87	0.85	0.76	0.69
gpt-4o	0.85	0.81	0.76	0.76
Qwen2.5-VL	0.85	0.77	0.69	0.71
LLaVA-v1.6	0.70	0.66	0.65	0.68

Key Findings: Country-Level Bias¶

Country Group	Avg. Affluence	Avg. Maintenance	Diversity Score
India / Nigeria / Colombia	2.31	3.34	Low
Japan / UAE / UK	3.53	4.30	Low
FLUX.1 (Global)	3.82	4.73	Extremely low (0.15)

Key Findings¶

FLUX.1 produces the most refined images but exhibits the lowest diversity, revealing a trade-off between refinement and diversity.
Overall geographical diversity has declined in newer model versions.
Background diversity (0.31) is substantially lower than entity diversity (0.44); mountains appear in only 12% of images.
Compared to Vendi-Score: only entity diversity shows moderate correlation (ρ = 0.56); correlations across other dimensions are low.

Highlights & Insights¶

GeoDiv is the first systematic and interpretable evaluation framework for geographical diversity in T2I models, supporting arbitrary extension to new entities and countries.
The discovery of FLUX.1's "high quality, low diversity" trade-off offers direct guidance for model development.
All data, annotations, and code are publicly released.

Limitations & Future Work¶

Coverage is limited to 16 countries and 10 entities; extending to more regions may reveal additional bias patterns.
The framework relies on the world knowledge of LLMs/VLMs, which may themselves harbor biases.
Cultural representation remains a limitation, with annotator–VQA model disagreements observed for certain countries.

vs. Vendi-Score: Vendi-Score measures only visual variation and cannot capture the socioeconomic dimension.
vs. GRADE: GRADE evaluates diversity only for everyday objects and does not address the complexity of the geographical dimension.

Rating¶

Novelty: ⭐⭐⭐⭐ First framework to decompose geographical diversity into four interpretable dimensions via SEVI + VDI.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 160K images, large-scale human validation, multi-model and multi-country comparisons.
Writing Quality: ⭐⭐⭐⭐ Clear structure with insightful findings.
Value: ⭐⭐⭐⭐ Direct applicability to fairness evaluation of T2I models.