DermaCon-IN: A Multi-concept Annotated Dermatological Image Dataset of Indian Skin Disorders¶

Conference: NeurIPS 2025 arXiv: 2506.06099
Code: GitHub / Harvard Dataverse
Area: Medical Imaging Keywords: dermatology dataset, Indian skin tone, concept bottleneck model, hierarchical classification, explainable AI

TL;DR¶

This work introduces DermaCon-IN—the first densely annotated dermatological image dataset predominantly featuring Indian skin tones (5,450 images / 3,002 patients / 245 diagnoses)—providing three-level hierarchical diagnostic labels, 47 lesion descriptors, and 49 anatomical site annotations, with benchmark evaluations using CNN, ViT, and concept bottleneck model architectures.

Background & Motivation¶

Background: Skin diseases constitute the fourth largest non-fatal disease burden globally. AI-assisted diagnosis is regarded as a promising approach to address the shortage of dermatological resources; however, existing models are predominantly trained and evaluated on datasets collected from Europe and North America.

Limitations of Prior Work: - Severe dataset bias: ISIC, HAM10000, and similar datasets focus on neoplastic conditions such as melanoma, neglecting tropical high-prevalence diseases such as fungal infections and scabies. - Underrepresentation of skin tones: Over 75% of Fitzpatrick17k consists of Type I–III (lighter skin), with a critical lack of darker skin tones (Type IV–VI). - One-dimensional annotations: Most datasets provide only diagnostic labels, lacking anatomical site information and lesion morphology descriptors. - Data source bias: SD-198 and Fitzpatrick17k are derived from teaching atlases rather than prospective clinical collection.

Key Challenge: The geographic, skin-tone, and disease-spectrum biases in existing datasets cause AI model accuracy to drop by 30–40% for darker-skinned populations (as demonstrated on the DDI dataset), precluding equitable service to global populations.

Key Insight: Prospective collection from Indian outpatient clinics to construct a dataset with broad disease coverage, representative skin tones, and multi-dimensional annotations.

Core Idea: To provide the first dermatological dataset predominantly featuring South Asian skin tones, incorporating a triple annotation scheme covering diagnostic hierarchy, anatomical sites, and lesion descriptors simultaneously.

Method¶

Overall Architecture¶

Dataset construction follows a collect → annotate → quality control → benchmark evaluation pipeline. Clinical images were collected from three tertiary hospitals in South India; four board-certified dermatologists applied the Rook classification system to produce three-level hierarchical labels and concept annotations. Final benchmarking was conducted using multiple architectures for classification and explainability evaluation.

Key Designs¶

Three-Level Hierarchical Diagnostic Labels
- Function: Constructs an 8 main-class → 19 sub-class → 245 specific disease hierarchy following the Rook Textbook of Dermatology.
- Mechanism: Main classes are organized by etiology (infectious, inflammatory, pigmentary, keratotic, etc.); sub-classes handle mixed or co-existing conditions (e.g., inflammatory + fungal infection); specific labels correspond to ICD-11 codes.
- Design Motivation: Reflects the clinical diagnostic workflow (broad category first, then refinement) and supports both coarse-grained and fine-grained modeling. The disease label distribution follows a long-tail pattern (log-normal exponent 1.8), consistent with real-world outpatient frequencies.
Dual-Dimension Concept Annotations (96 concepts)
- 47 lesion descriptors: scaling, erythema, vesicles, hyperpigmentation, etc., following standard clinical dermatological terminology.
- 49 anatomical sites: scalp, palms, soles, trunk, etc., retaining full anatomical context from the images.
- Design Motivation: Dermatological diagnosis relies on combined reasoning from lesion morphology and anatomical location (e.g., scalp scaling → psoriasis vs. pedal scaling → tinea pedis). Independently annotating both concept types enables interpretability research with concept bottleneck models.
- Pearson correlation analysis validates the clinical plausibility of concept–disease associations (e.g., vitiligo ↔ pigmentary disorders, \(r = +0.71\)).
Concept Bottleneck Model (CBM) Benchmark
- Function: Constructs a CBM on a Swin-B backbone, performing interpretable classification through a concept layer.
- Architecture: Image → Swin Encoder → concept logits \(c^\ell \in \mathbb{R}^{B+D}\) → sigmoid → concept vector \(c\) → classifier → diagnosis.
- Two hierarchical CBM designs are explored:
  - Type 1 (cascaded): concepts → sub-class prediction → main-class prediction, enforcing classification consistency.
  - Type 2 (parallel): concepts independently predict sub-classes and main classes simultaneously via multi-task learning regularization.
- Type 2 outperforms Type 1 across all metrics.

Loss & Training¶

The concept layer is supervised with BCE loss for binary classification of 96 concepts.
The classification layer uses cross-entropy with weighted sampling to address class imbalance.
Models are pretrained on ImageNet-22k; inputs are resized to 512×512; subject-wise stratified 80:20 splits are applied.

Key Experimental Results¶

Main Results: 8-Class Main Category Classification¶

Model	Pretrain	Accuracy	Balanced Acc.	F1 Score
ResNet50	—	47.45	23.93	46.43
ResNet50	ImageNet	64.31	38.77	63.31
DenseNet121	ImageNet	65.20	37.31	64.37
ViT-B/16-384	ImageNet	66.95	35.78	65.78
Swin-B/4W12-384	ImageNet	70.41	45.06	69.69

The Swin Transformer achieves the best performance consistently across all metrics.

Ablation Study: CBM Configurations¶

Configuration	Concepts	Accuracy	Macro AUC
Direct MC classification (no CBM)	—	70.41	78.51
CBM-D (descriptors only)	47	68.57	85.18
CBM-B (anatomical sites only)	49	68.38	84.96
CBM full concepts	96	68.12	82.78
Type 2 hierarchical CBM – MC	96	69.90	77.01

Key Findings¶

Introducing the concept bottleneck results in a modest accuracy decrease (~2%) but a substantial improvement in Macro AUC (78.51 → 85.18), indicating that the concept layer provides better inter-class discriminability.
Single-stream concepts (descriptors only or anatomical sites only) yield comparable performance; however, combining both streams produces a competitive suppression effect—the model tends to activate one concept group while inhibiting the other.
Grad-CAM analysis confirms that concept activations localize to semantically correct anatomical regions.
Concept-weight alignment (Spearman correlation) is satisfactory (\(p < 0.05\)) for pigmentary and keratotic disorder classes, but poor for neoplastic classes, likely due to insufficient sample sizes.

Highlights & Insights¶

First densely annotated dataset for Indian skin tones: covers Fitzpatrick IV–VI and MST 4–9 tones, addressing a critical gap in global dermatological AI fairness.
Elegant triple annotation scheme: the combination of diagnostic hierarchy, lesion descriptors, and anatomical sites systematically formalizes the clinical reasoning pathway in dermatology (morphology + location → diagnosis) at the dataset level for the first time.
The discovery of concept competition (suppression of one concept group in dual-stream CBMs) reveals a representational bottleneck in multi-concept learning, warranting further investigation.
An inter-annotator Cohen's Kappa of 0.84 ensures high data quality.

Limitations & Future Work¶

Facial images required privacy-preserving de-identification (eye occlusion or cropping), which compromises modeling capability for facial dermatoses.
Rare diseases in the long-tail distribution have very few samples, necessitating few-shot or long-tail learning strategies.
The absence of pixel-level segmentation annotations precludes support for lesion localization and segmentation tasks.
Data are sourced exclusively from Karnataka in South India, without coverage of North India or other Southeast Asian regions.
The concept competition issue in CBMs requires novel regularization or attention mechanisms to resolve.

vs. Fitzpatrick17k: The latter is derived from teaching atlases rather than clinical settings, comprises 75% lighter skin tones, and lacks fungal and viral infections; DermaCon-IN is prospectively collected in clinical settings and covers infectious diseases.
vs. SkinCon: SkinCon retrospectively adds descriptors to an existing dataset; DermaCon-IN collects lesion and anatomical site concepts simultaneously at the point of acquisition.
vs. DDI: DDI contains only 656 images across 78 classes; DermaCon-IN is substantially larger (5,450 images / 245 classes) with denser annotations.
vs. PASSION: PASSION focuses on 4 diseases in African pediatric populations; DermaCon-IN covers the full spectrum of 245 adult dermatological conditions.

Rating¶

Novelty: ⭐⭐⭐⭐ Fills the gap in Indian skin tone data and introduces a novel triple annotation scheme.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-architecture comparisons, CBM exploration, and Grad-CAM qualitative analysis provide comprehensive evaluation.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; related work comparison tables are detailed.
Value: ⭐⭐⭐⭐⭐ The dataset contribution carries significant implications for fairness research and global applicability.