BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models¶
Conference: ICLR 2026 arXiv: 2510.20095 Code: https://imageomics.github.io/biocap Area: Multimodal VLM Keywords: biological foundation model, synthetic captions, contrastive learning, species classification, CLIP
TL;DR¶
This paper proposes BioCAP, which trains a biological multimodal foundation model by using an MLLM to generate Wikipedia-knowledge-guided synthetic descriptive captions (rather than relying solely on species labels). BioCAP achieves an average improvement of 8.8% over BioCLIP across 10 species classification benchmarks and a 21.3% gain on text-image retrieval tasks.
Background & Motivation¶
Background: The biological domain contains massive collections of images annotated with species names (e.g., TreeOfLife-10M), but lacks instance-level descriptive text. Existing biological foundation models such as BioCLIP use only taxonomic species names as textual supervision, trained via CLIP-style contrastive learning.
Limitations of Prior Work: Species names are too coarse-grained as textual encodings — individuals within the same species exhibit large appearance variations (color, pose, environment, etc.), and names alone fail to capture fine-grained morphological characteristics. Wikipedia provides species descriptions, but these are not instance-specific. Directly generating captions with MLLMs is prone to hallucinations (e.g., incorrect descriptions of bird coloration).
Key Challenge: Instance-level captions are desirable, but manual annotation is infeasible at the scale of millions of images, while automatic generation suffers from hallucinations — precisely in the fine-grained morphological details most critical for species discrimination.
Goal: How to generate faithful, instance-specific descriptive captions for biological images at scale?
Key Insight: Species-level visual information extracted from Wikipedia, combined with taxonomic-category-specific format exemplars, is used as domain context to guide MLLM caption generation and reduce hallucinations.
Core Idea: Domain-knowledge-guided synthetic captions provide additional supervisory signals beyond labels for biological CLIP training.
Method¶
Overall Architecture¶
BioCAP = BioCLIP + Captions. The model is trained on TreeOfLife-10M using two text views (species names + descriptive captions) with CLIP-style contrastive learning. The primary contributions lie in the caption generation pipeline and the dual-projector training architecture.
Key Designs¶
-
Domain-Knowledge-Guided Synthetic Caption Generation:
- Function: Generate faithful, instance-level descriptive captions for 10M-scale biological images.
- Mechanism: A three-stage pipeline — (1) Scientific names are used to crawl Wikipedia species pages; Qwen3-32B extracts visual descriptive information (color, pattern, shape, texture, etc.), covering 29.5% of 447K species. (2) One to three format exemplars are crafted for each of 347 taxonomic classes (retrieved via Gemini Deep Research and manually verified), yielding 896 exemplars in total. (3) InternVL3-38B generates a descriptive caption for each image, conditioned on the Wikipedia visual information and format exemplars.
- Design Motivation: Wikipedia provides species-level prior knowledge to suppress hallucinations, while format exemplars teach the MLLM which features to attend to — different categories require different focal points (e.g., plumage color and wing shape for birds; wing venation and body segments for insects).
-
Separated Visual Projectors:
- Function: Assign independent visual projectors to the two heterogeneous text types: species names and captions.
- Mechanism: The visual encoder and text encoder are shared, but two separate projection heads are placed after the image encoder — the taxonomy projector is optimized only when the paired text is a species name, and the caption projector is optimized only when the paired text is a caption.
- Design Motivation: Species names are discrete categorical labels, whereas captions are continuous semantic descriptions, imposing different requirements on visual representations. Separated projectors prevent the two objectives from interfering with each other.
-
Morphological Space Theoretical Motivation:
- A theoretical interpretation is provided from a representation learning perspective: each species corresponds to a latent vector \(\mathbf{z}^*\) in morphological space, and both images and captions are noisy projections of this vector. Contrastive learning between the two views recovers the shared latent structure while suppressing noise (pose, illumination, and other environmental factors).
Loss & Training¶
Standard CLIP contrastive loss is applied, with alternating training over the two text views. The model is initialized from a ViT-B/16 CLIP checkpoint and trained for 50 epochs on TreeOfLife-10M.
Key Experimental Results¶
Main Results (Zero-shot Species Classification Accuracy)¶
| Model | NABirds | Plankton | Insects | Camera Trap | Fungi | Rare Species | Average |
|---|---|---|---|---|---|---|---|
| CLIP | 39.0 | 3.3 | 7.4 | 28.1 | 8.6 | 25.7 | 19.4 |
| BioCLIP | 58.8 | 6.1 | 34.9 | 31.7 | 40.9 | 37.1 | 37.6 |
| BioCAP | 67.6 | 7.2 | 41.9 | 37.4 | 64.4 | 44.2 | 46.4 |
Text-Image Retrieval (Recall@10)¶
| Model | INQUIRE (AP@50) | Cornell Bird I2T | PlantID I2T | Avg. Gain vs. BioCLIP |
|---|---|---|---|---|
| BioCLIP | ~31 | 15.4 | 48.4 | - |
| BioCAP | ~35 | 55.3 | 59.6 | +21.9% |
Key Findings¶
- Caption quality is critical: captions generated by an unguided MLLM actually degrade performance; guidance from Wikipedia and format exemplars yields substantial gains (Fungi: 40.9% → 64.4%, +23.5%).
- Separated projectors outperform shared projectors, confirming that species names and captions require distinct visual representations.
- Wikipedia information covering only 29.5% of species already yields an average improvement of 8.8%, suggesting further gains are achievable with broader species coverage.
- A 7.1% improvement on the most challenging Rare Species benchmark demonstrates that captions help the model generalize better to rare species.
Highlights & Insights¶
- Strong validation of "captions over labels": In the biological domain — where labels are abundant but captions are scarce — this work demonstrates the substantial value of descriptive text as an additional supervisory signal.
- Methodology for reducing hallucinations via domain knowledge guidance: The pipeline combining Wikipedia extraction and taxonomic format exemplars constitutes a reusable template applicable to any scenario requiring faithful domain-specific caption generation with MLLMs.
- Theoretical framework of morphological space: A causal generative model is employed to explain why captions are beneficial, rather than resorting to purely engineering-driven stacking.
Limitations & Future Work¶
- Wikipedia visual information covers only 29.5% of species; caption quality for the remaining species may be compromised by the absence of domain-level priors.
- The model is based on ViT-B/16; scalability to larger backbones (ViT-L or larger CLIP variants) has not been validated.
- Caption generation with InternVL3-38B may introduce model-specific biases.
- Format exemplars require manual verification (896 total), which may become a bottleneck when scaling to broader taxonomic coverage.
Related Work & Insights¶
- vs. BioCLIP: BioCAP builds on BioCLIP by incorporating caption supervision, achieving an average improvement of 8.8% and demonstrating the importance of supervision beyond labels.
- vs. LaCLIP/VeCLIP: These methods rewrite captions using LLMs in the general domain; BioCAP addresses the more challenging scenario where no captions exist and must be generated from scratch.
- vs. FG-CLIP: FG-CLIP uses long captions for fine-grained alignment but underperforms BioCLIP on biological tasks due to the absence of domain-knowledge guidance.
Rating¶
- Novelty: ⭐⭐⭐⭐ The domain-knowledge-guided caption generation pipeline is creative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Ten classification benchmarks, three retrieval tasks, and comprehensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical motivation is clear, methodology is described in detail, and figures are well-crafted.
- Value: ⭐⭐⭐⭐ Provides a valuable methodological contribution for multimodal foundation models in scientific domains.