BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment¶
Conference: CVPR 2026 arXiv: 2603.23883 Code: Project Page Area: Image Generation Keywords: Visual-textual-acoustic alignment, cross-modal retrieval, bioacoustics, species recognition, multimodal representation learning
TL;DR¶
This paper proposes the BioVITA framework, comprising a million-scale tri-modal (image–text–audio) biological dataset, a two-stage alignment model, and a six-direction cross-modal species-level retrieval benchmark, achieving for the first time unified visual-textual-acoustic representation learning in the biological domain.
Background & Motivation¶
Background: Biodiversity research relies on multiple sensory modalities (images for appearance, audio for vocalizations, text for taxonomic descriptions). Models such as BioCLIP have achieved success in image–text alignment, and CLAP has made progress on audio–text alignment.
Limitations of Prior Work: Existing multimodal datasets focus only on paired modalities (image–text or audio–text), lacking a unified tri-modal training and evaluation framework. Pioneer efforts such as SSW60 cover only 60 species, making them severely insufficient in scale.
Key Challenge: Biodiversity research requires comprehensive perception of species, yet visual-textual-acoustic (VITA) alignment remains an open challenge—different datasets employ inconsistent taxonomic systems and vary greatly in scale.
Goal: To construct a complete VITA alignment framework enabling free species-level cross-modal retrieval among images, audio, and text.
Key Insight: Beginning with dataset construction, the paper collects million-scale tri-modal data with ecological attribute annotations, and aligns audio representations to an established visual-textual representation space via a two-stage training strategy.
Core Idea: Leveraging the powerful image–text representations pretrained by BioCLIP 2, the paper efficiently achieves unified tri-modal representation through a two-stage strategy of audio–text contrastive alignment followed by joint tri-modal contrastive alignment.
Method¶
Overall Architecture¶
BioVITA consists of three components: (1) BioVITA Train: a million-scale tri-modal training dataset; (2) BioVITA Model: a unified representation model with audio, image, and text encoders; and (3) BioVITA Bench: a six-direction cross-modal species-level retrieval benchmark.
Key Designs¶
-
BioVITA Train Dataset Construction:
- Three-step pipeline: audio data curation → fine-grained annotation → visual data integration
- Collects 1.3 million audio recordings from iNaturalist, Xeno-Canto, and the Animal Sound Archive, paired with 2.3 million images from a subset of ToL-200M
- Covers 14,133 species with 34 ecological attribute labels (diet type, activity pattern, habitat, etc.)
- Design Motivation: Existing datasets are limited to either audio or images alone, precluding joint tri-modal training; the 34 attribute labels support fine-grained ecological analysis
-
Two-Stage Training Strategy:
- Stage 1 (Audio–Text Alignment): Trains only the ATC loss to align audio encoder representations with text \(\mathcal{L}_{\text{ATC}} = \frac{1}{2}(\ell(\mathbf{S}_{\text{AT}}) + \ell(\mathbf{S}_{\text{AT}}^\top))\) Trained for 30 epochs, learning rate \(10^{-4}\), batch size 64
- Stage 2 (Tri-modal Alignment): Activates AIC and ITC losses to achieve full VITA alignment \(\mathcal{L} = \mathcal{L}_{\text{ATC}} + \lambda(\mathcal{L}_{\text{AIC}} + \mathcal{L}_{\text{ITC}})\) Trained for 10 epochs, with \(\lambda\) linearly warmed up from 0 to 0.1 over the first 2 epochs
- Design Motivation: Direct joint tri-modal training is unstable due to the difficulty of fine-grained visual and acoustic discrimination; aligning audio–text first and then gradually introducing images exploits the strong image–text representation space of pretrained BioCLIP 2
-
Encoder Architecture:
- Audio Encoder: HTS-AT (hierarchical Transformer with 4 SwinT groups), extracting 768-dimensional representations from mel spectrograms
- Image–Text Encoder: Pretrained BioCLIP 2 (ViT-L/14 + 12-layer Transformer), 768-dimensional
- Design Motivation: Reusing a mature biological image–text encoder requires training only the audio encoder for alignment
Loss & Training¶
- Contrastive learning employs standard InfoNCE-style cross-entropy loss; the temperature hyperparameter \(\tau\) controls the sharpness of the similarity distribution
- A linear schedule for \(\lambda\) in Stage 2 prevents the ATC loss from increasing again
- At most 20 recordings per species per epoch; audio is randomly cropped to 10-second segments to increase diversity
Key Experimental Results¶
Main Results¶
| Retrieval Direction | Metric | BioVITA | ImageBind | CLAP | Gain |
|---|---|---|---|---|---|
| Audio→Text (Top-1) | Species | Best | — | 2nd | Significant |
| Text→Audio (Top-1) | Species | Best | — | 2nd | Significant |
| Audio→Image (Top-1) | Species | Best | 2nd | — | First achieved |
| Image→Audio (Top-1) | Species | Best | 2nd | — | First achieved |
| Image→Text (Top-1) | Species | Best | — | — | Matches BioCLIP 2 |
| Text→Image (Top-1) | Species | Best | — | — | Matches BioCLIP 2 |
Ablation Study¶
| Configuration | Audio→Text | Text→Audio | Note |
|---|---|---|---|
| Stage 1 only | High | High | Audio–text alignment is effective |
| Stage 1+2 (full) | Best | Best | Joint tri-modal training yields further gains |
| Single-stage joint training | Lower | Lower | Validates the necessity of the two-stage strategy |
Key Findings¶
- BioVITA is the first to achieve species-level retrieval across all six directions, with substantial gains on audio-related directions
- Two-stage training outperforms single-stage joint training, as audio–text alignment constitutes the foundation of tri-modal alignment
- Ecological attribute labels reveal interesting associations between acoustic and visual features (e.g., nocturnal animals have more distinctive vocalizations)
- The model generalizes well to unseen species (325 species) in zero-shot settings
Highlights & Insights¶
- The dataset scale and coverage far surpasses prior work (1.3M audio + 2.3M images + 14K species + 34 ecological attributes)
- The two-stage training strategy cleverly leverages pretrained models, avoiding tri-modal alignment from scratch
- The systematic six-direction retrieval benchmark provides a standardized evaluation protocol for biological multimodal research
- Ecological attribute annotations add an entirely new dimension to cross-modal biological understanding
Limitations & Future Work¶
- Audio and images are not strictly paired (different individuals of the same species), precluding individual-level correspondence learning
- The video modality (temporal information of animal behavior) is not considered
- Bird species constitute the vast majority of data; coverage of other taxonomic classes may be uneven
- End-to-end fine-tuning of the image–text encoder, rather than freezing it, is worth exploring
Related Work & Insights¶
- BioCLIP/BioCLIP 2 demonstrated the effectiveness of structured taxonomic text prompts for biological image–text alignment
- CLAP's success in general audio–language pretraining provides a foundation for bioacoustic alignment
- ImageBind's cross-modal alignment approach via a shared embedding space is an important reference, though it lacks sufficient biological domain data
- This work suggests that for new modality alignment, a two-stage "align first, then jointly train" approach is more robust than a one-step end-to-end strategy
Rating¶
- Novelty: ⭐⭐⭐⭐ First million-scale biological tri-modal dataset and benchmark, though the methodology itself is based on established contrastive learning
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive six-direction retrieval, multi-granularity analysis, and ecological perspective, but lacks downstream task evaluation
- Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed description of the dataset construction process
- Value: ⭐⭐⭐⭐⭐ Significant contributions to both biodiversity research and multimodal learning; the dataset alone constitutes a major contribution