BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment¶
Conference: CVPR 2026
arXiv: 2603.23883
Code: Project Page
Area: Image Generation
Keywords: visual-textual-acoustic alignment, cross-modal retrieval, bioacoustics, species identification, multimodal representation learning
TL;DR¶
The BioVITA framework is proposed, comprising a million-scale tri-modal (image-text-audio) biological dataset, a two-stage alignment model, and a six-direction cross-modal species-level retrieval benchmark, achieving the first unified visual-textual-acoustic representation learning in the biological domain.
Background & Motivation¶
Background: Biodiversity research relies on multiple sensory modalities (images for appearance, audio for vocalizations, and text for descriptive classification). Models like BioCLIP have succeeded in image-text alignment, while CLAP has progressed in audio-text alignment.
Limitations of Prior Work: Current multimodal datasets focus only on pair-wise modalities (image-text or audio-text), lacking a unified training and evaluation framework for tri-modal data. Pioneer works like SSW60 only cover 60 species, which is insufficient in scale.
Key Challenge: Comprehensive species perception is required for biodiversity research, but Visual-Textual-Acoustic (VITA) alignment remains an open challenge due to inconsistent taxonomic systems and large scale differences across datasets.
Goal: To build a complete VITA alignment framework that enables models to perform species-level cross-modal retrieval freely among images, audio, and text.
Key Insight: Starting from dataset construction, a million-scale tri-modal dataset with annotated ecological features is collected; a two-stage training strategy is then used to align audio representations into the existing visual-textural representation space.
Core Idea: Leveraging the powerful pre-trained image-text representations of BioCLIP 2, a two-stage strategy—starting with audio-text contrast and followed by tri-modal joint contrast—is employed to achieve unified tri-modal representation efficiently.
Method¶
Overall Architecture¶
BioVITA fills a persistent gap in biological multimodality: while mature models exist for image-text and audio-text (BioCLIP 2, CLAP), no previous work has aligned image, text, and audio (VITA) into a single space. It provides a comprehensive suite of "Dataset + Model + Benchmark": BioVITA Train provides million-scale tri-modal training data, BioVITA Model uses three encoders for audio/image/text to learn unified representations, and BioVITA Bench evaluates performance via six-direction cross-modal retrieval. The key to the model is not training tri-modality from scratch, but "anchoring" audio into the already aligned image-text space of BioCLIP 2.
graph TD
A["Dataset Construction (BioVITA Train)<br/>1.3M Audio + 2.3M Images<br/>14,133 Species + 34 Ecological Features"] --> ENC
subgraph ENC["Encoder Architecture (Train Audio, Reuse Img-Text)"]
direction TB
B["Audio Encoder HTS-AT<br/>Mel Spec → 768D (Trainable)"]
C["Img-Text Encoder BioCLIP 2<br/>ViT-L/14 → 768D (Frozen/Resumed)"]
end
ENC --> S1["Stage 1: Audio-Text Contrastive<br/>ATC Loss only, anchoring audio to text"]
S1 -->|Activate Vision Loss after ATC Convergence| S2["Stage 2: Full VITA Alignment<br/>ATC + λ(AIC + ITC)"]
S2 --> O["Unified Tri-modal Representation Space"]
O --> R["6-Direction Cross-modal Species Retrieval<br/>(BioVITA Bench Evaluation)"]
Key Designs¶
1. BioVITA Train Dataset Construction: Tri-modal alignment requires tri-modal paired data
Existing datasets contain either audio or images, which cannot support tri-modal joint training. BioVITA follows a three-step process: audio curation, fine-grained annotation, and visual integration. It collects 1.3 million audio clips from iNaturalist, Xeno-Canto, and the Animal Sound Archive, paired with 2.3 million images from the ToL-200M subset, covering 14,133 species and annotated with 34 ecological features (diet, activity patterns, habitat, etc.). This scale and the 34-dimensional feature labels enable fine-grained ecological analysis and tri-modal joint training for the first time.
2. Two-Stage Training Strategy: Align audio-text first, then introduce images
Direct tri-modal joint training is unstable due to the difficulty of fine-grained acoustic and visual discrimination. BioVITA decomposes this into two steps. Stage 1 only trains the audio-text contrastive (ATC) loss to align the audio encoder with text:
This stage runs for 30 epochs with a learning rate of \(10^{-4}\) and a batch size of 64. Once ATC converges, Stage 2 activates the image-related audio-image contrastive (AIC) loss and image-text contrastive (ITC) loss for full VITA alignment:
This runs for 10 epochs, with \(\lambda\) linearly scheduled from 0 to 0.1 over the first 2 epochs. Anchoring audio to text first and then gradually introducing images via the strong image-text space of BioCLIP 2 is much more stable than single-stage training.
3. Encoder Architecture: Train audio only, reuse mature image-text encoders
The audio encoder uses HTS-AT (a hierarchical Transformer with 4 SwinT stages) to extract 768-dimensional representations from Mel spectrograms. The image-text encoder directly uses pre-trained BioCLIP 2 (ViT-L/14 + 12-layer Transformer), also providing 768 dimensions. Since BioCLIP 2 already has robust image-text representations, only the audio encoder needs training for alignment, saving significant computational resources.
Loss & Training¶
- Contrastive learning utilizes a standard InfoNCE-style cross-entropy loss, with a temperature hyperparameter \(\tau\) controlling the sharpness of the similarity distribution.
- In Stage 2, \(\lambda\) employs a linear schedule to prevent the ATC loss from rebounding.
- A maximum of 20 recordings per species per epoch is used, with audio randomly cropped into 10-second segments to increase diversity.
Key Experimental Results¶
Main Results¶
| Retrieval Direction | Metric | BioVITA | ImageBind | CLAP | Gain |
|---|---|---|---|---|---|
| Audio→Text (Top-1) | Species | Best | - | Second | Significant Lead |
| Text→Audio (Top-1) | Species | Best | - | Second | Significant Lead |
| Audio→Image (Top-1) | Species | Best | Second | - | First achieved |
| Image→Audio (Top-1) | Species | Best | Second | - | First achieved |
| Image→Text (Top-1) | Species | Best | - | - | Matches BioCLIP 2 |
| Text→Image (Top-1) | Species | Best | - | - | Matches BioCLIP 2 |
Ablation Study¶
| Configuration | Audio→Text | Text→Audio | Description |
|---|---|---|---|
| Stage 1 only | High | High | Audio-text alignment is effective |
| Stage 1+2 (Full) | Best | Best | Tri-modal joint training further improves results |
| Single-stage joint training | Low | Low | Validates the necessity of the two-stage strategy |
Key Findings¶
- BioVITA achieves species-level retrieval across all six directions for the first time, significantly leading in audio-related directions.
- Two-stage training is more effective than single-stage joint training because audio-text alignment serves as the foundation for tri-modal alignment.
- Ecological feature labels reveal interesting correlations between acoustic and visual traits (e.g., vocalizations of nocturnal animals are more distinctive).
- The model demonstrates strong generalization performance on 325 unseen species.
Highlights & Insights¶
- Dataset scale and coverage far exceed previous works (1.3M audio + 2.3M images + 14K species + 34 ecological features).
- The two-stage training strategy intelligently leverages pre-trained models to avoid starting tri-modal alignment from scratch.
- The systematic six-direction retrieval benchmark provides a standardized evaluation for biological multimodal research.
- Ecological feature annotations add a new dimension to cross-modal biological understanding.
Limitations & Future Work¶
- Audio and images are not strictly paired (different individuals of the same species), making it impossible to learn individual-level correspondence.
- Video modality (temporal information of animal behavior) is not yet considered.
- Avian data constitutes the vast majority, potentially leading to imbalanced coverage across other taxonomic classes.
- Future work could explore end-to-end fine-tuning of the image-text encoder rather than keeping it frozen.
Related Work & Insights¶
- BioCLIP/BioCLIP 2 demonstrated the effectiveness of structured taxonomic text prompts for biological image-text alignment.
- The success of CLAP in general audio-language pre-training provided the foundation for bioacoustic alignment.
- ImageBind's approach to cross-modal alignment via a shared embedding space is an important reference, though it lacks sufficient data in the biological domain.
- This work suggests that for new modality alignment, a two-stage "align then joint" approach is more robust than single-stage alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐ First million-scale biological tri-modal dataset and benchmark, though the method itself uses mature contrastive learning.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive six-direction retrieval and multi-granularity analysis, though missing downstream task evaluations.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and detailed dataset construction process.
- Value: ⭐⭐⭐⭐⭐ Significant contribution to both biodiversity research and multimodal learning; the dataset itself is a major contribution.