Skip to content

BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Conference: ICCV 2025 arXiv: 2504.09426 Code: https://github.com/shawnking98/BabyVLM (Project page: shawnking98.github.io/BabyVLM) Area: Multimodal VLM Keywords: data-efficient pretraining, infant learning inspiration, vision-language models, developmental psychology, synthetic data

TL;DR

Inspired by the efficient learning capabilities of human infants, this paper proposes the BabyVLM framework, which includes a synthetic training dataset (converting general-purpose data into child-directed formats) and multiple developmentally aligned evaluation benchmarks. The framework enables data-efficient pretraining of compact VLMs, achieving performance that surpasses models trained solely on SAYCam or generic data.

Background & Motivation

  • Background: Training large-scale vision-language models (e.g., LLaVA, CLIP) requires massive datasets and expensive compute, often demanding thousands of GPU hours — a barrier for independent researchers.
  • Limitations of Prior Work: (1) The infant-inspired SAYCam dataset records only a subset of infants' daily experiences, offering limited coverage. (2) Existing benchmarks are either too simple (e.g., Labeled-S evaluates only classification) or misaligned with training data domains (e.g., VQA and Winoground are designed for large models), failing to accurately measure the developmental alignment of compact models.
  • Key Challenge: A gap exists between infant-inspired VLM training and evaluation — current data and benchmarks do not faithfully reflect the learning environment of infants.
  • Goal: Convert large-scale generic data into infant-learning-compatible formats via "child-directed transformation," and design diverse evaluation tasks aligned with the training domain to fill this gap.
  • Key Insight: Human infants rapidly acquire complex cognitive and perceptual abilities from highly limited sensory input, suggesting that carefully curated small-scale data can also yield effective representations.

Method

Overall Architecture

The BabyVLM framework consists of four core components: (1) a filtered subset of SAYCam; (2) synthetic training data generated via "child-directed transformation"; (3) BabyLLaVA, a generative baseline model; and (4) four developmentally aligned evaluation benchmarks.

Key Designs

  1. SAYCam Data Filtering:

    • Function: Extract image–utterance pairs from raw SAYCam videos and filter them using CLIP similarity.
    • Mechanism: Retain image–text pairs with CLIP similarity > 0.2, yielding approximately 67K high-quality pairs.
    • Design Motivation: Raw SAYCam data contains many low-quality image–text pairings; direct use introduces noise.
  2. Synthetic Data Transformation (Transferred Dataset):

    • Function: Convert generic datasets (CC3M, LAION, SBU, etc.) into infant-learning-style data.
    • Mechanism: Two-step process:
      • Step 1: GPT-4o rewrites original image captions into simple "child-directed utterances" simulating speech to a two-year-old, while simultaneously filtering image–text pairs irrelevant to infants' daily experiences.
      • Step 2: Hungarian matching (using CLIP similarity as the distance metric) selects a subset from the transformed data that is visually most similar to SAYCam images, ensuring visual consistency.
    • Design Motivation: SAYCam covers only a limited slice of infant experience; more diverse data is needed to simulate learning from a broader environment.
  3. BabyLLaVA Generative Baseline:

    • Function: Construct a compact generative VLM trained entirely from scratch on developmental data.
    • Mechanism: Inspired by LLaVA, GPT-2 (7M parameters) is integrated with ResNeXt-50 (23M parameters) via a lightweight MLP connector. A larger variant (Llama-1.1B + ViT-L) is also provided.
    • Design Motivation: Verify whether compact models can learn meaningful multimodal representations under developmental data constraints.
  4. Evaluation Benchmark Design:

    • Labeled-S: A classic classification task; the model selects the target category from four candidate images.
    • Visual Two-Word Test (VTWT): Inspired by the "two-word stage" of 18–24-month-old infants, this benchmark tests compositional semantic reasoning (e.g., "wash cup" vs. "fill cup"). GPT-4o generates 5,117 phrase pairs, manually filtered to 967 pairs.
    • Baby Winoground: Extends VTWT by requiring simultaneous matching of two image–text pairs; negative images are generated via Stable Diffusion, testing higher-order visio-linguistic compositional reasoning.
    • SAYCam Caption: A generative captioning evaluation using the METEOR metric to assess the model's ability to produce child-directed descriptions.

Loss & Training

BabyLLaVA follows the standard LLaVA training procedure on the curated developmental data. CVCL (the contrastive model) uses a standard contrastive learning loss. Model design adheres to three principles: (1) developmentally plausible complexity; (2) limited generalization boundaries; and (3) simplicity in both language and vision.

Key Experimental Results

Main Results

Model Labeled-S VTWT Baby Winoground (Overall) SAYCam Caption
CLIP-large (upper bound) 0.710 0.863 0.674 N/A
LLaVA-v1.5-7B (upper bound) 0.740 0.785 0.427 0.166
CVCL (contrastive baby model) 0.609 0.649 0.093 N/A
BabyLLaVA-GPT2 0.420 0.625 0.066 0.138
BabyLLaVA-Llama 0.420 0.603 0.052 0.129
Random chance 0.250 0.500 0.167 N/A

Ablation Study

Training Configuration Labeled-S VTWT Baby Winoground (Overall) Note
CVCL-filtered 0.609 0.649 0.093 SAYCam only
CVCL-filtered-aug 0.581 0.702 0.203 SAYCam + synthetic data (↑significant)
CVCL-filtered-random 0.602 0.684 0.107 SAYCam + random generic data
BabyLLaVA-filtered 0.420 0.625 0.066 SAYCam only
BabyLLaVA-filtered-aug 0.536 0.693 0.082 SAYCam + synthetic data (↑significant)
BabyLLaVA-aug-only 0.500 0.624 0.063 Synthetic data only

Key Findings

  • The contrastive model (CVCL) consistently outperforms the generative model (BabyLLaVA) on discriminative tasks, consistent with the understanding that contrastive learning is better suited for discriminative settings.
  • The larger BabyLLaVA-Llama (~50× more parameters than the GPT-2 variant) performs comparably or worse, indicating overfitting under limited data.
  • Gains from synthetic data substantially exceed those from random generic data, validating the effectiveness of child-directed transformation.
  • Baby Winoground reveals an in-distribution/out-of-distribution asymmetry: baby models perform reasonably on positive contexts (in-distribution) but fall below chance on negative contexts (out-of-distribution).
  • Removing visual input from VTWT reduces accuracy to ~53% (near chance), confirming that the task genuinely tests multimodal reasoning.

Highlights & Insights

  • The integration of developmental psychology insights (infant two-word stage, noun-dominance bias, etc.) with VLM training represents a novel cross-disciplinary approach.
  • The synthetic data transformation method generalizes to data-efficient training in other resource-constrained domains.
  • Compositional reasoning analysis reveals that models perform best on noun-level differences, consistent with the developmental linguistics finding that infants use nouns approximately twice as frequently as verbs.
  • The work demonstrates that "carefully curated small data + compact models" can learn meaningful representations, offering a new paradigm for resource-constrained model training.

Limitations & Future Work

  • Generative captioning performance remains poor; METEOR scores are low across all models.
  • Baby models perform far below the upper-bound models on Baby Winoground, indicating substantial room for improvement in compositional reasoning.
  • Synthetic data still relies on GPT-4o and large-scale source datasets, raising questions about its "developmental plausibility."
  • Temporal context, richer object interactions, and additional modality signals remain unexplored.
  • The framework is primarily validated within the SAYCam domain; cross-domain generalization warrants further investigation.
  • vs. CVCL (Vong et al.): CVCL is a contrastive baby model; this work builds upon it by introducing a generative model and richer synthetic data.
  • vs. BabyLM Challenge: BabyLM focuses on developmentally inspired training for language only; this work extends the paradigm to multimodal vision-language settings.
  • vs. LLaVA: BabyLLaVA adopts the LLaVA architecture but substantially reduces model scale, focusing on learning under developmental constraints.

Rating

  • Novelty: ⭐⭐⭐⭐ — Integrating developmental psychology with VLM training offers a distinctive perspective; benchmark design is creative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Ablation studies are detailed and provide multi-angle analysis of model behavior in relation to language development theory.
  • Writing Quality: ⭐⭐⭐⭐ — Structure is clear, motivation is well articulated, and cross-disciplinary exposition is effective.
  • Value: ⭐⭐⭐ — More oriented toward cognitive science; direct engineering applicability is limited, but the work introduces new directions for data-efficient training.