MonSTeR: a Unified Model for Motion, Scene, Text Retrieval¶

Conference: ICCV 2025 arXiv: 2510.03200 Code: GitHub Area: Information Retrieval Keywords: tri-modal retrieval, motion-scene-text, higher-order relationship modeling, contrastive learning, human-scene interaction evaluation Authors: Luca Collorone, Matteo Gioia et al. (Sapienza University of Rome, Technion/NVIDIA)

TL;DR¶

This paper proposes MonSTeR—the first tri-modal retrieval model for motion, scene, and text—which constructs a unified latent space via higher-order relationship modeling inspired by topological deep learning. By capturing intrinsic dependencies among all three modalities, MonSTeR substantially outperforms baselines that rely solely on unimodal representations across multiple retrieval tasks, and can further serve as an evaluation tool for human-scene interaction models.

Background & Motivation¶

When humans navigate complex environments, they must balance their intentions against the affordances offered by the surroundings. For instance, the intent "sit on a chair" accompanied by the corresponding motion becomes implausible if no chair is present in the scene. This observation reveals a strong intrinsic consistency among intent (text), motion (action), and environment (scene).

Nevertheless, a clear gap exists in prior work:

Text-motion retrieval models (TMR, MoPa) cannot incorporate environmental context, leaving motion representation divorced from scene information.

Evaluation of human-scene interaction (HSI) models lacks a global coherence/realism metric; existing approaches typically decompose evaluation into independent measures such as collision detection and goal-object distance, neglecting path plausibility and motion credibility.

Existing multimodal alignment methods either perform only pairwise alignment or process all modalities through a shared encoder without explicit cross-modal interaction modeling.

The core challenge lies in effectively representing many-to-many relationships among three modalities within a unified latent space—the same text can correspond to multiple motions, and the same motion may carry different meanings across different scenes.

Method¶

Overall Architecture¶

MonSTeR is built upon three types of encoders:

Unimodal encoders: Transformer-based variational autoencoders that separately encode text $t$, motion $m$, and scene $s$, producing latent variables $v_t$, $v_m$, $v_s$.
Cross-modal encoders: The output tokens of the unimodal encoders are concatenated pairwise and fed into cross-modal encoders to produce joint latent variables $v_{st}$, $v_{mt}$, $v_{ms}$.

Data Representation¶

Text $t \in \mathbb{R}^{768}$: DistilBERT features
Scene $s \in \mathbb{R}^{N \times 6}$: colored point cloud (x, y, z + RGB)
Motion $m \in \mathbb{R}^{T \times 3 \times 22}$: $T$ frames × 3D coordinates × 22 joints

Topological Modeling of Higher-Order Relationships¶

Inspired by topological deep learning, the three-modal relationships are modeled as a topological structure comprising nodes, edges, and faces:

Nodes $\mathcal{V} = \{t, s, m\}$: unimodal representations
Edges $\mathcal{E} = \{ts, sm, mt\}$: cross-modal representations
Face $\mathcal{P} = \{tsm\}$: global tri-modal relationship

Higher-order relationships are encoded by aligning unimodal representations with cross-modal ones: - $(st, m)$: scene-text cross-modal aligned with motion - $(mt, s)$: motion-text cross-modal aligned with scene - $(ms, t)$: motion-scene cross-modal aligned with text

Training Objective¶

The set of pairs included in the contrastive learning loss is: $$K = \{(t,s), (m,t), (m,s), (st,m), (mt,s), (ms,t)\}$$

Pairs that could induce degenerate solutions (e.g., $(st, t)$ or $(st, s)$) are excluded to prevent the cross-modal encoders from learning identity mappings.

For each pair $(i,j) \in K$, an $N \times N$ cosine similarity matrix $C_{i,j}$ is computed and aggregated via InfoNCE: $$\mathcal{L}_{\text{tot}} = \frac{1}{|K|} \sum_{(i,j) \in K} \frac{\mathcal{L}_{\text{NCE}}(C_{i,j})}{N}$$

Retrieval Inference¶

The unified latent space supports flexible retrieval across multiple configurations: - Bi-modal → uni-modal: st2m, ms2t, mt2s (given two modalities, retrieve the third) - Uni-modal → bi-modal: m2st, t2ms, s2mt (given one modality, retrieve two) - Uni-modal → uni-modal: t2m, m2t, s2m, m2s, t2s, s2t

Key Experimental Results¶

Main Results: HUMANISE+ Retrieval (All Protocol, mRecall)¶

Method	st2m	m2st	ms2t	t2sm	tm2s	s2mt	Avg.
TMR + S	4.10	3.30	5.81	4.79	1.08	1.98	2.72
MoPa + S	2.10	2.45	1.62	1.94	3.28	3.06	1.88
MonSTeR	13.91	13.14	8.46	10.39	4.09	4.45	4.80

MonSTeR achieves a relative improvement of 209% over the strongest baseline on st2m, and surpasses the best scene-aware model by 76.47% in average mRecall.

Ablation Study: Necessity of Higher-Order Relationship Modeling¶

Variant	st2m	m2st	t2m	m2t	Avg.
MonSTeR	13.91	13.14	3.62	3.11	5.63
w/o cross-modal	5.20	3.77	4.35	3.21	3.79
w/o single	11.91	12.93	0.22	0.29	4.36
w tri-modal	6.14	6.00	4.16	4.37	4.14

Variant	Small Batches Avg.
MonSTeR	60.00
w/o cross-modal	56.07
w/o single	41.77
w tri-modal	53.70

Key Findings: - Removing the cross-modal encoder leads to a substantial drop in bi-modal → uni-modal tasks. - Removing unimodal alignment terms causes near-complete collapse on uni-modal → uni-modal tasks. - A shared tri-modal encoder is inferior to the separated cross-modal encoder design.

Path Plausibility Evaluation¶

After rotating test-set motions from 0 to $\pi$ radians, MonSTeR's FID and Recall degrade consistently as expected; moreover, motions involving collisions receive lower scores—demonstrating that the MonSTeR latent space internalizes the prior that motion should not penetrate scene objects.

User Study¶

MonSTeR's rankings align with human preferences at a rate of 66.5% (1,122 annotations, 224 evaluators).

Downstream Task: Motion Captioning¶

Method	BLEU 1	BLEU 4	ROUGE L	CIDEr	BERT F1
MotionGPT	42.16	17.47	40.23	11.13	22.16
MonSTeR + GPT2	42.93	23.59	50.85	13.70	35.57

When MonSTeR embeddings are used for motion captioning, the model substantially outperforms MotionGPT on BLEU 4 (+6.12), ROUGE L (+10.62), and BERT F1 (+13.41).

Zero-Shot Scene Object Placement¶

Using mt2s scores to localize objects within a $5 \times 5 \times 5$ grid, the average error is only 18 cm (random baseline: 58.98 cm), demonstrating MonSTeR's precise spatial reasoning capability in a zero-shot setting.

Highlights & Insights¶

First tri-modal retrieval system: The unified latent space for motion, scene, and text has never been explored before; MonSTeR fills this gap.
Topology-inspired higher-order modeling: Rather than simple pairwise alignment or fully mixed multi-modal encoding, edge-node alignment is used to encode higher-order relationships, grounded in topological theory.
Evaluation capability: MonSTeR can replace conventional collision detection + distance metrics, providing a holistic consistency score that aligns with human judgment.
Flexible retrieval: Supports 12 retrieval task combinations, including uni-modal → bi-modal and bi-modal → uni-modal directions.
Zero-shot transfer: Zero-shot and downstream experiments on object placement and motion captioning validate the rich semantics of the learned latent space.

Limitations & Future Work¶

Training on aligned data only: Cross-modal encoders are trained exclusively on paired data, leaving the potential of unpaired data unexploited.
Static scene assumption: Human actions are assumed not to alter the scene layout, limiting the modeling of dynamic interactions.
Dataset scale: HUMANISE+ (19.6K samples) and TRUMANS+ (15 hours of motion capture) remain relatively small.
TMR still leads on t2m in TRUMANS+: This suggests that when scene information is not discriminative, specialized bi-modal models may have an advantage.

Relationship to TMR/MoPa: MonSTeR extends text-motion retrieval by incorporating the scene dimension, generalizing it to a tri-modal setting.
Difference from CLIP-style multimodal alignment: CLIP-based methods typically align to a single reference modality, whereas MonSTeR achieves all-to-all alignment.
Implications for HSI evaluation: Traditional collision + distance metrics can be replaced by a single tri-modal consistency score.
Connection to topological deep learning: Applying the topological framework of Bodnar et al. 2021 to multimodal alignment represents a novel cross-domain idea.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First tri-modal retrieval model for motion, scene, and text; the topology-inspired higher-order modeling approach is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across 12 retrieval tasks, thorough ablations, and diverse downstream applications (captioning, object placement, user study).
Writing Quality: ⭐⭐⭐⭐ — Clear structure, an engaging Beatles-inspired introduction, and well-motivated topological framing.
Value: ⭐⭐⭐⭐ — Opens a new direction in tri-modal retrieval and offers practical utility for evaluating human-scene interaction models.