TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding¶

Conference: AAAI 2026 arXiv: 2511.10241 Code: None Area: Object Detection Keywords: Weakly-supervised spatio-temporal video grounding, tube reconstruction, vision-language alignment, mutual constraint learning, STVG

TL;DR¶

This paper proposes TubeRMC, a framework that generates text-conditioned candidate tubes and performs tube-conditioned reconstruction along temporal, spatial, and spatio-temporal dimensions, augmented by spatial-temporal mutual constraints to improve weakly-supervised spatio-temporal video grounding.

Background & Motivation¶

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube — a sequence of bounding boxes within a specified temporal interval — in an untrimmed video given a natural language query. This is a highly challenging task requiring complex vision-language understanding and spatio-temporal reasoning.

Existing fully-supervised methods (e.g., TubeDETR) rely on expensive tube-text annotations. To reduce annotation costs, weakly-supervised STVG (WSTVG) methods train using only video-text pairs, without bounding box or temporal annotations.

Core limitations of existing weakly-supervised methods:

Late-fusion paradigm limitations: Methods such as WINNER and VCMA follow a detect-then-match pipeline — generating tube proposals with unimodal detectors (e.g., Faster-RCNN) and subsequently matching them with text. Since tube generation is entirely independent of the text description, two critical issues arise: - Target identification failure: The detector cannot perceive the targets described in the text. - Tracking inconsistency: Cross-frame target identification is unstable.

Infeasibility of directly concatenating frame-level results: Although pretrained visual grounding models (e.g., MDETR) can capture text-conditioned object localization, naively concatenating frame-level results to form a spatio-temporal tube is ineffective due to inconsistent cross-frame target identification and lack of spatio-temporal understanding.

Insufficiency of existing reconstruction methods: Existing weakly-supervised video temporal grounding methods (e.g., CNM, Kim2024) focus solely on temporal reconstruction, neglecting the spatial-textual correspondence.

Core insight: A tube that matches the target event should be capable of correctly reconstructing masked key phrases in the query sentence. As illustrated in Figure 1(b), a correct tube can reconstruct keywords such as "white man" and "sits down."

Method¶

Overall Architecture¶

The TubeRMC framework operates at three levels:

Text-conditioned tube generation: A pretrained visual grounding model (MDETR) is used to extract frame-level cross-modal representations and spatial localization results.
Tube-conditioned reconstruction learning: Reconstruction is performed along temporal, spatial, and spatio-temporal dimensions to comprehensively capture tube-text correspondences.
Mutual constraint learning: Bidirectional temporal-spatial constraints are introduced to enhance proposal quality.

Key Designs¶

1. Model Architecture¶

Static cross-modal extraction: A pretrained MDETR (ResNet-101 + RoBERTa-base) extracts cross-modal representations and localization results per frame. For each frame, all predicted boxes are ranked by the confidence score of the subject token, and the highest-scoring box is selected as the frame-level prediction. These are concatenated to form a bounding box tube \(B \in \mathbb{R}^{T \times 4}\) and a confidence vector \(S \in \mathbb{R}^{T \times 1}\).

Spatio-temporal modeling: Cross-modal features are fed into TimeSFormer to obtain cross-modal features \(F_t \in \mathbb{R}^{T \times (H \times W + L) \times d}\) and global frame features \(F_g \in \mathbb{R}^{T \times d}\), which are subsequently passed to: - Spatial Boxes Refiner: Models inter-frame contextual relationships via cross-attention to predict offsets that refine MDETR outputs. - Temporal Proposals Generator: Uses \(K\) learnable queries within a temporal decoder to model temporal context. - Spatio-Temporal Decoder: Integrates spatial and temporal information for event-level prediction.

2. Tube-conditioned Reconstruction Learning (Core Contribution)¶

Unlike methods that rely solely on temporal reconstruction, this paper proposes a Tube-conditioned Reconstructor (TR) that takes both temporal and spatial masks as input.

Gaussian mask representation: To enable backpropagation, temporal intervals, spatial boxes, and spatio-temporal tubes are converted into Gaussian distributions: - 1D Gaussian → temporal attention mask \(M_t \in \mathbb{R}^T\) - 2D Gaussian → spatial attention mask \(M_b \in \mathbb{R}^{HW}\) - 3D Gaussian → spatio-temporal mask \(T \times HW\)

TR architecture: Comprises a Tube-conditioned Encoder (6 layers) and a Masked-text Reconstructor: - The encoder contains a local branch (focusing on local spatio-temporal context) and a global branch (supplementing global temporal information). - Mask-Attention mechanism: \(M\text{-}att(Q,K,V,M) = (\text{softmax}(\frac{QK^T}{\sqrt{d}}) \bigotimes M) V\), guiding the encoder to attend to visual regions corresponding to the mask.

Three reconstruction strategies: - Temporal reconstruction: Masks predicates and related nouns (motion information) using a 1D Gaussian temporal mask. - Spatial reconstruction: Masks subject nouns and adjectives (appearance information) using a 2D Gaussian spatial mask. - Spatio-temporal reconstruction: Randomly masks tokens (with higher probability for verbs, nouns, and adjectives) using a 3D Gaussian mask.

3. Mutual Constraint Learning¶

Space-to-time constraint: The confidence scores of spatial boxes guide temporal proposal generation. Top-\(K\) high-scoring frames are selected as temporal centers for positive proposals, while the lowest-scoring frames serve as negative proposals. The loss minimizes the overlap between positive and negative proposals.

Time-to-space constraint: Ensures spatial continuity of targets across adjacent frames within the same scene. For predicted boxes in adjacent frames within each temporal proposal, cases where IoU falls below a threshold are penalized.

Loss & Training¶

Total loss: \(L_{total} = L_{rec} + L_{ipc} + L_{ivc} + L_{mc}\)

Reconstruction loss \(L_{rec}\): Sum of cross-entropy losses from the three reconstruction strategies.
Inter-proposal contrastive loss \(L_{ipc}\): Ensures the reconstruction loss of positive proposals is lower than that of negative proposals (with margin).
Intra-video contrastive loss \(L_{ivc}\): Spatio-temporal reconstruction with positive samples versus hard/easy negatives (inverted temporal mask, uniform mask).
Mutual constraint loss \(L_{mc}\): Space-to-time and time-to-space constraints combined.

Training settings: \(K=4\) (HCSTVG/VidSTG), 3-layer transformer for spatio-temporal modeling, 6 layers for TR, margin parameters \(\beta_1=0.5\), \(\beta_2=0.7\), \(\beta_3=0.5\), \(\beta_4=0.7\).

Key Experimental Results¶

Main Results¶

Dataset	Metric	TubeRMC	Prev. SOTA (VCMA)	Gain
HCSTVG-v1	m_vIoU	19.38	14.64	+4.74
HCSTVG-v1	vIoU@0.3	23.88	18.60	+5.28
HCSTVG-v2	m_vIoU	20.64	—	—
VidSTG-Decl	m_vIoU	15.93	14.45	+1.48
VidSTG-Decl	vIoU@0.3	25.16	18.57	+6.59
VidSTG-Inter	m_vIoU	13.47	13.25	+0.22

Compared to MDETR baselines (HCSTVG-v2): TubeRMC outperforms MDETR-Zero by 8.43% and MDETR+CPL by 5.55%.

Ablation Study¶

Reconstruction strategy ablation (HCSTVG-v1):

Spatial	Temporal	Spatio-Temp.	m_vIoU	vIoU@0.3	Note
✗	✗	✗	14.91	14.87	No reconstruction baseline
✓	✗	✗	15.24	14.74	+Spatial
✗	✓	✗	17.59	20.25	+Temporal (+2.68, significant)
✓	✓	✗	18.12	20.43	Spatial+Temporal complementary
✓	✓	✓	19.38	23.88	All strategies, best

Mutual constraint ablation:

s-to-t	t-to-s	m_vIoU	m_tIoU	m_sIoU
✗	✗	15.87	26.69	59.83
✓	✗	17.65	29.04	59.95
✗	✓	17.07	27.38	60.13
✓	✓	19.38	30.94	61.67

Visual grounding model substitution: Replacing MDETR with G-DINO (Swin-B) raises m_vIoU to 21.15%; however, MDETR is used by default to balance speed and performance.

Key Findings¶

Temporal reconstruction yields the largest single-component gain (+2.68 m_vIoU), underscoring the importance of temporal modeling in WSTVG.
The three reconstruction strategies are complementary, and their combination achieves the best performance.
The space-to-time constraint primarily improves m_tIoU (+2.35/+3.56), while the time-to-space constraint primarily improves m_sIoU.
Stronger visual grounding models (e.g., G-DINO-Swin-B) can further boost performance.

Highlights & Insights¶

Three-dimensional reconstruction paradigm: This work is the first to learn tube-text correspondences simultaneously across 1D/2D/3D dimensions, yielding a clean and elegant formulation.
Mutual constraint mechanism: Spatial information guides temporal proposal generation, while temporal proposals enforce spatial continuity, forming a virtuous cycle.
Plug-and-play visual grounding model: The framework is agnostic to the choice of visual grounding model and can benefit from future model advances.
Reconstruction as understanding: Evaluating tube-text matching quality by assessing whether masked text can be reconstructed constitutes an elegant weakly-supervised training signal.

Limitations & Future Work¶

Under severe viewpoint changes and occlusions, MDETR may incorrectly assign bounding boxes to other individuals performing similar actions (e.g., the third row in the visualization case).
Performance remains dependent on the quality of the visual grounding model; when the pretraining corpus of MDETR diverges significantly from the target dataset (e.g., VidSTG Interrogative), performance is limited.
Tracking algorithms could be incorporated to generate higher-quality tube proposals.
Automatic learning of 3D Gaussian mask parameters is worth exploring.

The reconstruction paradigm is inspired by weakly-supervised video temporal grounding (WVTG); this work extends it from 1D to 2D and 3D.
Visual grounding models such as MDETR and G-DINO provide stronger initialization for weakly-supervised approaches.
The mutual constraint idea is generalizable to other tasks requiring spatial-temporal co-reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ — The three-dimensional reconstruction and mutual constraint mechanisms are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive ablations across four datasets and multiple settings.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with complete mathematical formulations.
Value: ⭐⭐⭐⭐ — Advances the state of the art in weakly-supervised STVG research.