Sim-DETR: Unlock DETR for Temporal Sentence Grounding¶
Conference: ICCV 2025 arXiv: 2509.23867 Code: github.com/SooLab/Sim-DETR Area: Object Detection / Temporal Sentence Grounding Keywords: Temporal sentence grounding, DETR, query conflict, self-attention adjustment, global-local bridging
TL;DR¶
This paper systematically analyzes the root causes of anomalous behavior in DETR-based temporal sentence grounding (TSG) — inter-query conflict and intra-query global-local contradiction — and proposes two simple decoder modifications (Query Grouping & Ranking + Global-Local Bridging) to form Sim-DETR, unlocking the full potential of DETR for TSG.
Background & Motivation¶
TSG requires localizing the temporal segment in an untrimmed video corresponding to a natural language query. Dominant methods adopt the DETR framework, using learnable queries to predict temporal segments in the decoder.
Anomalous observations: In object detection, increasing the number of queries and decoder layers typically improves DETR performance. However, in TSG: - Increasing queries from 10 to 20: performance drops by more than 2% - Increasing decoder layers: performance drops by 1–2.5%
Root cause analysis (one of the paper's core contributions):
Inter-query conflict: In TSG, multiple target segments share the same linguistic semantics (e.g., multiple events corresponding to the same sentence), causing their associated queries to be highly similar. Under one-to-one matching, the same query may be matched to different target segments across decoder layers (a "random matching" phenomenon), resulting in very low cross-layer matching consistency.
Intra-query conflict: Each query must simultaneously serve two roles — (a) encoding global segment semantics for matching, and (b) decoding local boundaries for precise localization. These two objectives are inherently contradictory: should the query attend to global semantics or local boundaries? Experiments show that a high global matching score does not guarantee accurate local localization.
Method¶
Overall Architecture¶
Sim-DETR introduces two "small but critical" decoder modifications to a standard DETR-based TSG architecture: - Feature extraction: CLIP [CLS] + SlowFast concatenation - Multimodal encoder: video-language cross-attention fusion - Decoder: augmented with QGR and GLB modules
Key Designs¶
-
Query Grouping and Ranking (QGR):
-
Query grouping: Soft grouping based on L2 distance between predicted temporal segments; queries with close predictions are treated as likely corresponding to the same target segment: \(\mathcal{S}^{intra}_{i,j} = \|b_i - b_j\|_2\) L2 is used over L1 distance because when two segments are close (normalized distance ≤ 1), L2 decays faster and imposes smaller penalties for minor differences.
-
Query ranking: An IoU prediction head is introduced, and queries are ranked by combining classification confidence and predicted IoU: \(R_{rank}(q_i, q_j) = \begin{cases} +1 & \mathcal{P}^{cls}_i \circ \mathcal{P}^{IoU}_i \geq \mathcal{P}^{cls}_j \circ \mathcal{P}^{IoU}_j \\ -1 & \text{otherwise} \end{cases}\)
-
Self-attention adjustment: Grouping and ranking information is incorporated into self-attention weights: \(\mathcal{S}^{attn} = \text{sigmoid}(\text{MLP}(\mathcal{S}^{intra} \circ \mathcal{R}_{rank}))\) High values encourage high-quality queries to aggregate information from similar queries within the same group.
- Design motivation: Enable indistinguishable queries to attend to different contexts, reduce inter-query similarity, and allow the most suitable query to draw information from related queries.
-
-
Global-Local Bridging (GLB):
- Introduces a query-to-frame matching loss to strengthen the alignment between each query and every frame within the predicted segment.
- Computes semantic similarity between the query and all frames: \(z = \text{sigmoid}(\tau \cdot \cos(q_i, \hat{\mathcal{T}}))\)
- The loss maximizes similarity to intra-segment frames and minimizes similarity to extra-segment frames: \(\mathcal{L}_{bridge} = \lambda_{bridge} \frac{-\sum_j z_j \mathbb{I}[b^{gt}_i]_j}{\sum_j z_j(1-\mathbb{I}[b^{gt}_i]_j) + \sum_j \mathbb{I}[b^{gt}_i]_j}\) where \(\tau\) is a learnable scaling factor.
- Design motivation: The complete frame sequence within a segment (from start to end) serves as a bridge connecting global semantics and local boundaries.
-
IoU Prediction Head:
- Serves as an auxiliary signal for evaluating localization precision, used jointly with classification scores for query ranking.
- Addresses the issue that "high confidence ≠ precise localization."
- Design motivation: Ranking queries by classification score alone is ineffective in TSG (unlike object detection), necessitating an explicit local localization signal.
Loss & Training¶
- \(\mathcal{L}_{MD}\): Standard Moment DETR loss (L1 + gIoU + classification + saliency loss)
- \(\mathcal{L}_{bridge}\): Global-local bridging loss
- \(\mathcal{L}_{iou}\): IoU prediction head loss
- Trained for 200 epochs on a single A40 GPU, AdamW with lr=1e-4, 6 decoder layers.
Key Experimental Results¶
Main Results¶
QVHighlights test set:
| Method | R1@0.5 | R1@0.7 | mAP@0.5 | mAP@0.75 | mAP Avg |
|---|---|---|---|---|---|
| M-DETR | 52.89 | 33.02 | 54.82 | 29.40 | 30.73 |
| QD-DETR | 62.40 | 44.98 | 62.52 | 39.88 | 39.86 |
| TR-DETR | 64.66 | 48.96 | 63.98 | 43.73 | 42.62 |
| BAM-DETR | 62.71 | 48.64 | 64.57 | 46.33 | 45.36 |
| CG-DETR | 65.43 | 48.38 | 64.51 | 42.77 | 42.86 |
| Sim-DETR | 67.64 | 50.91 | 67.81 | 47.59 | 46.93 |
Charades-STA / TACoS:
| Dataset | Method | R1@0.5 | R1@0.7 | mIoU |
|---|---|---|---|---|
| TACoS | CG-DETR | 39.61 | 22.23 | 36.48 |
| TACoS | Sim-DETR | 42.79 | 26.82 | 39.44 |
| Charades | SpikeMba | 59.65 | 36.12 | 51.74 |
| Charades | Sim-DETR | 61.34 | 39.62 | 52.56 |
Ablation Study¶
Component ablation (QVHighlights val):
| Configuration | R1@0.5 | R1@0.7 | mAP Avg |
|---|---|---|---|
| Baseline (TR-DETR) | 65.48 | 50.84 | 44.97 |
| + \(\mathcal{L}_{iou}\) | 66.58 | 51.94 | 45.22 |
| + QGR | 68.77 | 52.26 | 47.03 |
| + GLB | 67.16 | 52.77 | 48.17 |
| + QGR + GLB | 69.48 | 54.06 | 49.50 |
Conflict metric ablation:
| Inner Relevance | Outer Global | Outer Local | mAP Avg |
|---|---|---|---|
| span border dist | confidence | IoU pred | 49.50 |
| center dist | confidence | IoU pred | 48.59 |
| span IoU | confidence | IoU pred | 48.94 |
| span border dist | w/o | IoU pred | 48.93 |
| span border dist | confidence | w/o | 48.66 |
Key Findings¶
- QGR effectively differentiates queries: Analysis shows that Sim-DETR successfully separates the similarity distributions of intra-segment and inter-segment queries, reducing query "oscillation" across different target segments.
- Cross-layer matching consistency significantly improves: QGR substantially increases query-segment matching consistency between consecutive decoder layers.
- GLB aligns global semantics with local localization: After introducing GLB, queries concentrate attention on intra-segment frames rather than dispersing across multiple target segments.
- Anomalous behavior eliminated: Increasing the number of queries and decoder layers no longer degrades performance, and yields slight improvements instead.
- Accelerated convergence: Eliminating the anomalous behavior significantly speeds up training convergence.
- L2 distance outperforms L1/IoU: Boundary distance proves superior to center distance and span IoU as a query grouping criterion.
Highlights & Insights¶
- Diagnosis-driven method design: Solutions are designed through systematic observation of phenomena and root cause analysis (three dedicated investigation sections), followed by targeted remedies.
- The paper reveals a fundamental distinction between TSG and object detection: multiple target segments share linguistic semantics.
- Minimal modifications, significant gains: Only two decoder modifications are sufficient to comprehensively surpass all state-of-the-art methods.
- Joint classification + IoU ranking: Specifically addresses the problem that "high confidence ≠ precise localization."
- Thorough analysis of effects: Beyond reporting performance gains, the paper also validates anomaly elimination and convergence acceleration.
Limitations & Future Work¶
- Query grouping relies on predicted span distances, which may be inaccurate during early training and thus affect grouping quality.
- Frame-level alignment in GLB uses simple cosine similarity; more sophisticated alignment strategies may yield further improvements.
- Validation is limited to video temporal grounding; generalization to other DETR-based tasks has not been explored.
- The IoU prediction head introduces additional parameters and computation, which may be a concern in resource-constrained settings.
Related Work & Insights¶
- The query ranking strategy from EASE-DETR inspired QGR, but directly using predicted scores for ranking is ineffective in TSG.
- The baseline loss design from Moment DETR provides a standard framework for TSG.
- The analysis of "semantically similar but spatially distinct target segments" in TSG may generalize to analogous multi-instance detection scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The diagnostic analysis is exceptionally thorough, uncovering fundamental issues in applying DETR to TSG.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensively outperforms prior work on three benchmarks, with multiple backbones, detailed ablations, and visual analyses.
- Writing Quality: ⭐⭐⭐⭐⭐ The "diagnose-then-treat" narrative structure is very clear, with analysis and validation tightly integrated.
- Value: ⭐⭐⭐⭐⭐ Provides a simple yet effective strong baseline for DETR-based TSG with broad methodological implications.