Skip to content

Online Generic Event Boundary Detection

Conference: ICCV2025 arXiv: 2510.06855 Code: To be confirmed Area: Video Understanding / Event Segmentation Keywords: Online event boundary detection, Event Segmentation Theory, streaming video, dynamic thresholding, Transformer decoder

TL;DR

This paper proposes Online Generic Event Boundary Detection (On-GEBD) as a new task—detecting event boundaries in real time from streaming video—and introduces the ESTimator framework inspired by the cognitive science Event Segmentation Theory (EST). Through the collaboration of a Consistent Event Anticipator (CEA) and an Online Boundary Discriminator (OBD), ESTimator achieves an Avg F1 of 0.748 on Kinetics-GEBD, surpassing all online baselines and approaching the performance of offline methods.

Background & Motivation

Background: Generic Event Boundary Detection (GEBD) aims to detect human-perceived event boundaries in long videos. These boundaries are taxonomy-free and not restricted to predefined action categories. Existing GEBD methods (e.g., DDM-Net, UBoCo, CoSeg) make predictions after processing complete videos, which differs from the way humans perceive events online in real time.

Limitations of Prior Work: - Current GEBD methods require access to future frames (offline setting) and cannot be applied to streaming video scenarios (live broadcasting, surveillance, real-time interaction). - Conventional online video understanding methods (OAD, On-TAL) focus on predefined action categories and are ill-suited for detecting taxonomy-free generic event boundaries. - Static thresholds fail to capture diverse semantic changes at varying granularities, while peak detection relies on future frame information.

Key Challenge: In the online setting, the model can only observe past and current frames, operating under severe information constraints, yet must detect diverse, subtle, and taxonomy-free semantic changes—a challenge that is already difficult in the offline setting.

Goal: (a) Define the new On-GEBD task—frame-by-frame streaming processing, immediate decision-making, and use of only past information; (b) Design a method capable of effectively detecting diverse generic event boundaries under severely constrained information.

Key Insight: Inspiration is drawn from the cognitive science Event Segmentation Theory (EST)—humans continuously predict future information consistent with the current event and perceive event boundaries when a significant discrepancy arises between prediction and actual incoming information.

Core Idea: A Transformer decoder is used to predict future frame features consistent with the current event; online event boundary detection is then performed by identifying statistical anomalies in the prediction error via a sliding-window dynamic threshold.

Method

Overall Architecture

ESTimator consists of two core components: 1. Consistent Event Anticipator (CEA): Receives ResNet-50 features from the past \(L\) frames along with a learnable token, and predicts the next frame's features through a Transformer decoder with causal masking. 2. Online Boundary Discriminator (OBD): Maintains a fixed-size FIFO queue of historical prediction errors and applies a Gaussian statistical test to determine whether the current frame's prediction error is an outlier.

The input is a 2048-dimensional frame feature extracted by ResNet-50, and the output is a binary decision (boundary or non-boundary) for each frame.

Key Designs

  1. Consistent Event Anticipator (CEA):

    • Function: Predicts the next frame feature consistent with the current event, based on past frames.
    • Mechanism: The \(L\) past frame features are concatenated with a learnable token \(\mathbf{T}\) and fed into a Transformer decoder with a causal attention mask. The output at the learnable token position, \(\hat{\mathbf{f}}_t\), serves as the prediction for the next frame. Prediction error is measured by cosine distance: \(\varepsilon_t = \frac{1}{2}(1 - \frac{\mathbf{f}_t \cdot \hat{\mathbf{f}}_t}{\|\mathbf{f}_t\|\|\hat{\mathbf{f}}_t\|})\)
    • Design Motivation: The causal mask ensures that only past information is used, satisfying the online constraint. Cosine distance is bounded in \([0, 1]\), offering greater stability than L1/L2 distances.
  2. EST Loss + REST Loss:

    • Function: Trains CEA to produce low prediction error within events and high error at boundaries.
    • EST Loss: Frame-level binary cross-entropy: \(\mathcal{L}_{EST} = -y_t \log \varepsilon_t - (1-y_t) \log(1-\varepsilon_t)\), encouraging boundary frame errors to approach 1 and non-boundary frame errors to approach 0.
    • REST Loss (Region EST): Region-level supervision applied to the averaged error over \(K\) consecutive frames, \(\bar{\varepsilon}_t = \frac{1}{K}\sum_{i=t-K}^{t}\varepsilon_i\), with the same BCE loss. This provides soft supervision for smooth transitions across consecutive frames.
    • Total Loss: \(\mathcal{L} = \alpha \cdot \mathcal{L}_{REST} + \sum_{i=t-K}^{t}\mathcal{L}_{EST}\), where \(\alpha=0.5\).
    • Design Motivation: Pure frame-level supervision is overly strict in regions of smooth transition (where the boundary/non-boundary label changes abruptly from one frame to the next). REST provides a smoother learning signal through regional averaging. However, neither loss alone is sufficient; the full benefit emerges only in conjunction with OBD.
  3. Batch-wise Loss Weighting:

    • Function: Automatically balances the sample imbalance between boundary and non-boundary frames.
    • Mechanism: The ratio of boundary to non-boundary frames is computed within each batch and multiplied against the boundary frame loss, eliminating the need for manual tuning.
    • Design Motivation: Boundary frames constitute a very small fraction of a video (approximately 5 event boundaries per video on average); without correction, the model is biased toward predicting non-boundary frames.
  4. Online Boundary Discriminator (OBD):

    • Function: Applies a dynamic threshold to determine whether the current frame is a boundary.
    • Mechanism: A FIFO queue \(\mathcal{Q}\) of size \(\Delta\) stores the prediction errors of the most recent \(\Delta\) frames. When a new frame \(v_t\) arrives, its normalized error score is computed as \(\zeta_t = \frac{\varepsilon_t - \mu_\mathcal{Q}}{\sigma_\mathcal{Q}}\). If \(\zeta_t > \tau\) (where \(\tau=1.5\)), the frame is classified as a boundary.
    • Design Motivation: A fixed threshold cannot adapt to varying degrees of semantic change across different video segments—in rapidly changing segments, even normal error values may be high, leading to excessive false positives. OBD establishes a local baseline from the sliding window mean and standard deviation, making the threshold adaptive to the current context.

Loss & Training

  • Total loss: EST + REST + batch-wise weighting
  • Optimizer: AdamW, lr=1e-4, batch size=512
  • Feature extraction: ImageNet-pretrained ResNet-50, feature dimension 2048
  • Transformer decoder: 3 layers
  • Sampling rate: Kinetics-GEBD at 24 FPS, TAPOS at 6 FPS

Key Experimental Results

Main Results: Online Baseline Comparison on On-GEBD

Method Kinetics-GEBD Avg F1 TAPOS Avg F1
TeSTra-BC 0.557 0.487
Sim-On-BC 0.618 0.344
OadTR-BC 0.558 0.416
MiniROAD-BC 0.681 0.528
ESTimator (Ours) 0.748 0.547

ESTimator outperforms the strongest baseline, MiniROAD, by approximately 6.7% Avg F1 on Kinetics-GEBD.

Comparison with Offline Methods

Method Setting Kinetics-GEBD Avg F1
TCN Offline 0.685
BMN-StartEnd Offline 0.640
PA Offline Unsupervised 0.527
CoSeg Offline 0.782
PC Offline 0.817
ESTimator Online 0.748

ESTimator's online performance surpasses the majority of offline methods, falling only slightly below PC and CoSeg.

Ablation Study

Configuration F1@0.05 Avg F1
Baseline (Transformer+BC) 0.483 0.607
+EST 0.571 0.698
+REST 0.504 0.654
+EST +REST 0.544 0.691
+EST +OBD 0.604 0.659
+REST +OBD 0.621 0.692
+EST +REST +OBD (Full) 0.620 0.748

All three components are indispensable: EST+REST without OBD yields only 0.691, and no pairwise combination of components matches the full model's 0.748.

Error Metric Ablation

Distance Metric Avg F1
L1 distance (min-max normalized) 0.733
L2 distance (min-max normalized) 0.733
KL divergence (min-max normalized) 0.734
Cosine distance 0.748

Cosine distance achieves the best performance due to its natural boundedness in \([0, 1]\), requiring no additional normalization.

Real-Time Performance

Method Model FPS Overall FPS Avg F1
TeSTra-BC 177 72.5 0.557
OadTR-BC 100 48.9 0.558
MiniROAD-BC 3069 99.8 0.681
ESTimator 2924 99.7 0.748

ESTimator achieves state-of-the-art performance while maintaining real-time processing speed (~100 FPS) comparable to MiniROAD.

Key Findings

  • The synergistic effect of all three components is critical: The EST+REST combination actually underperforms EST alone (0.691 vs. 0.698), but jumps to 0.748 with the addition of OBD—the dynamic thresholding of OBD resolves the inherent tension between the two loss functions.
  • The boundedness of cosine distance is a key advantage: Other unbounded metrics require min-max normalization for use with BCE loss, but normalization introduces noise.
  • Online methods can approach or even surpass most offline methods: This provides encouraging evidence for the feasibility of streaming video understanding.

Highlights & Insights

  • Precise operationalization of cognitive science theory: EST theory translates into the CEA module (continuous prediction) and the OBD module (discrepancy detection), with each cognitive science concept having a corresponding computational implementation. This "theory-to-method" translation pipeline is highly instructive.
  • The statistical testing idea in OBD is elegant and effective: Building a dynamic threshold from the sliding window \(\mu/\sigma\) essentially reformulates event boundary detection as a time-series anomaly detection problem. This design requires no learned parameters and generalizes exceptionally well.
  • The "regional averaging" soft supervision in REST Loss: This addresses the overly sharp nature of frame-level labels in video (a sudden label transition between consecutive frames is unrealistic). The regional averaging technique yields a smoother learning signal and is transferable to other video tasks involving frame-level annotations with ambiguous label boundaries.

Limitations & Future Work

  • Feature extractor fixed to ResNet-50: The impact of stronger video features (e.g., VideoMAE, InternVideo) on performance remains unexplored.
  • Threshold \(\tau=1.5\) is manually set: Although OBD is adaptive overall, \(\tau\) itself remains a fixed hyperparameter.
  • Effect of queue size \(\Delta\) is not fully analyzed: Different video types (fast-cut vs. long-take) may require different window sizes.
  • Limited to frame-level features: Spatial information (e.g., changes in object position) is not exploited, which may lead to suboptimal performance on event boundaries that depend on local spatial changes.
  • Potential improvements: (a) Using multi-scale queues (short window + long window) to simultaneously capture fast and slow semantic changes; (b) Introducing spatial attention to enable CEA to perceive changes in local regions; (c) Making \(\tau\) learnable or adaptive.
  • vs. CoSeg (offline): CoSeg is similarly inspired by cognitive science and uses event reconstruction to detect boundaries, but requires the complete video. ESTimator adapts the cognitive theory to the online setting, replacing reconstruction with prediction.
  • vs. MiniROAD (online action detection): MiniROAD uses a GRU for online action detection and is fast, but handles predefined action categories. ESTimator is designed for taxonomy-free generic events and outperforms MiniROAD by 6.7% F1.
  • vs. PC (Pairwise Comparison, offline): PC is currently the strongest offline GEBD method (Avg F1 = 0.817); ESTimator achieves 91.6% of its performance under online constraints.

Rating

  • Novelty: ⭐⭐⭐⭐ New task definition + precise engineering realization of cognitive science theory
  • Experimental Thoroughness: ⭐⭐⭐⭐ Detailed ablations and comparisons against both online and offline methods, though validation on additional datasets is lacking
  • Writing Quality: ⭐⭐⭐⭐⭐ Problem definition is clear, method derivation is coherent, and illustrations are intuitive
  • Value: ⭐⭐⭐⭐ Opens a new direction in On-GEBD with significant implications for streaming video understanding

Rating

  • Novelty: Pending
  • Experimental Thoroughness: Pending
  • Writing Quality: Pending
  • Value: Pending