Context-Enhanced Memory-Refined Transformer for Online Action Detection¶

Conference: CVPR 2025
arXiv: 2503.18359
Code: GitHub
Area: Video Understanding
Keywords: Online Action Detection, Action Anticipation, Transformer, Memory Mechanism, Training-Inference Discrepancy

TL;DR¶

This paper reveals the training-inference inconsistency problem in existing online action detection (OAD) methods—where unbalanced context exposure of short-term memory frames and non-causal information leakage introduced by pseudo-futures bias learning toward intermediate frames—and proposes CMeRT to address this issue through a near-past context-enhanced encoder and a near-future-based memory refinement decoder, achieving state-of-the-art performance on THUMOS'14, CrossTask, and EK100.

Background & Motivation¶

Online action detection (OAD) requires real-time action recognition in video streams based solely on past observations, serving as a foundation for applications like autonomous driving, surveillance, and AR assistants. State-of-the-art OAD methods divide historical frames into long-term and short-term memory, and compensate for missing future context by predicting a pseudo-future. During training, a causal mask is used to make all frames in the short-term memory serve as training samples, whereas only the latest frame is used during inference.

The key challenge lies in the training-inference discrepancy, which leads to two types of biases:

Unbalanced Context Exposure: The causal mask leaves early frames in short-term memory (e.g., \(t_s\)) with virtually no immediate context, whereas the latest frame (\(t\)) enjoys full context. This leads to poor representation quality (high loss) for early frames; however, incorporating these low-quality samples during training undermines the classifier's capability to predict the latest frame.

Non-Causal Leakage: Methods like MAT generate a "pseudo-future" based on the complete short-term memory to enhance detection, but this allows intermediate frames to indirectly access subsequent frames via "future \(\rightarrow\) short-term \(\rightarrow\) future" pathways, violating causality. This biases training toward intermediate frames (exhibiting a valley-shaped loss curve), detrimental to the learning of the latest frame.

This paper proposes CMeRT, which (1) complements early frames with immediate information via near-past context, and (2) generates the near-future solely from long-term memory (rather than short-term memory) to avoid non-causal leakage.

Method¶

Overall Architecture¶

CMeRT adopts an encoder-decoder architecture operating on five context partitions: long-term memory \(M_L\), short-term memory \(M_S\), anticipation query \(Q_A\), near-past \(M_C\), and near-future \(M_F\). The encoder compresses long-term memory and enhances short-term memory encoding using the near-past context; the decoder generates the near-future from compressed long-term memory and refines the short-term memory. All modules are built based on a unified Transformer Decoder Unit (TDU).

Key Designs¶

Context-Enhanced Encoder:
- Function: Complements early frames in short-term memory with immediate past context to alleviate the unbalanced exposure issue.
- Mechanism: Extract the near-past memory \(M_C = \{f_i\}_{i=t_s-T_c}^{t_s-1}\) (length \(T_c \ll T_l\)), concatenate it before the short-term memory, and encode them together with the anticipation query via a causally masked TDU: \(M_{SA} = \text{TDU}(M_C \| M_S \| Q_A, \hat{M}_L \| M_S \| Q_A, \hat{M}_L \| M_S \| Q_A, G)_{[T_c:T_c+T_s+T_a]}\).
- Design Motivation: Although the near-past \(M_C\) overlaps with the long-term memory \(M_L\), the long-term memory loses fine-grained details during compression, whereas \(M_C\) preserves these details for early frames; after encoding, \(M_C\) is discarded, leaving only the enhanced short-term memory and anticipation.
Near-Future Generator:
- Function: Generates near-future context from compressed long-term memory to provide future information for all short-term frames.
- Mechanism: \(M_F = \text{TDU}(Q_F, \hat{M}_L, \hat{M}_L, \text{None})\), retrieving useful information from \(\hat{M}_L\) using a learnable query \(Q_F\) (length \(T_f\)).
- Design Motivation: The key improvement is avoiding the use of short-term memory to generate the near-future (unlike MAT), relying solely on the compressed long-term memory instead, which fundamentally eliminates the non-causal leakage problem.
Memory Refinement:
- Function: Refines the encoded short-term memory with near-future context to boost detection and anticipation performance.
- Mechanism: \(\hat{M}_{SA} = \text{TDU}(M_{SA}, \hat{M}_L \| M_{SA} \| M_F, \hat{M}_L \| M_{SA} \| M_F, G)\).
- Design Motivation: Near-future information helps disambiguate ongoing actions; meanwhile, because \(M_F\) originates from compressed long-term memory rather than short-term memory, no causal leakage is introduced.

Loss & Training¶

The training loss is defined as: \(\mathcal{L} = \mathcal{L}_{SA}^1 + \lambda_1 \mathcal{L}_{SA}^0 + \lambda_2 \mathcal{L}_F\)

where \(\mathcal{L}_{SA}^0\) and \(\mathcal{L}_{SA}^1\) denote the cross-entropy losses for the encoder output and the refined output respectively, and \(\mathcal{L}_F\) is the cross-entropy loss for the near-future generation. A shared classifier is utilized. The trade-off coefficients are set to \(\lambda_1 = 0.2\) and \(\lambda_2 = 0.5\). The optimization employs the Adam optimizer with cosine annealing and warmup. Training sampling strategies include sliding-window (THUMOS and CrossTask) and event-centric sampling (EK100). Inference utilizes a sliding window with step size 1 to simulate online streaming scenarios.

Key Experimental Results¶

Main Results¶

Dataset	Metric	CMeRT	MAT (Prev. SOTA)	Gain
THUMOS'14	mAP (Detection)	73.2	71.6	+1.6
CrossTask	mAP (Detection)	35.9	33.9	+2.0
EK100	Top-5 Recall (Action)	27.6	26.3	+1.3
THUMOS'14	mAP (Anticipation Avg)	59.5	58.2	+1.3
EK100	Top-5 Recall (Action Anticip.)	19.8	19.5	+0.3

Ablation Study¶

Configuration	TH'14 mAP	CrossTask mAP	EK100 Action	Description
W/o CE, w/o MR	71.5	33.4	26.3	Baseline
+MR (Near-Future Refinement)	73.0	34.8	27.1	+1.5 / +1.4 / +0.8
+CE (Near-Past Enhancement)	71.9	33.9	26.6	+0.4 / +0.5 / +0.3
+CE+MR (Full CMeRT)	73.2	35.9	27.6	Optimal combination

Near-Past Length (s)	CrossTask	Near-Past Length (s)	TH'14	EK100 Action
5	35.1	0.5	73.2	27.2
10	35.9	1	72.8	27.3
15	35.6	2	72.7	27.6

Key Findings¶

Memory refinement (MR) contributes more than context enhancement (CE), with the former bringing gains of +1.5%, +1.4%, and +0.8% across the three datasets, respectively.
The optimal length of near-past context varies with dataset complexity: the simpler THUMOS requires only 0.5s, while the more complex CrossTask and EK100 need longer durations.
Naive solutions (e.g., MAT-rw weighting the latest frame, or MAT-stream training solely with the latest frame) yield limited improvements or even show severe performance degradation.
Replacing traditional features with DinoV2 earns CMeRT a 76.4% mAP on THUMOS, further validating its compatibility with stronger features.
Efficiency surpasses MAT: fewer parameters (94.5M vs 107.4M) and a higher FPS (126.6 vs 102.0).

Highlights & Insights¶

Diagnosis of Training-Inference Inconsistency: Precising the sources of the two types of biases through frame-level loss curve visualization—a diagnostic methodology that possesses transfer value.
Ingenious Introduction of Near-Past Context: Rather than simply extending short-term memory, it supplements immediate context for early frames during training without increasing inference costs.
Leakage-Free Near-Future Generation: Generates the near-future from compressed long-term memory instead of short-term memory, systematically eliminating non-causal leakage at its source.
Unified Detection and Anticipation: Handles both online detection and action anticipation within a single framework, achieving mutual benefits through a shared classifier and joint training.
New Evaluation Protocol: Standardizes the OAD field by introducing stronger features (DinoV2), event-centric metrics, and a new benchmark (CrossTask).

Limitations & Future Work¶

It still relies on pre-extracted frame features (e.g., ResNet-50, I3D), and end-to-end training remains unexplored.
The lengths of near-past and near-future contexts must be manually tuned for each dataset.
The gain in anticipation on EK100 is relatively incremental (+0.3%), likely due to the extremely long-tailed distribution of fine-grained actions in this dataset.
The near-future generated from compressed long-term memory may discard certain temporal details, suffering from an information loss compared to predictions based on short-term memory.
Learnable frame sampling weight strategies to replace the uniform processing of all short-term frames have not been explored.

vs LSTR: LSTR pioneered the long/short-term memory framework; CMeRT introduces near-past/near-future contexts on top of it, yielding significant performance gains.
vs TeSTra: TeSTra enhances streaming efficiency but fails to solve the unbalanced context issue; CMeRT resolves this limitation by supplying the near-past.
vs MAT: MAT introduces conditional recurrent interactions to unify detection and anticipation, but its CCI causes non-causal leakage; CMeRT's memory refinement successfully avoids this issue.
vs JOAAD: JOAAD is a recent state-of-the-art (72.6% on TH'14); CMeRT outperforms it at 73.2% while maintaining a simpler methodology.

Rating¶

Novelty: ⭐⭐⭐⭐ The diagnosis of the training-inference discrepancy is profound and novel, and the designs for near-past/near-future are systematic, though the overall framework is an incremental improvement over existing memory-based methods.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covering three datasets, detection and anticipation tasks, comprehensive ablations (lengths, distances, feature types, efficiency), along with new benchmarks and protocols.
Writing Quality: ⭐⭐⭐⭐⭐ Excellent visual analysis in the problem diagnosis section, rigorous logical derivation, and a natural transition from observations to the proposed method.
Value: ⭐⭐⭐⭐ Provides a systematic solution to the training-inference consistency issue in the OAD domain and advances the update of evaluation protocols.