Skip to content

Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation

Conference: CVPR 2026 arXiv: 2604.01859 Code: None Area: Image Segmentation Keywords: Temporal Action Segmentation, Boundary Supervision, Segment-Level Regularization, CDF Loss, Plug-and-Play

TL;DR

This paper proposes a lightweight dual-loss training framework for temporal action segmentation (TAS) that requires only one additional boundary output channel and two auxiliary losses—a boundary regression loss and a CDF segment shape regularization loss. The framework consistently improves F1 and Edit scores across three architectures (MS-TCN, C2F-TCN, and FACT), demonstrating that precise segmentation can be achieved through simple loss design rather than heavier architectural modifications.

Background & Motivation

Background: TAS has evolved from single-path models such as MS-TCN to multi-path frameworks including ASRF and FACT, the latter of which employ auxiliary modules for boundary refinement. While effective, these approaches increase computational complexity and parameter count. Frame-level cross-entropy loss remains the standard training objective, yet it frequently causes over-segmentation and boundary misalignment.

Limitations of Prior Work: (1) Frame-level classification losses lack segment-level structural constraints, leading to unstable predictions at action transitions and thus over-segmentation. (2) Existing boundary modeling methods—e.g., ASRF's dual-branch design and BCN's cascaded framework—require dedicated auxiliary branches, increasing architectural complexity and inference cost. (3) Post-processing refinement methods such as ASOT operate outside the training objective and do not directly influence the model's representation learning.

Key Challenge: How can boundary awareness and segment-level structural constraints be incorporated into TAS models without introducing significant architectural complexity?

Goal: To design an architecture-agnostic training-time augmentation scheme that improves segmentation quality across multiple TAS architectures with minimal modifications—specifically, one additional output channel and two auxiliary losses.

Key Insight: Boundary supervision and segment-level regularization can be injected as pure training objectives without dedicated architectural branches. A key insight is to decouple the loss assignment between boundary and non-boundary regions, thereby reducing optimization conflicts.

Core Idea: (1) A single-channel boundary regression head directly predicts action boundary positions from model outputs. (2) A CDF segment shape regularization loss constrains the predicted cumulative distribution within each segment to match the ground truth at the segment level, rather than aligning predictions only at the frame level.

Method

Overall Architecture

Building upon an existing TAS backbone (e.g., MS-TCN), the proposed method adds only one boundary output channel. During training, two auxiliary losses supplement the original model loss: the boundary regression loss \(\mathcal{L}_B\), applied within boundary-adjacent regions, and the CDF segment regularization loss \(\mathcal{L}_{CDF}\), applied within non-boundary regions. Their spatial decoupling avoids optimization conflicts. At inference, only the original classification outputs are used, incurring no additional computation.

Key Designs

  1. Boundary Regression Loss \(\mathcal{L}_B\):

  2. Function: Promotes precise temporal localization of action transition boundaries.

  3. Mechanism: An additional single-channel output predicts a class-agnostic boundary probability curve. The supervision signal is a binary boundary mask (set to 1 at GT class-transition positions, 0 elsewhere). The loss is computed only within a temporal window around each boundary.
  4. Design Motivation: Unlike ASRF's dual-branch design, this approach requires only one extra channel without introducing auxiliary branches or post-processing. Boundary-aware representations are learned during training with no additional inference overhead.

  5. CDF Segment Shape Regularization Loss \(\mathcal{L}_{CDF}\):

  6. Function: Constrains the temporal structure of predictions at the segment level to reduce over-segmentation.

  7. Mechanism: For each GT segment, the cumulative distribution function (CDF) of predicted probabilities within that segment is computed and aligned with the uniform CDF of the GT. Deviations between the two CDFs are penalized, drawing inspiration from the relationship between the 1D Wasserstein distance and CDFs, but applied here as a straightforward segment-level shape constraint.
  8. Design Motivation: Frame-level cross-entropy only requires each frame to be predicted correctly, without regard for temporal coherence within a segment. CDF regularization enforces a uniformly increasing cumulative distribution within each segment, encouraging intra-segment coherence over fragmentation.

  9. Decoupled Loss Assignment Strategy:

  10. Function: Assigns boundary loss and segment regularization loss to distinct temporal regions.

  11. Mechanism: The temporal axis is partitioned into boundary and non-boundary regions. \(\mathcal{L}_B\) is computed exclusively within boundary regions; \(\mathcal{L}_{CDF}\) is computed exclusively within non-boundary regions.
  12. Design Motivation: The optimization objectives at boundaries and within segment interiors are inherently conflicting—boundaries require rapid class switching, while interiors require stable predictions. Decoupled assignment allows each objective to operate independently, preventing mutual interference.

Loss & Training

Total loss = original model loss + \(\lambda_B \cdot \mathcal{L}_B\) + \(\lambda_{CDF} \cdot \mathcal{L}_{CDF}\). The method is a purely training-time augmentation; no additional computation or post-processing is required at inference.

Key Experimental Results

Main Results

Model Dataset F1@10 Gain Edit Gain Acc Change
MS-TCN GTEA +5.4% +4.6% Negligible
MS-TCN 50Salads Improved Improved Negligible
MS-TCN Breakfast Improved Improved Negligible
C2F-TCN All datasets Consistent gain Consistent gain Negligible
FACT All datasets Consistent gain Consistent gain Negligible

Ablation Study

Configuration F1 Edit Note
Baseline (no auxiliary loss) Baseline Baseline Original model
+ \(\mathcal{L}_B\) only Improved Improved Boundary awareness helps
+ \(\mathcal{L}_{CDF}\) only Improved Improved Segment structure constraint helps
+ Both (decoupled) Best Best Decoupled combination is optimal
+ Both (non-decoupled) Below decoupled Below decoupled Validates necessity of decoupling

Key Findings

  • Frame-level accuracy (Acc) remains largely unchanged while F1 and Edit scores improve substantially, indicating that gains stem from segment-level coherence and boundary precision rather than frame-level classification capability.
  • The largest improvements occur on MS-TCN, the simplest architecture (+5.4% F1@10), suggesting that simpler models benefit most from structural constraints.
  • The method is complementary to post-processing refinement approaches such as ASOT—it optimizes learned representations during training, while inference-time post-processing can optionally be applied on top.
  • Decoupled loss assignment is critical; without decoupling, the two losses conflict and degrade overall performance.

Highlights & Insights

  • The philosophy of "replacing architectural design with loss design" is highly pragmatic. Consistent improvements across three distinct architectures using only one extra channel and two auxiliary losses demonstrate strong generalizability.
  • The CDF segment shape regularization is an elegant design choice—expressing segment-level structural constraints as a CDF matching problem is both concise and effective. Its connection to the Wasserstein distance provides theoretical grounding.
  • Although simple, the decoupled loss assignment is crucial: boundaries and segment interiors represent fundamentally different optimization targets, and region-wise separation prevents opposing gradient signals from interfering with each other.

Limitations & Future Work

  • Experiments are conducted only on three standard TAS benchmarks; performance on larger-scale or more complex datasets (e.g., Assembly101) remains unknown.
  • The boundary window size is a manually specified hyperparameter that has a non-trivial effect on results.
  • The method contributes little to frame-level accuracy, limiting its value in settings where frame-level metrics are the primary concern.
  • Future directions include exploring adaptive boundary/non-boundary region partitioning and extending the CDF constraint to multi-granularity temporal windows.
  • vs. ASRF: Also introduces boundary supervision but requires a dual-branch architecture and post-processing; the proposed method uses only a single channel and auxiliary losses.
  • vs. BCN: Requires a cascaded framework with a Barrier Generation Module, entailing substantially greater complexity.
  • vs. ASOT: A post-processing method that is complementary to the proposed approach—this work optimizes training objectives while ASOT refines inference outputs.

Rating

  • Novelty: ⭐⭐⭐ The CDF regularization is original, though the overall contribution is an integration of existing ideas.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three architectures, three datasets, and complete ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clear, the method is concise, and figures are intuitive.
  • Value: ⭐⭐⭐⭐ A lightweight plug-and-play solution with practical utility for the TAS community.