LiVOS: Light Video Object Segmentation with Gated Linear Matching¶

Conference: CVPR 2025
arXiv: 2411.02818
Code: uncbiag/LiVOS
Area: Image Segmentation
Keywords: video object segmentation, linear attention, gated linear matching, memory network, 4096p inference

TL;DR¶

Proposed LiVOS—the first lightweight VOS network to replace softmax attention with gated linear attention for memory matching. It compresses the spatio-temporal attention matrix into a constant-sized 2D state matrix, achieving constant memory consumption for videos of arbitrary length and supporting 4096p inference on a 32G consumer GPU.

Background & Motivation¶

Background: Semi-supervised VOS is primarily driven by Space-Time Memory (STM) networks, which perform pixel-level matching between the query frame and all memory frames via softmax attention. Representative methods include XMem and Cutie.

Limitations of Prior Work: Softmax matching requires storing an \(\mathcal{O}(HW \times THW)\) attention matrix, with spatial complexity growing linearly with video length and quadratically with resolution. As videos become longer or resolutions increase, this leads to slow computation or out-of-memory (OOM) errors.

Key Challenge: Fixed-size memory banks fail under occlusions or rapid motion, while downscaling resolutions leads to a loss of fine-grained mask details; both are inherent limitations of softmax matching.

Key Insight: This work identifies softmax matching as the core bottleneck and fundamentally replaces the matching mechanism instead of applying superficial patches.

Core Idea: Reformulating softmax attention into a recurrent form of linear attention, where the attention matrix degenerates into a constant-sized 2D state \(\mathbf{S}_t \in \mathbb{R}^{C_k \times C_v}\), and introducing a data-dependent gating matrix to enhance selectivity.

Method¶

Overall Architecture¶

Image encoder (ResNet-50) extracts the key of the query frame.
Mask encoder (ResNet-18) extracts the value of the memory frame.
Gated Linear Matching (Core): Replaces full softmax matching with recurrent updates of a constant-sized state matrix.
Integrates sensory memory (low-level information) and object memory (high-level semantics) to enhance the readout.
Lightweight mask decoder outputs the segmentation results.

Key Designs¶

1. Linear Matching: From Softmax to Recurrent States - Function: Reformulate the softmax matching \(\mathbf{V}_{t+1} = \text{Softmax}(\mathbf{K}_{t+1}\mathbf{K}_{1:t}^T)\mathbf{V}_{1:t}\) into a kernel function approximation \(\phi(\mathbf{K}_{t+1})\mathbf{S}_t\). - Mechanism: Utilizing the associative property of matrix multiplication, \(\sum_i \phi(\mathbf{K}_{t+1})\phi(\mathbf{K}_i)^T\mathbf{V}_i\) is regrouped into \(\phi(\mathbf{K}_{t+1}) \cdot \sum_i \phi(\mathbf{K}_i)^T\mathbf{V}_i\). The state is defined as \(\mathbf{S}_t = \mathbf{S}_{t-1} + \phi(\mathbf{K}_i)^T\mathbf{V}_i\), where \(\mathbf{S}_t \in \mathbb{R}^{C_k \times C_v}\) is of constant size. The kernel function \(\phi\) uses row-wise softmax. - Design Motivation: The state \(\mathbf{S}_t\) is a spatio-temporally independent 2D matrix, whose size depends solely on the feature dimensions (\(64 \times 256\)), making it independent of video length and resolution.

2. Gated Linear Matching - Function: Introduce a data-dependent forget gate \(\mathbf{G}_t\) in the state update to selectively retain or discard historical information. - Mechanism: \(\mathbf{S}_t = \mathbf{G}_t \odot \mathbf{S}_{t-1} + \phi(\mathbf{K}_i)^T\mathbf{V}_i\). The gate \(\mathbf{G}_t = \alpha_t \mathbf{1}^T\) is implemented via low-rank parameterization, where \(\alpha_t \in (0,1)^{C_k}\) is extracted from image encoder features using depthwise convolution + spatial pooling + Sigmoid. - Design Motivation: Pure linear matching lacks a selection mechanism, leading to performance degradation in long sequences. Gating provides a forgetting ability similar to GRU/LSTM, enabling active discard of outdated information in scenarios like scene transitions and occlusions.

3. External Memory Fusion - Function: Reuse Cutie's sensory memory (element-wise addition fusion of low-level temporal features) and object memory (cross-attention fusion of high-level object semantics). - Mechanism: The readout output from linear matching sequentially interacts with sensory memory and the object transformer to compensate for the information lost through constant-sized state compression. - Design Motivation: Since the constant-sized state compresses spatio-temporal information, external memories provide complementary high-frequency and semantic information.

Loss & Training¶

Cross entropy + soft dice loss combined with equal weights.
AdamW optimizer, initial learning rate \(10^{-4}\), batch size 16, weight decay 0.001.
8 frames per batch, cropped to 480x480, trained for 125K iterations.
Image encoder learning rate scaled by 0.1 to mitigate overfitting, gradient clipping \(\tau=3\).
Point-based supervision (12544 points), following the training strategy of Cutie.

Key Experimental Results¶

Main Results¶

Method	STM?	MOSE J&F↑	DAVIS-17 val J&F↑	DAVIS-17 test J&F↑	YouTube-VOS 𝒢↑
RDE	✗	46.8	84.2	77.4	81.9
Cutie-small† (1 frame)	✗	49.3	76.4	71.6	79.0
LiVOS (Ours)	✗	64.8	85.1	-	-
Cutie-small	✓	62.2	87.2	84.1	86.2
Cutie-base	✓	64.0	88.8	84.2	86.1
XMem	✓	56.3	86.2	81.0	85.5

Efficiency Comparison¶

Metric	LiVOS vs STM Methods
GPU Memory Savings	53%
Memory Growth on Long Videos	Constant (vs linear growth for softmax)
Memory Growth with Resolution	Linear (vs quadratic for softmax)
Max Inferable Resolution	4096p (32G GPU)
CPU Latency vs Frame Count	Constant (vs linear growth for softmax)

Key Findings¶

LiVOS outperforms all non-STM methods and bridges the gap with STM methods: MOSE 64.8 vs Cutie-small 62.2, DAVIS 85.1 vs 87.2.
Matches STM methods' performance in long video and high-resolution scenarios while saving 53% GPU memory.
4096p inference becomes feasible: STM methods suffer from OOM at high resolutions due to softmax attention, whereas the constant state of LiVOS enables processing on consumer-grade GPUs.
The gating mechanism significantly improves long-sequence performance: In challenging scenarios like scene changes and occlusions, the gated state effectively forgets outdated information.

Highlights & Insights¶

Extending the softmax-to-linear attention reformulation from text/image classification to VOS—a video memory-intensive task—serves as a strong template.
The insight of the constant-sized state matrix is elegant: \(C_k \times C_v = 64 \times 256\) is sufficient to compress all spatio-temporal information of an arbitrarily long video.
The low-rank parameterized design of the gated linear matching is simple and efficient.
It paves the way for the development of foundation models for long-duration, high-resolution videos.

Limitations & Future Work¶

The constant state incurs information compression loss, resulting in a performance gap on standard short videos.
The gating parameterization adopts the simplest low-rank form, and richer parameterizations can be explored.
It is not optimized for training on high-resolution videos (trained only on 480p); 4096p is zero-shot generalization at test time.
In multi-object scenarios, a separate state is maintained for each object, still incurring overhead when the number of objects is very large.
Integration with hardware optimizations such as Flash Attention is not explored.

Cutie enhances XMem's readout quality via the object transformer; LiVOS builds on this by replacing the core matching mechanism.
Gated Linear Attention (GLA) has proven effective in language modeling, and this work extends it to visual memory matching.
Insight: The idea of constant-state compression can be applied to other video tasks requiring temporal memory (e.g., video QA, video generation).

Rating¶

Novelty: ⭐⭐⭐⭐ First to apply linear attention to VOS memory matching, with a rationally designed gating mechanism.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers MOSE/DAVIS/YouTube-VOS/LVOS, including efficiency and high-resolution experiments.
Writing Quality: ⭐⭐⭐⭐ The derivation process from softmax to linear to gated linear is clear and smooth.
Value: ⭐⭐⭐⭐⭐ Resolves the core scalability bottleneck in VOS, opening a new paradigm for high-resolution long video processing.