Skip to content

Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

Conference: CVPR 2026 arXiv: 2603.19678 Code: https://github.com/zhoujiahuan1991/CVPR2026-VLADR Area: Person Understanding Keywords: Lifelong Person Re-Identification, Vision-Language Model, Attribute Disentanglement, Cross-Modal Alignment, Forgetting Mitigation

TL;DR

VLADR leverages fine-grained attribute knowledge from vision-language models (VLMs) to enhance lifelong person re-identification. Through a two-stage training pipeline comprising Multi-grain Text Attribute Disentanglement (MTAD) and Inter-domain Cross-modal Attribute Reinforcement (ICAR), the framework explicitly models human body attributes shared across domains to enable effective knowledge transfer and forgetting mitigation, surpassing the state of the art by 1.9%–2.2% in anti-forgetting performance and 2.1%–2.5% in generalization.

Background & Motivation

Background: Lifelong Person Re-Identification (LReID) requires a model to continuously learn from streaming data across diverse domains, constructing a unified person retrieval system. Unlike standard ReID, LReID faces catastrophic forgetting—knowledge about previously seen domains tends to be lost when acquiring knowledge from new ones. Existing approaches primarily build on visually pre-trained classification models and employ strategies such as knowledge distillation, prototype memory, and distribution modeling to mitigate forgetting.

Limitations of Prior Work: Although vision-language models such as CLIP have demonstrated strong generalization capabilities, directly adapting VLMs to LReID reveals a critical shortcoming: existing methods focus exclusively on global representation learning, neglecting the utilization of fine-grained attribute knowledge. Global representations are susceptible to domain-specific noise such as background clutter and illumination variation, whereas human body attributes (e.g., clothing color, body shape, accessories) serve as semantically stable anchors across domains and have been substantially underexplored.

Key Challenge: The central challenge of LReID lies in the tension between acquiring new knowledge and retaining old knowledge. Global representations are sensitive to domain shift, leading to the classic stability-plasticity dilemma. Fine-grained attributes (e.g., "red shirt, black backpack") are domain-invariant semantic descriptors that theoretically can serve as bridges for cross-domain knowledge transfer—yet existing methods lack an explicit mechanism to model them.

Goal: The paper aims to design a VLM-driven LReID framework that (1) explicitly disentangles global and local human body attributes, (2) leverages cross-modal attribute alignment for fine-grained knowledge transfer, and (3) mitigates forgetting via cross-domain attribute alignment.

Key Insight: The authors observe that human body attributes exhibit cross-domain shareability—the semantics of "wearing blue jeans" remain consistent regardless of whether the image comes from Market-1501 or MSMT17. Explicitly disentangling these shared attributes and establishing cross-modal alignment provides natural anchors for cross-domain knowledge transfer.

Core Idea: A VLM (BLIP) is used to automatically generate multi-granularity textual attribute descriptions for pedestrian images. Learnable prompts then disentangle global and local attributes in the text space. Cross-modal and cross-domain multi-level alignment subsequently injects attribute knowledge into the visual encoder, enabling attribute-guided lifelong learning.

Method

Overall Architecture

VLADR adopts a two-stage training pipeline. Stage 1 (MTAD) performs multi-grain text attribute disentanglement on the CLIP text encoder side, learning prompt representations for global and local attributes. Stage 2 (ICAR) freezes the Stage 1 prompt weights and fine-tunes the CLIP image encoder using pre-extracted textual descriptions, achieving knowledge transfer through cross-modal attribute alignment and cross-domain attribute alignment. The backbone architecture builds on the CLIP-ReID and DASK frameworks.

Key Designs

  1. Multi-grain Text Attribute Disentanglement (MTAD):

    • Function: Disentangles global features and multiple local attribute features from textual descriptions of pedestrian images.
    • Mechanism: BLIP is first used to automatically generate a textual description for each pedestrian image (e.g., "a person wearing a red shirt and blue jeans, carrying a black backpack"). Two sets of learnable prompts are then designed: (a) a global prompt capturing the overall appearance of the pedestrian; and (b) multiple local attribute prompts, each corresponding to a different attribute dimension (e.g., upper-body color, trouser type, accessories). The CLIP text encoder processes the concatenation of descriptions and prompts, and an attention mechanism guides each prompt to focus on different attribute segments within the description, achieving attribute disentanglement.
    • Design Motivation: Global representations tend to conflate domain-specific information with identity-discriminative information. After disentanglement, local attributes constitute domain-invariant semantic units that serve as the minimal shareable elements for cross-domain knowledge transfer.
  2. Inter-domain Cross-modal Attribute Reinforcement (ICAR):

    • Function: Injects the attribute knowledge disentangled in Stage 1 into the visual encoder and enables cross-domain knowledge transfer.
    • Mechanism: Two alignment mechanisms are employed. (a) Cross-modal attribute alignment: visual encoder features are aligned with the textual attribute features extracted in Stage 1, compelling the visual model to learn attribute-semantically consistent visual representations. A visual-textual matching loss is computed for each attribute dimension independently, guiding the visual encoder to extract visual attribute features corresponding to each local attribute prompt. (b) Cross-domain attribute alignment: when training on a new domain, prototype representations of the same attributes from previously seen domains serve as anchors, constraining the new domain's attribute representations from drifting excessively. Attribute-level (rather than instance-level) knowledge distillation enables fine-grained knowledge preservation in the attribute space.
    • Design Motivation: Cross-modal alignment ensures the visual encoder understands attribute semantics rather than relying on domain-specific cues; cross-domain alignment operates at attribute granularity rather than global granularity, enabling more precise knowledge preservation.
  3. Attribute-level Prototype Memory Bank:

    • Function: Stores representative representations for each attribute dimension across all learned domains to support cross-domain attribute alignment.
    • Mechanism: A prototype vector is maintained for each attribute dimension of every previously learned domain, updated via exponential moving average. When learning a new domain, the new domain's attribute representations are aligned with the corresponding attribute prototypes from old domains to enforce consistency in the attribute space. Attribute-level prototypes are more compact and efficient than instance-level samples and better capture the distributional patterns of each attribute.
    • Design Motivation: Compared to traditional exemplar replay, attribute prototypes (1) do not require storing large quantities of old-domain samples, (2) operate directly in the semantic attribute space rather than raw feature space, and (3) naturally provide alignment anchors through cross-domain shared attributes.

Loss & Training

Stage 1 Losses:

  • Text–image matching loss: ensures global and local attribute prompts align with corresponding textual descriptions.
  • Attribute orthogonality loss: encourages different local attribute prompts to attend to distinct attribute dimensions, reducing redundancy.
  • Standard cross-entropy and triplet losses for identity classification.

Stage 2 Losses:

  • Cross-modal attribute alignment loss: \(\mathcal{L}_{\text{CMA}} = \sum_{k=1}^{K} \text{dist}(\mathbf{v}_k, \mathbf{t}_k)\), aligning the \(k\)-th visual attribute feature \(\mathbf{v}_k\) with the corresponding textual attribute feature \(\mathbf{t}_k\).
  • Inter-domain attribute alignment loss: \(\mathcal{L}_{\text{IDA}} = \sum_{k=1}^{K} \text{dist}(\mathbf{v}_k^{\text{new}}, \mathbf{p}_k^{\text{old}})\), aligning new-domain attribute representations with old-domain attribute prototypes.
  • Standard ReID losses (cross-entropy + triplet).

The two stages are trained separately; Stage 2 loads the prompt checkpoint from Stage 1 together with pre-extracted BLIP textual descriptions.

Key Experimental Results

Main Results

Metric VLADR Prev. SOTA (KRKC) Gain
Anti-forgetting mAP (Setting 1) ~52.3% ~50.1% +2.2%
Anti-forgetting Rank-1 (Setting 1) ~68.2% ~66.3% +1.9%
Generalization mAP (Setting 1) ~34.5% ~32.0% +2.5%
Generalization Rank-1 (Setting 1) ~49.8% ~47.7% +2.1%
Anti-forgetting mAP (Setting 2) ~48.7% ~46.8% +1.9%
Generalization mAP (Setting 2) ~31.2% ~28.8% +2.4%

Consistent improvements are achieved under both standard LReID evaluation settings. The anti-forgetting metric measures average performance across all seen domains; the generalization metric measures transfer ability to unseen domains.

Ablation Study

Configuration Anti-forgetting mAP Generalization mAP Notes
Baseline (CLIP-ReID + LReID) 47.1% 28.5% Direct VLM adaptation
+ MTAD (Stage 1 only) 49.6% 30.8% Attribute disentanglement is effective
+ CMA (cross-modal alignment) 51.0% 32.3% Attribute knowledge injected into visual encoder
+ IDA (cross-domain alignment) 52.3% 34.5% Attribute-level knowledge transfer mitigates forgetting
Global repr. + IDA (no attr. disentanglement) 49.2% 30.1% Global granularity inferior to attribute granularity
Random attribute partition (replacing MTAD) 48.8% 29.7% Learned disentanglement outperforms random split

Key Findings

  • MTAD attribute disentanglement is the foundation of the performance gain: +2.5% anti-forgetting mAP and +2.3% generalization mAP.
  • Cross-modal attribute alignment (CMA) effectively injects textual attribute knowledge into the visual encoder, yielding further gains of +1.4% / +1.5%.
  • Inter-domain attribute alignment (IDA) provides fine-grained knowledge preservation during new-domain learning, contributing additional improvements of +1.3% / +2.2%.
  • Attribute-granularity knowledge transfer substantially outperforms global-granularity transfer: global + IDA achieves only 49.2%, whereas attribute + IDA reaches 52.3%.
  • Consistent improvement trends across both LReID settings confirm the robustness of the proposed method.

Highlights & Insights

  • The idea of using attributes as cross-domain bridges is well-motivated: human body attributes such as clothing and body shape are naturally shared across domains and serve as more stable minimal semantic units for knowledge transfer than global representations.
  • The two-stage disentanglement–reinforcement design follows a divide-and-conquer principle: Stage 1 focuses on attribute mining in the text space, while Stage 2 focuses on attribute injection into the visual space, with each stage fulfilling a clearly defined role.
  • Automatic attribute description generation via BLIP eliminates the need for manually annotated attribute labels, endowing the method with strong scalability.
  • Attribute-level prototype memory is more compact and efficient than exemplar replay and is more semantically meaningful by operating in the attribute space.
  • The code is publicly available and built on the CLIP-ReID and DASK frameworks, ensuring good reproducibility.

Limitations & Future Work

  • The quality of attribute descriptions depends on BLIP's captioning capability; inaccurate descriptions may be generated for heavily occluded pedestrians or low-quality images.
  • The number of attributes (i.e., the number of local prompts \(K\)) must be predefined, and the optimal value may vary across datasets.
  • The method assumes that attributes are domain-invariant; however, certain attributes (e.g., culturally specific attire) may exhibit domain specificity.
  • Future work could explore extending attribute disentanglement from discrete prompts to a continuous attribute space for more flexible attribute modeling.
  • Incorporating large language models for more fine-grained attribute reasoning (e.g., inferring occupational attributes from "wearing a uniform") is a promising direction.
  • Extending the framework to open-set ReID and text-to-image person retrieval are also natural next steps.
  • DASK (AAAI 2025): A preceding work from the same group that mitigates forgetting through distribution rehearsal; VLADR extends this by introducing attribute-level knowledge transfer.
  • CLIP-ReID: The baseline framework that adapts CLIP to the ReID task; VLADR further exploits the fine-grained attribute potential of VLMs.
  • LSTKC (AAAI 2024): Another work from the same group on long-short-term knowledge consolidation; VLADR elevates the abstraction level from raw knowledge to attribute-semantic representations.
  • Continual Learning Community: The attribute-level knowledge transfer paradigm is generalizable to other continual learning tasks such as object detection and semantic segmentation.
  • Broader Insight: In the VLM era, how to better exploit structured knowledge on the language side is a broadly applicable and deeply worthwhile research question.

Rating

  • Novelty: ⭐⭐⭐⭐ (The combination of attribute disentanglement and cross-domain reinforcement is innovative, though individual components are relatively mature)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (Two settings with complete ablations; large-scale dataset validation is lacking)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure, well-articulated motivation)
  • Value: ⭐⭐⭐⭐ (Provides meaningful inspiration for VLM-driven lifelong learning; the attribute transfer paradigm has good generality)