Skip to content

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

Conference: ACL 2026 arXiv: 2604.20135 Code: None Area: Image Generation Keywords: e-commerce retrieval, fine-grained representation learning, attribute generation, reinforcement learning alignment, contrastive learning

TL;DR

This paper proposes the AFMRL framework, which formulates fine-grained product understanding in e-commerce as an attribute generation task. An MLLM generates key attributes to enhance contrastive learning (AGCL), while retrieval performance serves as a reward signal to inversely optimize the attribute generator (RAR), achieving state-of-the-art retrieval performance on large-scale e-commerce datasets.

Background & Motivation

Background: Multi-modal representation learning is evolving from discriminative matching frameworks such as CLIP toward generative large model-based approaches. E-commerce scenarios require distinguishing highly similar products (e.g., "V-neck red dress" vs. "round-neck red dress"), imposing stringent demands on fine-grained understanding.

Limitations of Prior Work: Models such as CLIP are fundamentally "bag-of-words" systems that struggle to distinguish compositional semantic differences (e.g., "white T-shirt with blue logo" vs. "blue T-shirt with white logo"). Large model representation methods such as VLM2Vec possess strong reasoning capabilities but are constrained by causal attention mechanisms, allowing embeddings to be obtained only via global average pooling or the last token, which is incompatible with fine-grained alignment techniques such as RoI.

Key Challenge: The generative capabilities of MLLMs can extract fine-grained attributes, but existing architectural constraints prevent their direct application to fine-grained representation learning. The central challenge is how to translate the understanding capacity of MLLMs into improvements in discriminative representations.

Goal: To leverage the generative capabilities of MLLMs for extracting key product attributes, integrate these attributes into the representation learning process, and ensure that attribute generation is aligned with the ultimate retrieval objective.

Key Insight: Fine-grained understanding is "delegated" to an attribute generator, which serves as an intermediate bridge to indirectly enhance the fine-grained discriminative capability of the representation model.

Core Idea: A two-stage training pipeline — first, attribute-guided contrastive learning is used to mine hard negatives; then, retrieval results serve as reward signals to optimize the attribute generator via RL, forming a self-improving closed loop.

Method

Overall Architecture

AFMRL employs two independent models: a representation model (VLM2Vec, responsible for generating discriminative embeddings) and an attribute generator (MLLM, responsible for extracting key attributes). Training proceeds in two stages: Stage 1 uses attributes to guide contrastive learning for training the representation model; Stage 2 freezes the trained representation model and provides retrieval rewards to the attribute generator, optimizing the generation policy via GRPO. At inference time, the generator extracts attributes to enrich query inputs, which are then encoded by the representation model for retrieval.

Key Designs

  1. Attribute-Guided Contrastive Learning (AGCL):

    • Function: Enhances standard InfoNCE contrastive learning using key attributes generated by an MLLM.
    • Mechanism: Two enhancement mechanisms are employed — (a) BM25 computes lexical similarity between query and candidate attributes, converted to importance weights via \(w_{ij} = e^{1+\tanh(B_{ij})}\), assigning greater training attention to lexically similar hard negatives; (b) false negative masking: if a negative sample's similarity to the positive sample exceeds a threshold \(\delta\), it is removed from the negative pool to avoid penalizing semantically correct matches.
    • Design Motivation: Standard InfoNCE has two shortcomings — it cannot exploit complementary matching signals beyond embeddings, and it penalizes false negatives. AGCL addresses both issues simultaneously through attribute information.
  2. Retrieval-Aware Reinforcement (RAR):

    • Function: Directly aligns attribute generation with downstream retrieval performance via reinforcement learning.
    • Mechanism: The representation model trained in Stage 1 is frozen and used as the reward environment. After the generator produces attributes for a query, the representation model performs retrieval using the enriched query, and Recall@k is used directly as the reward signal. GRPO is used to optimize the policy, with KL divergence regularization preventing excessive deviation from the SFT baseline. Invalid outputs receive a penalty of \(\eta=-0.1\).
    • Design Motivation: The attribute generation objective under SFT distillation is misaligned with the ultimate retrieval task. The RL stage directly employs retrieval metrics as rewards, ensuring that generated attributes are maximally useful for retrieval.
  3. Cyclic Iterative Training (CIT):

    • Function: The RL-optimized generator feeds back into representation model training.
    • Mechanism: After RL training, the optimized attribute generator re-supplies attributes for AGCL training, forming a self-improving loop. Significant performance gains are achieved using only 30% of training samples.
    • Design Motivation: The quality of the attribute generator and the representation model are mutually dependent; iterative training can escape local optima arising from initialization.

Loss & Training

The AGCL loss in Stage 1 is a weighted InfoNCE objective: \(\mathcal{L}_{\text{AGCL}} = -\log \frac{w_{ii} \cdot e^{s_{ii}/\tau}}{w_{ii} \cdot e^{s_{ii}/\tau} + \sum_{j \in \mathcal{N}_i} w_{ij} \cdot e^{s_{ij}/\tau}}\). Stage 2 uses the GRPO objective, incorporating clipped ratios and a KL regularization term. The representation model is initialized from Qwen2-VL-2B and fine-tuned with LoRA; the attribute generator is initialized from Qwen2.5-VL-3B and fine-tuned with full parameters.

Key Experimental Results

Main Results

Model Fine-Grained Recall@1 Recall@5 Recall@10
CLIP 14.98 23.07 27.59
FG-CLIP 31.44 49.78 68.38
VLM2Vec 48.05 64.26 69.65
+ AGCL 51.06 68.08 73.52
+ AGCL + Distill Gen. 52.42 71.00 76.26
AFMRL (Full) 54.28 72.19 77.27

Ablation Study

Configuration Accuracy NMI ARI Purity
Baseline VLM2Vec 87.67 87.04 44.39 73.16
+ AGCL 87.80 87.11 44.44 73.24
+ AGCL + Distilled Generator 87.98 87.63 46.24 74.21
+ AGCL + RL Policy 88.00 87.68 46.61 74.52
+ CIT (Cyclic Iteration) 89.13 88.97 47.40 75.98

Key Findings

  • Each component provides clear incremental gains: AGCL → +3.01 R@1, distilled attributes → +1.36 R@1, RL alignment → +1.86 R@1.
  • An emergent "generation conciseness" behavior is observed during RL training: the length of generated attributes consistently decreases, indicating that the model learns to capture only the most essential attributes for retrieval.
  • Recall@50 is identified as the optimal \(k\) value, balancing reward sparsity and saturation.
  • AGCL prevents the model from converging prematurely to local optima, providing a more robust representation space.

Highlights & Insights

  • The "attributes as bridge" design is elegant — by routing information through textual attributes, the framework circumvents the limitations imposed by the causal attention mechanism of MLLMs on fine-grained alignment, effectively translating generative capacity into discriminative capability.
  • Using retrieval performance as the RL reward signal to form a closed-loop optimization is more direct than conventional surrogate losses. This paradigm is transferable to any "generation-assisted discrimination" scenario.
  • The automatic shortening of attribute length during RL training is an intriguing emergent phenomenon, demonstrating that RL genuinely learns "what information is useful for retrieval."

Limitations & Future Work

  • The RL policy exhibits an "alignment tax" — over-optimizing Recall@k may degrade general representation quality.
  • Validation is currently limited to e-commerce datasets; generalizability to other fine-grained retrieval scenarios remains to be explored.
  • The attribute generator introduces additional inference overhead, requiring a trade-off between accuracy and efficiency.
  • The convergence behavior and optimal number of iterations for cyclic iterative training have not been thoroughly analyzed.
  • vs. VLM2Vec: VLM2Vec directly uses MLLMs for representation but lacks fine-grained signals; AFMRL supplements fine-grained information through attribute generation.
  • vs. FG-CLIP: FG-CLIP relies on region-level annotations for fine-grained alignment, which is incompatible with MLLM architectures; AFMRL replaces region features with attribute text.
  • vs. General GRPO: This work extends GRPO from reasoning tasks to retrieval alignment, with a more compact reward design.

Rating

  • Novelty: ⭐⭐⭐⭐ The combined design of attribute-guided contrastive learning and retrieval-reward RL is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Validated on large-scale e-commerce datasets with sufficient ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Framework description is clear and illustrations are intuitive.
  • Value: ⭐⭐⭐⭐ Directly applicable to fine-grained e-commerce retrieval.