AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce¶
Conference: ACL 2026
arXiv: 2604.20135
Code: None
Area: Image Generation
Keywords: E-commerce Retrieval, Fine-grained Representation Learning, Attribute Generation, Reinforcement Learning Alignment, Contrastive Learning
TL;DR¶
The authors propose the AFMRL framework, defining fine-grained understanding of e-commerce products as an attribute generation task. It enhances contrastive learning (AGCL) by generating key attributes through MLLMs and uses retrieval performance as a reward signal to optimize the attribute generator (RAR) through reinforcement learning, achieving SOTA retrieval performance on large-scale e-commerce datasets.
Background & Motivation¶
Background: Multimodal representation learning is evolving from discriminative matching frameworks like CLIP toward generative large model-based approaches. E-commerce scenarios require distinguishing highly similar products (e.g., "V-neck red dress" vs. "round-neck red dress"), necessitating extremely high fine-grained understanding.
Limitations of Prior Work: Models like CLIP are essentially "bag-of-words" systems that struggle to distinguish compositional semantic differences (e.g., "white T-shirt with blue logo" vs. "blue T-shirt with white logo"). While LLM-based approaches like VLM2Vec possess strong reasoning capabilities, they are constrained by causal attention mechanisms, obtaining embeddings only via global average pooling or the final token, which prevents compatibility with fine-grained alignment techniques like RoI.
Key Challenge: The generative capacity of MLLMs can extract fine-grained attributes, but existing architectures limit their direct application to fine-grained representation learning. How can the understanding capabilities of MLLMs be converted into improvements for discriminative representation?
Goal: To utilize MLLM generation capabilities to extract key product attributes and integrate these attributes into the representation learning process while ensuring attribute generation aligns with the final retrieval objective.
Key Insight: Offload fine-grained understanding to an attribute generator, using the generated attributes as a bridge to indirectly enhance the fine-grained discriminative power of the representation model.
Core Idea: A two-stage training approach—first using attributes to guide contrastive learning and mine hard negatives, then using retrieval results as reward signals to optimize the attribute generator via RL, forming a self-improving closed loop.
Method¶
Overall Architecture¶
AFMRL employs two independent models: a representation model (VLM2Vec, responsible for generating discriminative embeddings) and an attribute generator (MLLM, responsible for extracting key attributes). Training consists of two stages: Stage 1 uses attributes to guide the contrastive learning of the representation model; Stage 2 uses the frozen representation model to provide retrieval rewards for the attribute generator, optimizing the generation policy via GRPO. During inference, the generator extracts attributes to enrich the query input, which the representation model kemudian encodes for retrieval.
Key Designs¶
-
Attribute-Guided Contrastive Learning (AGCL):
- Function: Leverages key attributes generated by the MLLM to enhance standard InfoNCE contrastive learning.
- Mechanism: Two enhancement mechanisms—(a) Calculating lexical similarity between query and candidate attributes using BM25, converted into importance weights via \(w_{ij} = e^{1+\tanh(B_{ij})}\), forcing the model to focus more on lexically similar hard negatives; (b) False Negative Masking: If a negative sample's similarity to the positive sample exceeds a threshold \(\delta\), it is removed from the negative pool to avoid penalizing semantically correct matches.
- Design Motivation: Standard InfoNCE fails to utilize complementary signals outside of embeddings and penalizes false negatives. AGCL addresses both issues via attribute information.
-
Retrieval-Aware Attribute Reinforcement (RAR):
- Function: Directly aligns attribute generation with downstream retrieval performance through reinforcement learning.
- Mechanism: The representation model trained in Stage 1 is frozen to serve as the reward environment. After the generator produces attributes for a query, the representation model performs retrieval with the enhanced query, using Recall@k directly as the reward signal. The policy is optimized using GRPO with KL-divergence regularization to prevent drifting too far from the SFT base. Invalid outputs receive a penalty of \(\eta=-0.1\).
- Design Motivation: The attribute generation objective in SFT distillation is disconnected from the final retrieval task. The RL stage uses retrieval metrics as rewards to ensure generated attributes are most helpful for retrieval.
-
Cyclic Iterative Training (CIT):
- Function: The RL-optimized generator feeds back into the representation model training.
- Mechanism: After RL training, the optimized attribute generator provides new attributes for AGCL training, forming a self-improvement cycle. Using only 30% of training samples significantly improves performance.
- Design Motivation: The quality of the attribute generator and the representation model are interdependent; iterative training breaks local optima from initialization.
Loss & Training¶
The AGCL loss in Stage 1 is a weighted InfoNCE: $\(\mathcal{L}_{\text{AGCL}} = -\log \frac{w_{ii} \cdot e^{s_{ii}/\tau}}{w_{ii} \cdot e^{s_{ii}/\tau} + \sum_{j \in \mathcal{N}_i} w_{ij} \cdot e^{s_{ij}/\tau}}\)$. Stage 2 utilizes the GRPO objective function, including clipping ratios and KL regularization terms. The representation model is initialized with Qwen2-VL-2B and fine-tuned with LoRA; the attribute generator is initialized with Qwen2.5-VL-3B and undergoes full parameter fine-tuning.
Key Experimental Results¶
Main Results¶
| Model | Fine-grained Recall@1 | Recall@5 | Recall@10 |
|---|---|---|---|
| CLIP | 14.98 | 23.07 | 27.59 |
| FG-CLIP | 31.44 | 49.78 | 68.38 |
| VLM2Vec | 48.05 | 64.26 | 69.65 |
| + AGCL | 51.06 | 68.08 | 73.52 |
| + AGCL + Distill Gen. | 52.42 | 71.00 | 76.26 |
| AFMRL (Full) | 54.28 | 72.19 | 77.27 |
Ablation Study¶
| Configuration | Accuracy | NMI | ARI | Purity |
|---|---|---|---|---|
| Baseline VLM2Vec | 87.67 | 87.04 | 44.39 | 73.16 |
| + AGCL | 87.80 | 87.11 | 44.44 | 73.24 |
| + AGCL + Distilled Gen. | 87.98 | 87.63 | 46.24 | 74.21 |
| + AGCL + RL Policy | 88.00 | 87.68 | 46.61 | 74.52 |
| + CIT (Iterative) | 89.13 | 88.97 | 47.40 | 75.98 |
Key Findings¶
- Each component provides clear incremental gains: AGCL → +3.01 R@1, Distilled Attributes → +1.36 R@1, RL Alignment → +1.86 R@1.
- "Generative Conciseness" emerged as a behavior during RL training: the length of generated attributes continuously decreased as the model learned to complete retrieval using the most concise attributes.
- Recall@50 proved to be the optimal \(k\) value, balancing reward sparsity and saturation.
- AGCL prevents the model from falling into local optima prematurely, providing a more robust representation space.
Highlights & Insights¶
- The "attribute as a bridge" concept is ingenious—it bypasses the limitations of MLLM causal attention on fine-grained alignment by using textual attributes, converting generative power into discriminative power.
- Using retrieval performance as an RL reward signal creates a closed-loop optimization that is more direct than traditional proxy losses. This approach is transferable to any "generation-assisted discrimination" scenario.
- The automatic shortening of attribute length during RL training is a fascinating emergent phenomenon, indicating that RL is indeed learning "what information is useful for retrieval."
Limitations & Future Work¶
- The RL strategy involves an "alignment tax"—over-optimizing for Recall@k might damage general representation quality.
- Currently only validated on e-commerce datasets; generalization to other fine-grained retrieval scenarios remains to be explored.
- The attribute generator increases inference overhead, requiring a trade-off between accuracy and efficiency.
- The convergence and optimal iteration count for Cyclic Iterative Training have not yet been analyzed in depth.
Related Work & Insights¶
- vs VLM2Vec: VLM2Vec directly uses MLLMs for representation but lacks fine-grained signals; AFMRL supplements this with attribute generation.
- vs FG-CLIP: FG-CLIP relies on region-level annotations for fine-grained alignment, which is incompatible with MLLM architectures; AFMRL replaces region features with attribute text.
- vs General GRPO: This work extends GRPO from reasoning tasks to retrieval alignment, with a more compact reward design.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of attribute-guided contrastive learning and retrieval-reward RL is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated on large-scale e-commerce datasets with thorough ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear framework description and intuitive illustrations.
- Value: ⭐⭐⭐⭐ Clear practical value for fine-grained retrieval in e-commerce.