Cycle-Consistent Tuning for Layered Image Decomposition¶
Conference: CVPR 2026
arXiv: 2602.20989
Code: None (Project page available)
Area: Image Decomposition / Image Editing
Keywords: Image Decomposition, Cycle Consistency, Diffusion Models, LoRA Fine-tuning, In-Context Learning
TL;DR¶
A cycle-consistent fine-tuning framework based on diffusion models is proposed to achieve image layer separation (e.g., logo-object decomposition) by jointly training decomposition and synthesis models. A progressive self-improving data augmentation strategy is introduced to achieve robust decomposition in scenarios with non-linear layer interactions.
Background & Motivation¶
Image decomposition (splitting an image into semantically or physically meaningful layers) is a classic problem in CV and CG: - Traditional methods (e.g., intrinsic decomposition) are limited to linear interactions (alpha blending) and struggle with non-linear coupling such as lighting, perspective distortion, and material reflections. - Separating logos from product photos involves global non-linear interactions (shadows, perspective deformation, surface reflections). - Existing generative editing methods (e.g., ICEdit, Flux-Kontext) can remove logos but struggle to accurately isolate and extract them. - Decomposition is an ill-posed problem (more unknowns than inputs) requiring additional constraints.
Core Idea: Decomposition is the inverse process of synthesis. By simultaneously learning decomposition and synthesis while imposing cycle-consistency constraints, the determinism of synthesis is used to constrain the uncertainty of decomposition.
Method¶
Overall Architecture¶
Based on FLUX.1-Fill-dev (a pre-trained diffusion inpainting model), the model is adapted for decomposition tasks via lightweight LoRA fine-tuning. It adopts an In-Context Learning paradigm: the input is a three-panel grid image (Composite / Logo / Clean Object), and the model learns to separate the two layers from the composite.
Key Designs¶
-
Cycle-Consistent Decomposition-Synthesis Framework: Simultaneously learns a decomposition function \(\mathcal{F}_D(I)=\langle A,B\rangle\) and a synthesis function \(\mathcal{F}_C(\langle A,B\rangle)=I\), sharing the same LoRA parameter space. During training, it runs bidirectionally: (1) From \(I\), decompose to get \(\langle A',B'\rangle\) and re-synthesize back to \(I'\); (2) From \(A,B\), synthesize to get \(I^*\) and re-decompose back to \(\langle A^*,B^*\rangle\). The cycle-consistency loss aligns both directions: $\(\mathcal{L}_{cyc} = \mathbb{E}\left[\|v_\theta(x_{t_1}^I, M_D, t_1, \tau_D) - v_\theta(x_{t_1}^{I^*}, M_D, t_1, \tau_D)\|_2^2\right] + \mathbb{E}\left[\|v_\theta(x_{t_2}^{\langle A,B\rangle}, M_C, t_2, \tau_C) - v_\theta(x_{t_2}^{\langle A',B'\rangle}, M_C, t_2, \tau_C)\|_2^2\right]\)$ This allows decomposition and synthesis to supervise each other, reducing reliance on densely labeled data.
-
Progressive Self-Improving Data Collection: Addresses the scarcity of logo-object decomposition training data. It involves three stages: (a) Seed data—100 manually labeled triplets + GPT-4o assisted training of initial IC-LoRA; (b) Iterative data generation—generating candidate triplets using the current IC-LoRA, filtering high-quality samples with Qwen-VL, and retraining to improve generation stability round-by-round; (c) Cycle model self-improvement—using the cycle-consistent model to perform decomposition-resynthesis cycles on new composites, adding high-quality resynthesized samples to the training set. The selection rate consistently improved from round 1 to round 10.
-
Flow Matching-based ICL Training: Fine-tunes the LoRA parameters of FLUX.1-Fill-dev using a flow matching loss: $\(\mathcal{L}_{rec} = \mathbb{E}_{x,t}\left[\|v_\theta(x_t, M, t, \tau) - \frac{\partial x_t}{\partial x}\|_2^2\right]\)$ By using masks to distinguish regions to be generated (ones) from preserved regions (zeros), visual ICL with single-input multiple-output is achieved.
Loss & Training¶
- Total Loss = Flow matching reconstruction loss + Cycle-consistency loss.
- Decomposition and synthesis share the same LoRA parameters to improve parameter efficiency and stabilize training.
- The self-improving data cycle uses Qwen-VL for automatic filtering combined with simple manual checks.
Key Experimental Results¶
Main Results¶
| Method | Logo VQAScore↑ | Object VQAScore↑ | VLMScore Mean↑ |
|---|---|---|---|
| AssetDropper | 0.42 | — | — |
| ICEdit | 0.31 | 0.31 | 2.55 |
| Flux-Kontext | 0.40 | 0.32 | 3.79 |
| Gemini | 0.42 | 0.32 | 4.20 |
| Ours | 0.43 | 0.31 | 4.22 |
Evaluated on 1.5K synthetic test samples, the method achieves the best logo extraction quality and the highest overall score.
Ablation Study¶
| Configuration | Effect Description |
|---|---|
| Round 0 IC-LoRA only | Poor separation quality, severe logo residue |
| + Iterative data generation | Decomposition significantly improved |
| + Cycle-consistency | Significant improvement in logo fidelity |
| + Self-improvement process (Full) | Further improvement in object consistency and realism |
Generalization experiments: Intrinsic decomposition (MAW dataset) Intensity 0.57 / Chromaticity 3.54, close to specialized SOTA methods.
Key Findings¶
- Cycle consistency is the largest single contributor to decomposition quality improvement.
- The selection rate of high-quality samples in the self-improving data strategy grows continuously across rounds (from ~20% to >60%).
- In user studies, the method was ranked first in over 50% of cases.
- The framework generalizes to different tasks like intrinsic decomposition and foreground-background decomposition.
Highlights & Insights¶
- The insight that "decomposition and synthesis are dual processes" is elegant—using a deterministic process (synthesis) to constrain an ill-posed problem (decomposition).
- The progressive data bootstrapping starts from only 100 seed samples and gradually expands the high-quality training set, demonstrating high data efficiency.
- A single LoRA encodes both decomposition and synthesis capabilities, ensuring parameter efficiency.
- Unlike manipulation-based methods (e.g., Attend-and-Excite), this method requires zero modifications to the base model.
Limitations & Future Work¶
- Performance degrades when overlapping elements dominate the frame (e.g., large-scale wall advertisements).
- Currently supports only two-layer decomposition and cannot handle multiple overlapping logos.
- Constrained by the grid paradigm of ICL; architectural adjustments are needed to scale to more layers.
- Training data is biased toward product logos; additional adaptation is required for other types of overlapping elements (e.g., watermarks, stickers).
Related Work & Insights¶
- Comparison with AssetDropper: The latter uses reward-driven optimization to extract assets but cannot recover the underlying object.
- Difference from DecompDiffusion: The latter trains independent models for different layers, while ours shares the same model.
- The idea of cycle consistency could be extended to motion/lighting/multimodal decomposition.
- The progressive self-improving data strategy provides valuable reference for data-scarce scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of cycle consistency and self-improving data strategy is novel; the decomposition-synthesis dual perspective is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive quantitative, qualitative, ablation, user study, and generalization experiments.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with smooth logic from motivation to verification.
- Value: ⭐⭐⭐ Academic application (logo extraction) is relatively niche, but the framework's philosophy is generalizable.
- Value: TBD