Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation¶
Conference: ICLR 2026 arXiv: 2602.11440 Code: To be confirmed Area: 3D Vision / Visual Generation Keywords: Object manipulation, diffusion models, geometric consistency, camera pose control, image editing
TL;DR¶
This paper proposes Ctrl&Shift, an end-to-end diffusion framework that decomposes object manipulation into object removal and reference-guided inpainting, and injects relative camera pose control, achieving geometry-consistent fine-grained object manipulation for the first time without relying on explicit 3D reconstruction.
Background & Motivation¶
Background: Object-level manipulation (repositioning and rotating objects while preserving scene realism) is a fundamental operation in film post-production, AR, and creative editing. Mainstream approaches fall into two camps: geometry-based methods (manipulation after NeRF/3DGS reconstruction) and diffusion-based methods (text/trajectory-conditioned editing).
Limitations of Prior Work: - Geometry-based methods (NeRF/3DGS) provide precise control but require explicit 3D reconstruction, incurring high per-scene optimization costs and poor generalization - Diffusion-based methods (DragAnything/VACE, etc.) generalize well but lack fine-grained geometric control and cannot precisely specify object pose transformations - No existing method simultaneously achieves: background preservation, geometry-consistent viewpoint transformation, and user-controllable transformation
Key Challenge: A fundamental trade-off exists between geometric precision and generalization capability
Goal: Achieve geometry-consistent, fine-grained, controllable object manipulation without explicit 3D reconstruction
Key Insight: Rather than lifting content into 3D for editing, inject precise viewpoint control directly into the 2D diffusion process
Core Idea: Decompose object manipulation into three sub-tasks—removal, reference-guided inpainting, and camera pose control—and learn them jointly within a unified diffusion framework through multi-task, multi-stage training
Method¶
Overall Architecture¶
Inputs: source image/video frame + reference object image + source mask + target mask + relative camera pose descriptor. Output: target frame with the object moved/rotated to the target position/viewpoint. The architecture is based on a ControlNet-style DiT, injecting conditioning signals through a control branch, with camera pose injected via cross-attention.
Key Designs¶
-
Task Decomposition and Multi-Task Training
- Function: Decompose object manipulation into three separable tasks for joint training
- Mechanism:
- Primary task: Full object manipulation—remove the object from its source location and re-synthesize it at the target location with the target viewpoint
- Auxiliary task 1 (object removal): Set the reference image to white, target mask to all-zero, and pose outside the frame, learning to cleanly remove objects
- Auxiliary task 2 (reference inpainting + camera control): Set source mask to all-zero and input to background frame, learning to synthesize the reference object at a specified pose
- Task weight ratio is 8:1:1; each conditioning signal has an explicit functional role
- Design Motivation: The five conditioning signals (source frame, reference image, source mask, target mask, camera pose) are highly entangled; the multi-task strategy explicitly disentangles the contribution of each signal
-
Relative Camera Pose Encoding
- Function: Encode the geometric transformation from source viewpoint to target viewpoint
- Mechanism: A look-at camera model is adopted, parameterizing each viewpoint as \((yaw, pitch, d, r_x, r_y)\). The axis-angle representation of the relative rotation matrix \(\text{aa}(\mathbf{R}_{rel})\), relative translation \(\mathbf{t}_{rel}\), and NDC offset \((\Delta r_x, \Delta r_y)\) are concatenated into an 8-dimensional descriptor \(\mathbf{f} \in \mathbb{R}^8\)
- After Fourier positional encoding and MLP projection, this is mapped to 8 tokens (\(d=4096\)) and injected into the DiT via cross-attention
- Design Motivation: Relative pose is more intuitive than absolute pose (adjustments are made relative to the input frame, analogous to dragging), avoiding the difficulty of defining a canonical absolute pose
-
Mask Encoding Strategy
- Function: Align binary masks to the VAE latent space
- Mechanism: Instead of encoding masks through the VAE (which risks treating binary semantics as appearance), space-to-depth (pixel unshuffle) is applied to directly downsample the mask to match the VAE stride
- At inference time, the target mask is approximated by scaling and translating the bounding box of the source mask
-
Two-Stage Training
- Stage I (Synthetic Data): Pre-training on ~2M synthetic image pairs with white backgrounds and random camera poses, learning object priors and pose representations; both the backbone and control branch are jointly updated
- Stage II (Real Data): Fine-tuning on 100K high-quality real image/video pairs with the backbone frozen and only the control branch updated, focusing on background preservation and photorealism
-
Data Construction Pipeline
- Function: Automatically construct training pairs with pose annotations from real images
- Mechanism: Hunyuan3D-2 reconstructs the object mesh → differentiable rendering estimates the source camera pose (filtered by IoU ≥ 0.90) → target poses are sampled and rendered → MiniMax-Remover obtains clean backgrounds → an object pasting network performs harmonized compositing
Loss & Training¶
Flow-matching training is adopted with a linear path \(\mathbf{z}_t = (1-t)\mathbf{z}_0 + t\boldsymbol{\varepsilon}\) and a velocity matching loss \(\|\mathbf{v}_\theta(\mathbf{z}_t, \mathbf{c}, t) - \mathbf{v}^*(\mathbf{z}_t, t)\|_2^2\).
Key Experimental Results¶
Main Results¶
Zero-shot evaluation on ObjectMover-A:
| Method | PSNR↑ | DINO↑ | CLIP↑ | DreamSim↓ |
|---|---|---|---|---|
| ObjectMover | 25.27 | 85.07 | 93.16 | 0.142 |
| Ctrl&Shift | 28.69 | 88.07 | 93.58 | 0.075 |
GeoEditBench (proposed benchmark for geometry-aware editing evaluation):
| Method | PSNR↑ | DINO↑ | Pose MAPE↓ | Obj IoU↑ |
|---|---|---|---|---|
| VACE | 24.32 | 75.38 | 30.56% | 0.72 |
| Nano-Banana | 26.38 | 78.05 | 24.36% | 0.78 |
| Ctrl&Shift | 28.71 | 85.23 | 17.70% | 0.83 |
Ablation Study¶
- Removing Stage 1: Pose MAPE increases from 17.70% to 32.50%, severely degrading geometric understanding
- Removing Stage 2: PSNR drops from 28.71 to 24.83, with degraded background preservation and visual quality
- Removing Auxiliary Task 1: CLIP-Score drops to 86.32, impairing semantic consistency
- Removing Auxiliary Task 2: Obj IoU drops to 0.65 and Pose MAPE rises to 28.60%, with object-level precision most severely affected
Highlights & Insights¶
- A key conceptual breakthrough: geometry-consistent object manipulation without 3D reconstruction
- The multi-task decomposition strategy is elegant, enabling the model to learn disentangled signals from each task
- The data construction pipeline is scalable and supports real-world images and videos
- GeoEditBench provides a systematic evaluation protocol for geometry-aware editing
Limitations & Future Work¶
- The approximate target mask estimation at inference (bounding box scaling and translation) may be inaccurate under extreme transformations
- Based on the Wan-1.3B backbone, the model scale is modest, which may limit performance on complex scenes
- Currently supports only single-object manipulation; multi-object collaborative editing remains unexplored
- Data construction relies on Hunyuan3D-2 and an object pasting model, inheriting errors from these components
- Video manipulation capability is demonstrated but lacks sufficient quantitative evaluation
Related Work & Insights¶
- vs. DragAnything: A trajectory-conditioned diffusion method with poor generalization and no pose control
- vs. VACE: Preserves background well but effectively translates the entire frame rather than genuinely manipulating objects
- vs. Nano-Banana/Qwen-Image-Edit: High generation quality but imprecise camera pose control driven by text instructions
- vs. 3DiT/GeoDiffuser: Relies on 3D reconstruction or geometric conditions, limiting generalization
- vs. ObjectMover: A video-prior method; Ctrl&Shift achieves +3.42 PSNR and halves DreamSim
The approach of injecting 3D geometric control without performing 3D reconstruction is generalizable to other editing tasks. The multi-task disentanglement training strategy is worth adopting in other multi-condition generation settings. Relative pose encoding is better suited to interactive editing scenarios than absolute pose.
Rating¶
- Novelty: ⭐⭐⭐⭐ (conceptual innovation in task decomposition + pose injection)
- Experimental Thoroughness: ⭐⭐⭐⭐ (multiple benchmarks + ablations + proposed benchmark)
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐ (first to unify geometric precision and diffusion generalization)