ORIDa: Object-Centric Real-World Image Composition Dataset¶

Conference: CVPR 2025
arXiv: 2506.08964
Code: https://hello-jinwoo.github.io/orida
Area: Diffusion Models / Image Editing
Keywords: Image Composition, Object Insertion, Dataset, Real-World Data, Diffusion Models

TL;DR¶

ORIDa constructs the first large-scale, real-shot, and publicly available object composition dataset containing over 30,000 images of 200 unique objects (including fact-counterfactual pairs and multi-position variations). It validates the dataset's efficacy on object removal and insertion tasks via fine-tuning on StableDiffusion-Inpaint.

Background & Motivation¶

Background: Object compositing is the task of placing and blending an object into a target scene, involving challenges such as identity preservation, color harmonization, shadow generation, and geometric alignment. Current methods are categorized into training-free approaches (e.g., FreeCompose) and training-based approaches (e.g., ObjectStitch, ObjectDrop).

Limitations of Prior Work: (1) Training-free methods underperform in details, particularly scene harmonization and object identity preservation; (2) Training-based methods mostly rely on synthetic data, which lacks the complexity and diversity of real-world scenes; (3) The most relevant dataset, ObjectDrop, although captured in the real world, contains only 2,500 object pairs and is not publicly available, with only 1 scene and 1 position per object.

Key Challenge: The lack of high-quality real-world image composition datasets severely restricts the development of object composition models. Synthetic data fails to capture complex real-world "object-to-scene interaction" effects, such as lighting, shadows, and reflections.

Goal: To build a large-scale, publicly available, real-shot dataset that meets the following criteria: (1) contains a sufficient number of unique objects; (2) shows each object in multiple different scenes; (3) provides fact (object present) and counterfactual (background-only) image pairs; (4) includes multiple position variations of the object in each scene.

Key Insight: Utilizing a meticulously designed data acquisition pipeline (fixed camera parameters, tripod and remote shutter, strict filtering) ensures that the presence or absence of the object is the sole variable within a scene, thereby obtaining high-quality fact-counterfactual data pairs.

Core Idea: Constructing the ORIDa dataset (200 objects, 30,000+ images, 50+ scenes per object, 4 positions per scene), supporting ISP augmentation through the provision of raw DNG files, and demonstrating that high-quality object composition models can be trained solely on real-world data without synthetic data.

Method¶

Overall Architecture¶

The ORIDa dataset consists of two components: (1) Fact-Counterfactual (F-CF) Collection—each group has 5 images: 1 background-only plus 4 object images in different positions, totaling 5,699 groups; (2) Fact-Only (F-Only) Images—single images of objects in specific scenes, totaling 5,035 images. RAW DNG files were captured using 5 different smartphones in PRO mode. StableDiffusion-Inpaint is fine-tuned on this dataset for object removal and insertion tasks.

Key Designs¶

Data Acquisition and Quality Control:
- Function: To ensure that scene changes in the dataset are solely caused by the presence or absence of the object.
- Mechanism: During F-CF collection, the camera is fixed (using tripods and remote controls) along with key parameters (shutter speed, ISO, white balance, focus) to continuously shoot 5 images. A strict filtering process identifies three types of undesirable cases: (a) accidental background changes (such as illumination variations or pedestrians); (b) out-of-focus images; (c) incorrect object poses. Out of 7,000 F-CF groups, 5,699 were selected, and 5,035 were retained from 5,500 F-Only images.
- Design Motivation: Data quality is core to the value of a dataset. Only by strictly controlling variables can the bidirectional effects of object-to-scene influence (shadows, reflections, etc.) and scene-to-object influence (color shifts, etc.) be accurately captured.
Rich Annotation System:
- Function: To provide hierarchical annotations supporting downstream research.
- Mechanism: Four types of annotations are provided: (1) object descriptive texts (generated by GPT-4o and Gemini 1.5 Pro); (2) object points (manually annotated); (3) bounding boxes; (4) segmentation masks (generated by SAM2). Additionally, the raw DNG files support 5 types of ISP augmentations (original, high/low color temperature, high/low vibrancy) for training color harmonization.
- Design Motivation: Object composition involves multiple sub-problems (identity preservation, shadow generation, color harmonization), which require rich annotations to support evaluation and training across different dimensions.
SD-Inpaint Based Object Removal/Insertion Models:
- Function: To verify that effective object composition models can be trained solely using ORIDa real-world data.
- Mechanism: Object removal: fine-tuning SD-Inpaint (9-channel input) directly for only 5,000 steps (320k samples), which is far fewer than the 6.4 million samples used in ObjectDrop. Object insertion: trained for 500k steps using raw images from ORIDa and COCO (without synthetic data). During inference, skip residual connections (from DemoFusion) are incorporated to maintain object identity.
- Design Motivation: To demonstrate that high-quality real data can replace massive synthetic data. ObjectDrop requires a batch size of 512 and two-stage training with synthetic data, whereas ORIDa's model achieves superior performance with significantly less data.

Loss & Training¶

Optimizer: Adam, lr=5e-5, cosine scheduler
Training Scale: Object removal: 5,000 steps / batch size 64; Object insertion: 500k steps / batch size 64
Hardware: 4x NVIDIA A100-PCIE (40GB), object insertion training takes approximately 150 hours

Key Experimental Results¶

Main Results — Object Removal¶

Method	PSNR↑	DINO↑	CLIP↑	LPIPS↓
SD-Inpaint	21.76	0.845	0.903	0.108
SD-Ours_r	25.60	0.902	0.938	0.088

User study (5-point scale): SD-Ours_r scores 4.23, significantly outperforming SD-Inpaint (2.78), LaMa (2.63), and MGIE (1.96).

Object Insertion User Study¶

Dimension	SD-Ours_i Preference Rate
Object Identity Preservation	66%
Shadow Generation	79%
Color Harmonization	71%
Overall Quality	67%

Ablation Study — Data Scale¶

Data Ratio	Effect Description
25%	Inaccurate shadow generation, artifacts in object appearance
50%	Improvement observed but inconsistencies persist
100%	Best performance, accurate shadows, and well-preserved object identity

Key Findings¶

SD-Ours trained purely on real-world data (without synthetic data) significantly outperforms baselines in both object removal and insertion.
The F-CF data in ORIDa enables the model to learn accurate shadow removal and generation, which is difficult to achieve with purely synthetic data.
Data scale ablation confirms that the amount of data is crucial for high-quality composition.
ISP augmentation of raw DNG files effectively enhances the color harmonization capability.

Highlights & Insights¶

Dataset Level: The first object composition dataset that is large-scale, real-shot, publicly available, and features multiple objects, multiple scenes, and multiple positions.
Bidirectional Effects: Simultaneously captures both "object-to-scene influence" (shadows, reflections) and "scene-to-object influence" (color variations), which synthetic data cannot simulate.
Practical Engineering Value: Proves that in the field of object composition, meticulously captured real data is more effective than vast amounts of synthetic data.
RAW DNG Support: Provides unprocessed raw files, opening up new possibilities for ISP-level data augmentation.

Limitations & Future Work¶

The number of objects (200) is relatively limited; the diversity of covered object categories and materials needs to be expanded.
The insertion results exhibit some blurriness, stemming from the inherent limitations of the pretrained SD-Inpaint model.
Object pose variations are intentionally restricted and do not cover composition scenarios under different viewpoints/poses.
Future work can extend this to video composition, object insertion in dynamic scenes, and other directions.

ObjectDrop: The most relevant prior work, but it is not publicly available, features only 1 scene/position per object, and relies on auxiliary training with synthetic data.
Paint-by-Example: Trained on synthetic data, underperforming in maintaining object identity.
AnyDoor/ObjectStitch: State-of-the-art object composition methods that still fall short in shadow generation and color adaptation.

Rating¶

Novelty: 7/10 — The core contribution lies in the dataset construction rather than methodological innovation; the model is only used for validation fine-tuning.
Experimental Thoroughness: 8/10 — Multi-dimensional evaluations are provided, including quantitative metrics, user studies, ablation experiments, and data analysis.
Writing Quality: 8/10 — The dataset construction process is clearly described with detailed statistical analysis.
Value: 8/10 — Fills the gap in real-world object composition datasets and makes a significant contribution to the field.