TokenLight: Precise Lighting Control in Images using Attribute Tokens¶

Conference: CVPR 2026 arXiv: 2604.15310 Code: vrroom.github.io/tokenlight/ Area: Image Generation / Relighting Keywords: relighting, attribute tokens, diffusion transformer, lighting control, synthetic data

TL;DR¶

TokenLight formulates image relighting as an end-to-end image generation task conditioned on attribute tokens (intensity, color, ambient light, diffuse level, and 3D light source position), enabling precise, continuous, and interpretable lighting control within a diffusion Transformer framework.

Background & Motivation¶

Existing relighting methods are each constrained by their lighting representations: text-driven approaches lack precision, background images carry limited information, panoramic environment maps cannot model near-field illumination, and inverse rendering methods depend on high-quality 3D reconstruction. There is a lack of a representation that is simultaneously precise, interpretable, and spatially flexible enough for direct lighting manipulation in the 2D image domain. The core requirement is to combine the intuitive flexibility of 3D lighting tools with the accessibility of 2D image editing, without requiring 3D reconstruction.

Method¶

Overall Architecture¶

The method fine-tunes a pretrained diffusion Transformer (a text-to-image/video foundation model). Given control signals — attribute tokens combined with an input image — the model directly re-renders the desired output. A large-scale synthetic dataset provides precise lighting annotations, supplemented by a small amount of real data to improve generalization.

Key Designs¶

Attribute Token Lighting Representation: Each token encodes a physically meaningful lighting attribute — intensity, color (color temperature), diffuse level, 3D spatial position, and ambient light parameters. Each attribute is independently controllable and continuously adjustable. This decomposed representation naturally supports disentangled editing.
Large-Scale Synthetic Data Training: A dataset is generated in Blender using a path-tracing renderer, rendering 3D assets under systematically varied lighting conditions to provide precise ground-truth supervision for each lighting attribute. The dataset includes single-light-source renders, environment light images, and light source visibility masks.
Three Practical Lighting Control Modes: (1) Adding a virtual point light (placing a new light source at an arbitrary 3D position); (2) editing/diffusing ambient illumination (global lighting adjustment); (3) controlling in-scene light sources (toggling emissive objects via spatial masks). Combinations of these modes enable rich creative effects.

Loss & Training¶

Standard diffusion denoising objective. A small amount of real data (captured by toggling in-scene light sources) supplements training to improve photorealism and generalization. The visual priors of the pretrained foundation model are retained during fine-tuning.

Key Experimental Results¶

Main Results¶

Evaluated on both synthetic and real images, compared against prior methods including GenLit and LightLab:

Task	Metric	Prev. SOTA	Ours
Virtual light source addition	Quantitative + Qualitative	Inferior	SOTA
Ambient light editing	Quantitative + Qualitative	Inferior	SOTA
In-scene light source control	Quantitative + Qualitative	Inferior	SOTA

Key Findings¶

Without inverse rendering supervision, the model exhibits an intrinsic understanding of light–scene interaction.
Virtual light sources can be placed inside objects (e.g., a jack-o'-lantern glow effect).
Relighting of transparent materials produces convincing shadows.
Reasoning capabilities learned solely from synthetic data transfer successfully to real-world scenes.

Highlights & Insights¶

Attribute tokens transform lighting control from a black box into interpretable, physically grounded manipulation.
The end-to-end approach demonstrates 3D lighting understanding without requiring 3D reconstruction.
The strategy of scaling synthetic training data offers guidance for other generative tasks that require precise annotations.

Limitations & Future Work¶

The 3D light source position is coupled to the camera viewpoint, and multi-view consistency is not guaranteed.
Robustness under extreme lighting conditions (e.g., high dynamic range scenes) remains to be tested.
Inference speed for real-time interactive editing may be constrained by the diffusion model.

The decomposed control design using attribute tokens is generalizable to other conditional generation tasks.
The synthetic-plus-small-real training strategy balances precision and generalization.
The lighting reasoning capability of diffusion Transformers suggests the possibility of end-to-end physical understanding.

Rating¶

8/10 — The representation design is elegant, and both control precision and visual quality are excellent, representing a significant advance in the relighting field.