CharaConsist: Fine-Grained Consistent Character Generation¶

Conference: ICCV 2025 arXiv: 2507.11533 Code: https://github.com/xxx/CharaConsist.git Area: Diffusion Models Keywords: Consistent Character Generation, DiT, Training-Free, Attention Mechanism, FLUX.1

TL;DR¶

This paper proposes a training-free, fine-grained consistent character generation method that achieves high-quality cross-image character consistency on a DiT architecture (FLUX.1) for the first time, via Point-Tracking Attention, adaptive token merging, and foreground-background decoupled control.

Background & Motivation¶

Consistent Character Generation is a core requirement for story visualization, comic generation, and related applications: given textual descriptions of the same character, the character's appearance must remain consistent across multiple generated images. Existing training-free methods (e.g., StoryDiffusion, ConsiStory) face three critical issues:

Background inconsistency: Existing methods cannot distinguish foreground from background when computing cross-image attention, causing background elements to interfere with each other. When background switching is desired, character consistency and background diversity are difficult to reconcile simultaneously.

Trade-off between foreground consistency and action diversity: These methods rely on cross-image attention to transfer identity information, but when a character undergoes significant pose or position changes across images, a locality bias emerges — the model tends to attend to spatially proximate rather than semantically relevant regions, causing a sharp drop in consistency.

Architectural limitations: Both StoryDiffusion and ConsiStory are built on the UNet-based SDXL and cannot be directly transferred to more advanced DiT architectures (e.g., FLUX.1). DiT models offer substantial advantages in image quality and text comprehension, but their attention mechanisms differ from UNet, invalidating the design assumptions of existing methods.

The paper's core insight is that the root cause of locality bias lies in positional encoding. DiT employs RoPE (Rotary Position Embedding), which assigns higher attention weights to spatially adjacent tokens. Consequently, when a character occupies different positions in the reference and target images, naive KV injection fails. The proposed solution is to identify corresponding points via semantic matching and re-encode the reference KV using target positions, fundamentally eliminating locality bias.

Method¶

Overall Architecture¶

CharaConsist adopts a two-stage pipeline:

Identity image generation: An identity reference image is first generated normally with FLUX.1, and the attention KV values across all layers and all timesteps are cached during denoising.
Frame image generation: Subsequent images are generated by injecting identity information via Point-Tracking Attention, using the cached KV values and semantic matching results to achieve character consistency.

Unlike ConsiStory, which requires simultaneous generation of 2–4 images, the proposed method requires only one reference image, substantially reducing GPU memory consumption.

Key Designs¶

1. Semantic Point Matching¶

Conventional methods (e.g., DIFT) compute semantic correspondences via diffusion features, but DIFT performs poorly on FLUX.1. This paper proposes computing the cosine similarity of attention outputs averaged across all layers at the same timestep:

\[S_{t}(i, j) = \frac{1}{L} \sum_{l=1}^{L} \cos(O_{\text{id}}^{l,t}(i),\ O_{\text{frame}}^{l,t}(j))\]

where \(O_{\text{id}}\) and \(O_{\text{frame}}\) denote the attention outputs of the identity image and the frame image, respectively. Taking the argmax yields the best matching point in the identity image for each frame token. This multi-layer averaging strategy is more robust than single-layer features and captures multi-scale information ranging from low-level textures to high-level semantics.

2. Foreground Mask Extraction¶

Text-image cross-attention weights are used to distinguish foreground from background. Specifically, each image token's attention weights toward foreground text tokens and background text tokens are compared:

\[M(i) = \mathbb{1}\left[\frac{1}{L}\sum_l A_{\text{fg}}^l(i) > \frac{1}{L}\sum_l A_{\text{bg}}^l(i)\right]\]

This approach requires no segmentation model and exploits the joint processing of text and image tokens in DiT, obtaining high-quality segmentation masks at zero additional computational cost.

3. Point-Tracking Attention¶

This is the core contribution. When injecting the identity image's KV, positional encodings are reassigned according to semantic matching results rather than used directly:

For a foreground token at position \(j\) in the frame image, its matching point in the identity image is identified as \(i^* = \text{argmax}_i\ S_t(i, j)\).
The KV at position \(i^*\) in the identity image is retrieved and re-encoded using the RoPE corresponding to position \(j\) in the frame image.
An attention mask ensures that foreground tokens attend only to foreground KV from the identity image, while background tokens attend only to their own context.

The effect is that even when the character occupies entirely different positions across two images, RoPE encoding no longer induces locality bias, as the positional encodings have been "aligned."

4. Adaptive Token Merge¶

Directly replacing attention outputs is overly aggressive and degrades action diversity. The paper adopts an interpolation strategy weighted by matching similarity:

\[\hat{O}_{\text{frame}}(j) = \alpha_t \cdot w(j) \cdot O_{\text{id}}(i^*) + (1 - \alpha_t \cdot w(j)) \cdot O_{\text{frame}}(j)\]

where \(w(j) = S_t(i^*, j)\) is the matching confidence and \(\alpha_t\) decays over timesteps. Regions with high matching confidence (e.g., the face) receive stronger identity injection, while regions with low confidence (e.g., arm motion) retain more of the original information. The timestep decay ensures that identity structure is injected at early steps while fine-grained detail generation retains freedom at later steps.

Loss & Training¶

The method is entirely training-free and involves no parameter updates or fine-tuning. All components operate through attention manipulation during FLUX.1 inference. Key hyperparameters include: - Injection start/end timesteps - Decay schedule for \(\alpha_t\) - Partitioning of foreground/background text tokens

Key Experimental Results¶

Main Results¶

Evaluation is conducted under two scenarios — "background preservation" and "background switching" — using prompts generated by GPT-4:

Method	Architecture	CLIP-T ↑	CLIP-I ↑	CLIP-I-fg ↑	CLIP-I-bg ↑	ID Sim ↑	IQS ↑
StoryDiffusion	SDXL	0.272	0.875	0.846	0.870	0.643	0.970
ConsiStory	SDXL	0.268	0.889	0.860	0.886	0.721	0.975
CharaConsist	FLUX.1	0.281	0.910	0.876	0.916	0.748	0.985

Background switching scenario:

Method	CLIP-I-fg ↑	IAS ↑
StoryDiffusion	0.834	0.781
ConsiStory	0.856	0.802
CharaConsist	0.881	0.831

CharaConsist achieves state-of-the-art performance on nearly all metrics, with particularly notable advantages in background consistency (CLIP-I-bg) and image quality (IQS).

Ablation Study¶

Ablations verify the consistent improvement contributed by each component across different baselines:

Base Model	Baseline Method	Baseline CLIP-I	+ CharaConsist	Gain Δ
RealVisXL4.0	StoryDiffusion	0.875	—	—
RealVisXL4.0	ConsiStory	0.889	—	—
FLUX.1	No consistency control	0.812	—	—
FLUX.1	CharaConsist	—	0.910	+0.098

Key finding: although the baseline consistency of FLUX.1 (0.812) is lower than that of RealVisXL4.0 + ConsiStory (0.889), the consistency gain from CharaConsist (+0.098) substantially exceeds the gain of ConsiStory on SDXL, demonstrating that the proposed method more effectively exploits the capabilities of the DiT architecture.

Component-wise ablations further show: - Removing Point-Tracking Attention → sharp consistency drop under large positional changes - Removing adaptive token merging → reduced action diversity, with character poses becoming homogeneous - Removing foreground-background decoupling → significant performance degradation in the background switching scenario

Key Findings¶

The potential of DiT is underestimated: Although the FLUX.1 baseline consistency is lower than SDXL-based methods, its upper bound after targeted optimization is higher, suggesting that the potential of DiT architectures for consistent generation remains largely untapped.
Positional encoding is the key bottleneck: The locality bias induced by RoPE is the central obstacle to consistent generation on DiT; Point-Tracking Attention effectively resolves this by re-encoding positions.
Necessity of foreground-background decoupling: In the background switching scenario, decoupled control yields substantial performance gains, revealing unified foreground-background processing as an important deficiency of prior methods.

Highlights & Insights¶

Precise diagnosis of RoPE locality bias: The paper not only identifies the problem but also provides a clear causal analysis — RoPE causes spatially adjacent tokens to receive higher attention weights, leading to semantic matching failures in cross-image injection. This insight is broadly applicable and serves as a reference for all cross-image interaction tasks on RoPE-based DiT models.
Minimal and efficient design: Only one reference image is required; no training, no external segmentation model, and all operations are performed within the attention layers. Compared to ConsiStory, which requires parallel multi-image generation, GPU memory consumption is substantially reduced.
Multi-layer attention output averaging for matching: Abandoning single-layer feature approaches such as DIFT in favor of semantic matching via averaged attention outputs across all layers, this simple yet effective strategy is generalizable to other tasks requiring semantic correspondence.
Training-free segmentation via text-image attention: The joint processing of text and image tokens in DiT enables foreground masks to be extracted directly from attention weights without any additional model.

Limitations & Future Work¶

No support for external reference images: The current method can only use self-generated identity images as references and does not support user-provided real photographs as identity input, limiting practical applicability (e.g., generating stories from user photos).
Complementarity with training-based methods: The authors note that the method can be combined with training-based identity reference approaches such as IP-Adapter and PhotoMaker — first injecting external identity via the training-based method, then maintaining multi-frame consistency with CharaConsist.
Multi-character scenarios: The paper primarily demonstrates single-character consistency; mask extraction and matching in multi-character settings may be considerably more complex.
Computational overhead: Caching KV across all layers and all timesteps imposes substantial memory requirements for high-resolution images.

StoryDiffusion / ConsiStory: Training-free consistency methods based on SDXL that transfer identity information via cross-image self-attention, but are constrained by the UNet architecture and locality bias.
IP-Adapter / PhotoMaker: Training-based identity reference methods that support external reference images but require additional training and specific model configurations.
DIFT: Utilizes diffusion features for semantic correspondence but fails on FLUX.1, highlighting significant differences in feature spaces across architectures.
RoPE in vision: The paper's analysis of RoPE's side effects in cross-image settings provides important guidance for the design of future DiT models.

The methodological paradigm of this work — diagnosing positional encoding bias → semantic matching re-encoding → decoupled foreground/background control — constitutes a general framework design pattern that is extensible to related tasks such as video generation and multi-view generation.

Rating¶

Dimension	Score (1–5)	Notes
Novelty	4	First training-free consistent generation on DiT; novel diagnosis and solution for RoPE bias
Technical Depth	4	Multiple components are elegantly designed and mutually complementary; thorough exploitation of attention mechanisms
Experimental Thoroughness	3.5	Comprehensive metrics, but lacks comparison with training-based methods (e.g., IP-Adapter) and more detailed ablation analysis
Practical Value	4	Training-free, low memory footprint, high quality — strong practical utility
Writing Quality	4	Problem definition is clear, method motivation is well-grounded, and mathematical derivations are complete
Overall	3.9	A high-quality systematic contribution with significant reference value for consistent generation in the DiT era