Improved Masked Image Generation with Knowledge-Augmented Token Representations¶
Conference: AAAI 2026 arXiv: 2511.12032 Code: https://github.com/GuotaoLiang/KA-MIG Area: Image Generation Keywords: Masked image generation, knowledge graph, discrete tokens, prior knowledge augmentation, graph convolutional network
TL;DR¶
This paper proposes KA-MIG, a framework that mines three types of token-level semantic prior knowledge graphs from training data (co-occurrence graph, semantic similarity graph, and position-token incompatibility graph), learns augmented token representations via a graph-aware encoder, and injects them into existing MIG models through a lightweight addition-subtraction fusion mechanism, consistently improving generation quality across multiple backbone networks.
Background & Motivation¶
Masked image generation (MIG), exemplified by MaskGIT, achieves a favorable balance between sampling speed and quality through parallel decoding. The pipeline encodes images into discrete token sequences via VQ-VAE → trains a transformer to predict masked tokens → iteratively samples to generate complete token sequences.
However, MIG still lags behind diffusion models. Existing improvements have focused primarily on decoding/sampling strategies (e.g., Token-Critic, DPC, Self-Guidance, Halton sampling), with almost no attention paid to enhancing the model's internal representational capacity.
The root problem identified by the authors: existing MIG methods rely entirely on the transformer itself to learn semantic dependencies among tokens, which is challenging because:
Individual tokens lack explicit semantic meaning: VQ-VAE codebook entries are merely vectors in latent space that are not directly interpretable.
Token sequences are typically long (e.g., 256 tokens/image), making complex relationships in long sequences difficult to capture effectively.
Core motivation: Given that tokens themselves lack semantics, can implicit structural patterns among tokens be mined from large-scale training data and injected into the model as prior knowledge?
Method¶
Overall Architecture¶
KA-MIG consists of three stages: 1. Graph construction: Building three prior knowledge graphs from training data 2. Graph-aware encoder: Learning augmented token and position representations using GCN 3. Lightweight fusion mechanism: Injecting prior knowledge into the MIG transformer via addition and subtraction operations
Key Designs¶
1. Construction of Three Prior Knowledge Graphs¶
(a) Co-occurrence Graph \(\mathcal{G}_{co}\) (Positive Prior)
Captures frequently co-occurring token pairs within a local neighborhood, reflecting latent spatial-semantic correlations.
- Construction: Statistics over token sequences of all training images, recording the frequency of co-occurrence between token pairs within first-order neighborhoods (horizontal, vertical, diagonal directions)
- A weighted undirected graph is built; low-frequency edges are pruned to reduce noise
- Intuition: If token A and token B frequently appear in adjacent positions, they likely encode semantically related visual patterns
(b) Semantic Similarity Graph \(\mathcal{G}_s\) (Positive Prior)
Identifies tokens that are semantically similar (akin to "synonyms") in the context of image synthesis.
- Core assumption: If two tokens exhibit similar positional distributions across a large number of images, they likely express similar semantics
- A positional distribution vector of length \(N\) is constructed for each token (each entry represents the frequency of the token appearing at a specific position)
- Jensen-Shannon divergence is used to measure distributional similarity
- Each token retains its top-2 most similar tokens, forming a directed graph
The validation experiment is highly compelling: replacing token (1013) with its most similar token (463) yields a reconstructed image visually indistinguishable from the original (PSNR=35.78); replacing it with the least similar token (149) severely degrades quality (PSNR=18.97).
(c) Position-Token Incompatibility Graph \(\mathcal{G}_p^c\) (Negative Prior)
Identifies tokens that should not appear at specific spatial positions under a given category.
- For each category \(c\), all training images are scanned to record tokens that never appear at a given position
- Example: For the "airplane" category, tokens encoding ground/grass textures almost never appear in the upper half of the image
- Helps the model avoid semantically unreasonable spatial-token combinations
2. Graph-aware Encoder¶
Positive prior processing: Two independent 3-layer GCNs extract global token representations: $\(C_{co} = f_{\theta_{co}}(\mathcal{G}_{co}, C), \quad C_s = f_{\theta_s}(\mathcal{G}_s, C)\)$ where \(C\) denotes the VQ-VAE codebook embeddings.
Negative prior processing: For each position \(i\) under category \(c\), the mean embedding of incompatible tokens is aggregated: $\(p_i^c = \frac{1}{|\mathcal{I}_{i,j}|}\sum_{t \in \mathcal{I}_{i,j}} C_t W\)$ yielding positional embeddings \(P^c \in \mathbb{R}^{N \times d}\) that encode spatial constraints.
3. Lightweight Fusion Mechanism¶
Additive fusion (positive prior): Augments unmasked token representations before each transformer layer: $\(Z_{\overline{M}}^l = Z_{\overline{M}}^l + f_{pos}^l(C_{co}[Z_{\overline{M}}]) + f_{pos}^l(C_s[Z_{\overline{M}}])\)$
Subtractive fusion (negative prior): Suppresses incompatible token features at masked positions in each layer: $\(Z_M^l = Z_M^l - \alpha f_{neg}^l(P^c)\)$
Both \(f_{pos}\) and \(f_{neg}\) are implemented with zero convolution, ensuring no interference with existing knowledge at the start of training.
Loss & Training¶
- Standard MIG training objective (negative log-likelihood of masked tokens)
- Backbone is frozen; only classification layers and newly added parameters are fine-tuned
- Graph features can be precomputed and stored; inference involves only lightweight addition and subtraction operations
- Validated on three backbones: MaskGIT, AutoNAT, and TiTok
Key Experimental Results¶
Main Results¶
ImageNet-256 Class-conditional Generation
| Model | Type | Params | FID↓ | IS↑ | Prec↑ | Rec↑ |
|---|---|---|---|---|---|---|
| MaskGIT | MIG | 227M | 6.18 | 182.1 | 0.80 | 0.52 |
| MaskGIT-KA | MIG | 245M | 5.69 | 170.2 | 0.81 | 0.50 |
| AutoNAT | MIG | 194M | 2.68 | 278.8 | - | - |
| AutoNAT-KA | MIG | 211M | 2.45 | 274.1 | 0.82 | 0.56 |
| TiTok-b64 | MIG | 177M | 2.48 | 214.7 | - | - |
| TiTok-b64-KA | MIG | 194M | 2.40 | 217.0 | 0.78 | 0.60 |
| TiTok-s128 | MIG | 177M | 1.97 | 281.8 | - | - |
| TiTok-s128-KA | MIG | 194M | 1.90 | 271.9 | 0.78 | 0.61 |
| VAR-d20 | AR | 600M | 2.57 | 302.6 | 0.83 | 0.56 |
| LDM-4 | Diff. | 400M | 3.60 | 247.7 | - | - |
MS-COCO Text-to-Image Generation
| Method | FID↓ | CLIP-Score↑ |
|---|---|---|
| MaskGen | 22.27 | 25.58 |
| MaskGen + KA (Ours) | 21.01 | 26.10 |
Ablation Study¶
| Configuration | FID↓ | IS↑ | Note |
|---|---|---|---|
| AutoNAT (baseline) | 2.68 | 278.8 | |
| + \(\mathcal{G}_s\) only | 2.49 | 279.6 | Semantic similarity graph contributes most to FID |
| + \(\mathcal{G}_p\) only | 2.51 | 285.6 | Position incompatibility graph yields the largest IS gain |
| + \(\mathcal{G}_{co}\) only | 2.51 | 282.1 | Co-occurrence graph also effective |
| + \(\mathcal{G}_s\) + \(\mathcal{G}_p\) | 2.46 | 279.9 | Pairwise combinations yield further gains |
| + \(\mathcal{G}_{co}\) + \(\mathcal{G}_p\) | 2.46 | 280.7 | |
| + \(\mathcal{G}_{co}\) + \(\mathcal{G}_s\) | 2.48 | 277.4 | |
| + All three (KA-MIG) | 2.45 | 274.1 | Best FID |
Efficiency Analysis:
| Graph Type | Online Parameters | Precomputed Parameters | Online TFLOPs |
|---|---|---|---|
| \(\mathcal{G}_{co}\) | +16M | +0.79M | ~0 |
| \(\mathcal{G}_s\) | +16M | +0.79M | ~0 |
| \(\mathcal{G}_p\) | +15M | +196M | +0.06 |
Optimal strategy: precompute \(\mathcal{G}_{co}\) and \(\mathcal{G}_s\) (lightweight), compute \(\mathcal{G}_p\) online (avoiding per-class storage overhead).
Key Findings¶
- All three graphs are individually effective and mutually complementary: each yields improvements in isolation, with further gains when combined.
- \(\mathcal{G}_s\) contributes most to FID: learning interchangeable token patterns enhances both robustness and diversity.
- Longer sequences benefit more: MaskGIT/AutoNAT (256 tokens) gain more than TiTok (64/128 tokens), as token dependencies are more complex in longer sequences.
- Only ~20M additional parameters: the lightweight design incurs minimal inference overhead.
- AutoNAT-KA (2.45 FID) outperforms larger models such as LlamaGen-XL (2.62) and VAR-d20 (2.57).
Highlights & Insights¶
- Precise problem formulation: The paper observes that MIG improvements have almost exclusively targeted sampling strategies and is the first to systematically address internal representational capacity.
- Data-driven prior knowledge mining is highly practical: no external annotations or hand-crafted rules are required; knowledge is extracted purely from statistical patterns in training data.
- The assumption that "similar positional distributions imply semantic similarity" is simple yet effective, supported by compelling validation experiments (token substitution reconstruction).
- Additive fusion for positive priors and subtractive fusion for negative priors offers clear design intuition and straightforward implementation.
- Fully decoupled from the backbone: graph features can be precomputed, and only a small number of parameters need fine-tuning with the backbone frozen, making the approach highly practical.
Limitations & Future Work¶
- All three graphs are static (derived from training data statistics) and are not dynamically updated during training.
- The class-conditional nature of \(\mathcal{G}_p\) leads to substantial storage overhead (196M parameters across 1,000 classes); more compact representations could be explored.
- Improvements on short-sequence models such as TiTok are relatively limited; applicability to next-generation compact VQ methods remains to be seen.
- The combination of all three graphs does not always yield the best IS score (274 vs. baseline 278), suggesting possible information redundancy.
- More expressive graph network architectures (e.g., GAT, GraphSAGE) are not explored; the current 3-layer GCN may be insufficient.
- Systematic evaluation at higher resolutions (512×512) is absent.
Related Work & Insights¶
- Complementary to MaskGIT-SAG (self-guided sampling) and Halton sampling: those improve sampling while KA-MIG improves representations, and the two can be used jointly.
- The co-occurrence modeling paradigm from graph neural networks in recommender systems is cleverly transferred to the visual token domain.
- The "negative prior" (position-token incompatibility graph) conceptually parallels hard negative mining in contrastive learning.
- The approach may inspire similar token prior knowledge injection in autoregressive image generation models (e.g., LlamaGen, VAR).
- The use of zero convolution draws on the ControlNet design philosophy, ensuring that newly added modules do not interfere with pretrained model knowledge at initialization.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The construction of prior knowledge graphs is innovative, though the broader idea of mining statistical patterns from data is relatively straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three backbone networks, detailed ablations, efficiency analysis, and visual validation; very comprehensive.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with abundant illustrations.
- Value: ⭐⭐⭐⭐ — Opens a new direction for MIG improvement; the lightweight plug-and-play design is highly practical.