Improved Masked Image Generation with Knowledge-Augmented Token Representations¶

Conference: AAAI 2026 arXiv: 2511.12032 Code: https://github.com/GuotaoLiang/KA-MIG Area: Image Generation Keywords: Masked image generation, knowledge graph, discrete tokens, prior knowledge augmentation, graph convolutional network

TL;DR¶

This paper proposes KA-MIG, a framework that mines three types of token-level semantic prior knowledge graphs from training data (co-occurrence graph, semantic similarity graph, and position-token incompatibility graph), learns augmented token representations via a graph-aware encoder, and injects them into existing MIG models through a lightweight addition-subtraction fusion mechanism, consistently improving generation quality across multiple backbone networks.

Background & Motivation¶

Masked image generation (MIG), exemplified by MaskGIT, achieves a favorable balance between sampling speed and quality through parallel decoding. The pipeline encodes images into discrete token sequences via VQ-VAE → trains a transformer to predict masked tokens → iteratively samples to generate complete token sequences.

However, MIG still lags behind diffusion models. Existing improvements have focused primarily on decoding/sampling strategies (e.g., Token-Critic, DPC, Self-Guidance, Halton sampling), with almost no attention paid to enhancing the model's internal representational capacity.

The root problem identified by the authors: existing MIG methods rely entirely on the transformer itself to learn semantic dependencies among tokens, which is challenging because:

Individual tokens lack explicit semantic meaning: VQ-VAE codebook entries are merely vectors in latent space that are not directly interpretable.

Token sequences are typically long (e.g., 256 tokens/image), making complex relationships in long sequences difficult to capture effectively.

Core motivation: Given that tokens themselves lack semantics, can implicit structural patterns among tokens be mined from large-scale training data and injected into the model as prior knowledge?

Method¶

Overall Architecture¶

KA-MIG consists of three stages: 1. Graph construction: Building three prior knowledge graphs from training data 2. Graph-aware encoder: Learning augmented token and position representations using GCN 3. Lightweight fusion mechanism: Injecting prior knowledge into the MIG transformer via addition and subtraction operations

Key Designs¶

1. Construction of Three Prior Knowledge Graphs¶

(a) Co-occurrence Graph $\mathcal{G}_{co}$ (Positive Prior)

Captures frequently co-occurring token pairs within a local neighborhood, reflecting latent spatial-semantic correlations.

Construction: Statistics over token sequences of all training images, recording the frequency of co-occurrence between token pairs within first-order neighborhoods (horizontal, vertical, diagonal directions)
A weighted undirected graph is built; low-frequency edges are pruned to reduce noise
Intuition: If token A and token B frequently appear in adjacent positions, they likely encode semantically related visual patterns

(b) Semantic Similarity Graph $\mathcal{G}_s$ (Positive Prior)

Identifies tokens that are semantically similar (akin to "synonyms") in the context of image synthesis.

Core assumption: If two tokens exhibit similar positional distributions across a large number of images, they likely express similar semantics
A positional distribution vector of length $N$ is constructed for each token (each entry represents the frequency of the token appearing at a specific position)
Jensen-Shannon divergence is used to measure distributional similarity
Each token retains its top-2 most similar tokens, forming a directed graph

The validation experiment is highly compelling: replacing token (1013) with its most similar token (463) yields a reconstructed image visually indistinguishable from the original (PSNR=35.78); replacing it with the least similar token (149) severely degrades quality (PSNR=18.97).

(c) Position-Token Incompatibility Graph $\mathcal{G}_p^c$ (Negative Prior)

Identifies tokens that should not appear at specific spatial positions under a given category.

For each category $c$, all training images are scanned to record tokens that never appear at a given position
Example: For the "airplane" category, tokens encoding ground/grass textures almost never appear in the upper half of the image
Helps the model avoid semantically unreasonable spatial-token combinations

2. Graph-aware Encoder¶

Positive prior processing: Two independent 3-layer GCNs extract global token representations: $$C_{co} = f_{\theta_{co}}(\mathcal{G}_{co}, C), \quad C_s = f_{\theta_s}(\mathcal{G}_s, C)$$ where $C$ denotes the VQ-VAE codebook embeddings.

Negative prior processing: For each position $i$ under category $c$, the mean embedding of incompatible tokens is aggregated: $$p_i^c = \frac{1}{|\mathcal{I}_{i,j}|}\sum_{t \in \mathcal{I}_{i,j}} C_t W$$ yielding positional embeddings $P^c \in \mathbb{R}^{N \times d}$ that encode spatial constraints.

3. Lightweight Fusion Mechanism¶

Additive fusion (positive prior): Augments unmasked token representations before each transformer layer: $$Z_{\overline{M}}^l = Z_{\overline{M}}^l + f_{pos}^l(C_{co}[Z_{\overline{M}}]) + f_{pos}^l(C_s[Z_{\overline{M}}])$$

Subtractive fusion (negative prior): Suppresses incompatible token features at masked positions in each layer: $$Z_M^l = Z_M^l - \alpha f_{neg}^l(P^c)$$

Both $f_{pos}$ and $f_{neg}$ are implemented with zero convolution, ensuring no interference with existing knowledge at the start of training.

Loss & Training¶

Standard MIG training objective (negative log-likelihood of masked tokens)
Backbone is frozen; only classification layers and newly added parameters are fine-tuned
Graph features can be precomputed and stored; inference involves only lightweight addition and subtraction operations
Validated on three backbones: MaskGIT, AutoNAT, and TiTok

Key Experimental Results¶

Main Results¶

ImageNet-256 Class-conditional Generation

Model	Type	Params	FID↓	IS↑	Prec↑	Rec↑
MaskGIT	MIG	227M	6.18	182.1	0.80	0.52
MaskGIT-KA	MIG	245M	5.69	170.2	0.81	0.50
AutoNAT	MIG	194M	2.68	278.8	-	-
AutoNAT-KA	MIG	211M	2.45	274.1	0.82	0.56
TiTok-b64	MIG	177M	2.48	214.7	-	-
TiTok-b64-KA	MIG	194M	2.40	217.0	0.78	0.60
TiTok-s128	MIG	177M	1.97	281.8	-	-
TiTok-s128-KA	MIG	194M	1.90	271.9	0.78	0.61
VAR-d20	AR	600M	2.57	302.6	0.83	0.56
LDM-4	Diff.	400M	3.60	247.7	-	-

MS-COCO Text-to-Image Generation

Method	FID↓	CLIP-Score↑
MaskGen	22.27	25.58
MaskGen + KA (Ours)	21.01	26.10

Ablation Study¶

Configuration	FID↓	IS↑	Note
AutoNAT (baseline)	2.68	278.8
+ $\mathcal{G}_s$ only	2.49	279.6	Semantic similarity graph contributes most to FID
+ $\mathcal{G}_p$ only	2.51	285.6	Position incompatibility graph yields the largest IS gain
+ $\mathcal{G}_{co}$ only	2.51	282.1	Co-occurrence graph also effective
+ $\mathcal{G}_s$ + $\mathcal{G}_p$	2.46	279.9	Pairwise combinations yield further gains
+ $\mathcal{G}_{co}$ + $\mathcal{G}_p$	2.46	280.7
+ $\mathcal{G}_{co}$ + $\mathcal{G}_s$	2.48	277.4
+ All three (KA-MIG)	2.45	274.1	Best FID

Efficiency Analysis:

Graph Type	Online Parameters	Precomputed Parameters	Online TFLOPs
$\mathcal{G}_{co}$	+16M	+0.79M	~0
$\mathcal{G}_s$	+16M	+0.79M	~0
$\mathcal{G}_p$	+15M	+196M	+0.06

Optimal strategy: precompute $\mathcal{G}_{co}$ and $\mathcal{G}_s$ (lightweight), compute $\mathcal{G}_p$ online (avoiding per-class storage overhead).

Key Findings¶

All three graphs are individually effective and mutually complementary: each yields improvements in isolation, with further gains when combined.
$\mathcal{G}_s$ contributes most to FID: learning interchangeable token patterns enhances both robustness and diversity.
Longer sequences benefit more: MaskGIT/AutoNAT (256 tokens) gain more than TiTok (64/128 tokens), as token dependencies are more complex in longer sequences.
Only ~20M additional parameters: the lightweight design incurs minimal inference overhead.
AutoNAT-KA (2.45 FID) outperforms larger models such as LlamaGen-XL (2.62) and VAR-d20 (2.57).

Highlights & Insights¶

Precise problem formulation: The paper observes that MIG improvements have almost exclusively targeted sampling strategies and is the first to systematically address internal representational capacity.
Data-driven prior knowledge mining is highly practical: no external annotations or hand-crafted rules are required; knowledge is extracted purely from statistical patterns in training data.
The assumption that "similar positional distributions imply semantic similarity" is simple yet effective, supported by compelling validation experiments (token substitution reconstruction).
Additive fusion for positive priors and subtractive fusion for negative priors offers clear design intuition and straightforward implementation.
Fully decoupled from the backbone: graph features can be precomputed, and only a small number of parameters need fine-tuning with the backbone frozen, making the approach highly practical.

Limitations & Future Work¶

All three graphs are static (derived from training data statistics) and are not dynamically updated during training.
The class-conditional nature of $\mathcal{G}_p$ leads to substantial storage overhead (196M parameters across 1,000 classes); more compact representations could be explored.
Improvements on short-sequence models such as TiTok are relatively limited; applicability to next-generation compact VQ methods remains to be seen.
The combination of all three graphs does not always yield the best IS score (274 vs. baseline 278), suggesting possible information redundancy.
More expressive graph network architectures (e.g., GAT, GraphSAGE) are not explored; the current 3-layer GCN may be insufficient.
Systematic evaluation at higher resolutions (512×512) is absent.

Complementary to MaskGIT-SAG (self-guided sampling) and Halton sampling: those improve sampling while KA-MIG improves representations, and the two can be used jointly.
The co-occurrence modeling paradigm from graph neural networks in recommender systems is cleverly transferred to the visual token domain.
The "negative prior" (position-token incompatibility graph) conceptually parallels hard negative mining in contrastive learning.
The approach may inspire similar token prior knowledge injection in autoregressive image generation models (e.g., LlamaGen, VAR).
The use of zero convolution draws on the ControlNet design philosophy, ensuring that newly added modules do not interfere with pretrained model knowledge at initialization.

Rating¶

Novelty: ⭐⭐⭐⭐ — The construction of prior knowledge graphs is innovative, though the broader idea of mining statistical patterns from data is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three backbone networks, detailed ablations, efficiency analysis, and visual validation; very comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with abundant illustrations.
Value: ⭐⭐⭐⭐ — Opens a new direction for MIG improvement; the lightweight plug-and-play design is highly practical.

Graph Type	Online Parameters	Precomputed Parameters	Online TFLOPs
\(\mathcal{G}_{co}\)	+16M	+0.79M	~0
\(\mathcal{G}_s\)	+16M	+0.79M	~0
\(\mathcal{G}_p\)	+15M	+196M	+0.06