Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features¶

Conference: NeurIPS 2025 arXiv: 2509.16629 Code: https://github.com/Catchxu/CAPE Area: Causal Inference / Transformer Keywords: positional encoding, causal structure learning, hyperbolic embedding, rotary position encoding, multi-omics

TL;DR¶

CAPE learns the causal DAG structure among features from tabular data, embeds it into hyperbolic space to generate causality-aware rotary positional encodings (RoPE), enabling Transformers to process non-sequential yet causally structured feature data, with significant performance gains on downstream multi-omics tasks.

Background & Motivation¶

Background: Positional encodings in Transformers (e.g., sinusoidal, RoPE) assume data with a natural sequential order (word order, spatial arrangement of image patches), achieving great success in NLP and CV.

Limitations of Prior Work: Many real-world datasets (e.g., gene expression, proteomics, economic indicators) have features without a predefined sequential order, yet exhibit complex causal relationships. Existing positional encoding methods cannot capture such non-sequential causal structures.

Key Challenge: Existing methods either ignore inter-feature causal relationships (e.g., sorting by expression level) or use static pre-trained embeddings as surrogate positional encodings, neither of which genuinely exploits causal structural information among features.

Goal: How to generate positional encodings for non-sequential yet causally related features, such that the self-attention mechanism of Transformers becomes causality-aware?

Key Insight: Inspired by special relativity—causal connections correspond to relative positions in hyperbolic spacetime—the causal graph is embedded into hyperbolic space, naturally preserving two key properties: causal strength and causal specificity.

Core Idea: Learn the causal DAG among features → embed into hyperbolic space → convert to rotary positional encodings, so that attention scores decay with causal distance.

Method¶

Overall Architecture¶

CAPE is a three-step framework: the input is an \(N \times M\) tabular dataset \(\bm{X}\) (\(N\) observations, \(M\) non-sequential features), and the output is a causality-aware rotary positional encoding \(\bm{\varphi}_{v_j}\) for each feature \(v_j\). The three steps are: (1) causal structure learning → weighted DAG; (2) hyperbolic space embedding → preserving causal properties; (3) conversion to rotary form → injection into Transformer.

Key Designs¶

Causal Structure Learning (Step I):
- Function: Learn the causal structure among features from observed data \(\bm{X}\), represented as a weighted adjacency matrix \(\bm{A}\).
- Mechanism: A nonlinear structural equation model (SEM) is formulated in a VAE framework—encoder \(\bm{Z} = f(\bm{X})(\bm{I} - \bm{A})\), decoder \(\bm{X} = f^{-1}(\bm{Z}(\bm{I}-\bm{A})^{-1})\). Joint optimization via ELBO + sparsity regularization \(\|\bm{A}\|_1\) + acyclicity constraint \(\text{tr}(e^{\bm{A} \odot \bm{A}}) - M = 0\).
- Design Motivation: Nonlinear SEM captures complex causal relationships; continuous optimization of the acyclicity constraint avoids combinatorial search; threshold \(\tau\) prunes noisy edges.
Hyperbolic Space Embedding (Step II):
- Function: Embed the causal DAG into hyperbolic space (hyperboloid model), generating a \((d+1)\)-dimensional embedding \(\bm{p}_{v_j}\) for each node.
- Mechanism: Embeddings are optimized via regularized graph contrastive learning. The contrastive loss \(\mathcal{L}_{\text{con}}\) pulls causally connected nodes closer (positive samples are \(k\)-hop neighbors) and pushes unrelated nodes apart, weighted by \(|\bm{A}_{mn}|\) (causal strength). The regularization term \(\Omega = \bm{\pi}_{v_m} d_l(\bm{p}_{v_m}, \bm{p_o})\) uses PageRank scores \(\pi\) to penalize causally generic nodes (high out-degree root nodes), forcing them toward the origin.
- Design Motivation: Hyperbolic space is naturally suited for modeling tree-like/DAG structures. Two key properties are preserved—causal strength (proximity ↔ strong causal relationship) and causal specificity (distance from origin ↔ specificity of leaf nodes). Riemannian SGD is used for optimization on the manifold.
Rotary Positional Encoding Conversion (Step III):
- Function: Map hyperbolic embeddings to the Poincaré ball and convert them to rotary form for injection into the Transformer.
- Mechanism: The diffeomorphism \(f_d: \mathcal{H}^d \to \mathcal{B}^d\) first maps embeddings to the Poincaré ball to obtain \(\bm{e}_{v_j}\); then \(\bm{\varphi}_v = c \cdot \bm{e}_v\) (with \(c=\pi/4\)) serves as rotation angles to construct a block-diagonal rotation matrix \(\bm{R}(\bm{\varphi}_v)\). Attention is computed as \(\mathcal{A} = (\bm{q}_{v_m}^i)^\top \bm{R}(\bm{\varphi}_{v_n} - \bm{\varphi}_{v_m}) \bm{k}_{v_n}^i\).
- Design Motivation: The spherical geometry of the Poincaré ball is more suitable for rotary encoding; the rotary form is compatible with linear self-attention, and the relative positional encoding depends only on the causal distance difference \(\bm{\varphi}_{v_n} - \bm{\varphi}_{v_m}\).

Loss & Training¶

Step I: \(\mathcal{L}_{\text{DAG}} = -\mathcal{L}_{\text{ELBO}} + \lambda_s \|\bm{A}\|_1 + \frac{\rho}{2}|h(\bm{A})|^2 + \alpha h(\bm{A})\), solved via augmented Lagrangian method.
Step II: \(\mathcal{L}_{\mathcal{H}} = \frac{1}{M} \sum_j \mathcal{L}_{\text{con}}(\bm{p}_{v_j}) + \lambda_g \Omega(\bm{p}_{v_j})\), optimized with Riemannian SGD.
Step III: No additional trainable parameters; encoding is obtained via direct mapping.

Key Experimental Results¶

Main Results¶

The gene perturbation prediction (GPP) task is evaluated on single-cell scRNA-seq datasets using two Transformer backbones, scBERT and scGPT:

Model	Positional Encoding	Single-gene Perturbation MSE	Double-gene Perturbation MSE
scBERT	Static absolute PE (default)	0.224	0.230
scBERT	Learnable relative PE	0.219 (−0.005)	0.215 (−0.015)
scBERT	CAPE	0.193 (−0.031)	0.189 (−0.041)
scGPT	Learnable absolute PE (default)	0.202	0.201
scGPT	Learnable relative PE	0.195 (−0.007)	0.204 (+0.003)
scGPT	CAPE	0.182 (−0.020)	0.176 (−0.025)

CAPE reduces MSE by an average of 11.1%, whereas causality-agnostic relative PE reduces it by only 2.7%.

Ablation Study¶

Using scGPT as the backbone:

Configuration	Single-gene Perturbation MSE	Double-gene Perturbation MSE
CAPE (full)	0.182 (±0.005)	0.176 (±0.008)
CAPE-null (no PE)	0.234 (±0.014)	0.238 (±0.017)
CAPE-w/o-CSL (no causal learning)	0.209 (±0.010)	0.213 (±0.011)
CAPE-w/o-hyperbolic (Euclidean substitute)	0.192 (±0.008)	0.196 (±0.008)
CAPE-w/o-rotary (additive PE)	0.201 (±0.009)	0.208 (±0.010)

Key Findings¶

Causal structure learning (CSL) is the most critical component; removing it increases double-gene MSE from 0.176 to 0.213 (+21%).
The rotary form is also important, with a notable performance gap over additive PE.
The contribution of hyperbolic space modeling is moderate but consistently positive, indicating that curvature-aware optimization does help reflect causal graph structure more faithfully.
Theoretical properties are validated on synthetic data: attention decays with causal distance, decays with causal generality, and is robust to positional perturbations.

Highlights & Insights¶

The unified framework of causal graph → hyperbolic space → RoPE is remarkably elegant, organically integrating methods from three distinct fields: causal discovery, hyperbolic geometry, and rotary attention. The design of directly mapping causal distance to attention decay is highly natural.
Theoretical analysis is rigorous: three key properties are proven (causal distance decay, causal generality decay, and robustness), providing mathematical guarantees for the method's effectiveness.
Transferable design paradigm: any Transformer application involving non-sequential features (e.g., feature interactions in recommender systems, knowledge graph nodes, atoms in molecular graphs) can draw inspiration from this "structure → hyperbolic embedding → rotary encoding" paradigm.

Limitations & Future Work¶

The causal structure learning component assumes acyclicity (DAG), making it inapplicable to causal systems with feedback loops.
Validation is currently limited to biological omics data; evaluation on other non-sequential causal domains such as economics and social sciences is absent.
The accuracy of causal learning is sensitive to data volume and noise; DAG learning may be unreliable in small-sample, high-dimensional settings.
Computational complexity: the three-stage training pipeline of causal learning + hyperbolic embedding + Transformer may be slow, though the appendix indicates the complexity analysis shows it remains acceptable.

vs. RoPE / absolute PE and other standard methods: These methods assume a predefined order; CAPE extends positional encoding to unordered causally structured features.
vs. default PE in scGPT / scBERT: These models use expression-level-sorted order or pre-trained representations as surrogate PE, fundamentally ignoring causal relationships.
vs. NOTEARS / DAG-GNN: CAPE borrows from these causal discovery methods, but its innovation lies in converting the learned DAG into positional encodings rather than treating it as an end goal.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of causal graph → hyperbolic embedding → RoPE is highly novel; the cross-domain methodological fusion is particularly ingenious.
Experimental Thoroughness: ⭐⭐⭐⭐ Synthetic and multi-omics experiments are comprehensive, though broader domain coverage would be beneficial.
Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are rigorous, exposition is clear, and figures are intuitive.
Value: ⭐⭐⭐⭐ Opens a new direction for Transformer modeling of non-sequential data, though the practical application scope is currently somewhat narrow.