Sentient: Detecting APTs Via Capturing Indirect Dependencies and Behavioral Logic¶

Conference: AAAI 2026 arXiv: 2502.06521 Code: None Area: Graph Learning / Cybersecurity Keywords: APT Detection, Provenance Graph, Graph Transformer, Mamba, Behavioral Intent Analysis

TL;DR¶

This paper proposes Sentient, an APT detection method combining Graph Transformer pre-training and bidirectional Mamba2 intent analysis. Trained exclusively on benign data, it captures indirect dependencies, removes contextual noise, and correlates behavioral logic, achieving an average 44% reduction in false positive rate across three standard benchmarks.

Background & Motivation¶

Background: Advanced Persistent Threats (APTs) are notoriously difficult to detect due to their stealthiness and complexity. Provenance graph-based methods represent the current state of the art, leveraging entity relationships in system audit logs to uncover attack traces.
Limitations of Prior Work: (a) Missing indirect dependencies — GNN-based methods are constrained by the receptive field of neighborhood aggregation, failing to capture relationships between non-directly connected nodes; (b) Noise in complex scenarios — infected entities continue to perform numerous benign tasks, causing neighborhood aggregation to erroneously incorporate weakly related activities; (c) Missing behavioral logic correlation — isolated system behaviors exhibit contextual ambiguity (e.g., sshd writing a log appears normal in isolation), yet their combination reveals malicious intent.
Key Challenge: GNN local aggregation cannot reach indirect dependencies, introduces noise through indiscriminate neighbor aggregation, and is unable to establish logical correlations between distant behaviors.
Goal: Design a globally aware APT detection method capable of understanding behavioral logic.
Key Insight: Employ global attention in Graph Transformers to capture indirect dependencies, construct denoised behavior sequences via random walks, and leverage bidirectional Mamba2 to mine logical correlations among behaviors.
Core Idea: Graph Transformer for global node embeddings + bidirectional Mamba2 for intent logic over behavior sequences = addressing the three challenges of indirect dependencies, noise, and logical correlation.

Method¶

Overall Architecture¶

Five components: (1) Graph Construction — builds a provenance graph from system logs, initializing nodes with Word2Vec semantic encoding and Laplacian positional encoding; (2) Pre-training — a Graph Transformer reconstructs key node information to learn globally structured semantic embeddings; (3) Intent Analysis Module (IAM) — random walks construct behavior sequences, and bidirectional Mamba2 mines logical correlations; (4) Threat Detection — an MLP reconstructs behavioral actions, with behaviors whose reconstruction error exceeds a threshold flagged as malicious; (5) Attack Investigation — clusters behaviors with similar intent.

Key Designs¶

Graph Transformer Pre-training
- Function: Learns global node embeddings that capture indirect dependencies, circumventing the receptive field limitations of GNNs.
- Mechanism: The initial embedding \(h_i^0 = \sigma((A^0\alpha + a^0) + (B^0\beta + b^0))\) combines semantic encoding \(\alpha\) (Word2Vec) and positional encoding \(\beta\) (Laplacian eigenvectors). Multi-head attention allows each node to attend to all others in the graph (\(w_{ij} = \text{softmax}(Q h_i \cdot K h_j / \sqrt{d_k})\)), with residual connections and FFN producing final embeddings. The pre-training objective is node type reconstruction (weighted cross-entropy to handle class imbalance).
- Design Motivation: Attack behaviors in provenance graphs involve multi-hop relationships (e.g., file read → execution → network transmission), which GNNs require multiple layers to reach — incurring over-smoothing in deep settings. The global attention of Graph Transformers resolves this in a single pass.
Intent Analysis Module (IAM)
- Function: Mines logical correlations among behaviors in a denoised context to understand behavioral intent.
- Mechanism: Using pre-trained embeddings \(h\), random walks over the provenance graph construct behavior sequences \(\lambda_i = \{e_1, ..., e_W\}\), where each behavior \(e_t\) is represented as the concatenation of source and target node embeddings \([h_{\phi(e_t)}; h_{\psi(e_t)}]\). Random walks naturally build a target-node-centric local context, filtering irrelevant neighbors (denoising). Bidirectional Mamba2 then processes the sequence: \(\lambda^{\ell+1} = \mathbf{F}(\mathbf{E}(\lambda^\ell) + \mathcal{R}(\mathbf{E}(\mathcal{R}(\lambda^\ell))), \lambda^\ell)\), where \(\mathcal{R}\) denotes sequence reversal and \(\mathbf{E}\) denotes the Mamba2 state space model operation. Bidirectional processing ensures both forward and backward contextual logic are captured.
- Design Motivation: Isolated behaviors appear benign but reveal malicious intent only in combination. Mamba2's long-sequence modeling capability surpasses RNNs, and its linear complexity suits large-scale log processing. Bidirectional modeling is necessary because attack behaviors may depend on both preceding and subsequent context.
Threat Detection and Attack Investigation
- Function: Detects anomalies based on deviation from benign patterns and clusters attack behaviors into attack narratives.
- Mechanism: During training, key behavioral information (read/write/execute) is masked to learn benign behavioral reconstruction patterns. During inference, behaviors whose reconstruction error \(RE = \text{CrossEntropy}(\mathbf{P}(a_t), L(a_t))\) exceeds the threshold (mean + 1.5 standard deviations) are flagged as malicious. For attack investigation, the concatenation of behavioral intent embedding \(h_e\) with source/target node embeddings is clustered as \(C_k = \{e_i | \arg\min_k \|h_{behavior}^{(i)} - \mu_k\|^2\}\), merging alerts with similar intent to reduce analyst burden.
- Design Motivation: Training solely on benign data avoids the scarcity of attack samples. Reconstruction error naturally quantifies the degree of behavioral anomaly.

Loss & Training¶

The pre-training loss is weighted cross-entropy (node type reconstruction); the detection loss is cross-entropy (behavior type reconstruction). The anomaly threshold is set to mean + 1.5 standard deviations computed over the training period.

Key Experimental Results¶

Main Results¶

Results on Streamspot, Unicorn Wget, and DARPA E3 datasets:

Dataset	Method	Precision	Recall	F-score	FPR
Streamspot	Threatrace	98%	99%	98%	0.4%
Streamspot	Sentient	99%	99%	99%	0.2%
Unicorn Wget	Threatrace	93%	98%	95%	7.4%
Unicorn Wget	Sentient	96%	99%	97%	4.1%
DARPA Cadets	Flash	92%	99%	95%	0.3%
DARPA Cadets	Slot	94%	96%	95%	0.2%
DARPA Cadets	Sentient	96%	99%	97%	0.2%
DARPA Theia	Flash	91%	99%	95%	0.8%
DARPA Theia	Sentient	95%	99%	97%	0.4%
DARPA Trace	Flash	93%	99%	96%	0.4%
DARPA Trace	Sentient	97%	99%	98%	0.2%

Ablation Study¶

Configuration	Precision Change	Notes
w/o Pre-training (PT)	−20.75%	Loss of indirect dependency information
w/o Intent Analysis (IAM)	−31.59%	Loss of behavioral logic correlation; largest impact
w/o Laplacian PE	−8.2%	Loss of topological positional information
w/o Semantic Encoding	−12.3%	Loss of node attribute semantics

Key Findings¶

IAM contributes the most — removing it causes a 31.59% drop in precision, underscoring the critical importance of behavioral logic correlation for APT detection.
Advantages are most pronounced in complex scenarios (Unicorn Wget, DARPA Theia), where noise and indirect dependencies are more prevalent.
Achieving state-of-the-art detection using only benign training data is a significant practical advantage for real-world deployment.
Runtime overhead is acceptable: processing one day of logs requires only 63.6 seconds with a peak memory footprint of 2.01 GB.

Highlights & Insights¶

Graph Transformer + Sequential SSM combination: Using Graph Transformer for global representation and Mamba2 for sequential logic correlation addresses long-range dependencies at both the graph and sequence levels. This combination strategy is transferable to other graph-plus-sequence tasks.
Random walk as a denoising mechanism: Random walks naturally construct a target-node-centric context that filters irrelevant neighbors — an elegant denoising design.
Clustering for attack investigation: Beyond anomaly detection, clustering behaviors with similar intent into "attack stories" substantially reduces the workload of security analysts.

Limitations & Future Work¶

The anomaly threshold (mean + 1.5σ) is heuristically defined; an adaptive threshold may yield better results.
The random walk sequence length \(W\) is fixed; adaptive length selection could offer greater flexibility.
Robustness under concept drift (i.e., evolving system behavior patterns over time) has not been evaluated.
The clustering method for attack investigation is relatively simple (K-means); more sophisticated clustering approaches could generate higher-quality attack narratives.

vs. Flash/Threatrace: Employ GNN (GraphSAGE) neighborhood aggregation, which fails to capture indirect dependencies and introduces noise. Sentient addresses this via global attention in Graph Transformers.
vs. Slot: Uses graph reinforcement learning to adaptively select neighbors, but remains constrained by the GNN receptive field. Sentient bypasses the neighborhood aggregation paradigm entirely.
vs. Atlas: Requires attack data for training; Sentient requires only benign data.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of Graph Transformer, bidirectional Mamba2, and random-walk-based denoising is novel
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets covering real and simulated attacks; complete ablation study
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear; challenges are illustrated intuitively with figures
Value: ⭐⭐⭐⭐ Offers practical deployment value for real-world cybersecurity