TRUST -- Transformer-Driven U-Net for Sparse Target Recovery¶

Conference: NeurIPS 2025 arXiv: 2506.01112 Code: None Area: Signal Processing / Image Reconstruction Keywords: Sparse Recovery, Transformer, U-Net, Inverse Problems, Sensing Matrix Learning

TL;DR¶

This paper proposes the TRUST architecture, which integrates the Transformer attention mechanism with a U-Net decoder to jointly learn the sensing operator and reconstruct sparse signals under unknown sensing matrices, achieving significant improvements over conventional methods in SSIM and PSNR.

Background & Motivation¶

In the inverse problem \(\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}\), sparse recovery solves underdetermined systems by exploiting signal sparsity. However, existing methods face the following challenges:

Unknown Sensing Matrix: Most traditional methods assume \(\mathbf{A}\) is known, yet it is often unavailable in practice.

Limited Training Data: Only a small number of observation–target pairs \(\{(\mathbf{x}, \mathbf{y})\}\) are available.

Hallucination Artifacts: Deep learning methods tend to produce hallucinations inconsistent with the true signal.

Local–Global Features: Pure U-Net lacks a global receptive field, while pure Transformer lacks multi-scale local features.

Method¶

Overall Architecture¶

TRUST adopts a hybrid encoder–decoder architecture: - Encoder: A Transformer branch that captures long-range dependencies and estimates the sparse support set. - Decoder: A U-Net-style decoder that refines reconstruction through multi-scale feature fusion. - Skip Connections: Skip connections between Transformer layers and the decoder.

Key Designs¶

Transformer Encoding Branch:
- Multi-head self-attention layers capture global dependencies in the input observations.
- Features at different levels of abstraction are extracted layer by layer.
- The sparse support of the signal (i.e., which positions are nonzero) is estimated.
U-Net Decoding Path:
- Multi-scale transposed convolutions progressively recover spatial resolution.
- Features from each Transformer layer serve as guidance information.
- Local detail recovery is refined.
Skip Connection Design:
- Unlike symmetric skip connections in conventional U-Net.
- Asymmetric connections from Transformer layers to decoder layers.
- The decoder is granted access to image features at different levels of abstraction.

Loss & Training¶

\[\mathcal{L} = \|\hat{\mathbf{x}} - \mathbf{x}\|_2^2 + \lambda_1 \|\hat{\mathbf{x}}\|_1 + \lambda_2 \mathcal{L}_{\text{SSIM}}\]

The three terms correspond to reconstruction error, sparsity regularization, and structural similarity constraint, respectively.

Key Experimental Results¶

Main Results (Sparse Signal Recovery)¶

Method	PSNR (dB) ↑	SSIM ↑	Reconstruction Time (ms) ↓	Hallucination Rate (%) ↓
ISTA	24.3	0.712	850	12.5
LISTA	27.8	0.781	120	8.3
U-Net	30.2	0.845	45	15.7
SwinIR	31.5	0.868	78	9.2
Transformer-only	31.8	0.872	92	7.8
TRUST	34.6	0.921	52	3.1

Results at Different Compression Ratios¶

Compression Ratio (M/N)	U-Net PSNR	Transformer PSNR	TRUST PSNR	TRUST SSIM
0.1	22.1	23.5	26.8	0.785
0.2	26.3	27.1	30.2	0.856
0.3	29.5	30.2	33.1	0.905
0.5	32.8	33.5	36.2	0.945

Ablation Study¶

Configuration	PSNR ↑	SSIM ↑
Full TRUST	34.6	0.921
w/o Skip Connections	31.8	0.875
w/o Transformer Encoder	30.2	0.845
w/o U-Net Decoder	31.5	0.868
w/o Sparsity Regularization	33.2	0.902

Key Findings¶

TRUST demonstrates the most pronounced advantage at low compression ratios (M/N=0.1), indicating that global context is critical under severe undersampling.
Skip connections contribute approximately 2.8 dB PSNR improvement, making them the most critical component.
The hallucination rate drops from 15.7% (U-Net) to 3.1%, confirming the robustness of the hybrid architecture.
Inference speed is close to that of pure U-Net and far faster than traditional iterative methods.

Highlights & Insights¶

Hybrid Architecture Design: The Transformer handles global reasoning while the U-Net handles multi-scale reconstruction, with clearly delineated roles.
Hallucination Suppression: Reconstruction is guided by sparse support estimation, effectively mitigating the hallucination problem common in deep learning methods.
Blind Sparse Recovery: The sensing matrix \(\mathbf{A}\) need not be known; the model is learned end-to-end, offering strong practical utility.

Limitations & Future Work¶

Validation is currently limited to 2D signals; extension to 3D volumetric data requires further investigation.
The computational overhead of the Transformer encoder remains substantial for high-resolution signals.
Setting the sparsity hyperparameter requires domain knowledge.
No comparison is made against recent diffusion-based reconstruction methods.

LISTA (Gregor & LeCun, 2010): Unrolls ISTA into a learnable network.
SwinIR: Image restoration based on the Swin Transformer.
TransUNet: A Transformer–U-Net hybrid architecture for medical image segmentation, from which this work draws similar inspiration.

Rating¶

Dimension	Score (1–5)
Novelty	3
Theoretical Depth	3
Experimental Thoroughness	4
Writing Quality	4
Practical Value	4
Overall Recommendation	3.5