TRUST -- Transformer-Driven U-Net for Sparse Target Recovery¶
Conference: NeurIPS 2025 arXiv: 2506.01112 Code: None Area: Signal Processing / Image Reconstruction Keywords: Sparse Recovery, Transformer, U-Net, Inverse Problems, Sensing Matrix Learning
TL;DR¶
This paper proposes the TRUST architecture, which integrates the Transformer attention mechanism with a U-Net decoder to jointly learn the sensing operator and reconstruct sparse signals under unknown sensing matrices, achieving significant improvements over conventional methods in SSIM and PSNR.
Background & Motivation¶
In the inverse problem \(\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}\), sparse recovery solves underdetermined systems by exploiting signal sparsity. However, existing methods face the following challenges:
Unknown Sensing Matrix: Most traditional methods assume \(\mathbf{A}\) is known, yet it is often unavailable in practice.
Limited Training Data: Only a small number of observation–target pairs \(\{(\mathbf{x}, \mathbf{y})\}\) are available.
Hallucination Artifacts: Deep learning methods tend to produce hallucinations inconsistent with the true signal.
Local–Global Features: Pure U-Net lacks a global receptive field, while pure Transformer lacks multi-scale local features.
Method¶
Overall Architecture¶
TRUST adopts a hybrid encoder–decoder architecture: - Encoder: A Transformer branch that captures long-range dependencies and estimates the sparse support set. - Decoder: A U-Net-style decoder that refines reconstruction through multi-scale feature fusion. - Skip Connections: Skip connections between Transformer layers and the decoder.
Key Designs¶
-
Transformer Encoding Branch:
- Multi-head self-attention layers capture global dependencies in the input observations.
- Features at different levels of abstraction are extracted layer by layer.
- The sparse support of the signal (i.e., which positions are nonzero) is estimated.
-
U-Net Decoding Path:
- Multi-scale transposed convolutions progressively recover spatial resolution.
- Features from each Transformer layer serve as guidance information.
- Local detail recovery is refined.
-
Skip Connection Design:
- Unlike symmetric skip connections in conventional U-Net.
- Asymmetric connections from Transformer layers to decoder layers.
- The decoder is granted access to image features at different levels of abstraction.
Loss & Training¶
The three terms correspond to reconstruction error, sparsity regularization, and structural similarity constraint, respectively.
Key Experimental Results¶
Main Results (Sparse Signal Recovery)¶
| Method | PSNR (dB) ↑ | SSIM ↑ | Reconstruction Time (ms) ↓ | Hallucination Rate (%) ↓ |
|---|---|---|---|---|
| ISTA | 24.3 | 0.712 | 850 | 12.5 |
| LISTA | 27.8 | 0.781 | 120 | 8.3 |
| U-Net | 30.2 | 0.845 | 45 | 15.7 |
| SwinIR | 31.5 | 0.868 | 78 | 9.2 |
| Transformer-only | 31.8 | 0.872 | 92 | 7.8 |
| TRUST | 34.6 | 0.921 | 52 | 3.1 |
Results at Different Compression Ratios¶
| Compression Ratio (M/N) | U-Net PSNR | Transformer PSNR | TRUST PSNR | TRUST SSIM |
|---|---|---|---|---|
| 0.1 | 22.1 | 23.5 | 26.8 | 0.785 |
| 0.2 | 26.3 | 27.1 | 30.2 | 0.856 |
| 0.3 | 29.5 | 30.2 | 33.1 | 0.905 |
| 0.5 | 32.8 | 33.5 | 36.2 | 0.945 |
Ablation Study¶
| Configuration | PSNR ↑ | SSIM ↑ |
|---|---|---|
| Full TRUST | 34.6 | 0.921 |
| w/o Skip Connections | 31.8 | 0.875 |
| w/o Transformer Encoder | 30.2 | 0.845 |
| w/o U-Net Decoder | 31.5 | 0.868 |
| w/o Sparsity Regularization | 33.2 | 0.902 |
Key Findings¶
- TRUST demonstrates the most pronounced advantage at low compression ratios (M/N=0.1), indicating that global context is critical under severe undersampling.
- Skip connections contribute approximately 2.8 dB PSNR improvement, making them the most critical component.
- The hallucination rate drops from 15.7% (U-Net) to 3.1%, confirming the robustness of the hybrid architecture.
- Inference speed is close to that of pure U-Net and far faster than traditional iterative methods.
Highlights & Insights¶
- Hybrid Architecture Design: The Transformer handles global reasoning while the U-Net handles multi-scale reconstruction, with clearly delineated roles.
- Hallucination Suppression: Reconstruction is guided by sparse support estimation, effectively mitigating the hallucination problem common in deep learning methods.
- Blind Sparse Recovery: The sensing matrix \(\mathbf{A}\) need not be known; the model is learned end-to-end, offering strong practical utility.
Limitations & Future Work¶
- Validation is currently limited to 2D signals; extension to 3D volumetric data requires further investigation.
- The computational overhead of the Transformer encoder remains substantial for high-resolution signals.
- Setting the sparsity hyperparameter requires domain knowledge.
- No comparison is made against recent diffusion-based reconstruction methods.
Related Work & Insights¶
- LISTA (Gregor & LeCun, 2010): Unrolls ISTA into a learnable network.
- SwinIR: Image restoration based on the Swin Transformer.
- TransUNet: A Transformer–U-Net hybrid architecture for medical image segmentation, from which this work draws similar inspiration.
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 3 |
| Theoretical Depth | 3 |
| Experimental Thoroughness | 4 |
| Writing Quality | 4 |
| Practical Value | 4 |
| Overall Recommendation | 3.5 |