DeepRAHT: Learning Predictive RAHT for Point Cloud Attribute Compression¶

Conference: AAAI 2026 arXiv: 2601.12255 Code: Available Area: 3D Vision Keywords: Point Cloud Compression, Attribute Compression, RAHT, End-to-End Learning, Variable Bitrate

TL;DR¶

This paper proposes DeepRAHT, the first end-to-end differentiable Region Adaptive Hierarchical Transform (RAHT) framework for lossy point cloud attribute compression. By integrating learnable prediction models with a Laplace distribution-based rate proxy, DeepRAHT achieves compression performance surpassing both the G-PCC standard and existing deep learning methods.

Background & Motivation¶

Point cloud attribute compression (PCAC) is a critical component of 3D data processing. RAHT, as the core transform in the MPEG G-PCC standard, offers strong performance at low complexity. However, applying RAHT in deep learning settings faces several challenges:

Non-differentiability: The RAHT implementation in G-PCC is written in C++ and is non-differentiable, precluding end-to-end training.

Absence of prediction: 3DAC, the first method to learn RAHT coefficients, relies on handcrafted RAHT to generate transform coefficients and then learns entropy coding, neglecting the predictive RAHT that is integral to the G-PCC standard.

Rate-only optimization: Due to non-differentiability, 3DAC can only optimize the bitrate and cannot jointly optimize distortion.

Poor robustness: Existing methods are sensitive to data variance and require multiple models to cover different rate-distortion operating points.

Unexplored learnability of predictive RAHT: Prediction can substantially reduce the uncertainty of transform coefficients, and encoding residuals is more efficient than encoding raw coefficients.

Method¶

Overall Architecture¶

The core pipeline of DeepRAHT proceeds as follows:

Multi-scale generation: The input point cloud $P_0$ undergoes $s$ rounds of $2 \times 2 \times 2$ sum-pooling to produce $\{P_1, ..., P_s\}$.
Top-down encoding: Starting from the coarsest scale $s$, each scale applies a transform model (Haar) and an optional prediction model.
Encode-and-reconstruct: The reconstructed $\hat{A}_m$ is used for DC reconstruction and prediction at the next finer level.
Decoding: The decoding process is fully consistent with reconstruction, ensuring invertibility.

Key Designs¶

Differentiable RAHT via Sparse Convolution (Transform Model)¶

The core innovation is implementing a differentiable dyadic RAHT using Minkowski sparse tensors and sparse convolutions:

Haar Transform: For each $2 \times 2 \times 2$ voxel, the 8 nodes are decomposed into 1 DC coefficient and 7 AC coefficients via sequential binary decomposition along the Z→Y→X axes:

\[\begin{bmatrix} g_L \\ g_H \end{bmatrix} = \frac{1}{\sqrt{w_1+w_2}} \begin{bmatrix} \sqrt{w_1} & \sqrt{w_2} \\ -\sqrt{w_2} & \sqrt{w_1} \end{bmatrix} \begin{bmatrix} g_1 \\ g_2 \end{bmatrix}\]

where $w_1, w_2$ denote the number of original points contained in each node, serving as adaptive weights.

Sparse convolution implementation: - Z-axis decomposition: $\text{Zconv} \equiv \text{Conv}(i=1, o=2, k=s=(1,1,2))$ - Y-axis decomposition: $\text{Yconv} \equiv \text{Conv}(i=1, o=2, k=s=(1,2,1))$ - X-axis decomposition: $\text{Xconv} \equiv \text{Conv}(i=1, o=2, k=s=(2,1,1))$ - Initial convolution kernel weights are set to the identity matrix $I_2$

Key property: DC is equivalent to the normalized attribute at the next coarser scale: $DC_m \equiv g_{LLL} = A_{m+1,i}/\sqrt{w_{m+1,i}}$. Therefore, the DC need not be encoded (it is already encoded at the coarser scale); only the 7 AC coefficients require encoding.

The inverse Haar transform is implemented using ConvolutionTranspose.

Prediction Model¶

G-PCCv14 employs inverse distance weighting (IDW) prediction, but using sibling nodes at the same scale introduces autoregressive dependencies and increases decoding time. DeepRAHT performs prediction using only the parent scale:

IDW prediction (implemented via sparse convolution): $$\text{IDW}(\hat{a}_m) \equiv \text{Conv}(\text{Unpool}(\hat{a}_m), k=3^3, s=1^3)$$

Convolution kernel weights are assigned proportionally by distance: center:face:edge:corner = 4:3:2:1.

Prediction compensation module: Leverages the prediction error at the grandparent scale ($m+1$) to compensate the current prediction, avoiding autoregressive dependencies: $$a'_{m-1} = \text{Comp}(\hat{a}_m - \text{IDW}(\hat{a}_{m+1})) + \text{IDW}(\hat{a}_m)$$

The compensation module consists of multiple linear layers and sparse convolutions (hidden dimension 128, kernel size $3^3$), including a transposed convolution with stride 2. After prediction, the AC residuals are encoded: $r_{m-1} = AC_{m-1} - AC'_{m-1}$.

The compensation module can be selectively enabled based on prediction performance (signaled to the decoder with $s$ bits), guaranteeing a performance lower bound of G-PCCv14.

Entropy Coder (Rate Proxy)¶

Existing methods use bottleneck entropy models, which are sensitive to data variance. DeepRAHT instead employs zero run-length coding, exploiting the high concentration of RAHT residuals near zero.

Since run-length coding is non-differentiable, a Laplace distribution-based rate proxy is proposed: $$q(r) = \int_{r-0.5}^{r+0.5} \mathcal{L}_{\mu,\sigma}(r)dr$$

Parameters $\alpha=0.425, \mu=0, \sigma=0.2$ are obtained by fitting to real data, achieving a coefficient of determination of 0.991.

Variable bitrate advantage: Different bitrates are achieved simply by adjusting the quantization step $qs$ ($qs = \{8,10,12,...,224\}$), requiring only a single trained model, whereas 3DAC and TSC-PCAC require separate training for each rate-distortion operating point.

Loss & Training¶

The total loss function is: $$\ell = \ell_{bits} + \lambda(\ell_{recon} + \ell_{pred})$$

$\ell_{recon} = \|a_0 - \hat{a}_0\|_2^2$: end-to-end reconstruction error
$\ell_{pred} = \sum_m \|(a_m - a'_m)\|_2^2$: prediction loss for accelerating convergence
$\ell_{bits} = -\sum_m \log_2 q(r_m/qs)$: rate proxy loss
$\lambda = 1/255$, $qs = 8$, Adam optimizer, learning rate 0.0001, batch size 1
Training data: RWTT dataset (568 real-world objects)
Compression performed in YUV color space

Key Experimental Results¶

Main Results¶

BD-BR Gain (%, negative = bitrate savings, anchor = G-PCCv14):

Method	Owlii Avg	8iVSLF Avg	MPEG Avg	Overall Avg
G-PCCv23	-20.0	-17.5	-11.6	-16.4
3DAC	-66.6	-70.9	-62.7	-66.7
TSC-PCAC	-12.8	-68.5	-73.2	-51.5
Unicorn	-7.1	-10.9	-4.0	-7.3
DeepRAHT	—	—	—	Baseline

Note: DeepRAHT saves an average of 16.4% bitrate over G-PCCv23 and 7.3% over Unicorn; improvements are larger on chroma components (U: 20.5%, V: 20.8%).

Complexity Comparison (8iVSLF, avg. 3.25M points/frame):

Method	Enc. Time	Dec. Time	Model Size	GPU Memory
3DAC	38.45s	51.71s	1MB×5	10GB
TSC-PCAC	7.86s	26.87s	148MB×5	22GB
Unicorn	20.86s	14.99s	65MB×3	16GB
DeepRAHT	6.03s	5.74s	88MB×1	8GB

Ablation Study¶

Ablation on loot_viewdep (BD-rate gain vs. G-PCCv14):

Configuration	BD-rate Gain
Vanilla RAHT (no prediction)	Baseline
RAHT+Pred (IDW, ≈G-PCCv14)	-48.2%
RAHT+Pred+Comp (DeepRAHT)	-24.6% (vs. G-PCCv14)
vs. G-PCCv23	-16.6%

Key Findings¶

The prediction compensation module exceeds the sibling-based prediction of G-PCCv23 without using any sibling context.
The rate proxy achieves very high fitting accuracy ($R^2=0.991$), effectively replacing the bottleneck entropy model.
DeepRAHT is the only deep learning method that successfully compresses all test sequences; competing methods fail on certain large or sparse point clouds.
A single model covers 10 rate-distortion operating points, whereas competing methods require 3–5 separate models.
Guaranteed invertibility confines distortion to quantization alone, preserving more texture detail than Unicorn.

Highlights & Insights¶

First end-to-end differentiable RAHT: The core algorithm of the G-PCC standard is fully reimplemented using sparse convolutions, bridging deep learning and traditional coding standards.
Guaranteed performance lower bound: The framework is structurally aligned with G-PCCv14; the optional compensation module and signaling bits ensure that performance never falls below G-PCCv14.
Elegant variable bitrate solution: Exploiting the robustness of run-length coding to Laplace distributions, a single model covers a wide bitrate range by adjusting the quantization step.
The equivalence DC = normalized attribute at the next coarser scale is the key theoretical foundation for avoiding redundant coding.
Highly practical: Fastest encoding and decoding, lowest GPU memory usage, and best robustness among compared methods.

Limitations & Future Work¶

Training is conducted solely on the RWTT dataset; generalization to LiDAR and dynamic point clouds remains to be validated.
Batch size is limited to 1, creating a bottleneck for large-scale training efficiency.
The prediction model uses only parent/grandparent scales; longer-range context is unexplored.
Only color attributes are handled; applicability to other attributes such as normals and reflectance is unverified.
Integration with Gaussian Splatting data (a potential application mentioned by the authors) has not been experimentally evaluated.

G-PCC (tmc13v23): The industry standard; DeepRAHT aligns with its structure and surpasses it, demonstrating the potential of learned methods to replace handcrafted designs.
3DAC: The first method to learn RAHT coefficients, but neither end-to-end nor predictive — DeepRAHT directly addresses both shortcomings.
Unicorn: Current state-of-the-art deep learning framework using average pooling for multi-scale representation. DeepRAHT's RAHT decomposition provides a more theoretically grounded multi-scale alternative.
Insight: Deep integration of classical signal processing tools (e.g., Haar wavelet transforms) with deep learning is a promising direction in compression research.

Rating¶

Novelty: ⭐⭐⭐⭐ (The end-to-end differentiable RAHT and rate proxy design are novel, though the overall framework adheres to the G-PCC structure.)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive evaluation on three datasets with complexity comparisons, variable bitrate analysis, robustness validation, and ablation studies.)
Writing Quality: ⭐⭐⭐⭐ (Technical descriptions are precise and mathematical derivations are complete.)
Value: ⭐⭐⭐⭐⭐ (Directly benchmarked against the G-PCC industry standard; high practical value.)