Skip to content

Learning Protein Structure-Function Relationships through Knowledge-guided Representation Decomposition

Conference: ICML 2026
arXiv: 2605.23960
Code: https://github.com/AI-HPC-Research-Team/ProtDiS (Available)
Area: Scientific Computing / Protein Representation Learning / Disentangled Representations
Keywords: Protein Structure-Function, Knowledge-guided Disentanglement, Information Bottleneck, ESM-3, Redundancy Elimination

TL;DR

ProtDiS decomposes pretrained protein microenvironment embeddings (e.g., ESM-3) into 8 biophysically interpretable "knowledge channels" and 1 residual channel through information bottleneck and redundancy elimination, yielding consistent improvements across twelve downstream tasks, particularly in scenarios with similar structures but different functions.

Background & Motivation

Background: Current protein structure representations primarily rely on pretrained microenvironment encoders such as GearNet, ESM-3, or Foldseek. These compress 3D geometry, physicochemical properties, and topological information into a high-dimensional latent space, which is then passed to GNNs for downstream tasks like enzyme classification, ligand binding site prediction, and PPI.

Limitations of Prior Work: These latent spaces are highly entangled—geometric, physicochemical, and topological signals are squeezed into the same set of dimensions. This leads to two consequences: (1) lack of interpretability, making it impossible to trace model decisions back to specific biophysical quantities; (2) collapse on protein pairs with similar structures but different functions—where high TM-scores lead to near-maximum cosine similarity, failing to distinguish functions.

Key Challenge: Protein function does not depend on the complete high-dimensional structural embedding, but rather on a few semantically clear local microenvironment attributes (secondary structure, packing density, flexibility, curvature, etc.). However, pure structural similarity dominates pretraining objectives, burying these fine-grained signals.

Goal: Decompose the entangled structural embedding \(\mathbf{s}\) into \(K\) knowledge-specific channels \(Z_k\) (each aligned with a predefined biophysical attribute \(Y_k\)) and 1 residual channel \(Z_c\) (to capture unmodeled structural variations). The decomposition must ensure each \(Z_k\) encodes only its corresponding \(Y_k\), redundancy between different \(Z_k\) is low, and all channels together can fully reconstruct \(\mathbf{s}\).

Key Insight: Instead of forcing strict statistical independence (since biophysical attributes are inherently correlated, e.g., hydrophobicity and exposure), the authors adopt a Barlow Twins-style "redundancy elimination" approach—penalizing only second-order linear correlation between channels while allowing non-linear biological relationships to persist.

Core Idea: Use knowledge supervision to "explicitly anchor" the information bottleneck to biophysical variables, then employ adversarial training, reconstruction, and redundancy elimination to ensure the residual channel does not leak information and the knowledge channels do not overlap.

Method

Overall Architecture

Input: Microenvironment embedding \(\mathbf{s} \in \mathbb{R}^d\) generated by a pretrained structural encoder (default: ESM-3 Structural Tokenizer).
Output: \(K=8\) knowledge channel embeddings \(\{Z_1, \ldots, Z_8\}\) (corresponding to packing density, local complexity, curvature, shape, exposure, flexibility, stability, hydrophobicity) and 1 residual channel \(Z_c\).
Mechanism: Each channel is implemented by an independent encoder \(f_k(\mathbf{s})\). Each knowledge channel is connected to a supervised prediction head \(h_k\) to predict the corresponding biophysical label \(y_k\). The residual channel is connected to a reconstruction head \(r(\cdot)\) to reconstruct the original \(\mathbf{s}\), and \(K\) adversarial discriminators \(d_k\) attempt to predict \(y_k\) from \(Z_c\) (using gradient reversal to prevent \(Z_c\) from learning knowledge information). For downstream use, a gated network selects and fuses relevant channels before passing them to a GNN.

Key Designs

  1. Knowledge-guided Information Bottleneck Decomposition:

    • Function: Decomposes a single structural embedding into \(K+1\) semantically aligned sub-channels, where each channel is a "minimal sufficient statistic" of \(\mathbf{s}\) regarding \(Y_k\).
    • Core Idea: The theoretical objective is \(\min_{Z_k} I(Z_k;\mathbf{s}) - \beta_k I(Z_k;Y_k)\). In practice, three surrogates are used: (a) supervision loss \(\mathcal{L}_{\mathrm{kn}}^{(k)} = \mathbb{E}[\ell(h_k(Z_k), y_k)]\) as a variational lower bound for \(I(Z_k;Y_k)\); (b) batch-level KL regularization \(\mathcal{L}_{\mathrm{KL}} = \sum_k \mathrm{KL}(q(Z_k) \| \mathcal{N}(0, I))\) as an upper bound for \(I(Z_k;\mathbf{s})\); (c) an \(\ell_1\) reconstruction loss \(\mathcal{L}_{\mathrm{rec}} = \|\hat{\mathbf{s}} - \mathbf{s}\|_1\) to ensure that \((Z_1,\ldots,Z_K,Z_c)\) are jointly sufficient.
    • Design Motivation: Biophysical properties of proteins are computable (e.g., DSSP for secondary structure, Kyte-Doolittle for hydrophobicity), making knowledge supervision free. The IB framework unifies information selection, redundancy compression, and integrity preservation.
  2. Adversarial Knowledge Stripping + Barlow-Twins Redundancy Reduction:

    • Function: Ensures the residual channel \(Z_c\) does not capture modeled knowledge and that the \(K\) knowledge channels remain independent.
    • Core Idea: For the residual side, a gradient reversal adversarial loss \(\mathcal{L}_{\mathrm{adv}} = \sum_k \mathbb{E}[\ell(d_k(\mathcal{R}_\lambda(Z_c)), y_k)]\) is used as an upper bound minimization for \(I(Z_c; Y_k)\), forcing \(Z_c\) to contain only the parts of \(\mathbf{s}\) not covered by the knowledge channels. Between channels, two terms are used: variance regularization \(\mathcal{L}_{\mathrm{var}} = \sum_k \mathbb{E}_d[(\mathrm{std}(Z_k^{(d)}) - 1)^2]\) to prevent collapse, and the Frobenius norm of the cross-correlation matrix \(\mathcal{L}_{\mathrm{cov}} = \frac{1}{|\mathcal{P}|}\sum_{(i,j)} \|C_{ij}\|_F^2\) (where \(C_{ij} = \frac{1}{N}\tilde{Z}_i^\top \tilde{Z}_j\)) to penalize only second-order linear redundancy.
    • Design Motivation: The authors note that strict statistical independence is biologically unrealistic (hydrophobicity and exposure are strongly correlated). Thus, Barlow Twins' approach is chosen to remove linear redundancy while allowing non-linear biological relationships, making it more suitable for proteins than FactorVAE or \(\beta\)-TCVAE.
  3. Task-adaptive Gated Fusion:

    • Function: Allows the model to select the most relevant knowledge channels for specific downstream tasks.
    • Core Idea: A subset of \(\{Z_k\}\) most relevant to the task is identified based on feature-level importance analysis (e.g., for enzyme EC prediction, only residual, secondary structure, local packing, and contact entropy might be selected). These are fused via a gated network and passed to a GNN.
    • Design Motivation: Explicitly identifying which biophysical quantities determine which function prevents overfitting caused by stacking high-dimensional features (critical for small-sample multi-class tasks like SCOP-cf) and provides natural interpretability.

Loss & Training

Total loss: \(\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{sup}} + \lambda_{\mathrm{KL}}\mathcal{L}_{\mathrm{KL}} + \lambda_{\mathrm{red}}(\lambda_{\mathrm{var}}\mathcal{L}_{\mathrm{var}} + \lambda_{\mathrm{cov}}\mathcal{L}_{\mathrm{cov}}) + \lambda_{\mathrm{rec}}\mathcal{L}_{\mathrm{rec}} + \lambda_{\mathrm{adv}}\mathcal{L}_{\mathrm{adv}}\). Pretraining data: 100,000 high-quality structures sampled from PDB and AlphaFoldDB. During downstream evaluation, representations are frozen, and only the fusion layer and GNN head are trained to purely measure representation quality.

Key Experimental Results

Main Results

Evaluation of ESM-3 ST vs. ProtDiS across 12 downstream tasks under random and structure-based splits, focusing on the more rigorous structure split.

Task (struct split) Metric ESM-3 ST ProtDiS Gain
Enzyme Pred. EC acc 78.7 83.5 +6.05%
Ligand Aff. spr 35.1 36.6 +4.45% (rel.)
SCOP-family acc 75.0 78.0 +3.91%
PPIs auroc 82.1 84.6 +3.0
MF (Function) fmax 61.1 61.2 +0.1
Ligand BS mcc 61.7 62.3 +0.6

The gain is smaller under random split (e.g., EC 88.2 → 89.0), which aligns with the authors' expectation: when training and test structures are similar, entangled representations suffice. The structure split exposes the collapse problem in ESM-3.

Ablation Study

Analysis Dimension Key Metric Description
Knowledge Specificity (MI Heatmap) Diagonal Dominance Each \(Z_k\) has high MI with its \(Y_k\) and low MI with others; \(Z_c\) is low for all \(Y_k\).
Channel Independence (DCC) Low cross-channel correlation DCC between different \(Z_k\) is near 0, but each \(Z_k\) retains moderate correlation with \(\mathbf{s}\).
Integrity (Progressive Rec.) Monotonic decrease in reconstruction loss Adding \(Z_k\) channels in any order reduces reconstruction loss, showing information complementarity.
High TM-score Homology Pairs AUC ESM-3 scores 0.868 on highest TM-score bins; ProtDiS scores 0.946.
Cosine vs. TM-score Dispersion For negative pairs with TM-score > 0.5, ESM-3 cosine similarity collapses to ~1, while ProtDiS maintains low cosine.

Key Findings

  • Gains on structure split significantly exceed random split: EC improved by only 0.8 on random split but 6.05 on structure split. This suggests ProtDiS learns function-related signals beyond global structural similarity rather than just fitting the distribution.
  • Homologous high-similarity protein pairs represent the "killer app" scenario for ProtDiS: On negative pairs with TM-score > 0.9, pure structural embedding similarity saturates at ~1, while knowledge embeddings remain discriminative, improving AUC by ~8 points.
  • Side effects of task-adaptive channel selection: On small-sample tasks like SCOP-cf, forcing all channels leads to overfitting; this proves that not all 8 biophysical dimensions are necessary for every task.
  • Crucial role of the residual channel: The authors emphasize that without \(Z_c\), forcing all information into 8 knowledge channels would be "lossy or degenerate."

Highlights & Insights

  • Explicitly binding IB with protein biophysical quantities is a clean formalization: Traditional disentanglement seeks unsupervised latent factors and relies on post-hoc attribution. Here, \(Y_k\) is set as white-box biophysical quantities like DSSP or KD scale, binding "disentanglement" and "interpretability" during training.
  • Rejecting strict independence for Barlow Twins redundancy elimination is a pragmatic choice: Since protein attributes are inherently correlated, strict independence harms representation capacity. The trade-off of "removing only linear redundancy" while "retaining non-linear biological relations" is highly transferable to other scientific domains.
  • Batch-level KL as an Information Bottleneck: Unlike traditional VIB which uses per-sample variational Gaussians, using the aggregated posterior against a standard normal is more stable in practice and avoids noise from reparameterization sampling.
  • "Using homologous similarity pairs as hard negatives" is an evaluation protocol that better exposes representation collapse and should become a standard for protein representation learning.

Limitations & Future Work

  • Strong dependence on structural data: The method relies on ESM-3 microenvironment embeddings, making it unusable for proteins without experimental structures or high-confidence AlphaFold predictions.
  • Hand-picked knowledge dimensions: \(Y_k\) is limited to attributes computable by existing tools like DSSP or KD scales; extending this requires finding computable labels for new dimensions.
  • Offline task-channel selection: The selection remains manual based on feature importance analysis rather than end-to-end learning.
  • Lack of direct comparison with recent Sparse Autoencoder (SAE) routes (e.g., Adams 2025): Those methods explore post-hoc interpretable decomposition of PLMs, while the authors only compare concepts qualitatively.
  • Future Directions: (i) Applying ProtDiS logic to sequence space to estimate local structural knowledge from pure sequence input; (ii) Knowledge-guided protein design by adjusting specific channels like hydrophobicity or packing density for controllable generation.
  • vs. FactorVAE / \(\beta\)-TCVAE: These pursue strict statistical independence and rely on unsupervised latent factors; ProtDiS uses supervised anchoring and redundancy elimination (allowing non-linear correlation).
  • vs. DisenIB / IMB: Also performs supervised disentanglement under IB, but ProtDiS uses multiple independent biophysical variables plus a residual channel for higher symmetry and analyzability.
  • vs. SAE routes (Adams 2025): SAE is a post-hoc sparse decomposition of PLM outputs on sequence embeddings; ProtDiS is an explicit constraint during training with structural embeddings and information-theoretic integrity guarantees.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of IB, knowledge supervision, and Barlow Twins for protein representation is new, though individual components exist.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across 12 tasks, three types of analysis (specificity, independence, integrity), and hard negative evaluation, though lacking SAE direct comparison.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and methodology; well-defined bridging between IB theory and practical surrogates.
  • Value: ⭐⭐⭐⭐ Improvement of 3-6 points on structure-based splits is practically significant, and disentangled representations have clear paths for controllable protein design.