Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation¶

Conference: CVPR 2026 arXiv: 2603.12845 Code: None Area: Medical Imaging / Protein Language Models / Bioinformatics Keywords: enzyme kinetic parameter prediction, protein language model, cross-attention, mixture of experts, distribution alignment

TL;DR¶

This paper proposes ERBA (Enzyme-Reaction Bridging Adapter), which reformulates enzyme kinetic parameter prediction as a staged conditioning problem aligned with catalytic mechanisms — first injecting substrate information via MRCA to capture molecular recognition, then fusing active-site 3D geometry via G-MoE to model conformational adaptation, and applying ESDA for distribution alignment to preserve PLM priors — achieving state-of-the-art performance across three kinetic metrics.

Background & Motivation¶

Accurate prediction of enzyme kinetic parameters (e.g., turnover number \(k_{cat}\), Michaelis constant \(K_m\), inhibition constant \(K_i\)) is critical for high-throughput protein design and synthetic biology, enabling candidate enzyme screening prior to wet-lab experiments. Existing methods such as DLKcat, UniKP, CataPro, and CatPred incorporate protein language model (PLM) features and substrate SMILES encodings, but share two fundamental limitations:

Shallow fusion: Most methods simply concatenate enzyme and substrate representations before regression, treating catalysis as a static compatibility problem and ignoring its staged nature (substrate recognition followed by conformational adaptation).
Passive use of PLMs: PLMs serve merely as frozen feature extractors or lightly fine-tuned backbones, without being explicitly conditioned on specific substrates or pocket geometry. Furthermore, directly injecting 3D structural information risks disrupting the biochemical semantics learned during PLM pre-training.

Core Problem¶

How to systematically inject substrate chemical information and active-site 3D structural information when fine-tuning PLMs for enzyme kinetic prediction, while preserving the biochemical priors acquired through large-scale self-supervised pre-training? The core challenge lies in determining the order, mechanism, and stability of fusion.

Method¶

Overall Architecture¶

ERBA's pipeline emulates the two stages of real catalytic mechanisms: - Input: enzyme amino acid sequence \(\mathbf{S}_e\), substrate SMILES \(\mathbf{S}_m\), active-site 3D structure \(\mathbf{S}_g\) - Stage 1 — Molecular Recognition: substrate semantics are injected into the PLM enzyme representation via MRCA - Stage 2 — Conformational Adaptation: active-site geometric information is fused via G-MoE, routing to pocket-specialized experts - Distribution Alignment: ESDA constrains representations at each stage to remain consistent with the PLM manifold - Output: kinetic parameters predicted in \(\log_{10}\) space using a heteroscedastic Gaussian NLL objective

The overall formulation is \(\hat{\mathbf{y}} = \mathcal{G}^{(2)}(\mathcal{M}^{(1)}(\mathbf{S}_e, \mathbf{S}_m), \mathbf{S}_g)\).

Key Designs¶

MRCA (Molecular Recognition Cross-Attention): The PLM's shallow layers produce residue embeddings \(\mathbf{H}_e \in \mathbb{R}^{L_e \times D}\); the substrate is encoded by an MPNN into \(\mathbf{H}_m \in \mathbb{R}^{L_m \times D}\). MRCA injects substrate semantics into the enzyme representation via a single-layer cross-attention mechanism, using enzyme tokens as queries and substrate tokens as keys/values. This emulates the process by which the enzyme first recognizes and localizes the substrate. Compared to simple concatenation followed by self-attention, MRCA improves \(R^2\) on \(K_i\) from 0.47 to 0.61.
G-MoE (Geometry-aware Mixture-of-Experts): A geometry-aware mixture-of-experts layer is designed in which the routing mechanism jointly considers the pocket-region representation from MRCA output and 3D geometric descriptors encoded by an E-GNN. Top-\(k\) sparse gating selects 2 out of 4 experts. Each expert performs pocket-local low-rank adaptation (analogous to LoRA), with geometric information modulating channel activations via bias terms. This routes enzyme–substrate pairs of different conformational types to different experts, emulating the induced-fit effect in catalysis. Compared to standard MoE, \(R^2\) improves from 0.50 to 0.61.
ESDA (Enzyme-Substrate Distribution Alignment): Three levels of representations are defined — original sequence representation \(\mathcal{Z}^{(0)}\), substrate-conditioned \(\mathcal{Z}^{(1)}\), and geometry-conditioned \(\mathcal{Z}^{(2)}\). MMD (Maximum Mean Discrepancy) with an RBF kernel constrains \(\mathcal{Z}^{(1)}\) and \(\mathcal{Z}^{(2)}\) to remain close in distribution to \(\mathcal{Z}^{(0)}\), preventing "feature flooding" and PLM semantic forgetting during multimodal information injection. Compared to a standard L2 loss, \(R^2\) improves from 0.48 to 0.61.

Loss & Training¶

Primary task loss: heteroscedastic Gaussian NLL in \(\log_{10}\) space (jointly predicting mean and variance), suited for the positive-value constraints and multiplicative noise of kinetic parameters
G-MoE balance loss \(\mathcal{L}_\text{G-MoE}\): encourages uniform expert utilization to prevent expert collapse
ESDA alignment loss \(\mathcal{L}_\text{ESDA}\): MMD-based distribution alignment
Total loss: \(\mathcal{L} = \mathcal{L}_\text{task} + 0.01 \cdot \mathcal{L}_\text{G-MoE} + 0.1 \cdot \mathcal{L}_\text{ESDA}\)
Parameter-efficient fine-tuning of the PLM's top layers via LoRA (rank=8, scaling=16)

Key Experimental Results¶

Dataset / Metric	Metric	ERBA (Ours)	CatPred (Prev. SOTA)	Gain
\(k_{cat}\)	R²	0.54	0.40	+0.14
\(k_{cat}\)	PCC	0.74	0.67	+0.07
\(k_{cat}\)	RMSE	1.13	1.30	-0.17
\(K_m\)	R²	0.61	0.49	+0.12
\(K_m\)	PCC	0.79	0.65	+0.14
\(K_m\)	RMSE	0.70	0.93	-0.23
\(K_i\)	R²	0.61	0.45	+0.16
\(K_i\)	PCC	0.78	0.60	+0.18
OOD \(k_{cat}\)	R²	0.50	0.25 (CatPred)	+0.25
OOD \(K_m\)	R²	0.55	0.30 (CatPred)	+0.25

Backbone scaling: Applying ERBA to ESM2 (8M→3B), ProtT5-3B, and Ankh3 (1.8B/5.7B) consistently yields improvements, validating the method's generality.

Ablation Study¶

Fusion order matters: the substrate-first-then-structure order (\(\mathbf{S}_e \to \mathbf{S}_m \to \mathbf{S}_g\)) substantially outperforms the reverse, improving \(R^2\) by 37.8%
MRCA vs. concatenation + self-attention: MRCA improves \(R^2\) by 29.8%, demonstrating the superiority of explicit cross-attention over shallow fusion
G-MoE vs. standard MoE: geometry-aware routing improves \(R^2\) by 22% over standard routing, confirming the effectiveness of routing based on pocket morphology
ESDA vs. L2: distribution alignment improves \(R^2\) by 27.1%, validating the importance of maintaining PLM manifold consistency
Module-wise accumulation: PLM baseline (0.49) → +MRCA (0.51) → +G-MoE (0.54) → +ESDA (0.61), with each module contributing independently

Highlights & Insights¶

Mechanism-aligned problem formulation: mapping the two-stage enzymatic "recognition → adaptation" process directly onto model design (MRCA → G-MoE) is both biologically grounded and architecturally elegant
ESDA's distribution alignment perspective: using MMD in RKHS to prevent semantic drift during multimodal fine-tuning is a general and transferable idea applicable to any multimodal fine-tuning scenario
Geometry-guided routing in G-MoE: routing based on 3D pocket structure, allowing experts specialized for different active-site morphologies, aligns well with the biological intuition of enzyme catalysis
Substantial OOD generalization: \(R^2\) nearly doubles on the EITLEM test set (0.25 → 0.50), demonstrating that conditioned PLMs generalize far more robustly than passively used PLMs

Limitations & Future Work¶

The current work focuses exclusively on wild-type enzymes, with no systematic evaluation of mutants (though the OOD set includes some)
3D structures rely on AlphaFold2/ESMFold predictions, which may deviate from experimental structures
Environmental variables such as cofactors, pH, and temperature are not modeled
Performance on EC-6 (ligase) remains insufficient (\(K_m\) \(R^2\) of only 0.34), potentially requiring domain-specific data augmentation
Inference efficiency with ESM2-3B as the backbone warrants further optimization

vs. CatPred (Nat.C 2025): CatPred also incorporates 3D structures but employs shallow fusion; ERBA's staged conditioning substantially surpasses it, with the gap widening further under OOD evaluation
vs. EITLEM (Chem.Catal 2024): EITLEM uses ESM-1v with residue-level attention for mutant modeling but lacks explicit substrate conditioning and structural fusion
vs. UniKP/CataPro: these methods perform multi-endpoint prediction with ProtT5 and SMILES but are fundamentally based on shallow concatenation
The ESDA distribution alignment strategy is transferable to multimodal medical image analysis — preserving pre-trained semantics when fine-tuning visual foundation models
The staged modeling paradigm of "recognition then adaptation" is broadly applicable to any task involving two-stage decision processes
Geometry-aware, domain-sensitive routing in MoE (routing based on structural properties of the input) constitutes a general and reusable design pattern

Rating¶

Novelty: ⭐⭐⭐⭐ The mechanism-aligned staged conditioning has solid biological grounding and ESDA is an elegant contribution, though individual components are not entirely novel
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three endpoints × multiple PLM backbones × OOD evaluation × six EC-class analyses × comprehensive ablations — coverage is exceptional
Writing Quality: ⭐⭐⭐⭐ The mathematical framework is complete and biological context is well-explained, though the paper is somewhat lengthy
Value: ⭐⭐⭐⭐ Practically valuable for computational biology; the distribution alignment idea transfers readily to other multimodal tasks