Skip to content

⚛️ Physics & Scientific Computing

🧠 NeurIPS2025 · 57 paper notes

📌 Same area in other venues: 📷 CVPR2026 (2) · 🔬 ICLR2026 (69) · 🧪 ICML2026 (33) · 🤖 AAAI2026 (15) · 📹 ICCV2025 (2)

🔥 Top topics: Domain Adaptation ×3 · Diffusion Models ×2

3DID: Direct 3D Inverse Design for Aerodynamics with Physics-Aware Optimization

This paper proposes the 3DID framework, which learns a unified physics-geometry triplane latent representation, performs objective-gradient-guided diffusion sampling, and applies a two-stage topology-preserving refinement strategy to conduct inverse design directly in the full 3D space starting from random noise. On vehicle aerodynamic shape optimization, 3DID reduces simulated drag (Sim-Drag) by 13.6% compared to the best baseline.

A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

This paper proposes a novel class of regularizers constructed from current and historical gradients, combined with a conjugate gradient method equipped with negative-curvature detection to solve the regularized Newton equation. Within an adaptive framework that requires no prior knowledge of the Hessian Lipschitz constant, the method simultaneously achieves, for the first time, the optimal global iteration complexity of \(O(\epsilon^{-3/2})\) and a quadratic local convergence rate.

A Variational Manifold Embedding Framework for Nonlinear Dimensionality Reduction

This paper proposes a variational manifold embedding framework that formalizes dimensionality reduction as an optimization problem over smooth embedding maps (minimizing the KL divergence between a prior distribution and the pullback of the data distribution), theoretically unifying PCA and nonlinear dimensionality reduction methods, and leverages the calculus of variations (Euler-Lagrange equations) and Noether's theorem to derive interpretable constraints on optimal embeddings.

Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling

By theoretically analyzing the complementary weaknesses of ODE and SDE solvers (ODE solvers accumulate irreducible gradient errors; SDE solvers amplify discretization errors at large step sizes), this paper proposes AdaSDE—a method that introduces a learnable stochastic coefficient \(\gamma_i\) at each denoising step to control noise injection intensity. Optimized via lightweight distillation, AdaSDE achieves state-of-the-art FID of 4.18 on CIFAR-10 and 8.05 on FFHQ at 5 NFE.

AstroCo: Self-Supervised Conformer-Style Transformers for Light-Curve Embeddings

This paper proposes AstroCo, a self-supervised encoder that introduces the Conformer architecture (attention + depthwise separable convolution + gating) for irregular astronomical light curves. On the MACHO dataset, AstroCo reduces reconstruction error by 61–70% compared to Astromer v1/v2 and improves few-shot classification macro-F1 by approximately 7%.

Balanced Conic Rectified Flow

To address the distribution drift induced by the reflow step in k-rectified flow, this paper proposes conic reflow: constructing conic supervisory trajectories from the inverted noise of real images and their Slerp-perturbed neighbors, substantially reducing the number of required fake pairs while achieving superior generation quality and straighter ODE trajectories.

Bayesian Surrogates for Risk-Aware Pre-Assessment of Aging Bridge Portfolios

A Bayesian neural network (BNN)-based surrogate model is proposed to replace expensive nonlinear finite element analysis (NLFEA), enabling rapid, uncertainty-aware structural safety pre-assessment of aging bridge portfolios. In a real-world railway case study, the approach saves approximately $370,000 per bridge.

Collapsing Taylor Mode Automatic Differentiation

This paper proposes a collapsing optimization technique for Taylor mode automatic differentiation. By rewriting the computation graph to propagate derivative summation operations upward, it substantially accelerates the evaluation of PDE operators (e.g., Laplacian, general linear PDE operators), achieving speeds superior to nested backpropagation while retaining the low-memory advantage of forward-mode AD.

DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving

This paper proposes DeltaPhi, a framework that forgoes direct learning of the input-to-output mapping for PDEs and instead learns residuals between similar physical states. By exploiting the stability of physical systems as implicit data augmentation, DeltaPhi significantly improves the performance of diverse neural operators under data-scarce regimes.

EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale

EddyFormer is a Transformer architecture based on the Spectral Element Method (SEM) that decomposes the flow field into two parallel streams — LES (large-scale) and SGS (small-scale) — achieving DNS-level accuracy on 3D turbulence at \(256^3\) resolution with a 30× speedup, while generalizing well to unseen domains 4× larger.

Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries

This work investigates whether LLM embeddings encode physically meaningful quantities derived from X-ray astronomical observations—specifically hardness ratios, power-law indices, and variability indices. Results show that structured prompt design improves clustering purity of physical attributes by 5.9%–57.5%, and sparse autoencoders reveal that LLMs infer physical parameters not explicitly stated by recognizing object types.

Enforcing Governing Equation Constraints in Neural PDE Solvers via Training-free Projections

Two training-free post-processing projection methods are proposed—nonlinear LBFGS optimization and local linearization projection—to project the outputs of neural PDE solvers onto the feasible manifold satisfying governing equation constraints. Evaluated on Lorenz/KS/Navier-Stokes, both methods substantially reduce constraint violations and improve accuracy, markedly outperforming physics-informed training.

Exoplanet Formation Inference Using Conditional Invertible Neural Networks

A conditional invertible neural network (cINN) trained on 15,777 synthetic planets infers planet formation parameters (disk mass, turbulent \(\alpha\), dust-to-gas ratio) from observables (planet mass, orbital distance), achieving probabilistic parameter retrieval ~10⁶× faster than physical simulations. Multi-planet system data is shown to yield more robust inference than single-planet data.

F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning

This paper presents the first systematic study of parameter-efficient fine-tuning (PEFT) for pretrained large operator models (LOMs) in scientific machine learning. It demonstrates that LoRA exhibits a depth-amplified approximation error lower bound in Fourier layers, whereas Adapter preserves universal approximation capacity. Building on this analysis, the paper proposes the Frequency-Adaptive Adapter (F-Adapter), which allocates adapter capacity according to spectral energy distribution. On 3D Navier-Stokes prediction tasks, F-Adapter achieves state-of-the-art performance while tuning fewer than 2% of parameters.

FAIR Universe HiggsML Uncertainty Dataset and Competition

This work provides a standardized dataset of 280 million simulated LHC collision events and a competition platform featuring six parameterized systematic biases (detector calibration + background composition) alongside an asymmetric coverage penalty metric. Participants are required to construct robust 68.27% confidence intervals for the Higgs signal strength \(\mu\). The winning solutions, based on profile-free surrogate modeling, achieve confidence intervals approximately 20% narrower than conventional binned methods.

FEAT: Free Energy Estimators with Adaptive Transport

This paper proposes the FEAT framework, which employs stochastic interpolants to learn transport maps between two thermodynamic systems. Building on the escorted Jarzynski equality and the controlled Crooks theorem, FEAT provides consistent, minimum-variance free energy difference estimators along with variational upper and lower bounds, thereby unifying equilibrium and non-equilibrium approaches.

FlashMD: Long-Stride, Universal Prediction of Molecular Dynamics

FlashMD is proposed as a GNN-based framework that directly predicts the positional and momentum evolution of molecular dynamics trajectories with long strides, achieving time steps 1–2 orders of magnitude larger than those of conventional MD integrators. The architecture incorporates Hamiltonian dynamics constraints and generalizes to arbitrary thermodynamic ensembles and universal chemical systems.

From Black Hole to Galaxy: Neural Operator Framework for Accretion and Feedback Dynamics

A Neural Operator-based "sub-grid black hole" model is proposed to learn the small-scale (GR)MHD time-evolution operator \(u_t \to u_{t+\Delta T}\), replacing hand-crafted closure rules embedded in a multi-level direct numerical simulation framework. This work achieves, for the first time, the capture of intrinsic variability in accretion-driven feedback, with a speedup of \(\sim 10^5\times\).

From Images to Physics: Probabilistic Inference of Galaxy Parameters and Emission Lines via VAE & Normalizing Flows

This work proposes a VAE–Normalizing Flow hybrid framework that jointly infers galaxy physical parameters (stellar mass, SFR, redshift, gas-phase metallicity, central black hole mass) and emission line fluxes (Hα, Hβ, [N II], [O III]) in a probabilistic manner from SDSS gri images and photometric data, achieving over 100× speedup relative to SED fitting while providing well-calibrated posterior distributions.

From Simulations to Surveys: Domain Adaptation for Galaxy Observations

This work constructs a domain adaptation pipeline from simulated galaxies (TNG50) to real survey observations (SDSS) via feature-level alignment using Euclidean distance, optimal transport, and a top-\(k\) soft-matching loss with trainable weight scheduling, improving target-domain morphology classification accuracy from 46.8% (no adaptation) to 87.3%, and Macro F1 from 0.298 to 0.626.

Guided Diffusion Sampling on Function Spaces with Applications to PDEs

This paper proposes FunDPS (Function-space Diffusion Posterior Sampling), which trains an unconditional diffusion model in function space and performs plug-and-play posterior sampling for PDE inverse problems via gradient guidance at inference time. Theoretically, it extends the Tweedie formula to infinite-dimensional Banach spaces. Empirically, across 5 PDE tasks with only 3% observations, FunDPS achieves 32% higher accuracy on average than DiffusionPDE while reducing the number of sampling steps by 4×.

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations

This work presents GyroSwin, the first scalable 5D neural surrogate model for gyrokinetic plasma turbulence. It extends the Swin Transformer to the 5D gyrokinetic phase space, employs cross-attention for 3D↔5D interaction, and adopts channelwise mode separation to capture zonal flows. GyroSwin achieves higher accuracy than conventional quasilinear methods while being three orders of magnitude faster than the numerical solver GKW.

Hamiltonian Neural PDE Solvers through Functional Approximation

Grounded in the Riesz representation theorem, this work approximates infinite-dimensional Hamiltonian functionals via learnable integral kernel functionals (IKF). Functional derivatives are obtained through automatic differentiation, yielding an energy-conserving neural PDE solver (HNS) that demonstrates superior stability and generalization on 1D/2D PDEs.

High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction

This paper proposes QHFlow, the first method to apply conditional flow matching to density functional theory (DFT) Hamiltonian matrix prediction. By designing high-order SE(3)-equivariant vector fields and symmetry-aware prior distributions, QHFlow reduces Hamiltonian prediction error by 73% on MD17 and accelerates DFT computation by 54% when used as an SCF initializer.

INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers

This paper proposes the Indirect Neural Corrector (INC), which embeds learned correction terms into the right-hand side (RHS) of PDEs rather than directly modifying the state. The approach is theoretically shown to reduce error amplification by a factor of \(\mathcal{O}(\Delta t^{-1}+L)\), and achieves substantial improvements in long-term trajectory performance across 6 PDE systems (from 1D chaos to 3D turbulence), with R² gains up to 158.7% and up to 330× acceleration.

Integration Matters for Learning PDEs with Backward SDEs

This paper identifies the root cause of why standard BSDE methods underperform PINNs — an irreducible discretization bias introduced by Euler-Maruyama integration — and proposes Heun-BSDE based on the Stratonovich formulation to fully eliminate this bias, achieving competitive performance against PINNs on high-dimensional PDEs.

Knowledge is Overrated: A Zero-Knowledge ML and Cryptographic Hashing-Based Framework for Verifiable, Low Latency Inference at the LHC

This paper proposes PHAZE, a framework that combines cryptographic hashing (Rabin fingerprinting) and zero-knowledge machine learning (zkML) to enable verifiable early-exit inference at LHC trigger latency, achieving a theoretical online latency of ~152–253 ns while providing built-in anomaly detection capability.

Latent Representation Learning in Heavy-Ion Collisions with MaskPoint Transformer

This work introduces a masked point cloud Transformer autoencoder to heavy-ion collision analysis. Through a two-stage paradigm of self-supervised pre-training followed by supervised fine-tuning, the model learns nonlinear latent representations substantially stronger than those of PointNet—reducing PC1 distribution overlap from 2.42% to 0.27%—providing a general feature learning framework for studying QGP properties.

Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology

This paper applies a Multimodal Masked Autoencoder (MMAE) to jointly model galaxy images (HSC-PDR2, five bands) and spectra (DESI-DR1), constructing a cross-modal dataset GalaxiesML-Spectra of 134,533 galaxies. Under a 75% masking ratio, the model reconstructs major spectral emission lines and image morphology. When spectra are entirely absent at inference, the model achieves \(\sigma_{\text{NMAD}}=0.016\) for redshift prediction using images alone, outperforming AstroCLIP while extending the redshift range to \(z \sim 4\) for the first time.

Multi-Trajectory Physics-Informed Neural Networks for HJB Equations with Hard-Zero Terminal Inventory: Optimal Execution on Synthetic & SPY Data

To address the hard-zero terminal inventory constraint (\(X_T=0\)) in HJB equations arising from optimal trade execution, this paper proposes Multi-Trajectory PINN (MT-PINN). Through a rollout-based terminal loss and a \(\lambda\)-curriculum training strategy, MT-PINN significantly outperforms vanilla PINN on both synthetic benchmarks and live SPY backtesting, achieving a substantial reduction in terminal inventory violation rates.

Neural Deprojection of Galaxy Stellar Mass Profiles

A neural network approach is proposed to map Nuker galaxy profile parameters to analytically deprojectable Multi-Gaussian Expansion (MGE) components, enabling stellar mass modeling of galaxies without optical imaging. The method is integrated into the differentiable dynamical modeling pipeline SuperMAGE for Bayesian inference of supermassive black hole (SMBH) masses.

Neural Emulator Superiority: When Machine Learning for PDEs Surpasses its Training Data

This work challenges the prevailing assumption that the accuracy of neural PDE emulators is bounded by that of their training data (i.e., the numerical solver). It discovers and rigorously defines the phenomenon of emulator superiority—neural networks trained solely on low-accuracy solver data can, when evaluated against high-accuracy reference solutions, outperform the very solver that generated their training data.

Neural Green's Functions

This paper proposes Neural Green's Functions, a learnable linear PDE solution operator based on eigendecomposition: pointwise geometric features are extracted from the domain geometry to predict the eigendecomposition of the Green's function, enabling one-time training to solve for arbitrary source functions and boundary conditions via numerical integration. On mechanical part thermal analysis, the method reduces error by 13.9% over the state-of-the-art neural operator while running 350× faster than numerical solvers.

Neural Network for Simulating Radio Emission from Extensive Air Showers

A simple fully connected neural network is employed to replace computationally expensive CoREAS Monte Carlo simulations, enabling fast prediction of radio pulses from extensive air showers (EAS) while achieving \(X_{\text{max}}\) reconstruction resolution comparable to conventional simulations.

Neuro-Spectral Architectures for Causal Physics-Informed Networks

NeuSA integrates classical spectral methods with Neural ODEs: the PDE is projected onto a spectral basis (Fourier) to obtain an ODE system, which is then solved by a NODE that learns the dynamical evolution. This architecture-level design eliminates the spectral bias and causality violations inherent in conventional PINNs, achieving errors 1–2 orders of magnitude lower than baselines on wave, Burgers, and sine-Gordon equations while training faster.

One-Shot Transfer Learning for Nonlinear PDEs with Perturbative PINNs

By combining perturbation theory with PINNs, this work decomposes nonlinear PDEs into a sequence of linear subproblems. After learning the latent space of the linear operator via a Multi-Head PINN, transfer to new PDE instances is achieved through a closed-form solution within 0.2 seconds, attaining errors on the order of \(10^{-3}\).

Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints

This paper proposes Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear equality constraints to machine precision during sampling from pretrained flow matching models. The framework alternates among forward shooting with projection, OT-interpolation backward updates, and relaxed penalty correction at each sub-step, achieving up to 99.5% improvement over baselines on PDE problems involving shocks and discontinuities.

Physics-Guided Machine Learning for Uncertainty Quantification in Turbulence Models

This paper proposes a hybrid ML–EPM framework that employs a lightweight CNN to learn a correction mapping from RANS turbulent kinetic energy fields to DNS ground truth, using the learned corrections to modulate the perturbation magnitude of the Eigenspace Perturbation Method (EPM). The approach reduces uncertainty estimation errors by 1–2 orders of magnitude while preserving physical consistency.

Physics-Informed Neural Networks with Fourier Features and Attention-Driven Decoding

This paper proposes Spectral PINNsformer (S-Pformer), which replaces the encoder of PINNsformer with Fourier feature embeddings and adopts a decoder-only Transformer architecture. S-Pformer achieves superior performance on multiple PDE benchmarks while reducing parameter count by 18.6%, effectively alleviating the spectral bias problem.

POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning

This work introduces POLARIS, the first ML benchmark dataset for exoplanetary polarimetric imaging (921 VLT/SPHERE/IRDIS polarimetric images + 75,910 preprocessed exposures), and proposes the Diff-SimCLR framework (diffusion-augmented contrastive learning), achieving 93% accuracy on the reference-star vs. target-star classification task with fewer than 10% manual annotations.

Quantum Doubly Stochastic Transformers

This paper proposes QDSFormer (Quantum Doubly Stochastic Transformer), replacing softmax with a variational quantum circuit QontOT to generate doubly stochastic attention matrices. Both theoretical analysis and experiments demonstrate that quantum-circuit-generated DSMs are more diverse and better at preserving information, consistently outperforming standard ViT and Sinkformer on multiple small-scale visual recognition tasks.

Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity

This work establishes empirical scaling laws for single-layer PINNs on representative nonlinear PDEs, identifying a dual optimization failure: a width-scaling pathology (error does not decrease with width) and a compound pathology (nonlinearity exacerbates this failure), demonstrating that optimization rather than approximation capacity is the primary bottleneck.

Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery

This paper proposes SciNO (Score-informed Neural Operator), a probabilistic generative model designed in a smooth function space that stably approximates the log-density Hessian diagonal to improve ordering-based causal discovery, achieving a 42.7% reduction in order divergence on synthetic graphs and 31.5% on real-world data.

Simulation-Based Inference for Neutrino Interaction Model Parameter Tuning

This work presents the first application of simulation-based inference (SBI) to neutrino interaction model parameter tuning. Using neural posterior estimation (NPE), the method learns the posterior distribution of 4 physical parameters from 200K GENIE-simulated 58-bin histograms, and accurately recovers the ground-truth parameter values on mock data from the MicroBooNE Tune.

Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

This paper investigates the generalization properties of stable minima (flat minima) in two-layer overparameterized ReLU networks. It proves that while flatness does imply generalization, the convergence rate deteriorates exponentially with input dimension (i.e., the curse of dimensionality applies), forming an exponential separation from low-norm solutions (weight decay) that are immune to this curse. The paper further identifies the "neural shattering" phenomenon as the geometric mechanism underlying failure in high dimensions.

Symbolic Regression Is All You Need: From Simulations to Scaling Laws in Binary Neutron Star Mergers

This work applies Symbolic Regression (SR) to automatically discover analytic calibration relations for post-merger accretion disk mass in binary neutron star mergers from numerical relativity simulation data. The resulting compact expressions comprehensively outperform existing empirical fitting formulae in the literature in terms of predictive accuracy, generalization, and interpretability.

The Pareto Frontier of Resilient Jet Tagging

This work systematically evaluates the AUC–resilience trade-off across multiple architectures (DNN/PFN/EFN/ParT) for LHC jet tagging tasks, revealing that more complex models achieve higher AUC but exhibit stronger Monte Carlo model dependence. A Pareto frontier is constructed, and a case study demonstrates that low-resilience classifiers introduce bias in downstream parameter estimation even after calibration.

The Platonic Universe: Do Foundation Models See the Same Sky?

This paper validates the Platonic Representation Hypothesis (PRH) in an astronomical setting. Using JWST, HSC, Legacy Survey, and DESI spectroscopic data, it measures representation alignment across six foundation models (ViT/ConvNeXt/DINOv2/IJEPA/AstroPT/Specformer) and finds that both intra-modal and cross-modal MKNN scores consistently increase with model scale (\(p = 3.31 \times 10^{-5}\)), supporting the hypothesis that models of different architectures and modalities converge toward a shared representation.

The Primacy of Magnitude in Low-Rank Adaptation

This paper reveals that weight update magnitude is the fundamental driver of performance in LoRA, unifying the influence of learning rate, scaling factor, and initialization strategy under a single framework. It further proposes LoRAM—an efficient initialization method based on deterministic orthogonal bases and magnitude scaling—that matches or surpasses spectral initialization methods without requiring SVD.

TITAN: A Trajectory-Informed Technique for Adaptive Parameter Freezing in Large-Scale VQE

This paper proposes TITAN, a framework that employs deep learning models to predict "frozen parameters" in VQE—parameters that remain inactive throughout training—enabling 40–60% of parameters to be frozen at initialization, achieving up to 3× convergence speedup and 40–60% reduction in circuit evaluations, while matching or surpassing baseline accuracy on molecular systems of up to 30 qubits.

Toward Complete Merger Identification at Cosmic Noon with Deep Learning

A ResNet18 is trained on simulated HST CANDELS images generated from IllustrisTNG50, demonstrating for the first time that deep learning can successfully identify galaxy mergers at high redshift \(1<z<1.5\), including minor mergers (\(\mu \geq 1/10\)) and low-mass galaxies (\(M_\star > 10^8 M_\odot\)), achieving an overall accuracy of ~73%. Model behavior is further analyzed through Grad-CAM and UMAP.

Towards Universal Neural Operators through Multiphysics Pretraining

This paper proposes an adapter-based multiphysics pretraining framework for neural operators. By treating lifting/projection layers as problem-specific adapters and freezing shared kernel integration operator layers, the framework enables transfer learning across PDE problems, substantially reducing fine-tuning cost while improving generalization.

Transfer Learning Beyond the Standard Model

This work investigates whether neural networks pre-trained on the standard cosmological model (ΛCDM) can transfer to beyond-standard-model scenarios (massive neutrinos, modified gravity, primordial non-Gaussianity). The study finds that a dummy node architecture can reduce simulation requirements by an order of magnitude, but negative transfer emerges when parameters exhibit strong physical degeneracies (e.g., \(\sigma_8\)\(M_\nu\)).

Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms

Tropical Attention replaces softmax dot-product attention with tropical algebraic geometry, performing piecewise-linear reasoning in tropical projective space to align with the polyhedral decision structures of combinatorial algorithms. It is the first approach to extend neural algorithmic reasoning to NP-hard problems, comprehensively outperforming softmax baselines across three OOD generalization axes: length, magnitude, and noise.

Unsupervised Discovery of High-Redshift Galaxy Populations with Variational Autoencoders

A variational autoencoder (VAE) is applied to unsupervised clustering of 2,743 JWST high-redshift (\(z>4\)) galaxy spectra, uncovering 12 distinct astrophysical categories and more than doubling the known sample sizes of rare populations including post-starburst galaxies, Lyman-α emitters, extreme emission line galaxies, and Little Red Dots.

Vision Transformers for Cosmological Fields: Application to Weak Lensing Mass Maps

This work presents the first systematic application of Vision Transformers (ViT and Swin Transformer) to constraining cosmological parameters (\(\Omega_m\) and \(S_8\)) from weak lensing convergence maps, comparing attention-based architectures against CNNs within a simulation-based inference framework.

Why Is Attention Sparse in Particle Transformer?

This paper systematically analyzes the near-binary sparse attention phenomenon observed in Particle Transformer (ParT) after training on jet tagging tasks. Through cross-dataset comparisons and ablation studies, it demonstrates that the sparsity primarily originates from the attention mechanism itself rather than the physics-inspired interaction matrix. Nevertheless, the interaction matrix remains indispensable to final performance by influencing the argmax particle selection for the vast majority of tokens.