📂 Others¶

🧪 ICML2026 · 70 paper notes

📌 Same area in other venues: 📷 CVPR2026 (105) · 🔬 ICLR2026 (116) · 💬 ACL2026 (4) · 🤖 AAAI2026 (117) · 🧠 NeurIPS2025 (121) · 📹 ICCV2025 (33)

🔥 Top topics: Layout & Composition ×2 · Agents ×2 · Alignment/RLHF ×2

A Hypertoroidal Covering for Perfect Color Equivariance: This paper uses a double-cover mapping to lift the interval-valued saturation and luminance in HSL space onto circle groups, constructing \(\mathbb{T}^3\)CEN. This enables the network to achieve precise color equivariance for hue, saturation, and luminance shifts, enhancing robustness in tasks such as color-shifted and medical imaging.
Adaptive Multi-Round Allocation with Stochastic Arrivals: This paper formalizes network recruitment as a budget-constrained sequential control problem and proves that single-round optimal allocation is greedy. By introducing a population-level surrogate value function, the complexity of multi-round planning is reduced to \(O(b^5\log b)\). Furthermore, a robustness guarantee is provided, decomposing model errors into frontier-level, population-level, and approximation errors.
AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability: Drawing on carbon cap-and-trade, the authors propose a quota-trading market for AI inference FLOPs (AI Allowance). Using KKT conditions, they prove this mechanism strictly reduces FLOP usage by companies under reasonable parameters, simultaneously addressing energy consumption and the exclusion of small companies in the LLM era.
AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training: AMDP utilizes multi-directional asynchronous pipelines, a one-step parameter mismatch upper bound, gradient accumulation, and ZeRO state sharding to improve the throughput of large-scale model pipeline parallel training while maintaining near-synchronous convergence. In 8-GPU GPT/BERT experiments, it achieves a maximum improvement of approximately 17% relative to the strongest asynchronous baselines.
Amortized Simulation-Based Inference in Generalized Bayes via Neural Posterior Estimation: This paper amortizes the power posterior family in generalized Bayes into a single neural posterior estimator conditioned on both the observation \(x\) and the temperature \(\beta\). This allows posterior sampling for different observations and varying temperatures to be completed in a single forward pass, eliminating the need to run MCMC for every instance.
AutoNumerics-Zero: Automated Discovery of State-of-the-Art Mathematical Functions: AutoNumerics-Zero is proposed as an evolutionary symbolic regression method with zero prior knowledge. Starting from empty programs, it automatically discovers arithmetic programs for approximating transcendental functions (such as exponential and cosine functions). Under finite-precision targets, it surpasses classic approximation methods designed by mathematicians over centuries by requiring fewer operations.
Beyond Model Readiness: Institutional Readiness for AI Deployment in Public Systems: Addressing the widespread phenomenon of AI systems in the public sector being "technically feasible but failing in deployment," this paper proposes the Institutional Alignment Readiness (IAR) five-dimensional assessment framework. It evaluates whether a receiving institution is prepared for the responsible deployment of AI systems across five dimensions: institutional compatibility, data ecology maturity, human oversight capacity, fiscal sustainability, and legal alignment.
Bullet Trains: Parallelizing Training of Temporally Precise Spiking Neural Networks: A parallel training method for Spiking Neural Networks (SNNs) based on parallel associative scan is proposed, achieving up to 44× acceleration while maintaining exact hard-reset dynamics, using a differentiable numerical root solver to compute spike times with machine precision.
Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features: TabCascade decomposes tabular rows into two cascaded segments: "low-resolution (categorical + discretized version of numerical)" and "high-resolution (continuous numerical)". It first learns the low-resolution joint distribution using CDTD and then generates numerical details using flow matching guided by the low-resolution information. Transport costs are tightened through data-dependent coupling and learnable non-linear time schedules. It natively supports the generation of "mixed-type features" (e.g., missing values, zero-inflation), achieving a 51.9% Gain in detection scores over SOTA across 12 datasets.
Complexity as Advantage: A Regret-Based Perspective on Emergent Structure: This paper proposes Complexity-as-Advantage (CAA): redefining "complexity" as the regret dispersion of a family of resource-constrained observers on the same process. It proves that under the log-loss + Markov framework, it is equivalent to the sum of conditional mutual information atoms (recovering excess entropy); from a coding perspective, it is equivalent to the variance of excess description length (MDL). This unifies Kolmogorov complexity, Bennett's logical depth, and excess entropy into a computable and empirically estimable scalar spectrum.
Comprehensive AI Governance Requires Addressing Non-Model Gains: This position paper argues that the current model-centric AI governance paradigm is experiencing diminishing effectiveness as "non-model gains" (inference gains, systems gains, and asset gains) become increasingly significant. It calls for a multi-layered complementary governance framework—including system, entity, agent, and cloud governance—to fill existing regulatory gaps.
Connecting Independently Trained Modes via Layer-Wise Connectivity: Proposes the Low-Loss Path Finding (LLPF) algorithm, which reliably constructs low-loss paths between independently trained neural network models through layer-wise connectivity and variance sphere constraints. It supports modern architectures such as MobileNet, EfficientNet, and CCT, yielding highly reproducible results.
Continual Learning of Domain-Invariant Representations: The authors explicitly inject "Domain-Invariant Representation Learning (DIRL)" into continual learning for the first time. Using the replay buffer as a carrier for multi-domain invariance computation and domain-conditioned alignment, they propose five methods—⋆-CL-{VREX, Fishr, CORAL, MMD, ANDMask}—pushing target domain accuracy to SOTA across six vision, medical, manufacturing, and ecology datasets.
Coupled Training with Privileged Information and Unlabeled Data: Addressing the "available during training, unavailable at deployment" privileged features \(W\), the authors propose a framework for joint training of a deployment model \(f\) and a rich-view model \(g\). By explicitly constraining the fitting error of \(g\) on labeled data to adaptively control the influence intensity of privileged information, this approach avoids the negative transfer phenomenon of traditional two-stage pseudo-labeling methods when \(W\) signals are weak or noisy.
CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities: The authors construct CyberGym-E2E, the first large-scale real-world AI Agent security benchmark covering the full lifecycle of "vulnerability discovery \(\rightarrow\) PoC generation \(\rightarrow\) patch generation \(\rightarrow\) functional regression testing" (920 vulnerabilities across 139 open-source projects). Using an agent-assisted pipeline with expert final review, manual costs are minimized. Evaluations show that while frontier models achieve 80%+ on patch-only tasks, the S3 success rate for end-to-end tasks peaks at 65.9% (GPT-5.4), indicating that vulnerability discovery, rather than patch generation, is the true bottleneck.
Decision Tree Learning on Product Spaces: This paper extends the theoretical guarantees of Blanc et al. (ITCS'20) for the "top-down greedy decision tree heuristic" from uniform distributions to arbitrary product distributions. It provides an upper bound on tree size of \(\exp(\Delta_\mathrm{opt} D_\mathrm{opt}\log(e/\epsilon))\) (strictly superior to ITCS'20 in the full binary tree case) and is entirely parameter-free, requiring no prior knowledge of the optimal tree size or depth.
Decoupled Conformal Optimisation: Efficient Prediction Sets via Independent Tuning and Calibration: This paper proposes DCO-Warmstart—a "train–tune–calibrate" tripartite Bayesian conformal optimization paradigm. By placing efficiency search on an independent tuning split and reserving the conformal quantile for an untouched calibration split, it achieves standard finite-sample marginal coverage guarantees on candidate structures of any size (even infinite) without requiring a confidence parameter \(\delta\). Empirically, it typically produces smaller prediction sets than coupled calibration methods like CRC or BQ.
DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation: Using an anti-causal graph, the authors unify three types of biases—confounder, collider, and mediator—into a single conditional independence criterion \(\hat{Y} \perp \mathbf{B} \mid Y\). They design sDISCO, a single-step differentiable estimator with \(O(n^2)\) memory complexity, which serves as a regularization term to penalize conditional distance correlation in any gradient-trained network, thereby mitigating multiple biases and scaling to multi-bias scenarios.
DisjunctiveNet: Neural Symbolic Learning via Differentiable Convexified Optimization Layers: The authors formulate "input-dependent if-then logical rules" as disjunctive constraints representing the union of polyhedra. By utilizing a sequence of basic steps to convexify Conjunctive Normal Form (CNF) into the convex hull of Disjunctive Normal Form (DNF), they derive a differentiable LP projection layer. Neural network outputs passing through this layer precisely satisfy the original MILP-level constraints during both training and inference.
Envy-Free Allocation of Indivisible Goods via Noisy Queries: This paper establishes the first sample complexity benchmarks for the problem of "finding an envy-free allocation using noisy valuation queries." In a setting with two agents, additive Gaussian noise, \(m\) items, and an optimal negative envy gap \(\Delta\), the authors prove a tight bound of \(\widetilde{\Theta}(m^{2.5}/\Delta^2)\) (when \(\Delta\gg m^{1/4}\)). The upper bound is achieved by a polynomial-time algorithm using non-adaptive queries and a single-item threshold rule, while the lower bound holds for adaptive queries and arbitrary computational time.
Position: Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment: This position paper argues that evaluating ML model resource consumption and environmental impact must move beyond focusing solely on the marginal costs of "single training" or "single inference." Instead, it advocates for adopting Life Cycle Assessment (LCA) from industrial ecology to aggregate and attribute costs—ranging from hardware manufacturing (embodied costs) to operational costs of training and deployment—across the entire R&D-deployment lifecycle. It provides a four-phase LCA-for-ML methodology, cost attribution formulas, and an OLMo2 case study.
FOVI: Bio-inspired Foveated Interface for Deep Vision Models: Inspired by the human retina-V1 pathway, the authors construct FOVI (Foveated Interface), which uses a "cortical magnification function + local isotropic sampling" to create a non-uniform pixel distribution that remains uniformly dense on the sensor manifold. By introducing a novel kNN convolution + kernel mapping technique, FOVI is compatible with both CNNs and ViTs, allowing a DINOv3-ViT to approach full-resolution ImageNet accuracy using only ~1/16 of the pixels.
Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity: This paper theoretically characterizes the "functional equivalence" symmetry groups of Transformer attention with positional encodings—proving that sinusoidal positional encodings preserve the original attention's symmetry structure, while RoPE significantly compresses the symmetry group to enhance expressivity. Based on this, it designs a two-stage weight matching algorithm adaptable to both positional encodings and systematically validates linear mode connectivity (LMC) across various settings.
GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data: Addressing High-Dimension Low-Sample-Size (HDLSS) tabular tasks where features far outnumber samples, this paper keeps the TabPFN backbone frozen and introduces Graph-Guided Feature Ordering (GO-LR) to arrange related features adjacently, followed by Neuro-inspired Subunit Compression (NSC) to pool adjacent segments into a few meta-features. This allows thousands of features to fit into TabPFN's feature budget, achieving the top average rank across 8 genomic/image HDLSS datasets.
Guaranteed Optimal Compositional Explanations for Neurons: Compositional explanations typically use beam search to find the "logical formula that best aligns with neuron activations," but beam search lacks optimality guarantees. This paper proposes an exact decomposition of IoU (dIoU) + an admissible heuristic + a best-first optimal algorithm. For the first time, it guarantees a globally optimal solution within a runtime comparable to beam search, revealing that 10–40% of explanations in previous literature are actually suboptimal.
HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces: For Extreme Multi-label Classification (XMC) with millions of labels, HASTE replaces "per-label independent fan-in sampling" with "semantically grouped shared fan-in." Combined with a small dense head for high-frequency labels, this allows sparse training to achieve wall-clock gains matching its theoretical FLOPs on GPUs—reaching up to \(4.4\times\) forward and \(25\times\) backward speedups over existing sparse baselines while almost closing the accuracy gap with dense models.
How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks: This paper systematically compares the training performance of Muon and Adam on equivariant/geometric networks (EGNN, DGCNN, PointNet, GotenNet, GINE). It finds that Muon consistently outperforms Adam on 3D point cloud tasks and that the solutions converged upon exhibit significant structural differences across three dimensions: Hessian curvature, local smoothness of the loss landscape, and the spectral rank of weights/representations. This work repositions "optimizer choice" as a severely neglected inductive bias in the training of equivariant networks.
Identifiable Equivariant Networks are Layerwise Equivariant: This paper proves within an architecture-agnostic abstract framework that as long as parameters satisfy "weak identifiability," an end-to-end \(G\)-equivariant deep network must possess an equivalent parameterization where each layer is equivariant to some latent group action. This provides a theoretical explanation for the long-observed experimental phenomenon where "end-to-end equivariance spontaneously collapses into layerwise equivariance."
Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation: GLIDE unifies the latest estimators (PPI++, Stratified PPI, PTD, ASI) and samplers (uniform, stratified, active, cost-optimal) from the PPI (prediction-powered inference) family into a scipy-style mean estimation library. It specifically addresses the hybrid evaluation challenge of "expensive human annotation + cheap but biased LLM-as-judge," accompanied by Monte Carlo validation and a decision tree to enable industrialized, reliable assessment of GenAI and Agentic systems.
Inference of Online Newton Methods with Nesterov's Accelerated Sketching: This paper equips online Newton methods with Nesterov's accelerated sketch-and-project solver, reducing the per-step cost to \(O(d^2)\). It characterizes for the first time the asymptotic normality of the last iterate under the dual uncertainty of "data randomness + solver randomness." Accompanied by a streaming covariance estimator that requires no matrix inversion, the proposed method makes accelerated sketched online Newton methods truly viable for statistical inference.
iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework: iWorld-Bench is the first unified evaluation benchmark specifically designed for "interactive world models." It proposes an Action Generation Framework capable of converting text, one-hot, and camera intrinsic/extrinsic action inputs into a unified instruction space. Based on 330K videos, 4.9K tasks and 9 metrics were refined to perform a comprehensive comparison across 14 mainstream models.
Knowing Isn't Understanding: Re-Grounding Generative Proactivity with Epistemic and Behavioral Insight: This ICML 2026 position paper argues that the "proactivity" of generative agents should not merely be judged by whether they act earlier, more autonomously, or more persistently. Instead, it must be regulated by two joint constraints: epistemic legitimacy (whether the agent truly "understands" the context) and behavioral commitment (whether the intervention is reversible or forced to escalate). The authors re-interpret hallucinations, alignment failures, and unsafe autonomy as structural "mis-coupling" between knowing and acting.
Learning Permutation-Invariant Macroscopic Dynamics: Aiming at the naturally unordered microscopic states of particle systems, this paper proposes an autoencoder framework that "reconstructs density instead of particles." It utilizes a DeepSet encoder to obtain permutation-invariant closure variables \(\hat{\bm{z}}\) and employs conditional normalizing flows with a Gaussian mixture density, centered at observation points, as the reconstruction target. This approach bypasses point cloud matching and enables learning macroscopic dynamics via an SDE/ODE alongside macroscopic observables.
Less Data, Faster Training: Repeating Smaller Datasets Speeds Up Learning via Sampling Biases: This paper systematically characterizes and explains the "small-vs-large gap" phenomenon, where repeating smaller datasets leads to faster convergence than using larger datasets. The authors prove that this acceleration cannot be explained by the CSQ-SQ gap, gradient variance reduction, or input distribution bias. By analyzing a \(2\)-layer quadratic MLP on \(2\)-sparse parity, they derive a closed-form step bound \(T = O((Nd)^{1/4} \log(d/\varepsilon))\). Through intervention experiments—including random labels, initialization scaling, and inter-layer learning rates—they verify the core mechanism: the \(O(N^{-1/2})\) sampling bias inherent in small datasets accelerates first-layer feature learning by driving faster growth of the second-layer norm.
Local and Mixing-Based Algorithms for Gaussian Graphical Model Selection from Glauber Dynamics: The authors provide the first study on learning Gaussian Graphical Model (GGM) structures from a single Gaussian Glauber dynamics trajectory. They propose two complementary algorithms: LET-GL (local edge detection based on \(i,i,j,i\) windows, perfectly parallelizable) and BTR-GL (decorrelating the trajectory into approximate i.i.d. samples via burn-in/thinning under the Dobrushin condition for consumption by off-the-shelf i.i.d. learners). The work provides finite-sample recovery guarantees, information-theoretic lower bounds, and an independently useful TV mixing upper bound for random-scan Gaussian Gibbs samplers.
MalTree: Tracing Malware Evolution from Embeddings at Scale: MalTree transposes phylogenetic tree techniques from bioinformatics (UPGMA, Neighbor-Joining) to malware analysis. It extracts static, dynamic, and image-based tri-modal embeddings from memory dumps to reconstruct "evolutionary trees" for malware families at scale. Utilizing VirusTotal timestamps for the first temporal verification (achieving 87% temporal consistency), it demonstrates on 100k+ samples and 538 families that embedding distances approximate real evolutionary sequences, shifting malware analysis from "per-sample classification" toward "lineage-aware evolutionary modeling."
Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems: This is a position/taxonomy paper: it categorizes centuries of human anti-collusion experience (sanctions, leniency and whistleblowing, monitoring/auditing, market design, and governance) into five categories based on the lifecycle. These are mapped to implementable interventions for multi-agent AI systems (reward penalty, whistleblower agent, telemetry-first overseer, interaction protocol design, shutdown mechanisms, etc.), while identifying open challenges unique to AI such as attribution, identity fluidity, the cooperation-collusion boundary, and adversarial adaptation.
Markov Chain Monte Carlo without Evaluating the Target: An Auxiliary Variable Approach: The authors unify three categories of "target-free" MCMC—exchange, PoissonMH, and TunaMH—into a meta-algorithm using auxiliary variables. By introducing auxiliary randomness in both the proposal and the acceptance rate, they design gradient-based MCMC methods (Poisson–Barker, Poisson–MALA, Tuna–SGLD) that maintain exact stationary distributions under minibatch data, significantly outperforming baselines such as PoissonMH/TunaMH/SGLD.
Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks: The authors point out that "output predictability from metadata" \(\neq\) "output dependence on evidence." They propose a dual-statistic audit protocol: using MPDS to measure metadata predictability and evidence-shuffling \(\Delta\text{Evi}\) to measure evidence sensitivity, supplemented by a stronger-reader calibration layer and input ablation, forming a reusable 4-step diagnostic scheme for weak-label benchmarks.
MetaDNS: Enhancing Exploration in Discrete Neural Samplers via Well-Tempered Metadynamics: This work adapts "well-tempered metadynamics" from molecular dynamics into discrete neural samplers. By accumulating a history-dependent bias potential \(V_t(s)\) along low-dimensional collective variables to flatten visited energy basins, it forces MDNS-like models to cross energy barriers and cover multimodal Boltzmann distributions, while preserving unbiased estimation through importance reweighting.
Multi-Level Strategic Classification: Incentivizing Improvement Through Promotion and Relegation Dynamics: This paper extends traditional one-shot "strategic classification" into a sequential mechanism composed of multi-level ternary classifiers (pass/abstain/fail = promotion/stay/relegation). It proves that by leveraging three intertemporal effects — the discount factor \(\beta\), skill retention rate \(\gamma\), and "leg-up gain" \(\delta\) — the non-incentivizable region can be shrunk from \(c^+>c^-\) to \((1-\beta\gamma)c^+>c^-\). Furthermore, it provides a steady-state threshold sequence \(\mu_l = \delta(l-1)/(1-\gamma)\), demonstrating that under mild conditions, honest effort can be incentivized to push attributes to arbitrarily high levels.
nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding: The method evolves RoPE from "axis-wise splitting" to "encoding positions and frequencies as holistic n-dimensional vectors via a single inner product rotation \(e^{j\boldsymbol{\omega}^\top\mathbf{x}}\)." By employing regular simplex wave vectors to ensure isotropy, it achieves consistent accuracy gains and superior resolution/density extrapolation across images, videos, and point clouds.
Networked Information Aggregation for Binary Classification: This work extends the conclusion of Kearns-Roth-Ryu 2026—which states that linear regression agents on a DAG can approach global optimality by sequentially passing prediction columns—to binary classification. Under the \(M\)-coverage condition, each agent observes a subset of feature columns and sequentially forwards its logits downstream, achieving global logistic regression optimality with an excess BCE loss of \(O(M/\sqrt{D})\). Simultaneously, a hard instance is constructed to prove an \(\Omega(k/D)\) lower bound, characterizing network depth as the fundamental bottleneck for information aggregation.
New Bounds for Kernel Sums via Fast Spherical Embeddings: The authors accelerate the "randomized Nash device" spherical embedding theorem of Bartal-Recht-Schulman 2011 using iterative Fastfood transforms (time \(\widetilde{O}(d + \Lambda^2 + \varepsilon^{-2})\)). This serves as a preprocessing step for Gaussian KDE to compress the effective diameter to \(\widetilde{O}(1/\sqrt{\varepsilon})\), yielding a new Gaussian KDE query time bound of \(\widetilde{O}(d + \varepsilon \Delta_\sigma^2 + 1/\varepsilon^3)\), which outperforms RFF / FJLT+RFF / Fastfood in the regime of small \(\varepsilon\) and medium diameter.
NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search: Ours compresses the multi-agent MCTS joint-action space \(d^n\) into a low-dim non-linear bandit using an asinh-linked GLM surrogate. It implements the NonUCT proposal rule based on "first-order difference + second-order mixed difference" to maintain a small candidate set \(\mathcal{C}(s)\) at each node. Theoretical analysis proves a local regret of \(\widetilde{O}(T^{3/4})\) (independent of \(d^n\)). Experimental results on MatGame, SMAC, and SMACv2 demonstrate superior sample efficiency and final performance over strong baselines like MAZero.
On Revisiting Entropy for Identifying Mislabeled Images: It is observed that the phenomenon where "mislabeled samples maintain high predictive entropy throughout training" is insufficient to distinguish them from hard clean samples. This work introduces signed entropy by multiplying entropy with a sign bit indicating "whether the prediction aligns with the given label." By accumulating this over training epochs into the SEI statistic, the method achieves a new SOTA in mislabeled detection (up to 11%+ improvement) on medical datasets (ISIC/DeepDRiD/PANDA/CheXpert) and CIFAR-100N in a plug-and-play manner.
On the Coordination of Value-Maximizing Bidders: This paper formally investigates the "coordination" problem of multiple value-maximizing auto-bidders in online advertising. It proposes a simple coordination mechanism where "only the alliance member with the highest value bids, while others bid 0." It proves that for a large class of auto-bidding algorithms, this mechanism simultaneously reduces the RoS violation for each member and drives the total alliance value to the asymptotic optimum among all coordination mechanisms.
On the Epistemic Uncertainty of Overparametrized Neural Networks: This paper points out that the "epistemic uncertainty" of overparameterized neural networks does not vanish as the data volume increases. Due to parameter unidentifiability (permutation + neuron splitting), even if the function is fully identified, the parameter space posterior still retains continuous uncertainty on the splitting manifold. Using single-hidden-layer ReLU networks as an example, the authors provide a precise posterior description (Dirichlet on simplex) and empirical validation.
Optimal Regularization for Performative Learning: This paper systematically characterizes the scaling laws of optimal regularization strength within a high-dimensional ridge regression framework under "performativity," where model deployment drives data distribution shifts. The optimal \(\lambda\) is found to be proportional to the performative strength \(\bar b\), and in overparameterized regimes, appropriate regularization can even leverage performative effects to reduce risk.
Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization: This work provides the first empirical discovery in standard classification tasks that an "optimal value for Feature Learning Strength (FLS)" exists—it is neither "the larger the better" nor "the smaller the better." Through finite-time gradient flow analysis of two-layer ReLU networks under logistic loss, the authors decompose the error into two quantifiable opposing terms: over-fitting caused by excessive FLS and "over-alignment" caused by insufficient FLS, rigorously characterizing the existence of an optimal FLS.
ParalESN: Enabling Parallel Information Processing in Reservoir Computing: ParalESN injects LRU-style complex diagonal linear recurrences into the "untrained reservoir" of Echo State Networks, allowing traditional RC to achieve temporal parallelization and scale to \(10^5\) dimensions while strictly maintaining the Echo State Property and universal approximation properties of fading memory filters.
Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning: Polaris decouples concept representations into two signals: "direction (semantics) + orbital potential (hierarchy)," both learned on a unit hypersphere. It utilizes tangent space projection and exponential mapping to ensure manifold closure, anisotropic spherical SVGD to prevent equatorial concentration, and vMF KL divergence to implement asymmetric "parent should have higher entropy than child" constraints. On taxonomy expansion tasks, it improves top-K recall by up to 19 points and reduces mean rank by 60%.
Position: Age Estimation Models Do Not Process Biometric Data: This is a position paper providing empirical evidence across 14 models and 3 face verification benchmarks to argue that face age estimation models possess identity discrimination capabilities two orders of magnitude lower than regulatory thresholds. Therefore, they should not be automatically classified as "processing of biometric data" under GDPR, BIPA, or the EU AI Act.
Possibilistic Predictive Uncertainty for Deep Learning: This paper replaces the Bayesian probabilistic framework with possibility theory to propose DAPPr. By projecting the possibilistic posterior in the parameter space onto the predictive space via supremum and fitting it with a learnable Dirichlet possibility function, the authors derive a method for modeling epistemic uncertainty that requires only 10 lines of code, directly replaces cross-entropy, and outperforms the EDL family in OOD detection.
Private and Stable Test-Time Adaptation with Differential Privacy: This paper is the first to point out that Test-Time Adaptation (TTA) leads to leakage of test data privacy. It systematically transforms five mainstream TTA methods (Tent, EATA, SAR, DeYO, and COME) into differentially private (DP) versions using per-sample gradient clipping and Gaussian noise. On ImageNet-C, it provides provable \((\epsilon, \delta)\)-DP guarantees and unexpectedly finds that "clipping itself" improves TTA accuracy by \(0.1\%\)–\(4.1\%\).
Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations: The authors generalize the "post-projection alignment to isotropic Gaussian" in LeJEPA to "post-projection alignment to a Rectified Generalized Gaussian (RGG) distribution." By utilizing rectified and truncated generalized Gaussians, they achieve explicitly controllable expected \(\ell_0\) sparsity. On ImageNet-100, a ResNet encoder achieves a \(85.08\%\) linear probe accuracy while maintaining \(\ell_0\) sparsity at \(\sim 73\%\), significantly outperforming the fully dense representations of LeJEPA.
Return-to-Go is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning: Addressing the insufficient return-to-go (RTG) alignment in conditional sequence models (like Decision Transformer), this paper proposes the Q-align DT framework. By combining an RTG-to-behavior alignment loss (enforcing monotonic correspondence between RTG and Q-value shifts) with Q-function co-training under RTG perturbation, it creates a positive feedback loop that achieves SOTA performance on D4RL and significantly reduces alignment errors (68.9 vs 102.3 for QCS on HalfCheetah-medium).
Riemannian Networks over Full-Rank Correlation Matrices: This paper systematically generalizes three fundamental layers—MLR, FC, and Conv—to five Riemannian geometries (ECM, LECM, OLM, LSM, PHCM) on the full-rank correlation matrix manifold \(\mathrm{Cor}^+(n)\). It derives exact backpropagation for OLM and LSM. The constructed CorNet consistently outperforms SPDNet and Grassmann networks of similar size on Radar, HDM05, FPHA, and NTU120 datasets.
Sequential Group Composition: A Window into the Mechanics of Deep Learning: The authors use the unified task of "calculating the cumulative product of a sequence of group elements" as a microscope. Using Fourier analysis on groups and the AGF framework, they prove that two-layer networks learn irreducible representations (irreps) sequentially according to their Fourier energy. They further characterize the expressivity gap across sequence length \(k\), showing requirements of \(2^k\) width for two-layer networks, \(k\) steps for RNNs, and \(\log k\) layers for deep MLPs.
Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers: To address the performance gap of ViTs in small-model and limited-data scenarios due to the lack of spatial priors caused by permutation equivariance, this paper constructs a set of attenuation masks using Space Filling Curves (SFCs) such as Snake, Zig-zag, Peano, and Hilbert. By averaging these masks and multiplying them into the attention matrix, the method improves performance on spatial-sensitive tasks in VTAB-1K by up to 8.7%, with less than 0.0015% additional parameters and approximately 0.64% more FLOPs.
Structure-Induced Information for Rerooting Levin Tree Search: Within the \(\sqrt{\mathrm{lts}}\) framework, the authors propose three "rerooters"—global Leiden clustering, local heuristic cost-to-go, and an additive mixture of both—to automatically allocate search effort to implicit sub-tasks based on state-space structure and goal distance. This approach avoids expensive explicit sub-goal generation models like HIPS-\(\varepsilon\) / SGPS, achieving SOTA in online training sample efficiency and test-time expansion counts on complex domains such as BoulderDash and CraftWorld.
TabMGP: Martingale Posterior with TabPFN: This paper treats TabPFN, a pre-trained tabular Transformer, directly as a prediction rule for Martingale Posteriors (MGP). Through in-context forward rolling sampling, it obtains credible sets for parameters \(\theta\) under arbitrary loss functions. This approach avoids manual design of priors/likelihoods and hyperparameter tuning, outperforming manual MGP and classical Bayes in both coverage and credible set area across 30 real/synthetic scenarios.
TabSwift: An Efficient Tabular Foundation Model with Row-Wise Attention: The authors demonstrate that the minimalist backbone of TabPFN, which employs "row-wise attention only," is not outdated. By incorporating gated attention to stabilize training, a small set of learnable register tokens to aggregate global information, and a per-sample adaptive early exit head, the model achieves accuracy comparable to heavier column-aware models like TabPFN v2 and TabICL while significantly accelerating inference.
Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification: FGR employs DCT low-pass filtering to remove high-frequency spurious shortcuts from training images to achieve more accurate OOD calibration. It resolves the gradient conflict between "improving calibration" and "maintaining ID performance" through a geometric projection as a hard constraint, suppressing OOD ECE while preserving ID performance without hyperparameter tuning for loss weights.
TEMPORA: Characterising the Time-Contingent Utility of Online Test-Time Adaptation: TEMPORA reframes TTA evaluation from "offline accuracy with no latency upper bound" to "serviceable utility under latency constraints." By employing three types of time constraints (discrete, continuous, and amortized) and decomposable utility metrics, over 750 experiments on ImageNet-C × ResNet-50 demonstrate that offline SOTA methods lose their top ranking in 87.9% of latency-constrained scenarios, becoming increasingly impractical as they approach real-world deployment conditions.
Test-Time Training with KV Binding Is Secretly Linear Attention: This paper employs four "memory paradox" counterexamples and a set of rigorous expansion theorems to prove that TTT with KV-binding inner loops (such as LaCT and ViTTT) remains a "learned linear attention operator" even when utilizing multi-layer MLPs and momentum. Based on this, the authors simplify and parallelize it into standard linear attention, achieving a \(4\times\) throughput increase with almost no performance degradation.
Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate: This paper proposes ReWA: by reparameterizing the target variable as \(\boldsymbol{x}=\boldsymbol{y}^{K}\), applying weight decay to \(\boldsymbol{y}\), and utilizing a coordinate-wise adaptive step size \(\eta_t \boldsymbol{y}^{M}/(\boldsymbol{y}^{K-1}+\epsilon)\), it equivalently transforms the non-optimizable \(\ell_p\;(0<p<1)\) sparse regularization into a trainable objective with bounded gradients and resistance to zero-saddle points. Sparsity improvements over \(\ell_1\) are validated using ResNet on CIFAR-10 / ImageNet.
Torus Graphs for Large-Scale Neural Phase Analysis: The authors introduce the Torus Graph (TG)—an exponential family phase graph model defined on the \(d\)-torus \(\mathbb{T}^d\). By leveraging stochastic score matching, they reduce the per-step inference complexity from \(\mathcal{O}(d^6)\) to \(\mathcal{O}(d^2)\), enabling support for thousands of phase variables for the first time. They further develop TG-HMM and autoregressive TG (AR-TG) extensions, which revealed frequency-specific phase reorganization between wakefulness and NREM sleep in mouse LFP data.
Variable Clustering via Distributionally Robust Nodewise Regression: The study utilizes a Distributionally Robust Optimization (DRO) framework to transform the parameter tuning problem of nodewise regression into a convex optimization problem with spectral norm regularization—achieving a parameter-free clustering method that significantly outperforms Lasso sparse clustering on simulated, facial, and financial data.
Advantages of Non-Smooth Components in Vision Transformer Fine-Tuning: By defining a "plasticity" metric, this paper demonstrates that non-smooth components in ViTs (Attention and Feed-Forward layers) possess higher plasticity—providing larger gradient norms during fine-tuning to achieve better and more stable transfer learning performance.