On the Information Processing of One-Dimensional Wasserstein Distances with Finite Samples¶

Conference: AAAI 2026 arXiv: 2511.12881 Code: GitHub Area: Statistical Learning Theory / Optimal Transport Keywords: Wasserstein Distance, Finite Samples, Poisson Process, Rate Coding, Support Difference

TL;DR¶

This paper analytically characterizes, via a Poisson process framework, the ability of the one-dimensional Wasserstein distance under finite samples to simultaneously encode pointwise density differences (rate difference) and support differences between probability density functions, and validates its practical utility on neural spike data and amino acid contact frequency data.

Background & Motivation¶

The Wasserstein distance (also known as the Earth Mover's Distance) measures the discrepancy between two probability distributions by computing the transport cost between samples. Owing to its reliance on transport distances in the data space, the Wasserstein distance has a natural advantage in capturing support differences between distributions — one of the key reasons it has been widely adopted in generative models such as WGAN.

However, when two distributions have highly overlapping supports but exhibit significant pointwise density differences, whether and how the Wasserstein distance accurately identifies such density differences — particularly under finite samples — remains analytically unclear. This question is especially critical in the following contexts:

The coding debate in neuroscience: Whether the brain transmits information via rate coding or temporal coding is a classical controversy in neuroscience. Can the Wasserstein distance simultaneously capture both information modalities?

Comparison with KL divergence: KL divergence appropriately quantifies density differences (rate difference) but is overly sensitive to support differences (potentially diverging to infinity). The Wasserstein distance is believed to avoid such over-sensitivity, but can it reliably handle density differences?

Practical need for finite samples: In practice, only finitely many samples are ever available, making asymptotic analyses insufficient.

This paper aims to fill this theoretical gap: proving that the one-dimensional Wasserstein distance under finite samples can simultaneously encode rate differences and support differences, and clarifying how the two are jointly expressed.

Method¶

Overall Architecture¶

The authors adopt the Poisson process as a tractable analytical framework. A Poisson process is fully characterized by a rate parameter \(\lambda > 0\), which directly controls event frequency. By analyzing the empirical Wasserstein distance computed on finite sample sequences generated by Poisson processes, the contributions of rate differences and support differences can be cleanly separated.

The one-dimensional 1-Wasserstein distance between empirical distributions \(\hat{\mu}_N\) and \(\hat{\nu}_N\) admits a closed-form expression:

\[W(\hat{\mu}_N, \hat{\nu}_N) = \frac{1}{N}\sum_{k=1}^{N}|x_k - y_k|\]

where \(x_k\) and \(y_k\) denote the \(k\)-th order statistics of the two sequences, respectively.

Key Designs¶

1. Rate Difference Encoding¶

Consider two Poisson processes with rates \(\lambda_1\) and \(\lambda_2\) (\(\lambda_1 < \lambda_2\)), each generating \(N\) event times. The \(k\)-th event time follows an Erlang distribution: \(x_k \sim p(x_k; k, \lambda_1)\).

Core Theorem (Proposition 3.1): The expected distance between the \(k\)-th pair of samples is:

\[\mathbb{E}[|x_k - y_k|] = \frac{\lambda_1 + \lambda_2}{2\lambda_1\lambda_2}\mathbb{E}_{i\sim P(i|2k,p)}[|i-(2k-i)|]\]

where \(p = \lambda_1/(\lambda_1+\lambda_2)\) and \(P(i|2k,p)\) is a binomial distribution with parameters \(2k\) and \(p\).

Key properties of this expression: - Expressed entirely in terms of rates \(\lambda_1\) and \(\lambda_2\), exhibiting a natural symmetry - Under the constraint that the harmonic mean \(\frac{2\lambda_1\lambda_2}{\lambda_1+\lambda_2}=C\) is fixed, \(\mathbb{E}[|x_k-y_k|]\) is minimized when \(\lambda_1=\lambda_2\) - When the two rates are equal (i.e., no density difference), the Wasserstein distance is minimized

Asymptotic Behavior (Proposition 3.2): As \(k \to \infty\), the normalized distance \(s_k = |x_k-y_k|/k\) satisfies:

\[\lim_{k\to\infty}\mathbb{E}[s_k] = \frac{1}{\lambda_1} - \frac{1}{\lambda_2}, \quad \lim_{k\to\infty}\text{Var}[s_k] = 0\]

This implies that for large samples, the expected Wasserstein distance approximates \(\frac{N+1}{2}\left(\frac{1}{\lambda_1}-\frac{1}{\lambda_2}\right)\), directly reflecting the difference in inverse rates.

2. Support Difference Encoding¶

By shifting the event times of one Poisson process by \(\Delta t \geq 0\), the analysis of \(\mathbb{E}[|x_1+\Delta t - y_1|]\) yields:

\[\mathbb{E}[|x_1+\Delta t - y_1|] = e^{-\lambda_2\Delta t}\cdot(\text{pure rate term}) + \Delta t + (1-e^{-\lambda_2\Delta t})\left(\frac{1}{\lambda_1}-\frac{1}{\lambda_2}\right)\]

When \(\Delta t = 0\), this reduces to pure rate encoding
When \(\Delta t \to \infty\), it simplifies to \(\Delta t + \frac{1}{\lambda_1} - \frac{1}{\lambda_2}\), where the shift term dominates
The factor \(e^{-\lambda_2\Delta t}\) naturally balances the contributions of rate and support information

3. Extension to Time-Varying Rates¶

For inhomogeneous Poisson processes (time-varying rate \(\mu(t)\)), via the substitution \(x_k \mapsto u = m(x_k)\) where \(m(x)=\int_0^x \mu(t)dt\), the expected distance can be transformed into a double integral involving the inverse cumulative rate function. When the double Laplace transform of \(|m^{-1}(u)-n^{-1}(v)|u^{k-1}v^{l-1}\) is analytically tractable, the theoretical framework extends to more general settings, including the Sliced Wasserstein distance.

Loss & Training¶

This paper is a theoretical analysis work and involves no model training. Classification models used in experiments (e.g., FCN, ResNet, InceptionTime, XceptionTime) are trained with SGD minimizing cross-entropy loss, but these are not the core contributions of this paper.

Key Experimental Results¶

Main Results: Rate/Support Difference Prediction on Synthetic Data¶

| Feature Type | \(R^2\)（\(\log r_1\)） | \(R^2\)（\(\log r_2\)） | \(R^2\)（\(|\Delta t|\)） | |---|---|---|---| | Directed Hausdorff | 43.7±0.5 | 43.9±0.3 | 70.4±0.3 | | Bin-Wise JS Divergence | 64.0±0.4 | 68.4±0.3 | 70.3±0.1 | | Sample Transport Cost | 81.5±0.1 | 81.9±0.2 | 98.9±0.0 |

Sample transport cost outperforms both Hausdorff distance and JS divergence features across all rate and support difference prediction tasks.

Retinal Ganglion Cell Stimulus Classification¶

Method	Retina-All	Retina14	Retina23
FCN (ISI only)	0.945	0.962	0.925
FCN + SD1	0.951	0.971	0.931
FCN + SD2	0.945	0.968	0.935
XceptionTime (ISI only)	0.944	0.970	0.930
XceptionTime + SD1	0.947	0.979	0.932

Augmenting ISI features with sample transport distance features (SD1/SD2) yields consistent improvements in stimulus type classification AUC.

Ablation Study: Amino Acid Embeddings¶

Distance Metric	Kendall's τ (Top-10 Hydrophobicity)	Kendall's τ (All Residues)
KL Divergence Embedding	0.582	0.731
Wasserstein Embedding	0.722	0.807

The radial ordering of Wasserstein distance embeddings exhibits substantially higher correlation with known hydrophobicity rankings compared to KL divergence embeddings.

Key Findings¶

The Wasserstein distance reliably encodes rate differences under finite samples: The expected Wasserstein distance is minimized when rates are equal, establishing the identifiability of density differences.
Natural coordination between rate and support information: A smooth transition is achieved via the exponential decay factor \(e^{-\lambda_2\Delta t}\).
Complementarity: The Wasserstein distance provides an information perspective complementary to both KL divergence and Hausdorff distance.
Practical utility: Cross-domain applicability is demonstrated in neuroscience (spike train decoding) and molecular biology (amino acid contact frequencies).

Highlights & Insights¶

Theoretical elegance: The information processing capacity of the Wasserstein distance is decomposed into two orthogonal dimensions — rate encoding and support encoding — with closed-form analytical expressions derived via the Poisson process framework.
Interdisciplinary bridge: The paper connects optimal transport theory to the classical neuroscience debate of rate coding vs. temporal coding; the Wasserstein distance naturally unifies both information modalities.
Cleverly designed Isomap embedding experiments: Embedding visualizations on human neural spike data intuitively demonstrate how the Wasserstein distance simultaneously encodes temporal shifts and rate variations within a single representational space.
Biological significance of the amino acid experiment: CYS (the most hydrophobic amino acid) is positioned at the outermost periphery in the Wasserstein embedding but appears more centrally in the KL embedding, indicating that the Wasserstein distance more faithfully captures rate differences induced by long-range contacts.

Limitations & Future Work¶

Restricted to one dimension: Although an extension to the Sliced Wasserstein distance is discussed, the core theoretical derivations are strictly limited to the one-dimensional case.
Strong Poisson assumption: Real-world data (e.g., neural spikes) do not fully conform to Poisson assumptions; while experiments suggest the conclusions hold in non-Poisson settings, no theoretical guarantee is provided.
Equal sample size assumption: The theoretical derivations assume equal sample sizes for both sequences (\(N=N\)); the extension to unequal sample sizes is only briefly mentioned.
High-dimensional generalization: Extending these precise information-processing properties to high-dimensional Wasserstein distances remains an open problem.
Computational complexity: The scalability of finite-sample Wasserstein distance computation to large-scale datasets is not discussed.

The theoretical contributions of this paper can provide deeper theoretical grounding for Wasserstein distance applications in generative models, particularly in understanding why methods such as WGAN effectively capture distributional differences.
The discussion of Sliced Wasserstein distance suggests that the information-processing analysis developed here could be extended to high-dimensional settings, offering potential guidance for recent methods that rely heavily on sliced Wasserstein distances (e.g., image generation, distribution matching).
In neuroscience, this paper introduces a new theoretical tool for spike train analysis that may inspire novel neural signal decoding approaches.

Rating¶

Theoretical Depth: ★★★★★ — Analytical derivations are rigorous and complete; the connection from Poisson processes to Wasserstein distances is elegant.
Experimental Thoroughness: ★★★★☆ — Synthetic experiments plus two real-world application domains, though gains on downstream tasks are modest.
Novelty: ★★★★☆ — Fills a theoretical gap regarding the information processing capacity of finite-sample Wasserstein distances.
Value: ★★★☆☆ — Theoretical insights are substantive, but direct application scenarios are relatively limited.
Overall: 7.5/10