Score Matching with Missing Data¶
Conference: ICML 2025 (Spotlight)
arXiv: 2506.00557
Code: None
Area: Statistical Machine Learning / Generative Models
Keywords: score matching, missing data, importance weighting, variational inference, graphical models
TL;DR¶
This paper adapts score matching and its major extensions to missing data scenarios, proposing two variants—the Importance Weighting (IW) method and the variational method—and demonstrating their respective advantages in various scenarios such as graphical model estimation.
Background & Motivation¶
Background: Score matching is a core tool for learning data distributions, widely applied in diffusion models, energy-based models (EBMs), and graphical model estimation. It avoids the need to compute normalizing constants by matching the score function of the model and the data (the gradient of log density \(\nabla_x \log p(x)\)).
Limitations of Prior Work: Existing score matching methods assume fully observed data (all coordinates are observed), whereas missing values are extremely common in real-world data (survey data, sensor failures, privacy masking, etc.). When data is partially missing on arbitrary subsets of coordinates, standard score matching cannot be directly applied.
Key Challenge: The core advantage of score matching (not requiring a normalizing constant) no longer automatically holds under missing data, because the score of the marginal distribution differs from the score of the joint distribution.
Goal: To design score matching methods that are applicable to data with arbitrary patterns of missingness.
Key Insight: Deal with missing coordinates through importance weighting or variational approximation.
Core Idea: Connect the score on the observed coordinate subset with the score of the complete data, eliminating the influence of the missing variables through integration or approximation.
Method¶
Overall Architecture¶
Input: Partially observed data \(\{(\mathbf{x}_i^{\text{obs}}, \mathbf{m}_i)\}_{i=1}^n\), where \(\mathbf{m}_i\) is the missingness mask
Output: Score function estimate \(s_\theta(\mathbf{x}) \approx \nabla_\mathbf{x} \log p(\mathbf{x})\) of the data-generating distribution
Key Designs¶
-
Importance Weighting (IW) Method:
- Function: Translates the score matching objective with missing data into a computable form via importance weighting
- Mechanism: For a missingness pattern \(\mathbf{m}\), the marginal score matching objective is:
\(\mathcal{J}_{\text{IW}} = \sum_{\mathbf{m}} \mathbb{E}_{p(\mathbf{x}^{\text{obs}} | \mathbf{m})} \left[ \sum_{j \in \text{obs}} \left( \partial_j s_\theta^j + \frac{1}{2} (s_\theta^j)^2 \right) \cdot w(\mathbf{m}) \right]\)
where \(w(\mathbf{m})\) is the importance weight of the missingness pattern, estimated through the missingness mechanism (e.g., the MAR assumption) - Design Motivation: The IW method preserves the closed-form/semi-closed-form advantages of score matching, being particularly efficient in low-dimensional and finite domain settings. It provides finite-sample bounds in finite domain setups
-
Variational Method:
- Function: Introduces a variational distribution to approximate the conditional distribution of the missing variables
- Mechanism: Approximates the true conditional distribution with a variational distribution \(q_\phi(\mathbf{x}^{\text{mis}} | \mathbf{x}^{\text{obs}})\):
\(\mathcal{J}_{\text{var}} = \mathbb{E}_{q_\phi(\mathbf{x}^{\text{mis}} | \mathbf{x}^{\text{obs}})} \left[ \sum_j \left( \partial_j s_\theta^j(\mathbf{x}) + \frac{1}{2} (s_\theta^j(\mathbf{x}))^2 \right) \right]\)
alternately optimizing \(\theta\) (the score model) and \(\phi\) (the variational distribution) - Design Motivation: In high-dimensional complex scenarios, the variance of IW can be excessively large; the variational method provides more stable gradient estimates through parametric approximations
-
Extension to Major Score Matching Variants:
- Function: Extends the aforementioned methods to denoising score matching, sliced score matching, etc.
- Mechanism: Embeds the treatment of missing data (IW or variational) into the objective functions of each variant, maintaining their respective original strengths
- Design Motivation: Ensures the generality of the method, allowing integration with various approaches in the score matching ecosystem
Loss & Training¶
- IW Loss: Weighted Fisher divergence objective
- Variational Loss: A score matching objective in the form of a variational lower bound
- Finite-sample bound (IW method): Provides convergence rates under the setting of a finite domain \(|\mathcal{X}| < \infty\)
Key Experimental Results¶
Main Results¶
| Task/Data | Metric | IW Method | Variational Method | Complete Case Analysis | Mean Imputation + SM |
|---|---|---|---|---|---|
| Synthetic Gaussian (d=5, 20% missing) | Score MSE | 0.045 | 0.052 | 0.089 | 0.078 |
| Synthetic Gaussian (d=20, 40% missing) | Score MSE | 0.231 | 0.098 | 0.456 | 0.312 |
| Graphical Model Estimation (Synthetic) | Structure Learning F1 | 0.72 | 0.85 | 0.58 | 0.63 |
| Graphical Model Estimation (Real Gene Data) | Precision | 0.65 | 0.78 | 0.51 | 0.59 |
Ablation Study¶
| Configuration | Score MSE (d=10) | Description |
|---|---|---|
| IW, 10% missing | 0.028 | IW is optimal at low missing rates |
| IW, 40% missing | 0.185 | Variance increases at high missing rates |
| Variational, 10% missing | 0.035 | Slightly worse than IW at low missing rates |
| Variational, 40% missing | 0.097 | Variational is more stable at high missing rates |
| MCAR vs MAR | Not significant | Both methods are robust to the missingness mechanism |
Key Findings¶
- The IW method performs optimally in low-dimensional, small-sample, and low-missing-rate scenarios, with theoretical guarantees.
- The variational method is stronger in high-dimensional and high-missing-rate scenarios, showing a significant advantage particularly in graphical model estimation.
- The two methods are complementary, covering the requirements of different practical scenarios.
- Simple imputation (mean imputation followed by standard score matching) performs significantly worse than ours.
Highlights & Insights¶
- First to systematically address the missing data problem in score matching, filling an important gap.
- The complementary design of the two methods is highly practical—users can choose based on the scenario.
- Prominent performance on the crucial application of graphical model estimation.
- Accepted as a Spotlight paper, with its quality being well-recognized.
Limitations & Future Work¶
- The variance issue of the IW method in high dimensions or with high missing rates needs to be further mitigated.
- The choice and optimization of the variational distribution may require careful hyperparameter tuning.
- Combining with diffusion models (training diffusion models with missing data) is a valuable future direction.
Related Work & Insights¶
- Directly connects with the classical score matching framework of Hyvärinen (2005).
- The EM algorithm concept for missing data is reflected in the variational method.
- Holds direct application value for fields where missing data frequently occurs, such as medical data and survey data.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to solve score matching + missing data
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated on both synthetic and real data, with clear comparison between complementary methods
- Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theory with clear practical guidance
- Value: ⭐⭐⭐⭐⭐ Addresses a high-frequency pain point, selected as a Spotlight paper