Rethinking Consistent Multi-Label Classification Under Inexact Supervision¶
Conference: ICLR 2026 arXiv: 2510.04091 Area: Optimization Keywords: multi-label classification, weakly supervised learning, partial multi-label learning, complementary multi-label learning, risk consistency
TL;DR¶
The paper proposes the COMES framework, which provides consistent risk estimators for multi-label classification under inexact supervision via first-order (Hamming loss) and second-order (Ranking loss) strategies, without requiring estimation of the label generation process or uniform distribution assumptions.
Background & Motivation¶
Multi-label classification (MLC) requires associating each instance with multiple relevant labels, making annotation far more costly than single-label tasks. To reduce annotation burden, two weakly supervised paradigms have been proposed:
- Partial Multi-Label Learning (PML): Each instance is annotated with a candidate label set containing all truly relevant labels and some irrelevant "false positive" labels.
- Complementary Multi-Label Learning (CML): Each instance is annotated with complementary labels indicating the classes to which it does not belong.
A core observation is that PML and CML are mathematically equivalent — the complement of a candidate label set is precisely the complementary label set.
Limitations of Prior Work:
- Existing methods require accurate estimation of the candidate/complementary label generation process (i.e., the transition matrix), but the overconfidence of deep neural networks makes such estimation unreliable.
- Some methods assume a uniform distribution to bypass estimation, but this assumption is overly simplistic and cannot handle real-world class imbalance.
- Many approaches model different labels independently, ignoring semantic correlations among labels.
Method¶
Overall Architecture: COMES¶
COMES (COnsistent Multi-label classification under inExact Supervision) introduces a new data generation process assumption and two risk estimation strategies.
Data Generation Process¶
It is assumed that candidate labels are generated via per-class queries: if class \(j\) is irrelevant to instance \(\boldsymbol{x}\), it is annotated as a non-candidate label with a constant probability \(p_j\):
Key Lemma (Lemma 1): Under this assumption, the conditional density of instances with non-candidate label \(j\) is equivalent to the conditional density of instances irrelevant to class \(j\):
This assumption is more general than the uniform distribution assumption, as the conditional probabilities for different candidate label sets need not be identical.
First-Order Strategy: COMES-HL (Based on Hamming Loss)¶
MLC is decomposed into multiple independent binary classification problems. By Theorem 1, the \(\ell\)-risk of the Hamming loss can be equivalently expressed as:
An unbiased risk estimator is obtained by constructing an unlabeled dataset \(\mathcal{D}_U\) and conditional datasets \(\mathcal{D}_j\). To prevent overfitting in deep networks, an absolute value function is applied to wrap the negative terms, yielding the corrected risk estimator \(\tilde{R}_H^\ell\).
Second-Order Strategy: COMES-RL (Based on Ranking Loss)¶
This strategy considers the pairwise ranking relationships among labels and exploits the symmetric loss assumption \(\ell(z, \cdot) + \ell(-z, \cdot) = M\):
Flooding regularization is applied to mitigate overfitting: \(\tilde{R}_R^\ell = |\hat{R}_R^\ell - \beta| + \beta\).
Loss & Training¶
- COMES-HL uses binary cross-entropy as the surrogate loss.
- COMES-RL uses a symmetric loss function (e.g., sigmoid loss).
- The class prior \(\pi_j\) can be estimated from candidate labels using existing class prior estimation methods.
Theoretical Guarantees¶
| Property | COMES-HL | COMES-RL |
|---|---|---|
| Bias bound | \(0 \leq \text{bias} \leq O(\Delta_j)\), \(\Delta_j \to 0\) as \(n \to \infty\) | \(0 \leq \text{bias} \leq O(\Delta')\), \(\Delta' \to 0\) as \(n \to \infty\) |
| Estimation error convergence | \(O(\mathfrak{R}_n(\mathcal{G}) + \sqrt{\ln(1/\delta)/n})\) | \(O(\mathfrak{R}_n(\mathcal{G}) + \sqrt{\ln(1/\delta)/n})\) |
| Consistency | Bayes-optimal w.r.t. Hamming loss | Bayes-optimal w.r.t. Ranking loss |
Key Experimental Results¶
Main Results: Real-World Datasets (Ranking Loss ↓)¶
| Method | mirflickr | music_emotion | yeastBP | yeastCC | yeastMF |
|---|---|---|---|---|---|
| BCE | 0.106 | 0.244 | 0.328 | 0.206 | 0.251 |
| CCMN | 0.106 | 0.224 | 0.328 | 0.210 | 0.245 |
| GDF | 0.159 | 0.278 | 0.501 | 0.504 | 0.495 |
| CTL | 0.130 | 0.266 | 0.498 | 0.467 | 0.471 |
| COMES-HL | 0.095 | 0.214 | 0.154 | 0.124 | 0.173 |
| COMES-RL | 0.106 | 0.213 | 0.166 | 0.117 | 0.151 |
Method Comparison¶
| Method | No Uniform Assumption | No Generation Process Estimation | Label Correlation-Aware | Multiple Complementary Labels |
|---|---|---|---|---|
| CCMN | ✓ | ✗ | ✓ | ✓ |
| CTL | ✗ | ✓ | ✗ | ✗ |
| GDF | ✗ | ✓ | ✗ | ✓ |
| COMES-HL | ✓ | ✓ | ✗ | ✓ |
| COMES-RL | ✓ | ✓ | ✓ | ✓ |
Key Findings¶
- COMES-HL and COMES-RL consistently outperform state-of-the-art methods across 5 evaluation metrics on 6 real-world datasets.
- The advantage is particularly pronounced on the yeast benchmark series, with Ranking Loss reduced by approximately 50% (e.g., yeastBP: 0.154 vs. 0.328).
- COMES-RL achieves superior performance on datasets with strong label correlations (e.g., yeastMF: 0.151 vs. 0.173).
- Both methods demonstrate robustness on synthetic data generated under both uniform and non-uniform label generation processes.
Highlights & Insights¶
- Strong Theoretical Contributions: This work is the first to establish consistency for MLC under inexact supervision without relying on transition matrix estimation or uniform distribution assumptions.
- Unified Treatment of PML and CML: The mathematical equivalence between the two paradigms is exploited to address both problems within a single framework.
- Complementary First- and Second-Order Strategies: COMES-HL is computationally efficient but ignores label correlations, while COMES-RL leverages pairwise ranking relationships at higher computational cost, making the two strategies suitable for different scenarios.
- Practical Data Generation Assumption: The per-class query independence assumption more faithfully reflects realistic annotation workflows.
- Elegant Corrected Risk Estimator Design: Absolute value wrapping and flooding regularization respectively resolve overfitting issues in the first-order and second-order strategies.
Limitations & Future Work¶
- The framework focuses solely on Hamming loss and Ranking loss, leaving other MLC evaluation metrics (e.g., F1-measure) unaddressed.
- The quality of class prior estimation \(\pi_j\) affects final performance, yet the propagation of estimation error is not thoroughly analyzed.
- The assumption that \(p_j\) is constant in the data generation process may not fully capture complex annotation behaviors.
- The computational complexity of the second-order strategy is \(O(q^2)\), which may become a bottleneck when the label space is large.
- The datasets used in experiments are relatively small-scale; performance on large-scale benchmarks remains to be validated.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The framework design unifying PML/CML while removing strong assumptions is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive outperformance of state-of-the-art across multiple datasets and metrics, though dataset scale is limited.
- Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are rigorous and complete; the structure is clear.
- Value: ⭐⭐⭐⭐ — Provides a solid theoretical foundation and practical methodology for weakly supervised MLC.