Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/katdimitris/Complementary-Dissimilarity-Loss
Area: Representation Learning / Supervised Classification Regularization
Keywords: Complementary Learning, Class Dissimilarity, Neural Collapse, one-cold target, Supervised Regularization

TL;DR¶

Addressing the issue where Cross-Entropy (CE) suffers from vanishing gradients for non-target classes once a sample is correctly classified—thereby losing information about "how dissimilar classes are"—this paper proposes Complementary Dissimilarity Loss (CDL). It employs a "one-cold" target, where the target class is set to 0 and non-target classes are assigned probability mass based on dissimilarity, to explicitly supervise all non-target classes. This approach maintains non-vanishing gradients and actively pushes representations toward controllable Neural Collapse, providing consistent plug-and-play performance gains across closed-set, open-set, few-shot, and domain generalization tasks.

Background & Motivation¶

Background: Supervised classification is almost exclusively built upon Cross-Entropy (CE) with one-hot targets. To mitigate overconfidence and overfitting, various regularization terms have been derived, such as Label Smoothing (LS), MaxSup, and Complement Objective Training (COT/CCE), along with self-distillation (self-KD) methods like OLS, USKD, and Zipf's LS. These methods attempt to reallocate a portion of the probability mass from the target class to non-target classes.

Limitations of Prior Work: One-hot targets only concern "which class this sample belongs to." Once a sample is correctly classified and the target logit significantly exceeds others, the exponential normalization of softmax causes the gradients of all non-target classes to decay exponentially. Consequently, the model stops learning "what this sample is least like." While methods like LS distribute probability uniformly across non-target classes, they (a) suppress the target class probability, artificially shrinking inter-class margins, and (b) still suffer from vanishing gradients for classes with high confidence and high dissimilarity. Thus, existing methods capture "similarity" (what things look like) but ignore "dissimilarity" (explicitly what distant classes they do not look like).

Key Challenge: The supervisory target of CE is naturally effective only when the sample is "not yet correctly classified," with gradients approaching zero after correct classification. Meanwhile, "good geometry" such as Neural Collapse (NC)—where intra-class features collapse to a point, class means form a Simplex ETF, and the last layer degrades into a nearest-centroid classifier—only emerges as an accidental byproduct of prolonged fitting after zero training error in overparameterized models, rather than being explicitly optimized.

Goal: To transform "class dissimilarity" into an explicit supervisory signal with gradients that do not vanish upon correct classification, actively shaping the representation geometry from early training and providing controllable grasp over the trajectory toward NC.

Key Insight: The authors introduce the concept of "opposite-classes (O-classes)." The O-class $\bar{C}_k = X \setminus C_k$ of class $k$ represents "everything in the feature space that does not belong to $k$," including other known classes, unknown classes, and even random noise. A real sample $x \in C_c$ simultaneously falls into the O-classes of all non-target classes; thus, "which O-classes it belongs to" exactly encodes "which classes it is unlike."

Core Idea: Replace one-hot/soft targets with one-cold targets (target class set to 0, non-target classes assigned probability based on dissimilarity priors) to construct CDL—a loss orthogonal to and stackable with CE that explicitly supervises "what it is unlike."

Method¶

Overall Architecture¶

The method does not alter the network structure or add extra branches; it simply stacks a complementary loss term onto the original CE loss. Given network output logits $z = f_\theta(x)$, CE operates as usual on $z$ to ensure correct classification. Simultaneously, the same set of logits is negated $\bar{z} = -z$ to serve as "O-class logits." These are passed through a softmax to obtain the O-class predicted distribution $\hat{\bar p}$, which is then supervised by a one-cold target distribution $\bar p$ where the target class is 0 and non-target classes are normalized according to dissimilarity priors. The combined objective is:

\[\mathcal{L} = \mathcal{L}_{CE} + \gamma \cdot \mathcal{L}_{CD}\]

where $\gamma$ controls the intensity of the dissimilarity signal. The core of the mechanism lies in three aspects: how the one-cold target is defined, what priors fill it, and why negative logits prevent vanishing gradients while pushing representations toward NC.

Key Designs¶

1. O-class and one-cold targets: Formulating "what it is not" as a supervisable distribution

The pain point of one-hot is that it suppresses all non-target classes to 0, effectively declaring "no difference between non-target classes" and erasing dissimilarity information. This paper reverses that: for a sample $(x, y=c)$, the O-class membership probability for the target class $c$ is 0 (it already belongs to $c$), while for the remaining $K-1$ non-target classes, it is 1 (it indeed does not belong to them). Beyond just "not belonging to $k$," the authors define the dissimilarity prior:

\[\rho_{c|k} = P\big(y'=c \mid x' \in \bar{C}_k\big),\qquad \rho_{k|k}=0,\ \sum_{c\neq k}\rho_{c|k}=1.\]

A larger $\rho_{c|k}$ indicates that $c$ and $k$ are more dissimilar ($c$ has a heavier weight in the "complement" of $k$). Multiplying the O-class membership likelihood by this prior and normalizing over the non-target set yields the one-cold target distribution:

\[\bar p_k = \begin{cases} 0, & k=c,\\[2pt] \dfrac{\rho_{c|k}}{\sum_{j\neq c}\rho_{c|j}}, & k\neq c. \end{cases}\]

It assigns zero probability to the target class and spreads a unit of mass across non-target classes based on dissimilarity—hence the name "one-cold" (one cold spot), complementary to one-hot (one hot spot).

2. Complementary Dissimilarity Loss and two priors: Uniform vs. Self-Distillation

With the target distribution $\bar p$ and prediction $\hat{\bar p}$ defined, CDL is the cross-entropy between them: $\mathcal{L}_{CD} = -\sum_k \bar p_k \log \hat{\bar p}_k$. The choice of prior $\rho_{c|k}$ determines the geometry to be shaped. The authors provide two variants: - Uniform Prior: Assumes no extra information and that each non-target class is equally "dissimilar," i.e., $\rho^{uni}_{c|k}=1/(K-1)$, resulting in: $$\mathcal{L}^{uni}_{CD} = -\frac{1}{K-1}\sum_{k\neq c}\log\hat{\bar p}_k,$$ which suppresses spurious inter-class dependencies and encourages non-target logits to converge to a common constant, pushing class geometry toward a symmetric Simplex ETF. - Self-Distillation Prior (self-KD): Allows the prior to emerge adaptively from the model's own predictions. Each epoch, the means of temperature-scaled O-class predictions for correctly classified samples are collected: $\tilde\rho_{c|k}=\frac{1}{|S_c|}\sum_{x\in S_c}\hat{\bar p}_k(x;\tau)$. This is then interpolated with the uniform prior: $\rho^{self}_{c|k}=(1-\alpha)\rho^{uni}_{c|k}+\alpha\tilde\rho_{c|k}$. This allows the model to feed back asymmetric relationships discovered during training—which classes are more or less similar—into the target distribution, intentionally deviating from a strict ETF to preserve true inter-class structures.

3. Negative logit reuse + Non-vanishing gradients: Optimizing where CE plateaus

A clever engineering point is how CDL is attached to CE without conflict. Instead of using separate parameters, the authors let O-class logits $\bar z = -z$. As the target class grows in $z$, it shrinks in $\bar z$ (O-class membership probability approaches 0, matching $\bar p_c=0$). Prediction alignment is natural, allowing any CE-based loss to be applied to negated logits with zero architectural intrusion. The key benefit is in the gradients: the gradient of CE w.r.t. non-target logits is $\partial \mathcal{L}_{CE}/\partial z_k = p_k$, which vanishes as $p_k \to 0$ when the sample is correctly classified. In contrast, the gradient of CDL$_{uni}$ w.r.t. non-target O-class logits is $\partial \mathcal{L}^{uni}_{CD}/\partial \bar z_k = \bar p_k - \frac{1}{K-1}$, which remains non-zero even after correct classification. Theoretically (Theorem 1), the rate at which CE flattens non-target logits decays exponentially with margin $m$ (coefficient $\frac{\eta}{e^m+K-1}$), while CDL$_{uni}$ continues to contract the dispersion $\sigma$ of non-target logits with a margin-independent constant coefficient $\frac{\eta}{K-1}$. Consequently, the target class probability $p_c$ can still converge to 1 (avoiding the artificial margin reduction of LS/MaxSup) while non-target logits are pushed toward a uniform constant $z_{const}$, actively facilitating controllable NC from early training.

Loss & Training¶

The final loss is $\mathcal{L} = \mathcal{L}_{CE} + \gamma\,\mathcal{L}_{CD}$, where $\gamma$ controls dissimilarity strength (typically $\gamma=1.0$ when used alone, or $\gamma=0.1\sim0.5$ when stacked with strong regularizers). The self-KD variant includes interpolation weight $\alpha\in[0,1]$ and temperature $\tau$. This objective is orthogonal and stackable with any CE-based loss (LS, COT, focal, ASL, etc.) and self-distillation methods (OLS, etc.), requiring no architectural changes or additional inference cost.

Key Experimental Results¶

Main Results (Closed-set Classification)¶

On CIFAR-100 and TinyImageNet, stacking CDL$_{uni}$ onto 9 CE-based baselines across 3 seeds shows nearly universal error reduction:

Dataset / Model	Baseline (CE)	+CDL$_{uni}$	Strong Baseline COT	COT+CDL$_{uni}$
CIFAR-100 / RN18	24.98	23.92	24.52	23.21
CIFAR-100 / DN121	25.06	24.19	23.70	22.27
TinyImageNet / RN18	36.80	35.61	35.40	34.74
TinyImageNet / DN121	38.71	36.49	37.60	35.11

On ImageNet (ResNet, error rate ↓), CDL shows stable improvements, with larger models yielding greater gains:

Config	RN50 Top-1	RN50 Top-5	RN101 Top-1	RN101 Top-5
CE	23.72	7.08	22.78	6.50
CE + CDL$_{uni}$	22.61	6.24	21.46	5.68
LS	22.72	6.44	21.59	5.99
LS + CDL$_{uni}$	22.51	6.24	21.33	5.74
OLS + CDL$_{self}$	22.31	6.20	21.07	5.81

CDL$_{uni}$ reduces Top-1 error by 1.11% / 1.32% for CE on RN50/RN101 respectively; it adds 0.2~0.53% improvements even when stacked on MaxSup/LS. On DeiT-Small (with MixUp/CutMix disabled), CE+CDL$_{uni}$ reduces error from 25.61 to 23.12 (a 2.49% drop, outperforming the runner-up MaxSup by 0.39%), demonstrating efficacy in large-scale Transformer training.

Self-Distillation and Multi-task Generalization¶

Comparison of the self-KD variant against specialized self-distillation methods (Error rate ↓):

Method	CIFAR-100 RN34	CIFAR-100 MobV2	TinyIN RN34	TinyIN MobV2
CE	24.00	28.51	36.39	39.20
OLS	22.62	27.91	35.04	39.11
USKD	23.83	28.82	36.87	38.58
CE+CDL$_{self}$	22.44	27.52	34.81	38.88

Consistent gains were observed in three other task categories: Open-Set Recognition (OSCR ↑), where CDL$_{self}$ improved CE by +3.66 and ARPL by +2.18 on CIFAR-100; Few-Shot Learning (1-shot error ↓) on CUB, where CDL$_{self}$ reduced error by +3.98 over CE and +1.64 over CDL$_{uni}$; and Domain Generalization (PACS error ↓), where stacking CDL$_{uni}$ on MixUp yielded a significant average reduction of 1.38.

Key Findings¶

Non-vanishing gradients are the root cause: Fig. 2 shows that بينما the CE gradient field approaches zero and flattens in correctly classified regions, CDL$_{uni}$ maintains a valid descent direction. Theoretically, while the flattening rate of CE decays exponentially with the margin, CDL maintains a constant rate.
CDL actively reinforces NC: In a 250-epoch ResNet-50 training run, increasing $\gamma$ leads to a comprehensive decrease in NC1–NC4 metrics. The simultaneous minimization of NC2+NC3 implies that class means themselves form a Simplex ETF, exactly the symmetric geometry induced by CDL$_{uni}$. CDL$_{self}$ maintains low NC1 (strong separability) while intentionally deviating from strict ETF to capture asymmetric class relations.
Gains increase with task difficulty: Relative gains are most significant in difficult settings like open-set CIFAR-100 (100 known classes), indicating that explicit dissimilarity modeling is most useful when classes are numerous and easily confused.
Orthogonality and Stackability: Improvement across 9 different losses proves that it provides complementary information regarding dissimilarity signals that existing methods fail to utilize.

Highlights & Insights¶

Modeling "what a sample is NOT" is an overlooked supervisory dimension: One-hot says "what it is," soft targets say "what it is like," and this paper's one-cold says "what it is unlike." The three are complementary. Formalizing "opposite-classes" as probability distributions turns dissimilarity into an explicitly optimizable target with a clean logic.
Negative logits are zero-intrusion yet solve vanishing gradients: By setting $\bar z=-z$, any CE-based loss can be reused without extra parameters or architectural changes, continuing optimization exactly where CE "lies flat" after correct classification. This is a highly transferable trick for engineering.
Turning Neural Collapse from an "accidental byproduct" into a "controlled goal": Uniform priors lead to symmetric Simplex ETFs, while self-distillation priors lead to asymmetric geometries reflecting true class relations. This provides a "knob" ($\gamma$, $\alpha$, $\tau$) for deciding what representation geometry is desired, which is highly insightful for representation learning.
Self-bootstrapping dissimilarity priors: The self-KD variant requires no external teacher; it uses its own prediction means to iteratively refine the prior, closing the loop on the model's discovery of "which classes are more unlike."

Limitations & Future Work¶

Extra hyperparameters and scheduling costs: Parameters like $\gamma$, and $\alpha$ / $\tau$ for self-KD, require tuning. Furthermore, self-KD requires caching O-class predictions for correctly classified samples every epoch, adding a layer of state management compared to pure CE.
Locality assumptions in theoretical conclusions: The $\sigma$ contraction rate in Theorem 1 is built on local conditions ($\sigma\le r$, $m\ge m_0$) and small step sizes with $O(\eta\sigma^2)$ remainders. Its strict validity across the entire training trajectory and sensitivity to optimizer/learning rate scheduling warrants further exploration (details are in the appendix).
Small albeit stable absolute gains: Improvement magnitudes in many settings are in the 0.2~1.3% range, succeeding by being "consistently slightly better everywhere + better on harder tasks." In simple, low-class tasks, dissimilarity information is sparse, limiting gains.
Weak semantic source for dissimilarity priors: Uniform priors do not distinguish between near and far classes, and self-KD priors originate from the model itself, potentially introducing self-confirmation bias. Introducing external semantic/hierarchical priors (e.g., label trees, text embedding similarities) to fill $\rho_{c|k}$ could improve dissimilarity modeling accuracy.

vs. Label Smoothing / MaxSup: LS spreads probability uniformly, artificially suppressing target class probability and shrinking margins; it still suffers from vanishing gradients for dissimilar classes. CDL allows target probability to converge to 1 (preserving margin) while non-target gradients remain effective regardless of margin, enabling the shaping of specific geometries rather than global bias.
vs. COT / CCE (Complement Entropy): These maximize the entropy of non-target predictions to push them toward uniform, but they only suppress the most confident non-target logit; gradients for dissimilar classes remain weak. CDL explicitly targets "ignored dissimilar classes" and is orthogonal to them (stacking with COT proves to be a strong combination).
vs. OLS / USKD / Zipf's (self-KD): These methods focus on "exploiting class similarity" to construct soft labels for non-target logits but do not model dissimilarity. CE+CDL$_{self}$ consistently outperforms them in classification, NC1 separability, and open-set tasks, proving dissimilarity is the complementary half of the information.
vs. Neural Collapse Research: Previous works analyze NC as an emergent phenomenon of CE after long-term training. This paper transforms inducing NC (and even specifying ETF / non-ETF geometries) into an explicit, controllable training objective.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The one-cold target and O-class formalization of "modeling what it is NOT" is a rare and self-consistent new supervisory dimension.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers four task categories, CNN+ViT, 9 loss combinations, NC metrics, and gradient theory.
Writing Quality: ⭐⭐⭐⭐ Concepts and derivations are clear; however, the reliance on hyperparameters and appendices makes the main text somewhat dense.
Value: ⭐⭐⭐⭐ Plug-and-play, zero inference cost, stable gains across tasks, and provides a controllable knob for representation geometry; very deployment-friendly.