From Generalist to Specialist Representation¶
Conference: ICML 2026
arXiv: 2605.12733
Code: None
Area: Representation Learning Theory / Causal Identifiability
Keywords: identifiability, task-relevant representation, nonparametric, sparsity, world model
TL;DR¶
This paper provides the first fully nonparametric (no intervention, no functional constraints) two-layer hierarchical identifiability proof: the temporal-task structure is identifiable via CI tests from the collider perspective, and task-relevant latents can be separated from generalist representations using sparsity regularization.
Background & Motivation¶
Background: Learning latents from high-dimensional observations is central to world modeling. However, without identifiability guarantees, latent representations may be "observation-equivalent but content-misaligned" with the ground truth (arbitrary permutation \(\hat s = \phi(s)\)). Classical linear ICA relies on non-Gaussianity, nonlinear ICA on auxiliary variables or functional constraints, and causal representation learning on intervention/counterfactuals—each approach requires "extra ingredients."
Limitations of Prior Work: (1) Most identifiability results aim for full latent recovery, but downstream tasks often only require a subset; (2) Existing task-relevant works (content-style separation, subspace factorization) only support fixed structures, with inflexible task numbers and combinations; (3) In the i.i.d. setting, temporal signals cannot be used, making theory especially challenging.
Key Challenge: It is nearly impossible to achieve both "full generality" (nonparametric + arbitrary task structure + allowing disconnected sequences) and "provable identifiability." The core issue is that component-wise full latent recovery requires overly strong assumptions, but identifying only the task-relevant subset is often sufficient and allows much weaker conditions.
Goal: To prove, under the most general nonparametric setting, that (1) the association structure between time steps and tasks is identifiable; (2) the task-relevant latents within each time step are identifiable.
Key Insight: Model "tasks" as colliders at different time steps (\(s_t \to a_t \to g_i\))—the confounder/mediator perspective leads to incorrect conditional independencies, while only the collider encodes the true "multi-step interdependence within the same task." On a collider DAG, conditional independence tests plus sparsity regularization suffice for a closed solution.
Core Idea: Treat tasks as colliders, use band conditioning sets for CI tests to recover the temporal-task graph; then use \(\ell_1\) regularization to shrink the generalist's overcomplete representation to the minimal task subset. The entire theory requires neither intervention nor functional constraints.
Method¶
Overall Architecture¶
A two-layer hierarchical pipeline: The first layer (Section 3) starts from observation sequences \(\{o_t\}\) and task sets \(\{g_i\}\), using Algorithm 1 to recover the global temporal-task association structure via CI tests on segment pairs. The second layer (Section 4) disentangles, within each time step, the subset of latent state \(s_t\) truly relevant to each task using a VAE with sparsity regularization. The interface between the two layers is "task labels per step."
Key Designs¶
-
Collider Modeling + Band CI Test:
- Function: Provably identifies "which time step belongs to which task" under arbitrary task structures (interleaving, repetition, disconnection).
- Mechanism: The sequence is divided into equal-length segments \(S_k\) of length \(L\ge 2\). The band conditioning set is defined as \(Z_{\text{band}}(k,v,i) = \{s_{kL-1}, s_{kL+1}, s_{vL-1}, s_{vL+1}\} \cup \{g_i\}\). Theorem 1 proves: under Markov + Faithfulness, task \(g_i\) is associated with both \(S_k\) and \(S_v\) iff \(s_{kL}\) and \(s_{vL}\) are not independent conditioned on \(Z_{\text{band}}(k,v,i)\). The key is that the collider property blocks all paths not involving \(g_i\): conditioning on boundary states blocks temporal channels; other tasks as closed colliders are automatically blocked. Corollary 1 further shows that representative states can be freely substituted.
- Design Motivation: Confounder/mediator modeling leads to the counterintuitive conclusion that "two steps in the same task are conditionally independent"; only the collider preserves the semantics of "coordinated plan." The band conditioning set precisely isolates the dependency channel of a single task.
-
Generalist Bound + Sparsity Tightening:
- Function: First proves that a generalist model can only guarantee that the task-relevant latents are a superset of the ground truth, then shows that adding sparsity regularization tightens this to the exact set.
- Mechanism: Proposition 2, under sufficient nonlinearity, proves \(\|\mathcal{I}((J_{\hat u})_{i,\cdot})\| \ge \|\mathcal{I}((J_u)_{i,\cdot})\|\)—the number of estimated task-relevant latents is at least the ground truth. Theorem 2, with sparsity constraint \(\|\mathcal{I}(J_{\hat u})\| \le \|\mathcal{I}(J_u)\|\), tightens the inequality to equality, and via column permutation \(\pi\) derives \(\hat s_{t, \pi(I_k)} = h_k(s_{t, I_K})\) (invertible function correspondence), achieving disentanglement of task-relevant and irrelevant latents.
- Design Motivation: The generalist result first shows that "a large model alone is insufficient," then sparsity provides a clear guide: "add this, and you get that," forming a rigorous logical chain from motivation to solution. The conclusion is significant: group-wise identifiability can be achieved in the i.i.d. setting without functional constraints.
-
Bridge from Theory to Practice:
- Function: Enables the purely nonparametric proof to be implemented on real videos/images.
- Mechanism: CI tests are performed in the observation space (using conditional mutual information as a proxy to avoid high-dimensional statistics issues); when tasks are unknown, task representations are learned as latents. Latent estimation uses VAE + \(\ell_1\) regularization; for image generation, GANs with task-specific masks operate on the latents.
- Design Motivation: Identifiability is an asymptotic property, but without a runnable algorithm, the community may see it as only theoretical; CMI proxies + standard VAE make it reproducible for any ML researcher.
Loss & Training¶
No new loss functions are introduced. In the task structure discovery phase, Fisher's z-test (linear Gaussian) or CMI (deep models) is used for CI testing, with threshold \(p=0.05\); in the representation learning phase, standard VAE reconstruction loss plus \(\ell_1\) regularization \(\lambda \|M\|_1\) is applied to the task-latent mask. CMI is estimated using MINE.
Key Experimental Results¶
Main Results (Synthetic + Real)¶
Synthetic data is generated according to a collider DAG with 10k samples, 10 random runs.
| Setting | Metric | CCA | Group Lasso | SelTask | Ours |
|---|---|---|---|---|---|
| \(T \in [8,20], M=T/5\) | Accuracy | Low | Medium | Good | Highest |
| \(T=20, M \in [2,10]\) | MCC | Low | Medium | Good | Highest |
SportsHHI video (multi-person, multi-task interleaving, mAP):
| Method | mAP |
|---|---|
| Alg.1 on observed \(o\) | Lower |
| LEAP | Medium |
| Ours (CI on latent \(s\)) | Highest |
Ablation Study¶
Task-relevant identifiability (synthetic nonlinear MLP data, \(R^2\)):
| Configuration | \(R^2\) relevant ↑ | \(R^2\) irrelevant ↓ | Note |
|---|---|---|---|
| VAE without \(\ell_1\) | High | High | Information retained but entangled |
| VAE + \(\ell_1\) on task-latent mask | High | Low | Both retained and separated |
Flux cat images + GAN (tasks: "wearing glasses/hat/tie"):
| Configuration | Visual Result |
|---|---|
| with sparsity | Edits only affect target attribute |
| without sparsity | Color and other irrelevant factors are entangled and changed |
Key Findings¶
- \(\ell_1\) regularization is the "minimal sufficient gain" for turning a generalist into a specialist: removing it immediately causes entanglement; adding it achieves disentanglement.
- As the number of tasks \(M\) increases, the performance degradation slope is much flatter than baselines, indicating that the collider + band CI approach is especially advantageous in complex structures.
- On real videos, "doing CI on latent rather than observed" yields the largest gap, confirming the intuition that identifiability requires a correct latent space.
Highlights & Insights¶
- Modeling "task as collider" is a nontrivial choice—it enables CI conditions that precisely encode "same-task dependency" semantics into the graph, which is the foundation for the entire theory.
- Presenting the two-stage process of "first proving generalist is insufficient → then tightening with sparsity" elevates the necessity of sparsity from an empirical trick to a theoretical requirement.
- No intervention, no functional class, i.i.d. allowed—relaxing all three constraints while still achieving identifiability is a rare "major relaxation" result in this area.
Limitations & Future Work¶
- Identifiability is an asymptotic property; the paper does not analyze finite-sample errors, so there is no guarantee for data-scarce scenarios.
- Sparsity uses an \(\ell_1\) convex surrogate, which has a gap with true \(\ell_0\); the theoretical results assume minimal support, but in practice, careful tuning is needed.
- The latent space in experiments uses VAE, which itself depends on whether reconstruction retains sufficient information; the theoretical assumption of sufficient nonlinearity is hard to verify directly on real data.
- The authors also mention "identifiability-inspired architecture" as future work; currently, only standard estimators + regularization are used, and architectural innovation is still lacking.
Related Work & Insights¶
- vs nonlinear ICA (Hyvärinen series): They rely on temporal contrastive or auxiliary variables; this work allows disconnection and i.i.d.
- vs causal representation learning (von Kügelgen et al.): They require intervention; this work is fully observational.
- vs subspace factorization (content-style): Their structure is fixed; this work allows unknown task numbers, structures, and assignments.
- vs SelTask / LEAP: These are the strongest empirical baselines, but only guarantee latent identifiability, not structure identifiability; this work provides both.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First fully nonparametric task-relevant identifiability with strong originality from the collider perspective.
- Experimental Thoroughness: ⭐⭐⭐ Mainly theoretical, experiments serve more as validation than comprehensive comparison; real data scale is limited.
- Writing Quality: ⭐⭐⭐⭐ Two-layer framework is clear, theorem-corollary-lemma structure is rigorous.
- Value: ⭐⭐⭐⭐ Provides a formal foundation for "generalist → specialist fine-tuning," serving as theoretical support for future work.