Skip to content

From Generalist to Specialist Representation

Conference: ICML 2026
arXiv: 2605.12733
Code: None
Area: Representation Learning Theory / Causal Identifiability
Keywords: identifiability, task-relevant representation, nonparametric, sparsity, world model

TL;DR

This paper provides the first fully nonparametric (no intervention, no functional constraints) two-layer hierarchical identifiability proof: the temporal-task structure is identifiable via CI tests from the collider perspective, and task-relevant latents can be separated from generalist representations using sparsity regularization.

Background & Motivation

Background: Learning latents from high-dimensional observations is central to world modeling. However, without identifiability guarantees, latent representations may be "observation-equivalent but content-misaligned" with the ground truth (arbitrary permutation \(\hat s = \phi(s)\)). Classical linear ICA relies on non-Gaussianity, nonlinear ICA on auxiliary variables or functional constraints, and causal representation learning on intervention/counterfactuals—each approach requires "extra ingredients."

Limitations of Prior Work: (1) Most identifiability results aim for full latent recovery, but downstream tasks often only require a subset; (2) Existing task-relevant works (content-style separation, subspace factorization) only support fixed structures, with inflexible task numbers and combinations; (3) In the i.i.d. setting, temporal signals cannot be used, making theory especially challenging.

Key Challenge: It is nearly impossible to achieve both "full generality" (nonparametric + arbitrary task structure + allowing disconnected sequences) and "provable identifiability." The core issue is that component-wise full latent recovery requires overly strong assumptions, but identifying only the task-relevant subset is often sufficient and allows much weaker conditions.

Goal: To prove, under the most general nonparametric setting, that (1) the association structure between time steps and tasks is identifiable; (2) the task-relevant latents within each time step are identifiable.

Key Insight: Model "tasks" as colliders at different time steps (\(s_t \to a_t \to g_i\))—the confounder/mediator perspective leads to incorrect conditional independencies, while only the collider encodes the true "multi-step interdependence within the same task." On a collider DAG, conditional independence tests plus sparsity regularization suffice for a closed solution.

Core Idea: Treat tasks as colliders, use band conditioning sets for CI tests to recover the temporal-task graph; then use \(\ell_1\) regularization to shrink the generalist's overcomplete representation to the minimal task subset. The entire theory requires neither intervention nor functional constraints.

Method

Overall Architecture

A two-layer hierarchical pipeline: The first layer (Section 3) starts from observation sequences \(\{o_t\}\) and task sets \(\{g_i\}\), using Algorithm 1 to recover the global temporal-task association structure via CI tests on segment pairs. The second layer (Section 4) disentangles, within each time step, the subset of latent state \(s_t\) truly relevant to each task using a VAE with sparsity regularization. The interface between the two layers is "task labels per step."

Key Designs

  1. Collider Modeling + Band CI Test:

    • Function: Provably identifies "which time step belongs to which task" under arbitrary task structures (interleaving, repetition, disconnection).
    • Mechanism: The sequence is divided into equal-length segments \(S_k\) of length \(L\ge 2\). The band conditioning set is defined as \(Z_{\text{band}}(k,v,i) = \{s_{kL-1}, s_{kL+1}, s_{vL-1}, s_{vL+1}\} \cup \{g_i\}\). Theorem 1 proves: under Markov + Faithfulness, task \(g_i\) is associated with both \(S_k\) and \(S_v\) iff \(s_{kL}\) and \(s_{vL}\) are not independent conditioned on \(Z_{\text{band}}(k,v,i)\). The key is that the collider property blocks all paths not involving \(g_i\): conditioning on boundary states blocks temporal channels; other tasks as closed colliders are automatically blocked. Corollary 1 further shows that representative states can be freely substituted.
    • Design Motivation: Confounder/mediator modeling leads to the counterintuitive conclusion that "two steps in the same task are conditionally independent"; only the collider preserves the semantics of "coordinated plan." The band conditioning set precisely isolates the dependency channel of a single task.
  2. Generalist Bound + Sparsity Tightening:

    • Function: First proves that a generalist model can only guarantee that the task-relevant latents are a superset of the ground truth, then shows that adding sparsity regularization tightens this to the exact set.
    • Mechanism: Proposition 2, under sufficient nonlinearity, proves \(\|\mathcal{I}((J_{\hat u})_{i,\cdot})\| \ge \|\mathcal{I}((J_u)_{i,\cdot})\|\)—the number of estimated task-relevant latents is at least the ground truth. Theorem 2, with sparsity constraint \(\|\mathcal{I}(J_{\hat u})\| \le \|\mathcal{I}(J_u)\|\), tightens the inequality to equality, and via column permutation \(\pi\) derives \(\hat s_{t, \pi(I_k)} = h_k(s_{t, I_K})\) (invertible function correspondence), achieving disentanglement of task-relevant and irrelevant latents.
    • Design Motivation: The generalist result first shows that "a large model alone is insufficient," then sparsity provides a clear guide: "add this, and you get that," forming a rigorous logical chain from motivation to solution. The conclusion is significant: group-wise identifiability can be achieved in the i.i.d. setting without functional constraints.
  3. Bridge from Theory to Practice:

    • Function: Enables the purely nonparametric proof to be implemented on real videos/images.
    • Mechanism: CI tests are performed in the observation space (using conditional mutual information as a proxy to avoid high-dimensional statistics issues); when tasks are unknown, task representations are learned as latents. Latent estimation uses VAE + \(\ell_1\) regularization; for image generation, GANs with task-specific masks operate on the latents.
    • Design Motivation: Identifiability is an asymptotic property, but without a runnable algorithm, the community may see it as only theoretical; CMI proxies + standard VAE make it reproducible for any ML researcher.

Loss & Training

No new loss functions are introduced. In the task structure discovery phase, Fisher's z-test (linear Gaussian) or CMI (deep models) is used for CI testing, with threshold \(p=0.05\); in the representation learning phase, standard VAE reconstruction loss plus \(\ell_1\) regularization \(\lambda \|M\|_1\) is applied to the task-latent mask. CMI is estimated using MINE.

Key Experimental Results

Main Results (Synthetic + Real)

Synthetic data is generated according to a collider DAG with 10k samples, 10 random runs.

Setting Metric CCA Group Lasso SelTask Ours
\(T \in [8,20], M=T/5\) Accuracy Low Medium Good Highest
\(T=20, M \in [2,10]\) MCC Low Medium Good Highest

SportsHHI video (multi-person, multi-task interleaving, mAP):

Method mAP
Alg.1 on observed \(o\) Lower
LEAP Medium
Ours (CI on latent \(s\)) Highest

Ablation Study

Task-relevant identifiability (synthetic nonlinear MLP data, \(R^2\)):

Configuration \(R^2\) relevant ↑ \(R^2\) irrelevant ↓ Note
VAE without \(\ell_1\) High High Information retained but entangled
VAE + \(\ell_1\) on task-latent mask High Low Both retained and separated

Flux cat images + GAN (tasks: "wearing glasses/hat/tie"):

Configuration Visual Result
with sparsity Edits only affect target attribute
without sparsity Color and other irrelevant factors are entangled and changed

Key Findings

  • \(\ell_1\) regularization is the "minimal sufficient gain" for turning a generalist into a specialist: removing it immediately causes entanglement; adding it achieves disentanglement.
  • As the number of tasks \(M\) increases, the performance degradation slope is much flatter than baselines, indicating that the collider + band CI approach is especially advantageous in complex structures.
  • On real videos, "doing CI on latent rather than observed" yields the largest gap, confirming the intuition that identifiability requires a correct latent space.

Highlights & Insights

  • Modeling "task as collider" is a nontrivial choice—it enables CI conditions that precisely encode "same-task dependency" semantics into the graph, which is the foundation for the entire theory.
  • Presenting the two-stage process of "first proving generalist is insufficient → then tightening with sparsity" elevates the necessity of sparsity from an empirical trick to a theoretical requirement.
  • No intervention, no functional class, i.i.d. allowed—relaxing all three constraints while still achieving identifiability is a rare "major relaxation" result in this area.

Limitations & Future Work

  • Identifiability is an asymptotic property; the paper does not analyze finite-sample errors, so there is no guarantee for data-scarce scenarios.
  • Sparsity uses an \(\ell_1\) convex surrogate, which has a gap with true \(\ell_0\); the theoretical results assume minimal support, but in practice, careful tuning is needed.
  • The latent space in experiments uses VAE, which itself depends on whether reconstruction retains sufficient information; the theoretical assumption of sufficient nonlinearity is hard to verify directly on real data.
  • The authors also mention "identifiability-inspired architecture" as future work; currently, only standard estimators + regularization are used, and architectural innovation is still lacking.
  • vs nonlinear ICA (Hyvärinen series): They rely on temporal contrastive or auxiliary variables; this work allows disconnection and i.i.d.
  • vs causal representation learning (von Kügelgen et al.): They require intervention; this work is fully observational.
  • vs subspace factorization (content-style): Their structure is fixed; this work allows unknown task numbers, structures, and assignments.
  • vs SelTask / LEAP: These are the strongest empirical baselines, but only guarantee latent identifiability, not structure identifiability; this work provides both.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First fully nonparametric task-relevant identifiability with strong originality from the collider perspective.
  • Experimental Thoroughness: ⭐⭐⭐ Mainly theoretical, experiments serve more as validation than comprehensive comparison; real data scale is limited.
  • Writing Quality: ⭐⭐⭐⭐ Two-layer framework is clear, theorem-corollary-lemma structure is rigorous.
  • Value: ⭐⭐⭐⭐ Provides a formal foundation for "generalist → specialist fine-tuning," serving as theoretical support for future work.