Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts¶

Conference: ACL 2025
arXiv: 2501.02009
Code: None
Area: LLM/NLP
Keywords: steering vectors, cross-model transfer, representation alignment, linear transformation, LLM interpretability

TL;DR¶

Proposes the L-Cross Modulation method, which transfers concept steering vectors (SVs) from one LLM to another via simple linear transformations to achieve behavioral control. Three key findings are identified: (1) cross-model SV transfer is effective; (2) different concepts share the same transformation matrix; (3) SVs of smaller models can control larger models (weak-to-strong transfer).

Background & Motivation¶

Background: Steering vectors (SVs) represent the directional behavior of concepts within LLMs and can be used to control generation (e.g., guiding the generation of harmful/harmless content). However, existing research is limited to individual LLMs.

Limitations of Prior Work: Different LLMs have distinct representation spaces, preventing the direct cross-model application of SVs—for instance, an SV extracted from Llama cannot be directly applied to Qwen.

Key Challenge: If different LLMs learn fundamentally different representations of the same concept, cross-model control would be unfeasible. However, the Platonic Representation Hypothesis suggests that different networks converge toward a shared statistical model of reality.

Goal: To verify whether concept representations share an underlying structure across different LLMs and whether cross-model transfer can be achieved via simple transformations.

Key Insight: Analogous to Plato's Allegory of the Cave—different LLMs are like different "prisoners" observing different "shadows" of the same reality, where a linear transformation serves as the bridge between these "shadows."

Core Idea: Concept representations in different LLMs possess a linearly transformable shared structure, enabling weak-to-strong cross-model control.

Method¶

Overall Architecture¶

(1) Encode sentences using source and target LLMs on a shared corpus \(\rightarrow\) (2) Apply ordinary least squares (OLS) to solve for the linear transformation matrix \(\mathbf{T}\) that maps source representations to target representations \(\rightarrow\) (3) Apply \(\mathbf{T}\) to transform the source model's SV and inject it into the target model's hidden states for behavioral control.

Key Designs¶

Learning the Linear Transformation Matrix \(\mathbf{T}\):
- Function: Solve the OLS to minimize \(\|\lambda_\mathcal{D}^{m_t} - \lambda_\mathcal{D}^{m_s} \mathbf{T}'\|\)
- Closed-form solution: \(\mathbf{T} = (\lambda^{m_s\top}\lambda^{m_s})^\dagger \lambda^{m_s\top}\lambda^{m_t}\)
- Design Motivation: Linear transformations preserve the fundamental relationships between concepts (involving only rotation and scaling), which helps verify the universality of representations.
Corpus Selection:
- Concept-related contrastive texts \(Y_W\) can be used (more precise).
- Concept-unrelated general texts can also be used (more general)—experiments show that \(\mathbf{T}\) trained on general corpora still transfers effectively across concepts.
Weak-to-Strong Transfer:
- SVs extracted from Qwen-0.5B can effectively control the behavior of Qwen-7B after transformation.
- This implies that smaller models have already captured the core directions of concepts.

Key Experimental Results¶

RQ1: Cross-model Transfer Effectiveness (11 Concepts)¶

Concept	No Mod	Self Mod	L-Cross (Qwen→Llama2)
Harmfulness (↑)	0%	90%+	~90% (close to Self Mod)
Happiness (↑)	Low	High	Close to Self
Refusal (↓)	Baseline	Large change	Effective change

RQ2: Cross-concept Generalization of the Transformation Matrix¶

\(\mathbf{T}_{Y_{W_1}}\) learned from concept \(W_1\) data can effectively transfer SVs of concept \(W_2\).
Even a \(\mathbf{T}\) learned from a general corpus (concept-unrelated text) is effective.
This suggests that the SVs of different concepts share an underlying cross-model transformation structure.

RQ3: Weak-to-Strong Transfer¶

Source Model	Target Model	Effect
Qwen-0.5B	Qwen-7B	Effective (close to same-model Self Mod)

Key Findings¶

Linear transformations are sufficient to align the concept spaces of different LLMs (without requiring complex non-linear mapping).
The same transformation matrix is effective across multiple concepts \(\rightarrow\) the relational structures between concepts are consistent across models.
Transfer from small to large models is effective \(\rightarrow\) the core directions of concepts are shared across models of different scales.

Highlights & Insights¶

The analogy to Plato's Allegory of the Cave is highly vivid—different LLMs observe different "shadows" of the same "reality," yet the underlying structure remains consistent.
Linear transferability provides empirical support for the validity of the Platonic Representation Hypothesis at the conceptual level of LLMs.
Weak-to-strong transfer has direct implications for AI safety: smaller models can be utilized as "concept probes" to control larger models.

Limitations & Future Work¶

Only validated across three LLM families (Llama2/3.1, Qwen2).
Linear transformations may not apply to vastly different architectures.
The scaling factor \(\beta\) still requires manual adjustment.
Evaluation in complex scenarios such as multi-turn dialogues has not yet been conducted.

vs Activation Steering (CAA/RepE): Extends from single-model to cross-model scenarios, representing the first systematic study.
vs Platonic Representation Hypothesis: Provides empirical support at the conceptual level.
vs Weak-to-Strong Generalization (Burns et al.): Proves that weak-to-strong transfer is viable within the dimension of concept control.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Cross-model concept transfer and weak-to-strong control provide an entirely new perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ 11 concepts, progress through 3 RQs, and comparisons across multiple models.
Writing Quality: ⭐⭐⭐⭐⭐ Elegant analogy to Plato's allegory, with clear narrative logic.
Value: ⭐⭐⭐⭐⭐ Profound impact on LLM interpretability and AI safety research.