MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings¶

Conference: AAAI 2026 arXiv: 2503.03008 Code: HuggingFace Area: Code Intelligence Keywords: Self-distillation, Multi-exit network, Code retrieval, Early exit, Modular deployment

TL;DR¶

This paper proposes ModularStarEncoder (MoSE), a 1B-parameter multi-exit encoder that significantly enhances early-layer representations via a novel self-distillation mechanism in which higher layers guide the training of lower layers. MoSE surpasses all open-source models on code understanding tasks such as CodeSearchNet while supporting flexible compute–accuracy tradeoff deployment.

Background & Motivation¶

State of the Field¶

Large language models have achieved remarkable progress in NLP, yet their substantial computational demands pose severe deployment challenges. The community has proposed several mitigation strategies: quantization to reduce numerical precision, knowledge distillation to train smaller student models, and pruning to remove low-impact parameters. Model families such as LLaMA, Qwen, and Mistral represent a broader shift toward more efficient architectures.

Limitations of Prior Work¶

High cost of conventional distillation: Separate training of teacher and student models is required, with costs scaling linearly with the number of target models.

Fixed inference cost: Standard models must traverse all layers before producing output, precluding adaptive computation based on task difficulty.

Underutilized value of intermediate layers: Studies such as SimCLR and Valeriani demonstrate that intermediate rather than final layers often hold the most semantically rich representations.

Limitations of NSP loss: Traditional Next Sentence Prediction loss provides negligible benefit after fine-tuning and inefficiently exploits the long-context window for code inputs.

Root Cause¶

How can multiple sub-models of varying computational cost be trained simultaneously within a single model, while ensuring that early layers also produce high-quality representations?

Starting Point¶

Multiple exit points are introduced within a single Transformer. Multi-layer losses propagate training signals from higher layers to lower layers (self-distillation) without requiring an external teacher model. In addition, an In-Context Classification (ICC) loss replaces NSP to improve context window utilization.

Method¶

Overall Architecture¶

A 1B-parameter bidirectional encoder based on the StarCoder-2 architecture with 36 layers. Exit heads are inserted at layers 4, 9, 18, 27, and 36. Each exit jointly computes MLM and ICC losses, which are combined via a weighted sum (self-distillation). At inference time, users may select any exit point, yielding a minimum footprint of 160M parameters.

Key Designs¶

Self-Distillation Multi-Layer Loss:
- Losses are computed independently at selected layers \(\iota = \{4, 9, 18, 27, 36\}\)
- All layers share a classification head, augmented with layer-index positional encodings to differentiate them
- Weighting coefficient \(\alpha = i/|I|\), where \(I = \{1, ..., 36\}\), assigning greater weight to deeper layers
- Total loss: \(\mathcal{L} = \sum_{i \in \iota} \mathcal{L}_i \cdot \alpha\)
- Effect: Training signals from higher layers propagate naturally to shared lower-layer parameters, encouraging lower layers to learn better representations
- Design Motivation: A single training run yields multiple models of varying performance, eliminating the redundant cost of repeated distillation
In-Context Classification (ICC) Loss:
- Replaces the conventional Next Sentence Prediction objective
- Code snippets are randomly concatenated (separated by [SEP]), with a 50% probability of originating from different repositories
- Classification target: whether the concatenated input derives from the same repository
- Advantages:
  - Increases input density: average input length grows from 630 tokens to 1,300 tokens
  - Repositories are naturally modular and contain multilingual files, facilitating cross-lingual understanding
- Combined pre-training objective: \(\mathcal{L} = \mathcal{L}_{MLM} + \mathcal{L}_{ICC}\)
- Design Motivation: NSP provides negligible benefit after fine-tuning and wastes context window capacity
Architectural Modifications:
- Based on StarCoder-2: 36 hidden layers, 1B parameters
- GQA (16 attention heads, 4 KV heads) + RoPE (\(\theta=10^6\))
- Hidden dimension 1024, intermediate dimension 12288
- Key modifications:
  - Causal mask removed → bidirectional attention
  - Sliding window attention → full attention (to avoid receptive field limitations and ensure modularity)
  - FlashAttention V2 integrated
- Context length: 2048 tokens
SynthCoNL Dataset:
- Seeded from the CodeSearchNet dataset
- Code translation performed using Qwen2.5-Coder-7B-Instruct
- Generates 1,071,367 (natural language, code A, code B) triplets
- Code B spans Go, Ruby, JS, Python, C++, PHP, C, and Java
- Near-deduplication: LSH + Jaccard similarity threshold 0.7 + 256 permutations + character-level 5-gram
- Design Motivation: Extends text–code benchmarks and adds cross-lingual code–code retrieval capability

Loss & Training¶

Pre-training: - Batch size 4M tokens, maximum context 2048 tokens - 245,000 steps, processing approximately 1T tokens (TheStackV2 dataset) - AdamW optimizer (\(\beta_1=0.9\), \(\beta_2=0.95\), \(\varepsilon=1\text{e-}6\), weight decay \(=0.1\)) - Multi-step learning rate schedule: 4,000 warmup steps, with staged decay at steps 120K / 185K / 220K / 230K / 240K - 512 × NVIDIA Ampere (64 GB) GPUs, approximately 450,000 GPU hours

Fine-tuning (Retrieval): - CLIP-style loss combined with the multi-layer approach - Five distinct projection heads corresponding to the five exit points - Batch size 2048, with text–code and code–code pairs uniformly distributed - Data augmentation: high-frequency tokens in code randomly replaced with 30% probability - Learning rate \(1\text{e-}5\), temperature parameter \(10.0\)

Fine-tuning (Classification — BigCloneBench): - Input format: [SEP] snippet-1 [SEP] snippet-2 [CLS] - Final [CLS] token used for classification - Learning rate \(1\text{e-}5\), 2,000 warmup steps, batch size 64, 14,000 steps

Key Experimental Results¶

Main Results¶

Model	Ruby	JS	Go	Python	Java	PHP	avg MRR	avg NDCG	POJ104 mAP
MoSE	74.1	74.0	82.5	92.5	78.7	84.5	81.0	84.2	75.9
CodeT5+	78.0	71.3	92.7	75.8	76.2	70.1	77.4	-	24.5
UniXcoder	74.0	68.4	91.5	72.0	72.6	67.6	74.4	-	41.0
ModernBERT-large	-	-	-	-	-	-	-	59.5	27.3
OpenAI Embedding	84.7	85.3	95.9	99.8	90.1	95.6	91.9	93.3	82.9

Model	BigCloneBench F1	Notes
MoSE (L4)	93.0	Shallowest exit
MoSE (L9)	93.4	-
MoSE (L18)	94.2	Best layer
MoSE (L27)	94.1	-
MoSE (L36)	94.1	-
CodeT5+ (770M)	95.1	Only +0.9 higher
UniXcoder	95.2	-

Ablation Study¶

Configuration	avg Recall@1 Gain	Notes
Single-exit baseline L4	Baseline	Trained only at layer 4
Single-exit baseline L9	Baseline	Trained only at layer 9
Self-Distilled L9	+4.36%	Multi-layer loss yields the largest gain
Self-Distilled L18–36	Stable high performance	Performance gap narrows from layer 18 onward
ICC vs. NSP (CoIR CSN-CCR)	10.1 vs. 6.2	ICC substantially better
ICC vs. NSP (CoIR CT-DL)	32.0 vs. 31.0	ICC slightly better or comparable

Key Findings¶

Open-source SOTA: MoSE achieves MRR = 81.0 on CodeSearchNet, surpassing CodeT5+ by 3.6% and substantially narrowing the gap with closed-source OpenAI embeddings.
90% compute reduction with only 6.4% performance loss: Moving from layer 36 to layer 4 reduces FLOPs by approximately 90% while text–code retrieval MRR drops by only 6.4%.
Task-dependent optimal exit layer: The optimal layer for text–code retrieval is layer 18, while code–code retrieval peaks at shallower layers, corroborating the uneven distribution of semantic information across layers.
Self-distillation vs. single-exit: Multi-layer loss consistently outperforms single-exit baselines at all layers, with the largest gain at layer 9 (+4.36% Recall@1).
ICC > NSP: ICC significantly outperforms NSP on cross-context retrieval tasks while also providing denser input representations.
Cross-architecture generalization: The approach performs well across AlexNet, VGG11, and ResNet18.
Permutation test: Different layers produce significantly distinct similarity scores (\(p < 0.001\)), confirming that the model learns qualitatively different representations at each layer.

Highlights & Insights¶

The self-distillation design is elegant: higher layers naturally guide lower layers without a separate teacher model, yielding multiple model variants from a single training run.
The ICC loss is cleverly designed: it simultaneously improves context utilization (630 → 1,300 tokens) and leverages repository-level structure to enhance cross-lingual understanding.
The paper reveals a systematic relationship between task type and optimal exit layer: cross-modal tasks require deeper layers, while same-modal tasks can be resolved at shallower layers.
The SynthCoNL dataset construction pipeline is generalizable: code translation models can be used to generate cross-lingual training pairs at low cost.
The modular design affords users significant flexibility, enabling on-demand selection between 160M and 1B parameters.

Limitations & Future Work¶

Computational resource constraints limit thorough hyperparameter tuning and pre-training ablation studies.
The impact of fine-tuning on synthetic code translation data remains unclear.
A substantial gap from OpenAI Embedding persists (81.0 vs. 91.9 MRR), with the closed-source model's scale and training data remaining opaque.
The choice of exit points (4, 9, 18, 27, 36) is empirically determined; more principled exit configurations warrant exploration.
Combinatorial use of multiple exit points and a more systematic study of the relationship between task type and depth could be investigated.
Evaluation is limited to code understanding tasks; extension to natural language tasks remains as future work.

BranchyNet: Early exit architecture → realized at scale in large Transformers in this work
Matryoshka Representation Learning: Embedding dimension pruning → this work prunes layer depth instead
SimCLR / Valeriani: Final layers are not necessarily optimal → the core assumption of this paper
CodeT5+: Multi-task code pre-training → replaced here by self-distillation + ICC
Knowledge Distillation (Sanh 2019): Teacher–student paradigm → this work eliminates dependence on a teacher model

Rating¶

Novelty: ⭐⭐⭐⭐ (The combination of self-distillation and ICC is a first for code encoders)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (CodeSearchNet + POJ104 + BigCloneBench + CoIR with thorough ablations)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear structure, intuitive figures, rigorous argumentation)
Value: ⭐⭐⭐⭐⭐ (High practical deployment value; significant open-source model and dataset contributions)