Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning¶

Metadata¶

Title: Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning
Authors: Jiayi Lin, Yanlin Wang, Yibiao Yang, Lei Zhang, Yutao Xie
Conference: AAAI 2026
arXiv: 2406.12326
Code: GitHub
Area: Code Understanding, Contrastive Learning, Self-Supervised Learning

TL;DR¶

This paper proposes CL4D, a contrastive learning framework that adapts pretrained decoder-only code generation models to code understanding tasks (code search, clone detection) via continued pretraining, achieving performance comparable to or better than encoder-only models of equivalent scale without retraining them from scratch.

Background & Motivation¶

State of the Field¶

Large-scale decoder-only code generation models (e.g., StarCoder, CodeLlama, DeepSeek-Coder) have achieved remarkable success in code generation, yet perform poorly on code understanding tasks such as code search and clone detection.
This gap is primarily attributed to the autoregressive training objective (next-token prediction), which emphasizes generation over semantic understanding.
Encoder-only models (e.g., CodeBERT, UniXcoder, CodeSage) excel at understanding tasks but are far smaller than state-of-the-art decoder-only models.

Root Cause¶

Decoder-only models benefit from larger scale and richer training data, but their unidirectional attention mechanism limits fine-grained code understanding.
Pretraining encoder-only models of equivalent scale from scratch is computationally prohibitive (the largest existing CodeSage is only 1.3B; CoLSBERT is only 1.5B).
Core question: Can the code understanding capability of existing decoder-only models be enhanced without retraining them from scratch?

Design Motivation¶

Leverage the rich code knowledge already encoded in existing decoder-only models and transfer it via contrastive learning.
Explore the feasibility of unifying code generation and code understanding within a single decoder-only architecture serving both task types.

Method¶

Overall Architecture: CL4D (Contrastive Learning for Decoder-only)¶

CL4D is a contrastive learning framework that performs continued pretraining on pretrained decoder-only code generation models to improve their representation quality.

1. Data Construction¶

Six programming languages (Python, Java, Go, PHP, JavaScript, Ruby) are extracted from The Stack dataset.
Tree-Sitter is used to extract bimodal data and construct (query, code) pairs, where the query is the first line of a function's docstring.
CodeSearchNet filtering rules are applied to improve data quality.
Millions of training samples are constructed for continued pretraining.

2. Model Architecture¶

Two strategies for extracting code representations from decoder-only models are explored: - Last Token: The embedding of the last token in the final layer is used as the code representation (due to unidirectional attention, only the last token aggregates all preceding context). - Average: The mean of all token embeddings in the final layer is used.

A dual-encoder architecture is adopted, where two weight-sharing Transformer decoders separately encode the query and the code.

3. Contrastive Learning Objective¶

In-batch Negatives: Code snippets from other samples in the same batch serve as negative examples.
Hard Negatives: UniXcoder is used to retrieve code snippets that are close in representation space but semantically different from each query, drawn from the full codebase.
The loss function takes the InfoNCE form with temperature \(\tau = 0.05\); cosine similarity is used to compute relevance scores.

4. Representation Extraction Strategy¶

Ablation experiments reveal that: - Right padding + average pooling over all token embeddings is the optimal representation extraction strategy. - With left padding, the last-token strategy performs better; with right padding, average pooling performs better.

Training Details¶

Optimizer: AdamW, learning rate \(2 \times 10^{-5}\)
Trained for 2 epochs with batch size 64
8× A100 (80G) GPUs; maximum training time approximately 3 days (phi-1)

Key Experimental Results¶

Experimental Setup¶

Code Search: CodeSearchNet (CSN, 6 languages) and CoSQA datasets, metric: MRR
Clone Detection: POJ-104 dataset, metric: MAP
Baselines: encoder-only models (CodeBERT / GraphCodeBERT / UniXcoder / CodeSage) and decoder-only models (CodeGPT / CodeGen / SantaCoder / phi-1 / DeepSeek-Coder)

Table 1: Overall Fine-tuned Performance¶

Method	Scale	CSN (MRR)	CoSQA (MRR)	POJ-104 (MAP)
CodeBERT (Enc)	125M	70.18	65.7	83.79
GraphCodeBERT (Enc)	125M	72.08	68.4	85.50
UniXcoder (Enc)	125M	74.40	70.1	89.56
CodeSage (Enc)	1.3B	75.80	68.0	87.70
CodeGPT + CL4D (Dec)	125M	70.20	69.0	87.96
CodeGen + CL4D (Dec)	350M	73.30	71.5	89.68
SantaCoder + CL4D (Dec)	1.1B	74.98	72.2	83.98
phi-1 + CL4D (Dec)	1.3B	75.18	72.8	92.72
DeepSeek-Coder + CL4D (Dec)	1.3B	77.57	71.9	89.71

Key Findings: CL4D enables decoder-only models to outperform encoder-only models of equivalent scale by approximately 2% on most tasks; larger model scale consistently yields better understanding performance.

Table 2: Zero-Shot Performance (No Fine-Tuning)¶

Method	Scale	CSN (MRR)	CoSQA (MRR)	POJ-104 (MAP)
CodeBERT (Enc)	125M	0.10	0.24	20.38
UniXcoder (Enc)	125M	46.40	42.11	42.08
CodeSage (Enc)	1.3B	71.24	47.53	73.07
CodeGPT (Dec)	125M	0.12	0.04	9.41
DeepSeek-Coder (Dec)	1.3B	0.12	0.63	16.51
CodeGPT + CL4D (Dec)	125M	67.56 (↑67.44)	53.49 (↑53.45)	25.93 (↑16.52)
CodeGen + CL4D (Dec)	350M	71.97 (↑70.55)	51.18 (↑50.73)	45.84 (↑32.64)
SantaCoder + CL4D (Dec)	1.1B	74.18 (↑74.11)	52.82 (↑52.71)	71.14 (↑55.57)
DeepSeek-Coder + CL4D (Dec)	1.3B	76.02 (↑75.90)	48.34 (↑47.71)	71.18 (↑54.67)

Key Findings: CL4D improves decoder-only model performance by 40%–76% in the zero-shot setting, enabling it to match encoder-only models without any task-specific fine-tuning.

Ablation Study¶

Removing hard negatives leads to approximately 1.5% performance degradation.
Removing in-batch negatives causes a drastic performance drop (CSN MRR from 72.00 to 1.42), confirming that contrastive learning is the core of the method.

Highlights & Insights¶

High Practicality: The approach directly reuses existing decoder-only models' code knowledge without training large encoder models from scratch, substantially reducing computational cost.
Unified Architecture Potential: The work demonstrates that a decoder-only architecture can simultaneously handle both code generation and code understanding, pointing toward a unified code model paradigm.
Significant Zero-Shot Gains: CL4D achieves zero-shot improvements of up to 75.90%, matching encoder-only SOTA without any fine-tuning.
Clear Scaling Effect: Experiments clearly demonstrate that larger decoder-only models yield consistently better code understanding performance.
Systematic Exploration: A complete ablation analysis over representation extraction strategies (padding direction × pooling method) is conducted.

Limitations & Future Work¶

Limited Evaluation Tasks: Only code search and clone detection are evaluated; other understanding tasks such as code summarization, bug detection, and type inference are not covered.
Constrained Model Scale: The largest model in experiments is only 1.3B; the effectiveness on larger-scale decoder-only models (e.g., 7B, 13B) remains unverified.
Dependency on External Model for Hard Negatives: Constructing hard negatives requires UniXcoder as an auxiliary ranker, introducing an additional external dependency.
No Evaluation of Generation Capability: The impact of CL4D continued pretraining on the original model's code generation ability (e.g., potential catastrophic forgetting) is not analyzed.
Mainstream Language Bias: Only 6 mainstream programming languages are covered; generalization to low-resource languages is not validated.

Code Representation Learning: Encoder-only models such as CodeBERT, GraphCodeBERT (data flow), TreeBERT (AST), UniXcoder, CodeSage, and CoLSBERT learn code representations via objectives like MLM.
Code Contrastive Learning: CoSQA (query rewriting for positive pairs), SynCoBERT/Code-MVP (cross-modal positive pairs), UniXcoder (dropout-based positive pairs), CodeRetriever/R2/CodeSage (hard negative construction).
Decoder-Only Code Models: Codex, CodeGen, StarCoder, CodeLlama, DeepSeek-Coder, etc., continuously growing in scale but primarily targeting generation tasks.

Rating¶

Novelty: ⭐⭐⭐ — The intuition is sound but technical novelty is limited; the work essentially applies SimCSE-style contrastive learning to decoder-only code models.
Value: ⭐⭐⭐⭐ — The method is simple and efficient, requires low training cost, and can directly reuse existing models with low engineering overhead.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive comparisons across multiple models, datasets, and settings (fine-tuned / zero-shot), with complete ablation and visualization analyses.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-defined research problem, and well-organized experiments.
Overall: ⭐⭐⭐⭐ (7.5/10) — High practical value and thorough experiments, but limited technical innovation; the core contribution lies in systematically demonstrating that contrastive learning can effectively bridge the gap of decoder-only models in code understanding.