GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding¶
Conference: ACL 2025
arXiv: 2409.04183
Code: https://github.com/codefuse-ai/GALLa
Area: Code Intelligence
Keywords: code understanding, graph alignment, GNN, AST/DFG, model-agnostic
TL;DR¶
This paper proposes GALLa, which encodes the AST/DFG structural graph of code using a GNN and aligns it to the LLM embedding space via a cross-modal adapter. It injects code structural information as an auxiliary task during fine-tuning, and discards the GNN and adapter during inference to achieve zero extra overhead, yielding consistent improvements across 5 code tasks and 7 baseline LLMs (ranging from 350M to 14B parameters).
Background & Motivation¶
Background: Code LLMs (such as Code LLaMA, DeepSeek-Coder, etc.) have made significant progress by scaling up model and data sizes, but essentially remain standard decoder-only Transformers that treat source code merely as sequences of text tokens.
Limitations of Prior Work: Code contains rich graph-structural semantics (AST for Abstract Syntax Trees, DFG for Data Flow Graphs, CFG for Control Flow Graphs) which are ignored in pure text representations. Existing graph-enhancement methods fall into three categories: (1) modifying the attention mask to encode graph structures (e.g., GraphCodeBERT, which is incompatible with decoder-only LLMs); (2) linearizing graphs into text (only applicable to simple trees like AST, and not suitable for cyclic DFGs); (3) modifying positional encodings (requiring large-scale retraining).
Key Challenge: Graphic enhancement and large-scale pretrained LLMs are incompatible—either the architecture is modified, losing pretrained knowledge, or the graph is omitted, losing structural information.
Key Insight: Inspired by the VLM field (e.g., LLaVA) which uses lightweight adapters to bridge vision and language modalities, an external GNN + adapter is used to handle graph information while keeping the LLM architecture completely unchanged.
Core Idea: Graph information is injected into the LLM during training via GNN+adapter, and discarded during inference—meaning the graph is present during training, but zero overhead is incurred during inference.
Method¶
Overall Architecture¶
Three modules: GNN (encoding AST/DFG) \(\to\) Adapter (projecting into the LLM embedding space) \(\to\) LLM (decoding/generation). Training is split into two stages, and the data is divided into graph alignment data and downstream task data, which are completely independent.
Key Designs¶
-
Graph Encoding and Input:
- GNN encodes the node features (initialized using a text encoder) and edge structures of AST and DFG, outputting contextualized node representations \(H \in \mathbb{R}^{n_v \times d_{gnn}}\).
- The adapter uses learnable query vectors to compress node representations into a fixed number of graph tokens \(X_g \in \mathbb{R}^{n_g \times d_{lm}}\) via cross-attention.
- Graph tokens and text tokens are concatenated and input into the LLM, with the loss calculated solely on the text tokens.
-
Two-Stage Training:
- Stage 1 (Graph Encoder Pre-training): Freeze the LLM, and train only the GNN + adapter. Task: Graph2Code (generate source code given a graph) to let the GNN+adapter learn to generate graph tokens intelligible to the LLM.
- Stage 2 (Graph-LLM Alignment): Unfreeze the LLM, and perform joint training of all three modules. Tasks: Graph2Code + GraphQA (predicting edge existence, predicting parent/child nodes). Simultaneously, fine-tune the LLM on downstream task data (not passing through the GNN, directly using text input).
-
Discarding GNN during Inference:
- After training is complete, the GNN and adapter are discarded, and the LLM performs inference independently—resulting in identical speed to the baseline LLM.
- Key assumption: Through graph alignment training, the LLM has already internally acquired code structure understanding capabilities, eliminating the need for explicit graph input during inference.
-
Model- and Task-Agnosticism:
- The choice of GNN depends on the graph type (directed/undirected), and the choice of LLM depends on the application scenario—the framework itself imposes no constraints.
- Graph alignment data comes from CodeNet (240K Python + 75K Java), which is entirely independent of downstream task data.
Key Experimental Results¶
Main Results (Multi-Task Fine-Tuning - MFT)¶
| Model | Without GALLa | G2C | G2C+GraphQA | Average Gain |
|---|---|---|---|---|
| CodeGen 350M | 34.7 | 35.2 (+1%) | 36.6 (+5%) | +5% |
| StarCoder 1B | 14.0 | 16.7 (+20%) | 18.9 (+36%) | +36% |
| Phi-1 1.3B | Baseline | Improvement | Larger Improvement | Significant |
| LLaMA3-8B | Baseline | Improvement | Improvement | Consistent |
| Qwen2.5-Coder-7B | Baseline | Improvement | Improvement | Consistent |
5 Tasks: Code Translation (pass@1), Clone Detection (F1), Defect Detection (Acc), Code Summarization (BLEU), Code Refinement (EM).
Ablation Study¶
| Configuration | Performance | Description |
|---|---|---|
| G2C Only | Effective | Basic graph understanding |
| G2C + GraphQA | Better | Deep structural understanding |
| AST only / DFG only | Both effective | Complementary to each other |
| Cross-lingual Generalization | Effective | Alignment data contains only Python+Java, but improvements are also seen in JavaScript/C |
Key Findings¶
- Smaller models benefit more: StarCoder 1B achieves a 36% gain, whereas larger models see smaller but consistent improvements.
- Cross-lingual structural knowledge transfer: Programming languages not included in the graph alignment data also benefit, indicating that LLMs have learned general structural understanding capabilities.
- GraphQA is more critical than G2C: QA-style structural tasks force the model to deeply understand the graph topology rather than simply memorize the graph-to-code mapping.
Highlights & Insights¶
- The "graphs during training, no graphs during inference" design philosophy is highly practical: zero inference overhead means it can directly replace any existing code LLM.
- Analogy of GNN as a "graph tokenizer": Stage 1 is equivalent to training a graph tokenizer (similar to vision tokenizers in VLMs), enabling the LLM to "comprehend" graph information.
- Separation of graph alignment data and downstream task data: Downstream task data does not require graph structure annotations, significantly lowering the barrier to adoption.
Limitations & Future Work¶
- AST/DFG extraction requires syntactically valid and complete code, which is not applicable to code snippets or code with syntax errors.
- Only graph alignment data for Python and Java was tested; performance on other languages remains to be validated.
- Processing graphs of large code files with GNNs might encounter scalability issues.
Related Work & Insights¶
- vs GraphCodeBERT: Encodes DFG by modifying attention masks, but is incompatible with decoder-only LLMs; GALLa handles this entirely externally.
- vs TransCoder-IR: Aligns using LLVM IR as an intermediate representation, but IR also consists of text tokens with limited structural information; GALLa directly operates on graph structures.
Rating¶
- Novelty: ⭐⭐⭐⭐ An elegant bridging solution between graph structures and LLMs, with an ingenious "graphs in training, none in inference" design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 models (350M-14B) \(\times\) 5 tasks + cross-lingual generalization validation.
- Writing Quality: ⭐⭐⭐⭐ Clear description of downstream methodology and coherent logic in the two-stage training.
- Value: ⭐⭐⭐⭐⭐ Zero inference overhead + model-agnostic + task-agnostic, providing highly practical value.