Skip to content

Function-to-Style Guidance of LLMs for Code Translation

Conference: ICML 2025
arXiv: 2507.11083
Code: Yes (the paper mentions that the model and benchmark have been released)
Area: Code Intelligence
Keywords: code translation, LLM fine-tuning, functional learning, style learning, readability

TL;DR

F2STrans is proposed to progressively fine-tune LLMs in two stages: functional learning (correctness) and style learning (readability). This allows Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 code translation scenarios.

Background & Motivation

Background

Background: LLMs have made progress in code translation (e.g., Java to Python), but the correctness and readability of the translation results remain challenging.

Limitations of Prior Work: Most methods focus on functional correctness, but the resulting code style is unnatural with poor readability. There is also a lack of benchmarks that evaluate both functionality and style.

Key Challenge: Functional correctness and code style are objectives of two different dimensions; direct translation may achieve functional correctness but result in an unnatural style.

Goal: To design a progressive framework that first ensures functional correctness and then optimizes code style.

Key Insight: Mining high-quality code pairs from online programming platforms for functional learning, followed by style learning using positive and negative style samples.

Core Idea: Decoupling code translation optimization into two steps: "functionality first, style second."

Mechanism

Goal: ### Overall Architecture Stage 1 - Functional Learning: Fine-tuning with high-quality source-target code pairs to optimize translation correctness.

Method

Overall Architecture

Stage 1 - Functional Learning: Fine-tuning with high-quality source-target code pairs to optimize translation correctness. Stage 2 - Style Learning: Further fine-tuning with positive/negative style samples to guide the model towards outputting more natural code styles.

Key Designs

  1. High-Quality Code Pair Mining: Pairing different language submissions for the same problem from LeetCode/Codeforces, and filtering to select functionally equivalent, high-quality code pairs. Design Motivation: Real-world multilingual code is more natural than synthetic data, and functional correctness is guaranteed by test cases.

  2. Style Learning: Introducing contrastive learning concepts by using positive examples (conforming to target language idioms) and negative examples (functionally correct but stylistically unnatural). The model learns to shift toward a more natural style while preserving functionality.

  3. New Benchmark: Includes the latest source code, extensive test cases, and human-annotated ground truths, supporting 20 translation scenarios (pairwise translation among 5 languages) to evaluate both functionality and style simultaneously.

Loss & Training

  • Functional learning stage: Standard SFT loss
  • Style learning stage: Contrastive/preference learning loss combining positive and negative samples

Key Experimental Results

Main Results (20 Translation Scenarios)

Model Functional Accuracy Style Score Overall
F2STrans (Qwen-1.5B) Best Best SOTA
Qwen-32B + prompt High Medium Inferior to 1.5B FT
GPT-4 + prompt High Medium Inferior to 1.5B FT

Ablation Study

Configuration Functional Accuracy Style Score Description
Full F2STrans Highest Highest Both stages included
Functional Learning Only High Medium Correct but unnatural
Style Learning Only Low High Good style but potentially incorrect
Direct Joint Training Medium Medium Inferior to staged training

Key Findings

  • 1.5B fine-tuning outperforms 32B and GPT-4 prompting methods
  • Decoupled training outperforms joint training
  • Online programming platforms are a treasure trove of high-quality translation data

Highlights & Insights

  • The "functionality first, style second" approach can be generalized to other code generation tasks
  • The superiority of fine-tuning small models over prompting large models is validated once again

Limitations & Future Work

  • Only covers 5 programming languages
  • Style evaluation partially relies on human annotation
  • Formal verification of semantic equivalence has not been explored
  • Code quality is not just about "being runnable"; style and readability are equally important
  • The choice of data sources has a significant impact on translation quality

Rating

  • Novelty: ⭐⭐⭐⭐ Novel functional-style decoupled training paradigm
  • Experimental Thoroughness: ⭐⭐⭐⭐ 20 scenarios, new benchmark
  • Writing Quality: ⭐⭐⭐⭐ Clear description of methods
  • Value: ⭐⭐⭐⭐ Direct application value for code translation tools

Additional Reflections

The research direction of this paper is closely related to several major trends in current AI research: (1) the growing demand for an in-depth understanding of LLM internal mechanisms; (2) the increasing importance of model efficiency and accessibility; and (3) AI safety and reliability becoming core concerns. From a methodological perspective, this work represents a paradigm shift from "black-box utilization" to "white-box understanding."

Specific Suggestions for Future Work

  1. The core idea of this paper can be combined with other modalities (vision, speech).
  2. Consider validating the generalizability of the conclusions on larger-scale models and datasets.
  3. Explore the possibility of combining this approach with reinforcement learning and online learning.
  4. Develop automated evaluation and optimization toolchains.

Additional Reflections

The research direction of this paper is closely related to several major trends in current AI research: evaluation of model capabilities and reliability guarantees, parameter-efficient fine-tuning and model compression, and AI safety and alignment. From a methodological perspective, this work represents an exploration of the deeper mechanisms of LLMs, helping to drive the paradigm shift from empirically-driven to theoretically-driven research.

Specific Suggestions for Future Work

  1. The core idea can be combined with other modalities (vision, speech, multi-modal) to verify the cross-modal generalizability of the method.
  2. Validate the conclusions on larger-scale models (70B+) and newer architectures (such as Mixture-of-Experts).
  3. Explore the possibility of integration with reinforcement learning and online learning to achieve dynamic adaptation.
  4. Develop automated evaluation and optimization tools to lower the barrier to using the method.
  5. Consider the intersection with LLM alignment research to explore the synergetic optimization of safety and performance.