Function-to-Style Guidance of LLMs for Code Translation¶

Conference: ICML 2025
arXiv: 2507.11083
Code: Yes (the paper mentions that the model and benchmark have been released)
Area: Code Intelligence
Keywords: code translation, LLM fine-tuning, functional learning, style learning, readability

TL;DR¶

F2STrans is proposed to progressively fine-tune LLMs in two stages: functional learning (correctness) and style learning (readability). This allows Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 code translation scenarios.

Background & Motivation¶

Background¶

Background: LLMs have made progress in code translation (e.g., Java to Python), but the correctness and readability of the translation results remain challenging.

Limitations of Prior Work: Most methods focus on functional correctness, but the resulting code style is unnatural with poor readability. There is also a lack of benchmarks that evaluate both functionality and style.

Key Challenge: Functional correctness and code style are objectives of two different dimensions; direct translation may achieve functional correctness but result in an unnatural style.

Goal: To design a progressive framework that first ensures functional correctness and then optimizes code style.

Key Insight: Mining high-quality code pairs from online programming platforms for functional learning, followed by style learning using positive and negative style samples.

Core Idea: Decoupling code translation optimization into two steps: "functionality first, style second."

Mechanism¶

Goal: ### Overall Architecture Stage 1 - Functional Learning: Fine-tuning with high-quality source-target code pairs to optimize translation correctness.

Method¶

Overall Architecture¶

Stage 1 - Functional Learning: Fine-tuning with high-quality source-target code pairs to optimize translation correctness. Stage 2 - Style Learning: Further fine-tuning with positive/negative style samples to guide the model towards outputting more natural code styles.

Key Designs¶

High-Quality Code Pair Mining: Pairing different language submissions for the same problem from LeetCode/Codeforces, and filtering to select functionally equivalent, high-quality code pairs. Design Motivation: Real-world multilingual code is more natural than synthetic data, and functional correctness is guaranteed by test cases.
Style Learning: Introducing contrastive learning concepts by using positive examples (conforming to target language idioms) and negative examples (functionally correct but stylistically unnatural). The model learns to shift toward a more natural style while preserving functionality.
New Benchmark: Includes the latest source code, extensive test cases, and human-annotated ground truths, supporting 20 translation scenarios (pairwise translation among 5 languages) to evaluate both functionality and style simultaneously.

Loss & Training¶

Functional learning stage: Standard SFT loss
Style learning stage: Contrastive/preference learning loss combining positive and negative samples

Key Experimental Results¶

Main Results (20 Translation Scenarios)¶

Model	Functional Accuracy	Style Score	Overall
F2STrans (Qwen-1.5B)	Best	Best	SOTA
Qwen-32B + prompt	High	Medium	Inferior to 1.5B FT
GPT-4 + prompt	High	Medium	Inferior to 1.5B FT

Ablation Study¶

Configuration	Functional Accuracy	Style Score	Description
Full F2STrans	Highest	Highest	Both stages included
Functional Learning Only	High	Medium	Correct but unnatural
Style Learning Only	Low	High	Good style but potentially incorrect
Direct Joint Training	Medium	Medium	Inferior to staged training

Key Findings¶

1.5B fine-tuning outperforms 32B and GPT-4 prompting methods
Decoupled training outperforms joint training
Online programming platforms are a treasure trove of high-quality translation data

Highlights & Insights¶

The "functionality first, style second" approach can be generalized to other code generation tasks
The superiority of fine-tuning small models over prompting large models is validated once again

Limitations & Future Work¶

Only covers 5 programming languages
Style evaluation partially relies on human annotation
Formal verification of semantic equivalence has not been explored

Code quality is not just about "being runnable"; style and readability are equally important
The choice of data sources has a significant impact on translation quality

Rating¶

Novelty: ⭐⭐⭐⭐ Novel functional-style decoupled training paradigm
Experimental Thoroughness: ⭐⭐⭐⭐ 20 scenarios, new benchmark
Writing Quality: ⭐⭐⭐⭐ Clear description of methods
Value: ⭐⭐⭐⭐ Direct application value for code translation tools

Additional Reflections¶

Relationship with Domain Trends¶

The research direction of this paper is closely related to several major trends in current AI research: (1) the growing demand for an in-depth understanding of LLM internal mechanisms; (2) the increasing importance of model efficiency and accessibility; and (3) AI safety and reliability becoming core concerns. From a methodological perspective, this work represents a paradigm shift from "black-box utilization" to "white-box understanding."

Specific Suggestions for Future Work¶

The core idea of this paper can be combined with other modalities (vision, speech).
Consider validating the generalizability of the conclusions on larger-scale models and datasets.
Explore the possibility of combining this approach with reinforcement learning and online learning.
Develop automated evaluation and optimization toolchains.

Additional Reflections¶

Relationship with Domain Trends¶

The research direction of this paper is closely related to several major trends in current AI research: evaluation of model capabilities and reliability guarantees, parameter-efficient fine-tuning and model compression, and AI safety and alignment. From a methodological perspective, this work represents an exploration of the deeper mechanisms of LLMs, helping to drive the paradigm shift from empirically-driven to theoretically-driven research.

Specific Suggestions for Future Work¶

The core idea can be combined with other modalities (vision, speech, multi-modal) to verify the cross-modal generalizability of the method.
Validate the conclusions on larger-scale models (70B+) and newer architectures (such as Mixture-of-Experts).
Explore the possibility of integration with reinforcement learning and online learning to achieve dynamic adaptation.
Develop automated evaluation and optimization tools to lower the barrier to using the method.
Consider the intersection with LLM alignment research to explore the synergetic optimization of safety and performance.

Function-to-Style Guidance of LLMs for Code Translation¶

TL;DR¶

Background & Motivation¶

Background¶

Mechanism¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results (20 Translation Scenarios)¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Additional Reflections¶

Relationship with Domain Trends¶

Specific Suggestions for Future Work¶

Additional Reflections¶

Relationship with Domain Trends¶

Specific Suggestions for Future Work¶

Related Papers¶