Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=ABc5y3741T
Code: https://huggingface.co/KOREAson (Data and Model Collection)
Area: LLM Reasoning / Multilingual Reasoning
Keywords: Language-Mixed CoT, Mid-resource Languages, Korean Reasoning, Data Distillation, SFT
TL;DR¶
To address the lack of long-reasoning models for mid-resource languages (Korean), this paper proposes Language-Mixed CoT—using English as a "logic anchor" for reasoning while retaining key Korean terminology. Combined with 5.79M self-collected native Korean prompts and high-yield subset distillation, the authors trained KO-REAson-35B using only SFT, achieving a top average score of 64.0 across nine Korean benchmarks, with an average improvement of +18.6 points for smaller models.
Background & Motivation¶
Background: Frontier models improve reasoning by exploring solution spaces via long chain-of-thought (long-CoT). The mainstream route for open-source replication is distillation from teacher models: systematic prompt collection → teacher generation of long reasoning trajectories → quality filtering → SFT. However, such pipelines almost exclusively serve English (and some Chinese), leaving mid-resource languages as a blank space.
Limitations of Prior Work: Directly applying recipes from high-resource languages to Korean is ineffective—Korean base models are weaker, high-quality data is scarce, and RL cold-starts rely heavily on the "strong base + reliable reward model + large-scale data" trinity. When using translated corpora, training a Qwen2.5-1.5B on a translated version of OpenThoughts improved MATH scores from 25.5 to 74.4, but scores on the Korean cultural benchmark HAE-RAE Bench plummeted from 35.2 to 15.3. Translationese pollutes the corpus, making the model fragile to colloquial and cultural contexts.
Key Challenge: Which language should be used for the reasoning process? Both monolingual options have critical flaws—using only English introduces translation noise (prompt mistranslation, error accumulation, "forgetting" the original Korean); using only Korean leads to a significant drop in reasoning capability, and prolonged Korean training on English-pre-trained bases triggers distribution shift, damaging original strengths.
Goal: Find a reproducible and affordable recipe for mid-resource languages that ensures robustness across both "reasoning and cultural knowledge" dimensions.
Key Insight: The "logic skeleton" of reasoning can be decoupled from "semantic faithfulness." English, a dominant language, handles logical deduction, while key terms and citations remain in the target language. This is combined with native (non-translated) prompts to ensure a realistic corpus distribution.
Core Idea: Replace monolingual CoT with English-anchored Language-Mixed CoT (English logic anchor with Korean terminology). Use massive native Korean data and high-yield subset distillation to push mid-resource models to SOTA via pure SFT.
Method¶
Overall Architecture¶
This work presents a data-centric pipeline: "Data Collection → Supervisory Signal Construction → High-yield Subset Distillation → Cross-family SFT." It does not modify model architectures; all gains derive from supervisory signal formatting and data quality. Specifically: 5.79M native user prompts (YI-SANG instruction set) were crawled from Korean communities. Qwen3-32B acted as the teacher to generate 3.7M long-reasoning trajectories (YI-SANG full set) using the Language-Mixed CoT format. A 260k high-yield subset (YI-SANG-HQ) was distilled via category-wise ablation, loss-spike filtering, and decontamination. Finally, the KO-REAson series was developed via SFT on this subset using models ranging from 4B to 35B across six model families.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["5.79M Native Korean<br/>User Prompts"] --> B["Native Instruction Collection<br/>Verbatim Retention + Light Filtering"]
B --> C["Language-Mixed CoT Supervision<br/>Qwen3-32B Teacher<br/>English Anchor + Korean Terms"]
C --> D["3.7M Long Reasoning Trajectories<br/>(YI-SANG Full Set)"]
D --> E["High-yield Subset Distillation<br/>Category Ablation + Loss-spike Filtering + Decontamination"]
E --> F["YI-SANG-HQ<br/>260k High-yield Samples"]
F --> G["Pure SFT Training Recipe<br/>Across 6 Families (4B–35B)"]
G --> H["KO-REAson Series<br/>KO-REAson-35B SOTA"]
Key Designs¶
1. Language-Mixed CoT: English Logic Anchor and Korean Terminology Code-switching
This is the core supervisory signal addressing the choice of reasoning language. During the thinking phase, the model performs code-switching: logical scaffolding is written in English, while named entities, cited fragments, and key terms remain in Korean. The teacher model (Qwen3-32B) is prompted to follow this format; samples with Korean character ratios outside \([5\%, 20\%]\) are filtered out. This leverages English reasoning strengths without losing the semantic fidelity of the Korean prompt. Ablation (Table 2) shows that on Gemma-3-4B, English-anchored Language-Mixed CoT outperforms both mono-English and mono-Korean across HRB (54.9), MCLM (55.8), and KMMLU-R (53.0). Experiments with Chinese and Russian anchors showed that only the English anchor provided gains in mathematics (MCLM).
2. Native Korean Instruction Collection: Verbatim Retention of Real Data Distribution
To solve the "translationese" and robustness issues, native prompts were collected instead of using translations. 28 user Q&A communities were selected after filtering for licensing (re-distributable vs. non-commercial). Contrary to standard practice, prompts were retained verbatim including typos, abbreviations, and internet slang to preserve user features and improve deployment robustness. Only light filtering was applied (discarding samples with <30% Korean characters or extreme lengths). YI-SANG is the largest disclosed Korean post-training corpus (5.79M prompts). For competitive math challenges, a translation of OpenThought was supplemented using Gemini-2.5-Flash.
3. High-yield Subset Distillation: Category Ablation, Loss-Spike Filtering, and Decontamination
To manage computational costs, a "high-yield" subset was extracted. Category-wise ablation revealed that OpenThought benefits MATH/MCLM, Exams benefit HAE-RAE Bench/KMMLU-R, while Medical was hyperspecialized—improving ClinicalQA but dragging down all other benchmarks—and thus was removed along with Daily topics. Loss-spike analysis on a proxy small model (Kanana-1.5) helped identify and filter out failure modes such as infinite repetition, multiple <think> blocks, or final solutions in non-Korean languages. Trajectories exceeding 16k tokens were removed. 13-gram decontamination using MeCab-KO removed approximately 0.7% of trajectories. YI-SANG-HQ ultimately contains 260k samples.
4. Pure SFT Training Recipe: SFT Without RL or Web Oracles
Given the lack of strong base models and RL instability in mid-resource scenarios, pure SFT was used. The pipeline avoids "agreement-sampling" and "hint-based refinement" due to computational costs and leakage risks. The teacher (Qwen3-32B) regenerated all targets from zero given only the prompt, without using web-crawled answers. Comparison of teacher models confirmed that explicit reasoning is the key to unlocking capability. All models were trained for 5 epochs (3 epochs for the 35B model due to constraints).
Loss & Training¶
Standard SFT was used throughout. Evaluation utilized vLLM with temperature=0.7, top_p=0.9, and max_tokens=32768. Final answers were required to be within \boxed{...} and validated using math-verify. Main experiments report the mean of three trials.
Key Experimental Results¶
Main Results¶
Comparison in the 20B+ range (9 benchmarks, KO-REAson-35B based on A.X-3.1 + YI-SANG-HQ):
| Benchmark | GPT-OSS-20B | DS-R1-32B | EXAONE-Deep-32B | QwQ-32B | KO-REAson-35B |
|---|---|---|---|---|---|
| KMMLU-Redux | 67.6 | 70.0 | 68.2 | 74.7 | 76.0 |
| KMMLU-Hard | 39.0 | 43.3 | 43.5 | 49.0 | 51.4 |
| Math (Ko) | 82.8 | 85.4 | 84.8 | 82.3 | 87.5 |
| HRB | 65.1 | 70.8 | 76.1 | 75.5 | 78.9 |
| Average (9 tasks) | 58.8 | 56.4 | 57.4 | 59.6 | 64.0 |
KO-REAson-35B ranked first in 5/9 tasks and second elsewhere, achieving the highest average score. Results in competition math (AIME2024-Ko, KSM) were slightly behind GPT-OSS-20B, attributed to the limited number of high-quality competitive samples in the mixture.
Ablation Study¶
| Configuration (Gemma-3-4B) | HRB | MCLM | KMMLU-R | Notes |
|---|---|---|---|---|
| English-only CoT | 50.3 | 48.1 | 52.2 | Monolingual English |
| Korean-only CoT | 40.6 | 25.6 | 42.5 | Monolingual Korean; math collapse |
| Lang-Mixed (zh/ko) | 48.2 | 26.3 | 45.3 | Chinese anchor; no math gain |
| Lang-Mixed (en/ko) | 54.9 | 55.8 | 53.0 | English anchor; best performance |
Category contributions (Gemma-3-4B): OpenThought yielded the largest gain for math (55.8); Exams were most effective for KMMLU-R (64.2); Medical improved ClinicalQA (65.6) but significantly degraded other tasks.
Key Findings¶
- English Anchor as the Sole Source of Math Gains: Only en/ko mixture improved MCLM; zh/ko and ru/ko failed in math. The effectiveness of anchor languages correlates strongly with the base model's pre-training distribution.
- Universal Across Families and Scales: All 9 models (4B–35B across 6 families) improved after training on YI-SANG-HQ. Gains were especially notable in math-related tasks, with small/medium models seeing an average boost of +18.6 points.
- "Free Lunch" in Cross-lingual and Multi-modal Tasks: KO-REAson-12B, though trained only on Korean text, showed gains in English reasoning (AIME25 15.6→32.0) and Korean vision-language tasks (HAERAE-Vision 15.47→26.42).
Highlights & Insights¶
- Decoupling "Logic Language" and "Semantic Language": Language-Mixed CoT circumvents the "translation noise vs. reasoning degradation" dilemma. This code-switching approach is transferable to any combination of "strong English base + mid-resource target language."
- Robustness Through Original Distributions: Retaining verbatim user prompts (including errors) indicates that robustness comes from the real distribution rather than sanitized templates.
- Loss-Spike as a Diagnostic Tool: Using training loss spikes to locate bad samples and iterate on rules is a simple but effective data quality cycle for large-scale SFT.
Limitations & Future Work¶
- Gap in Competition Math: Limited competitive samples led to lower performance on AIME2024/KSM relative to specialized models. Future work will involve incorporating more translated competition sets.
- Verification Limited to Korean: The method was case-studied on Korean; its effectiveness for languages with greater orthographic or linguistic distance from English remains to be verified.
- Reliance on Strong Teachers: The trajectory quality depends on Qwen3-32B. Performance may degrade if no strong multilingual teacher is available for the target language.
- Scope Limited to SFT: This work positions the model as a strong foundation for future RL, but the absolute ceiling of reasoning via online RL has not yet been explored.
Related Work & Insights¶
- Vs. Translated Corpora (lightblue, Lee et al. 2025a): Unlike those using machine-translated data, this work uses native collection and language-mixed supervision, avoiding cultural performance degradation.
- Vs. Monolingual CoT (all-English or all-Korean): The code-switching approach captures both reasoning power and semantic faithfulness, outperforming single-language strategies.
- Vs. RL Routes (GRPO, R1): While RL requires strong bases and verifiable rewards, this work demonstrates that pure SFT with high-quality data can approach closed-source SOTA in mid-resource scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ Language-Mixed CoT is a concise and effective solution to the reasoning language problem.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 100+ ablations across 9 models, 6 families, and 9 benchmarks, including cross-modal validation.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, logical flow, and comprehensive data-centric narrative.
- Value: ⭐⭐⭐⭐⭐ Provides the largest open Korean post-training corpus and a reproducible recipe for mid-resource language communities.