Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning¶
Conference: NeurIPS 2025 arXiv: 2511.07198 Code: Not released Area: LLM/NLP Keywords: multi-domain fine-tuning, inter-domain synergy, partition strategy, generalization bound, Adapter
TL;DR¶
This paper proposes a partition-based multi-stage fine-tuning framework that strategically partitions multiple domains into subsets (stages) to maximize inter-domain synergy while minimizing negative transfer, and derives a novel generalization bound to theoretically support the partitioning strategy.
Background & Motivation¶
Single-domain fine-tuning of LLMs is well-studied (LoRA, Adapter, etc.), but practical scenarios frequently require simultaneous adaptation to multiple heterogeneous domains (e.g., clinical text, social media, legal documents), a problem that remains largely underexplored.
Limitations of existing multi-domain methods: - Joint fine-tuning: Training all domains together leads to mutual interference among domain features (negative transfer). - Independent fine-tuning: Training each domain separately fails to exploit inter-domain synergy. - Adversarial training / distribution alignment methods: Approaches such as MDAN and M3SDA neglect inter-domain synergistic relationships. - Adapter methods: Reduce parameter counts but do not leverage domain partitioning to maximize overall performance.
Core Problem: How to effectively and efficiently fine-tune a single LLM across multiple heterogeneous domains, exploiting inter-domain synergy while mitigating negative transfer?
Method¶
Overall Architecture¶
The \(k\) source domains are partitioned into \(M\) disjoint stages \(S_1, \ldots, S_M\), within each of which the assigned domains are fine-tuned jointly. The framework consists of two components: 1. Partition optimization: Maximize inter-domain synergy while minimizing intra-stage discrepancy and capacity overhead. 2. Multi-stage Adapter fine-tuning: Sequentially train the LLM backbone and domain-specific Adapters stage by stage.
Key Design 1: Partition Objective¶
The partition \((S_1, \ldots, S_M)\) is optimized to maximize:
where: - Discrepancy measure: \(d(\mathcal{D}_i, \mathcal{D}_j) = \text{JS}(P_i, P_j)\) (Jensen–Shannon divergence) - Synergy measure: \(s(\mathcal{D}_i, \mathcal{D}_j) = \frac{1}{2}(\text{Jacc}(V_i, V_j) + \cos(\mu_i, \mu_j))\) (vocabulary overlap + embedding cosine similarity) - \(\lambda\) balances the relative weight of synergy versus discrepancy
Key Design 2: Generalization Bound¶
Theorem 3.1 (Multi-source concurrent generalization bound):
Theorem 3.2 (Optimality of multi-stage partition): The optimal partition \((S_1^*, \ldots, S_M^*)\) minimizes the right-hand side of the above bound.
Corollary 3.1: High-synergy subsets tend to be assigned to the same stage — when \(\Lambda > \lambda^{-1}(\gamma + \text{Cap}(U))\), a high-synergy domain set \(U\) is necessarily grouped together in the optimal partition.
Key Design 3: Algorithm¶
Procedure: 1. Compute pairwise discrepancy and synergy matrices for all domain pairs (\(O(k^2)\)). 2. Solve for the optimal partition via single-linkage hierarchical clustering (\(O(k^2 \log k)\)) or ILP. 3. Initialize \(\theta^0 = \theta^*\) (pretrained parameters) and \(\phi_j^0 = 0\). 4. Stage-wise optimization: for stage \(t\), jointly fine-tune \(\theta\) and \(\{\phi_j\}_{j \in S_t}\) on the data of domains in \(S_t\).
Constraints: \(\|\theta^t - \theta^{t-1}\|_2 \leq \rho_\theta\), \(\|\phi_j^t\|_2 \leq \rho_\phi\).
Loss & Training¶
Each stage employs a weighted multi-domain loss: \(\sum_{j \in S_t} \alpha_j^t \mathcal{L}(\theta, \{\phi_i\}; \mathcal{D}_j)\), with parameter norm constraints to preserve the implicit regularization of pretraining. Adapters for out-of-stage domains remain frozen: \(\phi_j^t = \phi_j^{t-1}\) for \(j \notin S_t\).
Key Experimental Results¶
Main Results¶
Performance across three LLM backbones on four tasks (LLaMA2-7B / LLaMA2-13B / Falcon-40B):
| Method | NSum | Q&A | Sent | Topic | Type |
|---|---|---|---|---|---|
| FULL | 41.2/42.1/43.2 | 64.7/66.3/68.2 | 89.0/89.8/90.4 | 86.5/87.1/88.3 | Baseline |
| LoRA | 41.0/42.0/42.5 | 63.9/65.1/66.5 | 88.4/89.1/— | 86.2/86.9/— | Single-domain |
| MDAN | 39.7/40.5/41.7 | 62.8/64.0/66.1 | 88.1/88.9/89.3 | 85.9/86.3/87.0 | Domain adaptation |
| M3SDA | 40.5/41.7/42.3 | 63.1/64.9/66.6 | 88.6/89.4/89.9 | 86.1/86.7/87.4 | Domain adaptation |
| PMS-FT (Ours) | 42.1/43.1/44.3 | 66.1/67.8/69.5 | 89.9/90.5/91.1 | 87.4/88.0/89.0 | Partition multi-stage |
PMS-FT outperforms all baselines across all models and tasks.
Ablation Study¶
- Partition vs. no partition: The partitioning strategy outperforms full-domain joint fine-tuning across all configurations, validating the value of reducing inter-domain interference.
- Contribution of synergy measure: Removing the synergy term degrades performance, demonstrating that reducing discrepancy alone is insufficient and inter-domain complementarity must be exploited.
- Effect of stage count \(M\): The optimal \(M\) depends on the number of domains and their inter-domain relationships; too few stages fail to isolate conflicting domains, while too many prevent synergy exploitation.
- Effect of capacity constraints: Appropriate \(\rho_\theta, \rho_\phi\) constraints prevent catastrophic forgetting while retaining sufficient adaptation capacity.
Key Findings¶
- Inter-domain synergy and discrepancy are equally important dimensions in multi-domain fine-tuning — attending to only one is insufficient.
- The additional computational overhead of partitioning is negligible (\(O(k^2 \log k)\)), yet the performance gains are substantial.
- The theoretical generalization bound is consistent with experimental trends: domain groupings with low discrepancy and high synergy consistently yield better performance.
- The method scales to LLMs of varying sizes (validated from 7B to 40B parameters).
Highlights & Insights¶
- Formal modeling of inter-domain synergy: This work is the first to incorporate "synergy" as an explicit optimization objective in multi-domain fine-tuning.
- Unification of theory and practice: The generalization bound not only explains why partitioning is effective but also directly guides the design of the partitioning algorithm.
- Generality of the framework: The approach is compatible with any parameter-efficient fine-tuning method (LoRA, Adapter, etc.).
- Practical usability: The partitioning step incurs negligible computation and does not increase GPU memory during fine-tuning.
Limitations & Future Work¶
- Limited domain scale: Experiments cover at most \(k \leq 10\) domains; performance under large-scale domain settings (e.g., hundreds of domains) remains unknown.
- Heuristic synergy measure: Vocabulary Jaccard similarity combined with embedding cosine similarity may lack precision; more sophisticated domain relationship modeling offers room for improvement.
- Stage ordering not optimized: The current approach assumes stage order is irrelevant, yet sequential effects across stages may exist in practice.
- Static partitioning: Partitions are determined prior to training and are not adjusted during training — dynamic partitioning may yield further gains.
- Focus on NLP tasks: The method has not been validated on multimodal or non-textual domains.
- Loose capacity bounds: The Rademacher complexity bound for Adapter capacity may not be sufficiently tight.
Related Work & Insights¶
- Distinction from Ganin & Lempitsky (2015) adversarial domain training: No adversarial objective is used; domain relationships are instead managed directly through partitioning.
- Complementarity with LoRA (Hu et al., 2021): LoRA is a parameter-efficient single-domain method; the proposed partitioning strategy can be applied on top of it.
- Distinction from continual learning (Xu et al., 2025): Continual learning assumes tasks arrive sequentially, whereas this work assumes all domains are simultaneously available.
- Broader implications: The synergy–discrepancy partitioning paradigm generalizes to multi-task learning, federated learning, and other multi-source settings.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The partitioning strategy and synergy modeling constitute novel framework-level contributions.
- Theoretical Depth: ⭐⭐⭐⭐ — The generalization bound derivation is complete, though some assumptions (Lipschitz continuity, bounded norms) are standard.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Validation across multiple models and tasks is thorough, though the number of domains tested is limited.
- Writing Quality: ⭐⭐⭐⭐ — The structure is clear, with tight integration between theory and experiments.
- Value: ⭐⭐⭐⭐ — The method is simple, effective, and readily integrable into existing fine-tuning pipelines.
- Overall: ⭐⭐⭐⭐ (8/10) — Addresses a practical pain point in multi-domain fine-tuning with a well-grounded combination of theory and practice.