R-VC: Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching¶
Conference: ACL 2025
arXiv: 2506.01014
Code: https://r-vc929.github.io/r-vc/
Area: Image Generation
Keywords: voice conversion, flow matching, rhythm control, DiT, duration modeling
TL;DR¶
R-VC is the first zero-shot voice conversion system to achieve rhythm control. It models the target speaker's rhythm style using a Mask Transformer duration model, combined with a Shortcut Flow Matching DiT decoder to achieve efficient and high-quality speech generation in only 2 sampling steps, achieving a WER of 3.51 and speaker similarity of 0.930 on LibriSpeech.
Background & Motivation¶
Background: Zero-shot voice conversion (VC) aims to convert the speaker's timbre while preserving the linguistic content. Mainstream methods (HierSpeech++, CosyVoice, etc.) focus primarily on preserving the prosody of the source speech.
Limitations of Prior Work: - Preserving the source prosody can cause speaker timbre information to leak through prosodic coupling. - It is impossible to transfer the target speaker's rhythm/speaking-rate style to the synthesized speech. - Diffusion/Flow Matching methods require \(\ge 10\) sampling steps, leading to high inference latency.
Key Challenge: High-quality speech generation requires multi-step sampling, but real-time applications demand low latency.
Core Idea: Model token-level duration to achieve rhythm control using a Mask Transformer, and reduce the sampling steps to 2 via Shortcut Flow Matching.
Method¶
Overall Architecture¶
R-VC consists of three modules: (1) Linguistic Content Representation: HuBERT + K-Means discretization + deduplication to extract token-level durations; (2) Mask Transformer Duration Model: predicts token durations style-matched to the target speaker using non-autoregressive iterative decoding; (3) Shortcut Flow Matching DiT Decoder: a 22-layer DiT (300M params) conditioned on step size \(d\) to enable 2-step generation.
Key Designs¶
-
Content Representation and Duration Extraction:
- Function: Eliminate speaker identity from the content representation and extract token-level duration.
- Mechanism: Apply data perturbations (formant shifting, pitch randomization, parametric EQ) to the input speech, extract discrete tokens using HuBERT + K-Means, and then deduplicate them to obtain the content sequence and corresponding durations. For example: \([u_1, u_1, u_1, u_2, u_3, u_3] \rightarrow [u_1, u_2, u_3]\) with durations \([3, 1, 2]\).
- Design Motivation: Deduplication eliminates the prosodic patterns and retains only the pure linguistic content, laying the foundation for independent rhythm modeling.
-
Mask Transformer Duration Model:
- Function: Non-autoregressively predict the duration of each content token, conditioned on the target speaker.
- Mechanism: Perform iterative decoding using mask-predict (with a sinusoidal schedule \(p = \sin(u)\)) on input composed of deduplicated content tokens + partially unmasked durations + global speaker embeddings, trained using a cross-entropy loss.
- Design Motivation: Token-level duration modeling is more fine-grained than sentence-level modeling, allowing it to capture target-speaker rhythm characteristics (such as speaking rate and pause patterns).
-
Shortcut Flow Matching DiT Decoder:
- Function: Generate mel-spectrograms from noise in only 2 steps (NFE=2).
- Mechanism: \(x_{t+d}' = x_t + s(x_t, t, d) \cdot d\), while optimization simultaneously targets OT-CFM and self-consistency: $\(L_{S-CFM} = \mathbb{E}[\|s_\theta(x_t,t,0) - (x_1-x_0)\|^2 + \|s_\theta(x_t,t,2d) - s_{target}\|^2]\)$. It employs hybrid training with 30% self-consistency and 70% flow matching.
- Design Motivation: While standard CFM requires \(\ge 10\) steps, the shortcut strategy encourages the model to learn larger step jumps, allowing 2-step inference to achieve quality close to 10-step generation.
Key Experimental Results¶
Main Results (LibriSpeech test-clean, 2620 samples)¶
| Method | WER↓ | SECS↑ | UTMOS↑ | RTF↓ |
|---|---|---|---|---|
| FACodec | 4.68 | 0.908 | 3.94 | - |
| CosyVoice | 5.95 | 0.933 | 4.09 | - |
| HierSpeech++ | 1.46(CER) | 0.907 | 4.09 | - |
| R-VC (NFE=2) | 3.51 | 0.930 | 4.10 | 0.12 |
Efficiency and Quality¶
| NFE | WER | SECS | UTMOS | RTF |
|---|---|---|---|---|
| 2 (R-VC) | 3.51 | 0.930 | 4.10 | 0.12 |
| 10 (vanilla CFM) | ~similar | ~similar | ~similar | 0.34 |
| Speed Improvement | - | - | - | 2.83× |
Ablation Study¶
| Configuration | WER | SECS | EMO score |
|---|---|---|---|
| Full R-VC | 6.95 | 0.880 | 0.590 |
| w/o duration model | 7.03 | 0.878 | 0.425 (-0.165) |
| w/o spk embedding (decoder) | 8.24 | 0.873 | 0.477 |
| w/o perturbation | 7.28 | 0.869 | 0.580 |
| w/ sentence-level duration | 9.86 | 0.872 | 0.528 |
Key Findings¶
- The duration model is crucial for emotion transfer: Removing it degrades the EMO score from 0.590 to 0.425.
- Token-level > Sentence-level duration: Sentence-level duration causes the WER to surge to 9.86.
- NFE=2 achieves similar quality to NFE=10: This validates the effectiveness of Shortcut Flow Matching.
- Rhythm classification accuracy of 90.2%: Precise control across three rhythm levels (slow, normal, fast) is achieved.
Highlights & Insights¶
- First to achieve rhythm control in zero-shot VC, filling a gap in the voice conversion domain. This ability is highly valuable for applications such as personalized TTS and dubbing.
- Shortcut Flow Matching elegantly integrates the concept of consistency distillation into flow matching. Simultaneously learning continuous and jumping joint targets during training makes it simpler than standalone distillation methods.
- Data perturbation + deduplication pipeline is a valuable content extraction strategy: perturbation removes speaker characteristics, while deduplication removes prosodic patterns. This two-step process decouples content, timbre, and prosody.
Limitations & Future Work¶
- Fine-grained duration prediction is unstable and can sometimes cause over-prolonged, abnormal pronunciations.
- Trained exclusively on English data; cross-lingual capabilities remain unexplored.
- Single-step generation (NFE=1) has not been investigated.
- Trained on only 20K hours of data (vs. CosyVoice's 171K). While it achieves comparable performance, there is likely still room for improvement.
Related Work & Insights¶
- vs CosyVoice: CosyVoice uses 171K hours of data, whereas R-VC uses only 20K but achieves comparable performance; R-VC additionally supports rhythm control.
- vs HierSpeech++: HierSpeech++ performs better on CER, but R-VC outperforms it in MOS and emotion transfer.
- vs Diff-HierVC: R-VC outperforms Diff-HierVC in both WER and emotion scores.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first integration of rhythm control and efficient inference, filling a gap in the VC domain.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers three dimensions (VC, emotion-transfer, and rhythm control), including MOS evaluation.
- Writing Quality: ⭐⭐⭐⭐ Clear method description and well-designed ablation study.
- Value: ⭐⭐⭐⭐ Practical value for speech synthesis and personalization scenarios.
- Value: ⭐⭐⭐⭐ Practical voice conversion improvements.