R-VC: Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching¶

Conference: ACL 2025
arXiv: 2506.01014
Code: https://r-vc929.github.io/r-vc/
Area: Image Generation
Keywords: voice conversion, flow matching, rhythm control, DiT, duration modeling

TL;DR¶

R-VC is the first zero-shot voice conversion system to achieve rhythm control. It models the target speaker's rhythm style using a Mask Transformer duration model, combined with a Shortcut Flow Matching DiT decoder to achieve efficient and high-quality speech generation in only 2 sampling steps, achieving a WER of 3.51 and speaker similarity of 0.930 on LibriSpeech.

Background & Motivation¶

Background: Zero-shot voice conversion (VC) aims to convert the speaker's timbre while preserving the linguistic content. Mainstream methods (HierSpeech++, CosyVoice, etc.) focus primarily on preserving the prosody of the source speech.

Limitations of Prior Work: - Preserving the source prosody can cause speaker timbre information to leak through prosodic coupling. - It is impossible to transfer the target speaker's rhythm/speaking-rate style to the synthesized speech. - Diffusion/Flow Matching methods require $\ge 10$ sampling steps, leading to high inference latency.

Key Challenge: High-quality speech generation requires multi-step sampling, but real-time applications demand low latency.

Core Idea: Model token-level duration to achieve rhythm control using a Mask Transformer, and reduce the sampling steps to 2 via Shortcut Flow Matching.

Method¶

Overall Architecture¶

R-VC consists of three modules: (1) Linguistic Content Representation: HuBERT + K-Means discretization + deduplication to extract token-level durations; (2) Mask Transformer Duration Model: predicts token durations style-matched to the target speaker using non-autoregressive iterative decoding; (3) Shortcut Flow Matching DiT Decoder: a 22-layer DiT (300M params) conditioned on step size $d$ to enable 2-step generation.

Key Designs¶

Content Representation and Duration Extraction:
- Function: Eliminate speaker identity from the content representation and extract token-level duration.
- Mechanism: Apply data perturbations (formant shifting, pitch randomization, parametric EQ) to the input speech, extract discrete tokens using HuBERT + K-Means, and then deduplicate them to obtain the content sequence and corresponding durations. For example: $[u_1, u_1, u_1, u_2, u_3, u_3] \rightarrow [u_1, u_2, u_3]$ with durations $[3, 1, 2]$.
- Design Motivation: Deduplication eliminates the prosodic patterns and retains only the pure linguistic content, laying the foundation for independent rhythm modeling.
Mask Transformer Duration Model:
- Function: Non-autoregressively predict the duration of each content token, conditioned on the target speaker.
- Mechanism: Perform iterative decoding using mask-predict (with a sinusoidal schedule $p = \sin(u)$) on input composed of deduplicated content tokens + partially unmasked durations + global speaker embeddings, trained using a cross-entropy loss.
- Design Motivation: Token-level duration modeling is more fine-grained than sentence-level modeling, allowing it to capture target-speaker rhythm characteristics (such as speaking rate and pause patterns).
Shortcut Flow Matching DiT Decoder:
- Function: Generate mel-spectrograms from noise in only 2 steps (NFE=2).
- Mechanism: $x_{t+d}' = x_t + s(x_t, t, d) \cdot d$, while optimization simultaneously targets OT-CFM and self-consistency: $$L_{S-CFM} = \mathbb{E}[\|s_\theta(x_t,t,0) - (x_1-x_0)\|^2 + \|s_\theta(x_t,t,2d) - s_{target}\|^2]$$. It employs hybrid training with 30% self-consistency and 70% flow matching.
- Design Motivation: While standard CFM requires $\ge 10$ steps, the shortcut strategy encourages the model to learn larger step jumps, allowing 2-step inference to achieve quality close to 10-step generation.

Key Experimental Results¶

Main Results (LibriSpeech test-clean, 2620 samples)¶

Method	WER↓	SECS↑	UTMOS↑	RTF↓
FACodec	4.68	0.908	3.94	-
CosyVoice	5.95	0.933	4.09	-
HierSpeech++	1.46(CER)	0.907	4.09	-
R-VC (NFE=2)	3.51	0.930	4.10	0.12

Efficiency and Quality¶

NFE	WER	SECS	UTMOS	RTF
2 (R-VC)	3.51	0.930	4.10	0.12
10 (vanilla CFM)	~similar	~similar	~similar	0.34
Speed Improvement	-	-	-	2.83×

Ablation Study¶

Configuration	WER	SECS	EMO score
Full R-VC	6.95	0.880	0.590
w/o duration model	7.03	0.878	0.425 (-0.165)
w/o spk embedding (decoder)	8.24	0.873	0.477
w/o perturbation	7.28	0.869	0.580
w/ sentence-level duration	9.86	0.872	0.528

Key Findings¶

The duration model is crucial for emotion transfer: Removing it degrades the EMO score from 0.590 to 0.425.
Token-level > Sentence-level duration: Sentence-level duration causes the WER to surge to 9.86.
NFE=2 achieves similar quality to NFE=10: This validates the effectiveness of Shortcut Flow Matching.
Rhythm classification accuracy of 90.2%: Precise control across three rhythm levels (slow, normal, fast) is achieved.

Highlights & Insights¶

First to achieve rhythm control in zero-shot VC, filling a gap in the voice conversion domain. This ability is highly valuable for applications such as personalized TTS and dubbing.
Shortcut Flow Matching elegantly integrates the concept of consistency distillation into flow matching. Simultaneously learning continuous and jumping joint targets during training makes it simpler than standalone distillation methods.
Data perturbation + deduplication pipeline is a valuable content extraction strategy: perturbation removes speaker characteristics, while deduplication removes prosodic patterns. This two-step process decouples content, timbre, and prosody.

Limitations & Future Work¶

Fine-grained duration prediction is unstable and can sometimes cause over-prolonged, abnormal pronunciations.
Trained exclusively on English data; cross-lingual capabilities remain unexplored.
Single-step generation (NFE=1) has not been investigated.
Trained on only 20K hours of data (vs. CosyVoice's 171K). While it achieves comparable performance, there is likely still room for improvement.

vs CosyVoice: CosyVoice uses 171K hours of data, whereas R-VC uses only 20K but achieves comparable performance; R-VC additionally supports rhythm control.
vs HierSpeech++: HierSpeech++ performs better on CER, but R-VC outperforms it in MOS and emotion transfer.
vs Diff-HierVC: R-VC outperforms Diff-HierVC in both WER and emotion scores.

Rating¶

Novelty: ⭐⭐⭐⭐ The first integration of rhythm control and efficient inference, filling a gap in the VC domain.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers three dimensions (VC, emotion-transfer, and rhythm control), including MOS evaluation.
Writing Quality: ⭐⭐⭐⭐ Clear method description and well-designed ablation study.
Value: ⭐⭐⭐⭐ Practical value for speech synthesis and personalization scenarios.
Value: ⭐⭐⭐⭐ Practical voice conversion improvements.