SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation¶
Conference: ICLR 2026 arXiv: 2603.01101 Code: https://synctrack-v1.github.io Area: Music Generation / Audio Keywords: multi-track music generation, rhythmic synchronization, diffusion models, cross-track attention, evaluation metrics
TL;DR¶
SyncTrack is proposed with a unified architecture comprising track-shared modules (dual cross-track attention for rhythmic synchronization) and track-specific modules (learnable instrument priors for timbre preservation), along with three new rhythmic consistency evaluation metrics (IRS/CBS/CBD), achieving substantial improvements in multi-track music generation quality (FAD: 6.55→1.26, subjective MOS: 3.42 vs. 1.57).
Background & Motivation¶
Background: Multi-track music generation enables independent control over individual instrument tracks (mixing, re-arrangement); methods such as MSDM and MSG-LD employ diffusion models to learn the joint distribution of multiple tracks.
Limitations of Prior Work: Existing methods treat multi-track generation as multivariate time-series or video generation, over-emphasizing inter-track differences while neglecting shared rhythmic structure—resulting in rhythmic instability and inter-track desynchronization. MSDM achieves an FAD of 6.55 and a subjective score of only 1.57/5.0.
Key Challenge: Rhythmic information is shared across tracks (all instruments follow the same beat), whereas timbre information is track-independent (bass is low-pitched, piano is bright)—these two types of information must be handled separately.
Core Idea: Track-shared modules (shared rhythm) + track-specific modules (independent timbre) + novel evaluation metrics.
Method¶
Overall Architecture¶
Built upon a latent diffusion model (LDM): multi-track audio → STFT+Mel → VAE-encoded latent representation \(z^s \in \mathbb{R}^{C \times T \times F}\) → noising/denoising diffusion (U-Net with track-shared and track-specific modules) → VAE decoding → HiFi-GAN waveform synthesis. Each track (bass, drums, guitar, piano) is generated in parallel.
Key Designs¶
-
Track-Shared Module:
- Comprises ResBlock, intra-track attention, global cross-track attention, and time-specific cross-track attention.
- Global Cross-Track Attention: each element \(z_{t,f}^s\) attends to all time-frequency positions across all tracks → maintains global rhythmic consistency and stable beat framework.
- Time-Specific Cross-Track Attention: at the same timestep \(t\), attention is applied over the frequency dimension across all tracks → achieves fine-grained beat alignment and chord synchronization.
- Design Motivation: rhythmic information is shared across tracks; the two attention mechanisms handle "global rhythmic framework" and "beat-level synchronization" respectively.
-
Track-Specific Module:
- Learnable Instrument Prior: one-hot track identifier → positional encoding → two-layer MLP → added to ResBlock output.
- Design Motivation: timbre and pitch range are unique to each track and require independent modeling.
-
Three New Evaluation Metrics:
- IRS (Inner-track Rhythmic Stability): mean standard deviation of intra-track beat intervals; measures rhythmic stability (↓ better).
- CBS (Cross-track Beat Synchronization): proportion of cross-track beat alignment computed via a sliding tolerance window (↑ better).
- CBD (Cross-track Beat Dispersion): normalized mean of cross-track beat errors (↓ better).
Loss & Training¶
Standard DDPM noise prediction objective \(\|\epsilon - \epsilon_\theta(\{z_l^s\}, l)\|^2\). Initialized from MusicLDM pretrained weights; trained for 320K steps, batch size 16, with DDIM 200-step inference.
Key Experimental Results¶
Main Results¶
| Method | FAD↓ | IRS↓ | CBS↑ | CBD↓ |
|---|---|---|---|---|
| MSDM | 6.55 | 0.167 | 0.428 | 0.156 |
| MSG-LD | 1.31 | 0.148 | 0.434 | 0.147 |
| SyncTrack | 1.26 | 0.125 | 0.487 | 0.120 |
| Ground Truth | - | 0.049 | 0.592 | 0.079 |
SyncTrack achieves the closest performance to Ground Truth across all metrics; improvements in IRS and CBS particularly indicate significant gains in rhythmic consistency.
Ablation Study¶
| Configuration | FAD↓ | IRS↓ | CBS↑ | Note |
|---|---|---|---|---|
| Full SyncTrack | 1.26 | 0.125 | 0.487 | Complete |
| w/o global cross-track attn | ~1.8 | ~0.14 | ~0.45 | Global rhythm degraded |
| w/o time-specific cross-track attn | ~1.6 | ~0.13 | ~0.46 | Fine-grained alignment degraded |
| w/o instrument prior | ~1.4 | ~0.13 | ~0.47 | Timbre discriminability reduced |
Key Findings¶
- FAD reduced from 6.55 (MSDM) to 1.26 (−80%), and from 1.31 (MSG-LD) to 1.26 (−3.8%).
- Per-track FAD shows the largest improvement on Piano (MSG-LD: 2.04→1.11, −45.6%), attributable to the piano's wide pitch range and complex notation.
- IRS reduced from MSG-LD's 0.148 to 0.125; CBS improved from 0.434 to 0.487—indicating significant gains in rhythmic consistency.
- In subjective evaluation, SyncTrack mix MOS: 3.42 vs. MSG-LD: 1.57 vs. GT: 4.48.
- Ablation: global cross-track attention contributes most to global rhythmic stability; time-specific attention contributes most to beat-level synchronization.
- The three proposed metrics correlate strongly with human subjective preferences—temporal structure undetectable by FAD is effectively quantified by IRS/CBS/CBD.
Highlights & Insights¶
- The shared/specific module separation design principle is generalizable to other multi-channel generation tasks—applicable to any multi-channel scenario exhibiting "shared attributes + independent attributes."
- Rhythmic evaluation metrics fill a gap—FAD cannot capture temporal structure; IRS/CBS/CBD provide complementary dimensions.
- The division of labor between the two cross-track attention mechanisms is clear: global cross-track → beat framework; time-specific cross-track → beat-level synchronization.
- The learnable instrument prior requires only one-hot encoding + positional encoding + MLP, making it extremely lightweight yet effective for timbre discrimination.
- The large gap in subjective evaluation (3.42 vs. 1.57) demonstrates that rhythmic synchronization is a core factor in human auditory perception.
- Initialization from MusicLDM pretrained weights effectively leverages existing audio generation knowledge, accelerating convergence.
Limitations & Future Work¶
- Validation is limited to Slakh2100 (4 tracks: bass/drums/guitar/piano); more complex ensembles (orchestral, choral) and larger track counts remain untested.
- Conditional generation (text- or melody-guided) is unexplored—the current model is unconditional.
- The IRS/CBS/CBD metrics rely on beat detection algorithms, whose accuracy affects metric reliability.
- The global cross-track attention in the track-shared module may become a computational bottleneck as the number of tracks increases.
- The generated audio sample rate (16 kHz) is well below commercial standards (44.1 kHz); performance at higher sample rates remains to be verified.
Related Work & Insights¶
- vs. MSDM: MSDM learns the joint multi-track distribution with a unified model without distinguishing shared/specific information; reducing FAD from 6.55 to 1.26 represents a substantial improvement.
- vs. MSG-LD: MSG-LD is stronger but still neglects rhythmic synchronization; SyncTrack comprehensively outperforms it on IRS/CBS/CBD.
- vs. StemGen/JEN-1 Composer: these methods employ Transformer/LDM but lack explicit cross-track synchronization mechanisms.
- Insights: the shared/specific module separation principle applies to all multi-channel generation tasks (e.g., multi-speaker speech generation, multi-instrument arrangement).
Rating¶
- Novelty: ⭐⭐⭐⭐ Shared/specific module separation + dual cross-track attention + novel rhythmic evaluation metrics
- Experimental Thoroughness: ⭐⭐⭐⭐ Objective + subjective evaluation + ablation study, with comprehensive metric design
- Writing Quality: ⭐⭐⭐⭐ Motivation clearly articulated, figures intuitive
- Value: ⭐⭐⭐⭐ Fills the gap in multi-track music rhythmic evaluation; model design principles are generalizable