SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation¶

Conference: ICLR 2026 arXiv: 2603.01101 Code: https://synctrack-v1.github.io Area: Music Generation / Audio Keywords: multi-track music generation, rhythmic synchronization, diffusion models, cross-track attention, evaluation metrics

TL;DR¶

SyncTrack is proposed with a unified architecture comprising track-shared modules (dual cross-track attention for rhythmic synchronization) and track-specific modules (learnable instrument priors for timbre preservation), along with three new rhythmic consistency evaluation metrics (IRS/CBS/CBD), achieving substantial improvements in multi-track music generation quality (FAD: 6.55→1.26, subjective MOS: 3.42 vs. 1.57).

Background & Motivation¶

Background: Multi-track music generation enables independent control over individual instrument tracks (mixing, re-arrangement); methods such as MSDM and MSG-LD employ diffusion models to learn the joint distribution of multiple tracks.

Limitations of Prior Work: Existing methods treat multi-track generation as multivariate time-series or video generation, over-emphasizing inter-track differences while neglecting shared rhythmic structure—resulting in rhythmic instability and inter-track desynchronization. MSDM achieves an FAD of 6.55 and a subjective score of only 1.57/5.0.

Key Challenge: Rhythmic information is shared across tracks (all instruments follow the same beat), whereas timbre information is track-independent (bass is low-pitched, piano is bright)—these two types of information must be handled separately.

Core Idea: Track-shared modules (shared rhythm) + track-specific modules (independent timbre) + novel evaluation metrics.

Method¶

Overall Architecture¶

Built upon a latent diffusion model (LDM): multi-track audio → STFT+Mel → VAE-encoded latent representation \(z^s \in \mathbb{R}^{C \times T \times F}\) → noising/denoising diffusion (U-Net with track-shared and track-specific modules) → VAE decoding → HiFi-GAN waveform synthesis. Each track (bass, drums, guitar, piano) is generated in parallel.

Key Designs¶

Track-Shared Module:
- Comprises ResBlock, intra-track attention, global cross-track attention, and time-specific cross-track attention.
- Global Cross-Track Attention: each element \(z_{t,f}^s\) attends to all time-frequency positions across all tracks → maintains global rhythmic consistency and stable beat framework.
- Time-Specific Cross-Track Attention: at the same timestep \(t\), attention is applied over the frequency dimension across all tracks → achieves fine-grained beat alignment and chord synchronization.
- Design Motivation: rhythmic information is shared across tracks; the two attention mechanisms handle "global rhythmic framework" and "beat-level synchronization" respectively.
Track-Specific Module:
- Learnable Instrument Prior: one-hot track identifier → positional encoding → two-layer MLP → added to ResBlock output.
- Design Motivation: timbre and pitch range are unique to each track and require independent modeling.
Three New Evaluation Metrics:
- IRS (Inner-track Rhythmic Stability): mean standard deviation of intra-track beat intervals; measures rhythmic stability (↓ better).
- CBS (Cross-track Beat Synchronization): proportion of cross-track beat alignment computed via a sliding tolerance window (↑ better).
- CBD (Cross-track Beat Dispersion): normalized mean of cross-track beat errors (↓ better).

Loss & Training¶

Standard DDPM noise prediction objective \(\|\epsilon - \epsilon_\theta(\{z_l^s\}, l)\|^2\). Initialized from MusicLDM pretrained weights; trained for 320K steps, batch size 16, with DDIM 200-step inference.

Key Experimental Results¶

Main Results¶

Method	FAD↓	IRS↓	CBS↑	CBD↓
MSDM	6.55	0.167	0.428	0.156
MSG-LD	1.31	0.148	0.434	0.147
SyncTrack	1.26	0.125	0.487	0.120
Ground Truth	-	0.049	0.592	0.079

SyncTrack achieves the closest performance to Ground Truth across all metrics; improvements in IRS and CBS particularly indicate significant gains in rhythmic consistency.

Ablation Study¶

Configuration	FAD↓	IRS↓	CBS↑	Note
Full SyncTrack	1.26	0.125	0.487	Complete
w/o global cross-track attn	~1.8	~0.14	~0.45	Global rhythm degraded
w/o time-specific cross-track attn	~1.6	~0.13	~0.46	Fine-grained alignment degraded
w/o instrument prior	~1.4	~0.13	~0.47	Timbre discriminability reduced

Key Findings¶

FAD reduced from 6.55 (MSDM) to 1.26 (−80%), and from 1.31 (MSG-LD) to 1.26 (−3.8%).
Per-track FAD shows the largest improvement on Piano (MSG-LD: 2.04→1.11, −45.6%), attributable to the piano's wide pitch range and complex notation.
IRS reduced from MSG-LD's 0.148 to 0.125; CBS improved from 0.434 to 0.487—indicating significant gains in rhythmic consistency.
In subjective evaluation, SyncTrack mix MOS: 3.42 vs. MSG-LD: 1.57 vs. GT: 4.48.
Ablation: global cross-track attention contributes most to global rhythmic stability; time-specific attention contributes most to beat-level synchronization.
The three proposed metrics correlate strongly with human subjective preferences—temporal structure undetectable by FAD is effectively quantified by IRS/CBS/CBD.

Highlights & Insights¶

The shared/specific module separation design principle is generalizable to other multi-channel generation tasks—applicable to any multi-channel scenario exhibiting "shared attributes + independent attributes."
Rhythmic evaluation metrics fill a gap—FAD cannot capture temporal structure; IRS/CBS/CBD provide complementary dimensions.
The division of labor between the two cross-track attention mechanisms is clear: global cross-track → beat framework; time-specific cross-track → beat-level synchronization.
The learnable instrument prior requires only one-hot encoding + positional encoding + MLP, making it extremely lightweight yet effective for timbre discrimination.
The large gap in subjective evaluation (3.42 vs. 1.57) demonstrates that rhythmic synchronization is a core factor in human auditory perception.
Initialization from MusicLDM pretrained weights effectively leverages existing audio generation knowledge, accelerating convergence.

Limitations & Future Work¶

Validation is limited to Slakh2100 (4 tracks: bass/drums/guitar/piano); more complex ensembles (orchestral, choral) and larger track counts remain untested.
Conditional generation (text- or melody-guided) is unexplored—the current model is unconditional.
The IRS/CBS/CBD metrics rely on beat detection algorithms, whose accuracy affects metric reliability.
The global cross-track attention in the track-shared module may become a computational bottleneck as the number of tracks increases.
The generated audio sample rate (16 kHz) is well below commercial standards (44.1 kHz); performance at higher sample rates remains to be verified.

vs. MSDM: MSDM learns the joint multi-track distribution with a unified model without distinguishing shared/specific information; reducing FAD from 6.55 to 1.26 represents a substantial improvement.
vs. MSG-LD: MSG-LD is stronger but still neglects rhythmic synchronization; SyncTrack comprehensively outperforms it on IRS/CBS/CBD.
vs. StemGen/JEN-1 Composer: these methods employ Transformer/LDM but lack explicit cross-track synchronization mechanisms.
Insights: the shared/specific module separation principle applies to all multi-channel generation tasks (e.g., multi-speaker speech generation, multi-instrument arrangement).

Rating¶

Novelty: ⭐⭐⭐⭐ Shared/specific module separation + dual cross-track attention + novel rhythmic evaluation metrics
Experimental Thoroughness: ⭐⭐⭐⭐ Objective + subjective evaluation + ablation study, with comprehensive metric design
Writing Quality: ⭐⭐⭐⭐ Motivation clearly articulated, figures intuitive
Value: ⭐⭐⭐⭐ Fills the gap in multi-track music rhythmic evaluation; model design principles are generalizable