Embedding Alignment in Code Generation for Audio¶
Conference: NeurIPS 2025 arXiv: 2508.05473 Code: None Area: Code Intelligence Keywords: code generation, audio embedding, contrastive learning, cross-modal alignment, live-coding
TL;DR¶
A dual-MLP + InfoNCE contrastive learning framework is proposed to align code embeddings (distilroberta-base) and audio embeddings (wav2vec2) into a shared space, enabling LLM-based code generation pipelines to infer musical similarity directly from code without compilation or execution. CKA improves from 0.090 to 0.590.
Background & Motivation¶
- Live-coding scenario: Performers must write music-generating code (e.g., Sonic Pi) in real time under time pressure and in front of an audience. LLM-assisted code generation can reduce syntactic burden, allowing creators to focus on high-level musical ideas.
- Core limitation: Existing LLM code generation models evaluate candidate code using text similarity metrics (e.g., BLEU, edit distance), but textual similarity does not imply audio similarity — two textually similar programs may produce drastically different sounds, and vice versa.
- Key observation: Computing code-to-code and audio-to-audio embedding distances on 27 Sonic Pi tutorial examples yields a Pearson correlation of only 0.0159 (\(p=0.677\)) and a Spearman correlation of only 0.0409 (\(p=0.445\)), indicating virtually no linear or rank-order relationship between the raw embedding spaces.
- Parameter perturbation experiment: Minor code modifications (sleep, amplitude, bpm) keep code embedding similarity above 0.990, while audio embedding similarity varies more widely (below 0.975), with no consistent pattern across parameter types — demonstrating that the alignment mapping is non-trivial.
Method¶
Data Construction¶
- Using 27 Sonic Pi tutorial code examples, parameters (synths, samples, notes, attack/release, amp, sleep, effects, etc.) are randomized via the Jinja templating engine.
- 500 distinct Sonic Pi code files are generated, yielding 13,500 code–audio paired samples in total.
- Code embeddings are produced by distilroberta-base; audio embeddings by Meta's wav2vec2. Audio is rendered at 120 BPM over 9 bars.
Symmetric Dual-MLP Architecture¶
- Two independent MLPs process code embeddings \(\texttt{MLP}_c\) and audio embeddings \(\texttt{MLP}_a\), respectively.
- Each MLP consists of \(L\) linear layers with hidden dimension \(d_{\text{hidden}}\), BatchNorm, and GELU activations.
- Pre-trained embeddings are projected into a shared space: \(c_i = \texttt{MLP}_c(c_i^0)\), \(a_i = \texttt{MLP}_a(a_i^0)\).
- A symmetric MLP design is chosen over attention mechanisms, as the goal is efficient mapping without introducing modality bias.
InfoNCE Contrastive Learning¶
- Given \(N\) aligned code–audio embedding pairs \(\{(c_i, a_i)\}_{i=1}^N\) in a batch, cosine similarity is computed as:
\[\text{sim}(c_i, a_j) = \frac{c_i^\top a_j}{\|c_i\| \cdot \|a_j\|}\]
- The InfoNCE loss pulls positive pairs closer and pushes negative pairs apart:
\[\mathcal{L}_i = -\log \frac{\exp(\text{sim}(c_i, a_i) / \tau)}{\sum_{j=1}^N \exp(\text{sim}(c_i, a_j) / \tau)}\]
- \(\tau\) is a temperature hyperparameter controlling the sharpness of the similarity distribution.
- This self-supervised approach requires no explicit annotations and learns alignment solely from code–audio pairing.
Evaluation Metrics¶
| Metric | Type | Description |
|---|---|---|
| CKA (Centered Kernel Alignment) | Structural similarity | Invariant to orthogonal transformations and isotropic scaling; captures nonlinear structural similarity |
| CCA (Canonical Correlation Analysis) | Linear correlation | Measures maximum linear correlation between two sets of multivariate variables |
| Jaccard / overlap@k | Neighborhood consistency | Whether nearest neighbors in code space correspond to nearest neighbors in audio space |
| Spearman / Pearson | Rank / linear correlation | Degree of correlation in distance rankings |
Key Experimental Results¶
Hyperparameter Tuning (24 configurations, averaged over 5 runs)¶
| Setting | CKA | CCA |
|---|---|---|
| Pre-alignment baseline | 0.090 | 0.140 |
| Best config (CKA) | 0.590 | — |
| Best config (CCA) | — | 0.902 |
Both metrics achieve a 6× or greater improvement.
Three Code Completion Scenarios¶
| Scenario | Method | Jaccard | overlap@3 | Spearman | Pearson |
|---|---|---|---|---|---|
| melody | Raw baseline | 0.20 | 0.33 | 0.21 | 0.18 |
| melody | Ours | 0.34 | 0.47 | 0.16 | 0.07 |
| drum | Raw baseline | 0.00 | 0.00 | — | -0.25 |
| drum | Ours | 0.16 | 0.27 | -0.05 | -0.12 |
| bass | Raw baseline | 0.20 | 0.33 | 0.24 | 0.21 |
| bass | Ours | 0.50 | 0.67 | 0.44 | 0.46 |
Key Findings¶
- Consistent neighborhood metric improvements: Jaccard and overlap@3 surpass the raw baseline in all three scenarios.
- Largest gain on drum: The baseline completely fails (Jaccard=0); alignment raises it to 0.16/0.27.
- Strongest performance on bass: Jaccard 0.50, overlap@3 0.67, with significant improvements in rank correlation.
- No audio compilation required: All inference is performed directly on code embeddings, avoiding the computational overhead of audio rendering.
- UMAP visualization: Before alignment, code (blue) and audio (orange) embeddings are completely separated; after alignment, the two modalities overlap in the embedding space, with semantically related code–audio pairs clustering in nearby regions.
Highlights & Insights¶
- Novel problem formulation: This work is the first to systematically study cross-modal alignment between code and audio embeddings, filling a gap in the niche but active field of creative coding / live-coding.
- Lightweight yet effective: Significant alignment is achieved with only dual MLPs and InfoNCE, without complex cross-modal Transformer architectures.
- Practical value: Post-alignment, audio similarity can be inferred directly from code embeddings, endowing code completion tools with "music-aware" capabilities and enabling LLM-assisted live-coding systems to generate more diverse and perceptually meaningful candidate code.
- Transferable methodology: The approach of first conducting negative experiments (demonstrating the absence of correlation in raw spaces) to motivate the alignment model provides a clear and replicable argumentative structure.
Limitations & Future Work¶
- Limited data scale: The dataset is expanded from only 27 Sonic Pi tutorial templates, resulting in insufficient diversity in musical style and code structure.
- Sonic Pi only: Generalization to other music programming languages (SuperCollider, TidalCycles, Strudel) is not validated.
- Unstable rank correlation metrics: Spearman/Pearson do not improve consistently in the melody and drum scenarios, indicating that alignment remains insufficient for fine-grained ranking.
- Not integrated into a real code completion system: Validation is conducted only in offline experiments and has not yet been embedded into an actual LLM code assistant pipeline.
- Embedding model selection: distilroberta-base is a general-purpose text model not optimized for code semantics; replacing it with CodeBERT or similar models may yield better results.
- Audio representation: wav2vec2 is primarily designed for speech; its music representation may be imprecise. Music-specific embeddings such as CLAP could be considered.
Related Work & Insights¶
- Unlike music–text joint embedding works such as MuLan (Huang et al., 2022), this paper focuses on the unique cross-modal pair of code–audio.
- The cross-modal alignment approach (lightweight MLP + contrastive learning) is transferable to other scenarios where "code produces non-textual outputs," such as visualization code → images or simulation code → motion trajectories.
- This work offers insights for subsequent research in creative AI / AI-assisted music: how to equip LLMs with awareness of output modalities during code generation.
- The methodological contributions are transferable to related problems.
Rating¶
- Novelty: ⭐⭐⭐ Novel perspective on code–audio alignment
- Experimental Thoroughness: ⭐⭐⭐
- Writing Quality: ⭐⭐⭐
- Value: ⭐⭐⭐ Specific value for the audio generation domain