FilmComposer: LLM-Driven Music Production for Silent Film Clips¶

TL;DR¶

FilmComposer is the first to combine a large language model multi-agent system with waveform/symbolic music generation, simulating the workflow of professional musicians (spotting \(\rightarrow\) composing \(\rightarrow\) arranging \(\rightarrow\) mixing) to automatically generate high-quality (48kHz), highly musical, and progressive film soundtracks from silent film clips.

Background & Motivation¶

Core Pain Point: Film soundtrack production requires highly specialized skills, and the manual pipeline is time-consuming and costly. Although existing AI music generation methods have made progress, they still fall short in three core dimensions:
Audio Quality: The sampling rate of current waveform music generation models is generally only 32kHz, which falls far short of the cinema-grade 48kHz/24bit standard.
Musicality: The melodies generated by models lack expressiveness and aesthetic appeal, with limited control mechanisms.
Musical Development: Refers to the ability of themes and motifs to evolve over time, which is crucial for audio-visual consistency but heavily ignored by existing work.
Limitations of Prior Work:
- Waveform music generation (e.g., MusicGen, Stable Audio) utilizes abundant data but suffers from poor quality and weak control.
- Symbolic music generation (MIDI) offers fine-grained control but suffers from data scarcity and monotonous timbres.
- Video-to-music methods (e.g., CMT, VidMuse) only focus on either rhythmic or semantic alignment, failing to consider the professional requirements of film scoring.
Key Insight: The creative process of professional musicians is staged (analysis \(\rightarrow\) spotting \(\rightarrow\) composing \(\rightarrow\) arranging \(\rightarrow\) mixing), which can be simulated by an AI system as a complete pipeline.

Method¶

Overall Architecture¶

FilmComposer consists of three major modules, simulating the complete workflow of a human musician: 1. Visual Processing: Analyzes the film clip to extract rhythm control spots, visual semantic descriptions, and motion features. 2. Rhythm-Controllable MusicGen: Generates the main melody conditioned on rhythm and descriptions. 3. Multi-Agent Assessment, Arrangement & Mix: A multi-agent system that evaluates melody quality, arranges instrumentation, and mixes the final output.

Key Designs¶

1. Visual Processing Module¶

Extracts three layers of information from the film clip: - Rhythm Spots: Improvises CMT into CRT (Controllable Rhythm Transformer). It automatically identifies the main melody track using track coverage \(Coverage_i = T_i / T_{music}\) and note ratio \(R_i = n_i / \sum n_i\), and flattens it into a rhythm sequence. - Visual Semantic Attributes: Defines 7 categories of attributes (scene, brightness, color tone, action, emotional tone, shot size, theme) based on audio-visual language theory, identified using GPT-4V. - Motion Description: Extracts optical flow velocity \(S_{motion}\), motion saliency \(S_{saliency}\), shot transitions, and narrative development to guide arrangement.

2. Rhythm-Controllable MusicGen¶

Adds a new Rhythm Conditioner to the MusicGen architecture: extracts the rhythm spot waveform as a chromagram, and projects it to a dimension consistent with the T5 text encoder.
Condition Fusion Strategy: Employs a prepend method to sequentially concatenate the rhythm condition, text condition, and transformer inputs.
Output formula: \(Y_{output} = \text{Decoder}(C_{rhythm}, C_{description}, Y)\)
Fine-tuned on the self-built MusicPro-7k dataset, enabling the model to support directly generating movie-aligned melodies from visual inputs for the first time.

3. Multi-Agent Assessment, Arrangement & Mixing¶

Melody Assessment: Designs 5 reviewer agents (Mode/Melody/Harmony/Rhythm/Emotion) based on the AutoGen framework to evaluate the melody progressively according to music theory. Unqualified melodies are sent back for regeneration.
Arrangement and Mixing: 6 agents in a Group Chat (Analyze/Arrange/Instrument/Volume/Mixing/Reviewer) collaborate sequentially following the actual music production workflow, using CoT and Few-Shot prompt engineering.
Final Execution: The execution agent translates the arrangement plan into Reaper DAW commands to output 48kHz cinema-grade music.

Loss & Training¶

Continues training based on MusicGen-Melody, utilizing the Adam optimizer (with decaying lr=1e-1) and a cosine learning rate scheduler.
Gradient clipping with max_norm=1.0, batch size of 4, converging in approximately 150 epochs.
Training takes approximately 6 days on a single NVIDIA A6000 GPU.

Key Experimental Results¶

Main Results (Tab. 3)¶

Method	KL↓	FAD↓	SR	ImageBind↑	Diversity↑	Musicality↑	Rhythm↑	Dynamic↓	Instru.↓
GT	0.000	0.000	48K	0.328	0.451	14.40	4042	0.000	0.000
CMT	1.554	0.644	-	0.104	0.361	10.02	3153	0.801	0.519
M2UGen	1.569	0.306	32K	0.114	0.373	9.40	3392	1.070	0.510
VidMuse	2.035	0.376	32K	0.070	0.153	8.02	2467	0.892	0.527
MusicGen	1.368	0.456	32K	0.108	0.133	8.80	3050	1.147	0.546
RC-MusicGen	1.320	0.219	32K	0.120	0.385	9.42	3646	0.820	0.506
FilmComposer	1.209	0.207	48K	0.131	0.444	10.78	3834	0.767	0.434

Ablation Study (Tab. 4)¶

Text	Rhythm	Agent	FAD↓	Rhythm↑	ImageBind↑
✓	✗	✗	0.246	2978	0.254
✗	✓	✗	0.319	2836	0.163
✓	✓	✗	0.219	3646	0.265
✓	✓	✓	0.207	3834	0.318

Key Findings¶

FilmComposer achieves SOTA across all metrics and is the only method to reach the 48kHz cinema-grade sampling rate.
The rhythm control module improves the Rhythm metric from 2978 to 3646, while text control simultaneously improves rhythm alignment.
The multi-agent system boosts the ImageBind ranking score from 0.265 to 0.318, proving the crucial role of arrangement and mixing in audio-visual consistency.
In user studies, over 60% of both experts and non-experts prefer the music generated by FilmComposer.

Highlights & Insights¶

Workflow Simulation Paradigm: Rather than pursuing end-to-end black-box generation, it simulates the staged workflow of professional musicians, making the system highly interactive and intervention-ready.
Hybrid Waveform + Symbolic Strategy: Uses waveform generation to guarantee richness and symbolic representation (ABC notation) to support precise arrangement, exploiting the advantages of both.
Effectiveness of Multi-Agent Collaboration: Music evaluation and arranging tasks are naturally suited for a multi-agent paradigm, with different agents handling professional judgments across different musical dimensions.
MusicPro-7k Dataset: Contains 7,418 professional film-music pairs, including descriptions, main melodies, and rhythm spots, filling the gap in film-scoring datasets.

Limitations & Future Work¶

Generation Speed: Multi-agent arrangement and mixing involve multiple rounds of LLM conversations and DAW operations, resulting in high end-to-end latency.
Style Generalization: The training data is dominated by classical movies, offering limited coverage of modern pop styles, experimental music, etc.
Melody Transcription Accuracy: The MT3 model might introduce errors when transcribing waveform to MIDI, which affects the subsequent arrangement quality.
Subjectivity of Evaluation Metrics: Although validated by user studies, the multi-agent musicality ratings still rely on the music comprehension capability of the LLMs.

CMT (ICCV 2023): Pioneering video background music generation, focusing on rhythm alignment \(\rightarrow\) Ours adds semantic and developmental dimensions on top of this.
MusicGen (Meta): Open-source LLM music generation backbone \(\rightarrow\) Ours adds a rhythm conditioner on top of it and fine-tunes it.
AutoGen Multi-Agent Framework: LLM multi-agent arrangement \(\rightarrow\) Insight: Complex creative tasks can be decomposed into multi-agent collaborations.
Insight: The key to AI-assisted creative production is not to replace humans, but to simulate professional workflows so that humans can intervene at every stage.

Rating¶

⭐⭐⭐⭐ — This work represents the first application of a multi-agent LLM system to the professional domain of film scoring. It is highly complete from problem definition to dataset construction and system design, offering outstanding engineering value, though the contribution leans more toward system integration.