Skip to content

LUMIA: A Handheld Vision-to-Music System for Real-Time, Embodied Composition

Conference: NeurIPS 2025 arXiv: 2512.17228 Code: https://github.com/KidaGSD/LLOv2 Area: Audio & Speech Keywords: vision-to-music, real-time composition, human-AI co-creation, embodied interaction, multimodal generation

TL;DR

This paper presents Lumia — a handheld camera-shaped device that analyzes captured frames via GPT-4 Vision to generate structured prompts, which are then fed to Stable Audio to synthesize loopable music segments, enabling a real-time, embodied improvisation workflow from visual input to music.

Background & Motivation

Background: Generative AI has achieved breakthroughs in text, image, and audio domains (DALL·E, MusicLM, Stable Audio, etc.), yet most of these tools are screen- and prompt-driven, lacking physical interaction and real-time improvisation capabilities.

Limitations of Prior Work: (a) Existing text-to-music models (MusicLM, AudioLDM, Stable Audio) are primarily accessed via batch processing or prompt interfaces, with no real-time interactive control; (b) digital music tools (Magenta Studio, Jukebox) emphasize precise control but lack tactile and improvisational workflows; (c) tangible interaction devices (Reactable, Bela) are confined to fixed environments and predefined mappings.

Key Challenge: A fundamental disconnect exists between the capabilities of generative music AI and users' physical creative experience — models are powerful, but interaction modalities are impoverished, making the creative process feel like "tweaking parameters" rather than "performing."

Goal: How can multimodal generative AI be embedded into a physical device so that visual scenes serve as musical material sources, enabling context-aware musical improvisation?

Key Insight: The work extends the concept of the Large Language Object (LLO) — embedding generative models into systems with material expressiveness. The predecessor VBox supported haptic navigation of audio latent spaces; Lumia pivots toward composition, linking visual perception to multimodal generation.

Core Idea: Camera framing as a sampling act — the user "composes" by "looking." Captured frames are analyzed by a VLM and translated into music prompts, generating audio segments that can be looped and layered.

Method

Overall Architecture

The system comprises three pillars: (1) a tangible hardware controller — a camera-shaped handheld device with five buttons (four instrument selection + one capture/play); (2) a browser-based front-end application — the central orchestrator managing session state and loop playback; (3) cloud AI services — GPT-4 Vision for image analysis, Stable Audio for music generation, and Tonn AI for mixing/mastering. End-to-end latency is approximately 5–6.5 seconds (capture → audio in loop).

Key Designs

  1. Vision-to-Music Pipeline:

    • Function: Transforms captured frames into structured music generation prompts.
    • Mechanism: Upon pressing the capture button, the current camera frame is sent to GPT-4 Vision, which returns a structured JSON description containing: scene summary, list of salient objects, overall mood (adjectives), section role (intro/verse/chorus/bridge/outro), musical style, and suggested BPM. The system then merges this JSON with the user-selected instruments (up to three, following the principle of perceptual stream segregation), appends section-specific modifiers (e.g., chorus receives "higher energy, catchy hook"), and produces a single-sentence prompt for Stable Audio.
    • Design Motivation: The VLM serves as an intermediate translation layer, extracting atmosphere and context from visual scenes rather than literal objects, avoiding the degraded audio quality that results from overly direct "object → sound" mappings.
  2. Loop Playback Engine:

    • Function: Manages seamless looping, layering, and transitions across multiple audio segments.
    • Mechanism: Given a session tempo of \(b\) BPM, one beat is \(T_{\text{beat}} = 60/b\) seconds and one bar is \(T_{\text{bar}} = 4T_{\text{beat}}\). Each segment has a fixed length \(L_k = m_k T_{\text{bar}}\). A tempo-adaptive crossfade window is used: \(T_{\text{cf}}(b) = \max\left(\frac{120}{b},\; 0.3\right) \text{ s}\) The start time of the next segment is: \(t_{k+1} = t_k + L_k - T_{\text{cf}}\)
    • An equal-power crossfade envelope maintains instantaneous power: \(g_{\text{out}}(n) = \cos\left(\frac{\pi n}{2N}\right), \quad g_{\text{in}}(n) = \sin\left(\frac{\pi n}{2N}\right)\) satisfying \(g_{\text{out}}^2(n) + g_{\text{in}}^2(n) = 1\)
    • For sparse ambient segments, an optional power-law envelope is applied: \(g_{\text{out}}(n) = (1 - n/N)^{\alpha_0}\), \(\alpha_0 \approx 2.5\)
    • Design Motivation: Seamless loop joining is critical to user experience; tempo alignment and equal-power crossfade prevent rhythmic discontinuities and volume jumps at splice points.
  3. Envelope Selection Strategy:

    • Function: Automatically selects the crossfade type based on segment characteristics.
    • Mechanism: Starting from a context vector \(\mathbf{c} = (\Delta P, \text{section role})\), the optimal fade type and parameters are selected by minimizing a loudness mismatch objective: \((f^\star, \theta^\star) = \arg\min_{f \in \{\text{eq}, \text{poly}\}, \theta} \sum_{n=0}^{N-1} (|z[n]|^2 - P_{\text{target}})^2 + \lambda \mathcal{C}_{\text{transient}}\) where \(P_{\text{target}}\) is the running power target and \(\mathcal{C}_{\text{transient}}\) penalizes transients at splice points.
  4. Automatic AI Mixing and Mastering:

    • Function: Automatically triggers a mixing preview and export-grade mastering once at least two segments are ready.
    • Mechanism: Individual segment WAV files are uploaded to Tonn AI as stems, with parameters such as instrumentGroup, presenceSetting, and panPreference specified per segment. Upon completion of the mixing preview, a hot-swap replaces the currently playing audio, enabling an uninterrupted quality upgrade. Mastering is performed by concatenating segments with pydub and submitting to Tonn's album-level mastering service.

Hardware Design

  • An Arduino Nano 33 IoT microcontroller manages I/O, with a state machine loop handling button debouncing, LCD display updates, and LED status indication.
  • The firmware is stateless (except for the display); all playback and audio logic is driven by the front end.
  • Physical color filter discs were added — iterative development revealed that strong image coloration significantly influences style inference, prompting the addition of physical filters as a simple visual style control.

Key Experimental Results

User Evaluation

Three professional audio engineers (4–6 years of experience) each composed a multi-section piece of 120–150 seconds using Lumia, with sessions lasting approximately 25–30 minutes.

Evaluation Dimension Scale Score Range / Result
Co-creation / Sense of Agency 1–7 Likert See Figure 10
Musical Quality 1–7 Likert See Figure 10
Audio Mapping 1–7 Likert See Figure 10
Interaction / Flow 1–7 Likert See Figure 10
Value / Fit 1–7 Likert See Figure 10
Authorship Attribution 0–10 Mean 4.0
Expectation Match 0–10 Mean 6.3

System Latency

Stage Mean Latency
GPT-4 Vision image analysis 1.2 ± 0.3 s
Stable Audio music generation (15 s segment) 3.8 ± 0.6 s
End-to-end (capture → audio in loop) 5.0–6.5 s
Mixing preview 5.2 ± 0.9 s
Full mastering 8.6 ± 1.1 s
End-to-end including mixing update 10–13 s

Key Findings

  • User feedback was positive: "Starting from an image gets me to the right vibe faster than starting from a DAW template."
  • Mean authorship attribution of 4.0/10 indicates that users perceive a large AI contribution; future work should strengthen users' sense of control.
  • Expectation match of 6.3/10 is acceptable for a generative system but leaves room for improvement.
  • Improvement requests centered on: (i) optional style/BPM locking; (ii) layer-level fine-grained editing; (iii) a visual-to-audio mapping legend; (iv) reduced latency.
  • Use cases include rapid inspiration sketching, mood boarding, and short-video scoring.

Highlights & Insights

  • "Seeing as Sampling" interaction metaphor: Framing the camera viewfinder as a DJ sampling act is both intuitive and novel. The physical form factor (camera shape) reinforces this metaphor and lowers the cognitive barrier for non-professional users.
  • Intermediate structured prompt layer: Rather than end-to-end image → audio, the pipeline proceeds as image → structured JSON (scene / mood / section role / style / BPM) → music prompt. This intermediate layer makes the system interpretable and controllable, and leaves room for future interfaces that allow users to edit prompts directly.
  • Tempo-aligned seamless splicing: Equal-power crossfade + bar-boundary quantization + hot-swap mechanism together ensure musical coherence — a critical yet frequently overlooked requirement for music generation systems.
  • Modular cloud-service architecture: VLM, music generation, and mixing are each independent API calls; no local GPU is required, and the front end runs on a MacBook.

Limitations & Future Work

  • Full dependence on cloud APIs (GPT-4V, Stable Audio, Tonn) means offline operation is unavailable and the system is susceptible to latency fluctuations.
  • The user evaluation is extremely small in scale (only 3 participants), providing insufficient statistical power; conclusions are primarily qualitative.
  • Generative models lack temporal awareness — newly generated segments cannot perceive the harmonic or melodic content of existing segments; consistency is maintained only through fixed style/key/BPM prompts.
  • Fine-grained local editing of already-generated segments is not supported (e.g., modifying only one bar of the bass line).
  • The physical device has a limited number of function buttons; FX knobs and gesture control remain planned future work.
  • vs. Magenta Studio / Jukebox: These tools are embedded in DAW workflows and target technically proficient musicians; Lumia targets a broader non-professional user base and lowers the barrier through physical interaction.
  • vs. Reactable: A classic tangible music interface, but confined to a fixed desktop environment with predefined mappings; Lumia is mobile and context-aware.
  • vs. VBox: Both belong to the LLO series; VBox supports haptic navigation of audio latent spaces, while Lumia advances from navigation to vision-driven composition, representing greater complexity.
  • vs. Be the Beat: Both are embodied devices embedding generative AI, but Be the Beat responds to a dancer's movements while Lumia responds to visual scenes, differing in modality.

Rating

  • Novelty: ⭐⭐⭐⭐ The concept of "composing by photographing" is novel and compelling, though the underlying technology is primarily API composition.
  • Experimental Thoroughness: ⭐⭐ Only a 3-participant user study; lacks quantitative comparative baselines; system evaluation is weak.
  • Writing Quality: ⭐⭐⭐⭐ System description is clear and complete; the iterative interaction design process is transparent; in-depth technical analysis is limited.
  • Value: ⭐⭐⭐ Informative as an HCI/creative AI system, but leans toward proof-of-concept; practical deployment remains some distance away.

Highlights & Insights

Rating