TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech¶

Conference: AAAI 2026 arXiv: 2511.05817 Code: None Area: Dialogue Systems Keywords: multimodal interaction, sketch design, speech input, generative AI, creativity support tools

TL;DR¶

This paper proposes TalkSketch, a system that integrates hand-drawn sketches with real-time speech input into a multimodal AI chatbot, enabling designers to simultaneously draw and verbalize ideas during early-stage ideation. The system addresses the problem that text-based prompting in existing GenAI tools disrupts the creative workflow.

Background & Motivation¶

In early-stage design, sketching serves as a central practice that is improvisational, open-ended, and dynamic. Designers frequently switch between design stages and iterate to explore alternatives. With the advancement of multimodal large language models, GenAI has demonstrated strong creative capabilities and is increasingly suited to support early ideation.

However, existing GenAI chatbots face three core challenges when assisting design ideation:

The text-prompting dilemma: Designers struggle to accurately articulate evolving visual concepts in words; repeatedly typing and refining prompts is effortful and disrupts the creative flow.

Command-based interaction paradigm: Current systems place the full burden of direction on the user, ignoring more natural and intuitive input modalities.

Tool fragmentation: Designers must frequently switch among sketching apps, ChatGPT, and image generation tools, causing workflow interruptions.

Key observation: Designers naturally verbalize their thoughts while sketching, but these contextual cues are almost never captured by existing systems. This natural behavior represents a valuable input source for multimodal AI interaction.

Formative Study (N=6)¶

To better understand these challenges, the authors conducted a formative study comprising a design task and interviews:

Participants: 6 individuals (2 female, 4 male) with 2–5 years of design experience, with backgrounds in architecture, furniture, interior, robotics, and electronics design.

Task: Design a household bread toaster within 30 minutes using Goodnotes/Procreate combined with GenAI tools.

Three usage patterns identified: - Pattern 1: Using GenAI for research and idea inspiration (e.g., querying common toaster problems) - Pattern 2: Using GenAI to render sketch concepts (uploading sketches to ChatGPT Image/Gemini for visualization) - Pattern 3: Iterating in a loop among sketches, prompts, and references

Three major challenges: - AI responses were too generic and required extensive prompt adjustment - Output images misaligned with intent (P1: "kind of crazy"; P5: "too time-consuming, I'd rather just draw myself") - Frequent tool switching disrupted the creative process

Method¶

Overall Architecture¶

TalkSketch consists of three core modules: 1. Sketching Module: A digital canvas supporting freehand drawing, erasing, selection, and undo 2. Talking Module: Real-time speech capture and transcription 3. Multimodal AI Chatbot: Comprising automated AI insights and interactive text/image generation

Key Designs¶

1. Sketching Module¶

A digital canvas built on Fabric.js: - Supports stylus/touch input for freehand drawing - Toolbar includes drawing controls and a "Generate with AI" button - Allows selection and export of any canvas region as part of a multimodal prompt to the chatbot - Supports saving to and retrieving from a gallery

Design Motivation: Embedding AI directly within the sketching environment reduces tool switching (Design Goal 1).

2. Talking Module¶

Real-time speech capture supports a "draw-and-think-aloud" workflow: - Recording begins automatically when the user is in sketch mode (chatbot not open) - A recording indicator displays the current recording status - Recording stops when the AI chatbot is opened - Low-latency transcription is performed via Google Cloud Speech-to-Text

Design Motivation: Speech replaces lengthy text prompts, reducing the burden of prompt authoring on designers (Design Goal 2).

3. Multimodal AI Chatbot¶

Comprises two sub-components sharing a common backend and unified conversation history:

(a) AI Insights: - Automatically generates reflective feedback based on the user's current sketch and speech transcription - Triggered automatically upon clicking "Generate with AI" without requiring explicit prompting - Powered by Gemini 2.5 Flash, configured as a design thinking expert persona - Employs the Double Diamond framework to guide users through the Discover and Define phases

Two prompt templates: - Kickoff prompt: Triggered upon first drawing; approximately 100 words; identifies design intent and provides 3–4 directions - Refine prompt: Triggered in subsequent interactions; approximately 80–100 words; summarizes the current design, provides 1–2 expansion suggestions and 1–2 open-ended questions

(b) Multimodal Chat Interface: - Supports text input and iPad's built-in voice dictation - Allows sketch regions to be exported as image input - Text generation mode: Uses Gemini 2.0 Flash; accepts text/sketches; outputs textual suggestions - Image generation mode: Uses Gemini 2.5 Flash Image; accepts text/sketches; outputs images with descriptions - Generated images can be imported back onto the canvas as visual references

Design Motivation: Realizes a proactive, context-aware AI that functions more as a design partner than a passive tool (Design Goal 3).

Loss & Training¶

This is a system design paper and does not involve model training. The backend employs pretrained Gemini models (Gemini 2.0 Flash for text dialogue, Gemini 2.5 Flash Image for image generation), with functionality realized through carefully designed prompt templates.

Key Experimental Results¶

Main Results¶

As a system paper, no formal user study evaluation has been conducted. The authors propose an analytical framework of anticipated outcomes:

Evaluation Dimension	Expected Effect	Theoretical Basis
Intent expression	More fluent communication of design intent	Combining verbal and visual cues reduces cognitive load
Interaction naturalness	Higher naturalness ratings	"Think-aloud while drawing" preserves the spontaneity of natural conversation
Creativity support	Richer ideation traces	Continuous externalization of thought without interruption for text input
Reflective behavior	More design reflection	AI Insights guides reflective interaction
Human-AI alignment	Better alignment	Multimodal input reduces intent misinterpretation

Ablation Study¶

Key findings from the formative study (used as design rationale):

Finding	Specific Observation	Design Response
P1	GenAI responses too generic, requiring repeated adjustment	AI Insights auto-generates structured feedback
P2	AI fails to understand intent from sketch input	Speech context improves understanding
P3	Repetitive text prompting causes fatigue	Speech input replaces text prompting
P4	Tool switching disrupts creative flow	AI embedded within the sketch environment
P5	AI acts as a passive tool rather than a proactive partner	Proactive AI Insights with context awareness

Key Findings¶

Speech as a sketching companion: Designers naturally verbalize thoughts while sketching—an overlooked yet highly valuable information source.
Cognitive burden of text prompting: In early-stage ideation when creativity is in flow, typing itself interrupts thinking.
Tool fragmentation as the primary pain point: Switching among Goodnotes, ChatGPT, Midjourney, and similar tools significantly impairs creative flow.
AI should be a design partner, not a tool: Designers expect AI to "see what I'm drawing and offer suggestions" rather than wait passively for instructions.

Highlights & Insights¶

The core insight of "Talk + Sketch = Better Prompt" is highly inspiring—transforming unconscious verbal behavior into AI input is far more natural than forcing users to write text prompts.
The three system design goals are directly derived from the formative study, establishing a clear research-design-implementation chain.
The dual prompt template design (Kickoff vs. Refine) reflects an understanding of the phased nature of the design process.
Information editability: Users can edit speech transcriptions and regenerate AI insights, balancing automation with user control.
The paradigm shift from command-based to conversation-based interaction carries broad significance.

Limitations & Future Work¶

Speech transcription errors: Background noise and unclear pronunciation may lead to transcription errors, degrading the AI's understanding of intent.
Silent user scenarios: Some users are not accustomed to thinking aloud while drawing; in such cases, the system degrades to relying solely on sketches and text prompts.
Coarse-grained speech-sketch correspondence: The current approach treats speech and sketches as monolithic blocks, without establishing fine-grained associations between spoken content and specific drawn regions.
Lack of quantitative evaluation: The paper presents only a formative study and system design; no formal controlled user study has been conducted.
Single-user scenario: Collaborative design settings involving multiple users simultaneously speaking and sketching have not been explored.

DrawTalking combines sketching and speech to construct interactive animated worlds; TalkSketch extends this combination to the design ideation phase.
GesPrompt combines gestures and speech for intent expression in extended reality environments, inspiring the multimodal input design.
Inkspire and SketchAI demonstrate the potential of freehand input for guiding image generation and analogical inspiration.
The Double Diamond framework provides the theoretical foundation in design thinking for AI Insights (Discover–Define phases).
The use of speech as an input modality has historical roots in the SHRDLU and Put-That-There systems of the 1970s–1980s.

Rating¶

Novelty: ⭐⭐⭐⭐ — The concept of "draw-and-talk" as multimodal AI input is novel
Experimental Thoroughness: ⭐⭐ — Only a formative study; system evaluation is absent
Writing Quality: ⭐⭐⭐⭐ — Research questions are clearly articulated; design logic chain is complete
Value: ⭐⭐⭐ — The concept is valuable but requires formal evaluation for validation