Skip to content

HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding

Conference: CVPR 2026
arXiv: 2504.10739
Code: https://github.com/linyueqian/HippoMM
Area: Image Segmentation
Keywords: Hippocampal cognitive architecture, multimodal memory, long video understanding, cross-modal association, episodic memory

TL;DR

HippoMM maps three core hippocampal cognitive mechanisms—pattern separation (episodic segmentation), memory consolidation (semantic compression), and pattern completion (hierarchical retrieval)—into a computational architecture for episodic memory formation and cross-modal associative recall in long audiovisual streams. On the authors' proposed benchmark HippoVlog, the system achieves 78.2% accuracy while being 5× faster than retrieval-augmented baselines.

Background & Motivation

  1. Background: Current multimodal models face three key challenges in long video understanding: (1) inability to efficiently memorize continuous content spanning hours; (2) inability to reconstruct complete experiences from partial sensory cues (e.g., a sound); and (3) inability to extract persistent abstract knowledge from transient perception. The human hippocampus naturally addresses all three challenges.

  2. Limitations of Prior Work: Existing methods either scale model size or design complex architectures to handle long videos (e.g., VideoLLaMA, Qwen2.5-Omni), but lack explicit memory mechanisms. These models can only process pre-segmented clips and cannot form episodic memories from continuous streams or perform cross-modal pattern completion (e.g., recalling a visual scene upon hearing applause).

  3. Key Challenge: Existing benchmarks (e.g., MLVU, Video-MME) evaluate comprehension of already-presented content rather than memory formation and associative recall. No standard evaluation protocol exists for cross-modal associative recall.

  4. Goal: (a) How to construct episodic memory from continuous audiovisual streams? (b) How to enable cross-modal pattern completion (a cue from one modality triggers recall in another)? (c) How to balance accuracy and efficiency?

  5. Key Insight: The biological hippocampus addresses these problems through pattern separation in the dentate gyrus (DG), auto-associative pattern completion in CA3, and memory consolidation in CA1. The authors directly map these three mechanisms to algorithmic implementations.

  6. Core Idea: Map the hippocampal "segmentation–consolidation–retrieval" cognitive pipeline to a computational architecture of "content-adaptive segmentation → similarity-filtered compression → confidence-gated hierarchical retrieval" for episodic memory understanding of long audiovisual content.

Method

Overall Architecture

HippoMM operates in two stages: (1) Memory Formation—transforming a continuous audiovisual stream \(X\) into a hierarchical memory structure \(M\) via episodic segmentation, perceptual encoding, and memory consolidation, comprising short-term memory objects \(m_i\) and long-term semantic indices ThetaEvent \(\theta_k\); (2) Hierarchical Memory Retrieval—given a query \(q\), first attempting fast semantic retrieval, escalating to detailed recall if confidence is insufficient (supporting cross-modal pattern completion), and finally synthesizing an answer through adaptive reasoning.

Key Designs

  1. Episodic Segmentation (Pattern Separation):

    • Function: Segments continuous audiovisual streams into discrete episodic units, simulating pattern separation in the dentate gyrus.
    • Mechanism: Segmentation is triggered by detecting visual discontinuities or auditory boundaries at time \(t\). Visual discontinuity is measured via SSIM: \(d_v(F_t, F_{t-1}) = 1 - \text{SSIM}(F_t, F_{t-1})\), with a split triggered when the difference exceeds threshold \(\tau_v\). Audio boundaries are detected via decibel-level energy: \(d_a(a_t) = -20\log_{10}(\sqrt{\frac{1}{N}\sum a_i^2})\), where values below \(\tau_a\) indicate silence or pauses. Segment lengths are constrained to 5–10 seconds, consistent with human event segmentation timescales.
    • Design Motivation: Fixed-window segmentation arbitrarily breaks continuous events or conflates unrelated scenes. Content-adaptive segmentation preserves semantic integrity, yielding a 46% improvement over VideoLLaMA 2 on the temporal understanding task (NQA).
  2. Perceptual Encoding + Memory Consolidation:

    • Function: Constructs multimodal representations for each episodic segment and compresses them into efficient semantic indices.
    • Mechanism: The perceptual encoding stage processes segments in parallel using three specialized models: ImageBind generates 1024-dimensional cross-modal embeddings \(\mathbf{E}_i\), Whisper produces speech transcriptions \(\mathcal{T}_a\), and Qwen2.5-VL generates visual descriptions \(\mathcal{T}_v\). These outputs are aggregated into ShortTermMemory objects \(m_i = \{\mathbf{E}_i, \mathbf{T}_i, \mathbf{C}_i, t_{s,i}, t_{e,i}\}\). During consolidation, cosine similarity filters redundant segments: for each segment, the mean embedding \(\mathbf{v}_i\) is computed and retained only when its similarity to all stored memories falls below threshold \(\gamma\), i.e., \(K = \{i \mid \forall j \in K, j < i \Rightarrow \cos(\mathbf{v}_i, \mathbf{v}_j) < \gamma\}\) (\(\gamma=0.85\)). An LLM (Qwen2.5-VL) then synthesizes the multimodal content of each retained segment into a concise textual gist \(\mathbf{S}_{\theta_k}\), forming the ThetaEvent object.
    • Design Motivation: The filtering strategy simulates the sparsity of CA3 (only 2–5% of neurons active), creating efficient memory storage. The dual representation of ThetaEvent (embedding + semantic summary) bridges abstract semantics and perceptual detail, mirroring the function of CA1 in biological memory consolidation.
  3. Hierarchical Memory Retrieval (Pattern Completion):

    • Function: Implements a dual-path retrieval system supporting fast semantic retrieval and detailed cross-modal recall.
    • Mechanism: Fast retrieval \(\Phi_{\text{fast}}\) is attempted first—searching only ThetaEvent summaries and evaluating confidence via Qwen2.5-VL. If confidence falls below threshold \(\tau=0.75\), the system escalates to detailed recall \(\Psi_{\text{detailed}}\). The key innovation in detailed recall is cross-modal pattern completion: query cues are used to identify seed segments in the target modality \(\mathbf{S}_{\text{query}} = \text{TopK}(\text{sim}(q_{\text{embed}}, \{\mathbf{v}_k\}), k)\); a temporal window \(\mathbf{W} = \{[t_{s,k} - \delta, t_{e,k} + \delta]\}\) is expanded around the seeds; information from the complementary modality is then retrieved within the expanded window \(\mathbf{S}_{\text{target}}\). For example, "What was on screen when the applause started?" → locate audio segments containing applause → expand the temporal window → retrieve visual descriptions overlapping that window.
    • Design Motivation: Fast retrieval handles high-level semantic queries (e.g., "What is the theme of the video?"), while detailed recall handles cross-modal queries requiring precise temporal localization. The on-demand escalation design balances efficiency and accuracy—removing fast retrieval maintains accuracy but triples response time (19.54s vs. 6.39s).

HippoVlog Benchmark

A newly constructed benchmark comprising 25 daily vlogs (682 minutes total) with 1,000 manually verified questions spanning four memory function categories: cross-modal binding (\(T_{V \times A}\)), auditory retrieval (\(T_A\)), visual retrieval (\(T_V\)), and semantic reasoning (\(T_S\)). Inter-annotator agreement: Cohen's \(\kappa = 0.975\).

Key Experimental Results

Main Results

Performance comparison on the HippoVlog benchmark:

Method A+V A V S Avg. Accuracy Response Time
VideoRAG 63.6% 67.2% 41.2% 84.8% 64.2% 112.5s
Ola 72.4% 85.6% 57.6% 84.0% 74.9% 79.4s
GPT-5 72.0% 73.2% 45.6% 88.0% 69.7% -
VideoLLaMA 3 - - 70.8% 75.2% 73.0% 58.3s
HippoMM 70.8% 81.6% 66.8% 93.6% 78.2% 20.4s

HippoMM achieves the highest accuracy while being more than 5× faster than VideoRAG.

Ablation Study

Configuration Avg. Accuracy Response Time Note
HippoMM (Full) 78.2% 20.4s All components
w/o Detailed Recall 61.2% (−17.0) 6.39s Largest accuracy drop
w/o Fast Retrieval 74.6% (−3.6) 19.54s Slower without fast path
w/o Adaptive Reasoning 76.8% (−1.4) 11.2s Minor accuracy drop
EOR-only (embedding retrieval only) 71.1% (−7.1) - 71% without LLM reasoning
Qwen2.5-14B replacing GPT-4o 70.8% (−7.4) 15.7s Smaller model remains competitive
SAM (naive cognitive baseline) 30.3% - Simple Hebbian association fails

Key Findings

  • Detailed Recall is the most critical component: Removing it causes a 17% accuracy drop, with the largest impact on cross-modal binding (70.8% → 39.2%) and visual retrieval (66.8% → 48.0%), demonstrating the indispensability of fine-grained cross-modal pattern completion.
  • Fast Retrieval primarily contributes to efficiency rather than accuracy: Removing it reduces accuracy by only 3.6%, but response time remains nearly unchanged since all queries then follow the detailed recall path.
  • Even when replacing GPT-4o with a smaller model (Qwen2.5-14B), accuracy remains at 70.8%, indicating that the cognitive architecture itself drives performance rather than reliance on a specific large model's capabilities.
  • The naive Hebbian auto-associative baseline SAM achieves only 30.3%, demonstrating that simple cognitive mappings are insufficient and that structured architectural design is necessary.
  • On the temporal understanding task NQA, HippoMM achieves 73.1%, a 46% improvement over VideoLLaMA 2.

Highlights & Insights

  • Cognitive science-guided system design: Rather than superficially invoking "bio-inspired" concepts, the work precisely maps the computational primitives of three hippocampal subregions (DG–CA3–CA1) to algorithmic modules, with each mapping accompanied by explicit functional correspondence and experimental validation.
  • The temporal window mechanism for cross-modal pattern completion cleverly exploits temporal co-occurrence as an associative cue—"sounds and images appearing at the same time belong to the same episode"—a simple assumption that proves highly effective in practice.
  • Confidence-gated dual-path retrieval avoids the overhead of exhaustive retrieval for all queries; semantically simple questions are answered directly at the summary level, while complex queries trigger fine-grained recall.
  • ThetaEvent dual representation bridges semantics and perception—embeddings support fast similarity search, text summaries enable LLM reasoning, and pointers back to raw data support detailed recall.

Limitations & Future Work

  • Memory formation requires 5.09 hours of processing time for 25 vlogs, making real-time deployment infeasible.
  • Segmentation and consolidation thresholds (\(\tau_v, \tau_a, \gamma\)) require manual tuning.
  • Cross-modal association relies on a temporal co-occurrence assumption, which may fail for semantically related but temporally non-overlapping content.
  • Evaluation is limited to daily vlog-style videos; generalization to other types (lectures, films, surveillance footage) has not been verified.
  • The system depends on multiple external models (ImageBind, Whisper, Qwen2.5-VL, GPT-4o), resulting in high system complexity.
  • vs. VideoRAG: VideoRAG performs retrieval augmentation directly without explicit memory structure. HippoMM achieves higher efficiency (5× speedup) and accuracy (+14%) through episodic memory organization.
  • vs. MA-LMM: MA-LMM introduces a memory bank for long video processing but remains unimodal in design. HippoMM uniquely integrates pattern separation, consolidation, and cross-modal pattern completion.
  • vs. HippoRAG: HippoRAG maps hippocampal mechanisms to text retrieval; HippoMM extends this to continuous audiovisual understanding and cross-modal association.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The multimodal memory architecture is grounded in cognitive science principles; the cross-modal pattern completion mechanism is novel and effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Ablations are comprehensive and the proposed benchmark is valuable, but evaluation on external benchmarks is limited.
  • Writing Quality: ⭐⭐⭐⭐ — Biological mappings are clearly explained, though the overall system pipeline is relatively complex.
  • Value: ⭐⭐⭐⭐ — Proposes a principled paradigm for long video understanding; the proposed benchmark can facilitate future research on cross-modal memory.