HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding¶

Conference: CVPR 2026
arXiv: 2504.10739
Code: https://github.com/linyueqian/HippoMM
Area: Image Segmentation
Keywords: Hippocampal cognitive architecture, multimodal memory, long video understanding, cross-modal association, episodic memory

TL;DR¶

HippoMM maps three core hippocampal cognitive mechanisms—pattern separation (episodic segmentation), memory consolidation (semantic compression), and pattern completion (hierarchical retrieval)—into a computational architecture for episodic memory formation and cross-modal associative recall in long audiovisual streams. On the authors' proposed benchmark HippoVlog, the system achieves 78.2% accuracy while being 5× faster than retrieval-augmented baselines.

Background & Motivation¶

Background: Current multimodal models face three key challenges in long video understanding: (1) inability to efficiently memorize continuous content spanning hours; (2) inability to reconstruct complete experiences from partial sensory cues (e.g., a sound); and (3) inability to extract persistent abstract knowledge from transient perception. The human hippocampus naturally addresses all three challenges.
Limitations of Prior Work: Existing methods either scale model size or design complex architectures to handle long videos (e.g., VideoLLaMA, Qwen2.5-Omni), but lack explicit memory mechanisms. These models can only process pre-segmented clips and cannot form episodic memories from continuous streams or perform cross-modal pattern completion (e.g., recalling a visual scene upon hearing applause).
Key Challenge: Existing benchmarks (e.g., MLVU, Video-MME) evaluate comprehension of already-presented content rather than memory formation and associative recall. No standard evaluation protocol exists for cross-modal associative recall.
Goal: (a) How to construct episodic memory from continuous audiovisual streams? (b) How to enable cross-modal pattern completion (a cue from one modality triggers recall in another)? (c) How to balance accuracy and efficiency?
Key Insight: The biological hippocampus addresses these problems through pattern separation in the dentate gyrus (DG), auto-associative pattern completion in CA3, and memory consolidation in CA1. The authors directly map these three mechanisms to algorithmic implementations.
Core Idea: Map the hippocampal "segmentation–consolidation–retrieval" cognitive pipeline to a computational architecture of "content-adaptive segmentation → similarity-filtered compression → confidence-gated hierarchical retrieval" for episodic memory understanding of long audiovisual content.

Method¶

Overall Architecture¶

HippoMM operates in two stages: (1) Memory Formation—transforming a continuous audiovisual stream \(X\) into a hierarchical memory structure \(M\) via episodic segmentation, perceptual encoding, and memory consolidation, comprising short-term memory objects \(m_i\) and long-term semantic indices ThetaEvent \(\theta_k\); (2) Hierarchical Memory Retrieval—given a query \(q\), first attempting fast semantic retrieval, escalating to detailed recall if confidence is insufficient (supporting cross-modal pattern completion), and finally synthesizing an answer through adaptive reasoning.

Key Designs¶

Episodic Segmentation (Pattern Separation):
- Function: Segments continuous audiovisual streams into discrete episodic units, simulating pattern separation in the dentate gyrus.
- Mechanism: Segmentation is triggered by detecting visual discontinuities or auditory boundaries at time \(t\). Visual discontinuity is measured via SSIM: \(d_v(F_t, F_{t-1}) = 1 - \text{SSIM}(F_t, F_{t-1})\), with a split triggered when the difference exceeds threshold \(\tau_v\). Audio boundaries are detected via decibel-level energy: \(d_a(a_t) = -20\log_{10}(\sqrt{\frac{1}{N}\sum a_i^2})\), where values below \(\tau_a\) indicate silence or pauses. Segment lengths are constrained to 5–10 seconds, consistent with human event segmentation timescales.
- Design Motivation: Fixed-window segmentation arbitrarily breaks continuous events or conflates unrelated scenes. Content-adaptive segmentation preserves semantic integrity, yielding a 46% improvement over VideoLLaMA 2 on the temporal understanding task (NQA).
Perceptual Encoding + Memory Consolidation:
- Function: Constructs multimodal representations for each episodic segment and compresses them into efficient semantic indices.
- Mechanism: The perceptual encoding stage processes segments in parallel using three specialized models: ImageBind generates 1024-dimensional cross-modal embeddings \(\mathbf{E}_i\), Whisper produces speech transcriptions \(\mathcal{T}_a\), and Qwen2.5-VL generates visual descriptions \(\mathcal{T}_v\). These outputs are aggregated into ShortTermMemory objects \(m_i = \{\mathbf{E}_i, \mathbf{T}_i, \mathbf{C}_i, t_{s,i}, t_{e,i}\}\). During consolidation, cosine similarity filters redundant segments: for each segment, the mean embedding \(\mathbf{v}_i\) is computed and retained only when its similarity to all stored memories falls below threshold \(\gamma\), i.e., \(K = \{i \mid \forall j \in K, j < i \Rightarrow \cos(\mathbf{v}_i, \mathbf{v}_j) < \gamma\}\) (\(\gamma=0.85\)). An LLM (Qwen2.5-VL) then synthesizes the multimodal content of each retained segment into a concise textual gist \(\mathbf{S}_{\theta_k}\), forming the ThetaEvent object.
- Design Motivation: The filtering strategy simulates the sparsity of CA3 (only 2–5% of neurons active), creating efficient memory storage. The dual representation of ThetaEvent (embedding + semantic summary) bridges abstract semantics and perceptual detail, mirroring the function of CA1 in biological memory consolidation.
Hierarchical Memory Retrieval (Pattern Completion):
- Function: Implements a dual-path retrieval system supporting fast semantic retrieval and detailed cross-modal recall.
- Mechanism: Fast retrieval \(\Phi_{\text{fast}}\) is attempted first—searching only ThetaEvent summaries and evaluating confidence via Qwen2.5-VL. If confidence falls below threshold \(\tau=0.75\), the system escalates to detailed recall \(\Psi_{\text{detailed}}\). The key innovation in detailed recall is cross-modal pattern completion: query cues are used to identify seed segments in the target modality \(\mathbf{S}_{\text{query}} = \text{TopK}(\text{sim}(q_{\text{embed}}, \{\mathbf{v}_k\}), k)\); a temporal window \(\mathbf{W} = \{[t_{s,k} - \delta, t_{e,k} + \delta]\}\) is expanded around the seeds; information from the complementary modality is then retrieved within the expanded window \(\mathbf{S}_{\text{target}}\). For example, "What was on screen when the applause started?" → locate audio segments containing applause → expand the temporal window → retrieve visual descriptions overlapping that window.
- Design Motivation: Fast retrieval handles high-level semantic queries (e.g., "What is the theme of the video?"), while detailed recall handles cross-modal queries requiring precise temporal localization. The on-demand escalation design balances efficiency and accuracy—removing fast retrieval maintains accuracy but triples response time (19.54s vs. 6.39s).

HippoVlog Benchmark¶

A newly constructed benchmark comprising 25 daily vlogs (682 minutes total) with 1,000 manually verified questions spanning four memory function categories: cross-modal binding (\(T_{V \times A}\)), auditory retrieval (\(T_A\)), visual retrieval (\(T_V\)), and semantic reasoning (\(T_S\)). Inter-annotator agreement: Cohen's \(\kappa = 0.975\).

Key Experimental Results¶

Main Results¶

Performance comparison on the HippoVlog benchmark:

Method	A+V	A	V	S	Avg. Accuracy	Response Time
VideoRAG	63.6%	67.2%	41.2%	84.8%	64.2%	112.5s
Ola	72.4%	85.6%	57.6%	84.0%	74.9%	79.4s
GPT-5	72.0%	73.2%	45.6%	88.0%	69.7%	-
VideoLLaMA 3	-	-	70.8%	75.2%	73.0%	58.3s
HippoMM	70.8%	81.6%	66.8%	93.6%	78.2%	20.4s

HippoMM achieves the highest accuracy while being more than 5× faster than VideoRAG.

Ablation Study¶

Configuration	Avg. Accuracy	Response Time	Note
HippoMM (Full)	78.2%	20.4s	All components
w/o Detailed Recall	61.2% (−17.0)	6.39s	Largest accuracy drop
w/o Fast Retrieval	74.6% (−3.6)	19.54s	Slower without fast path
w/o Adaptive Reasoning	76.8% (−1.4)	11.2s	Minor accuracy drop
EOR-only (embedding retrieval only)	71.1% (−7.1)	-	71% without LLM reasoning
Qwen2.5-14B replacing GPT-4o	70.8% (−7.4)	15.7s	Smaller model remains competitive
SAM (naive cognitive baseline)	30.3%	-	Simple Hebbian association fails

Key Findings¶

Detailed Recall is the most critical component: Removing it causes a 17% accuracy drop, with the largest impact on cross-modal binding (70.8% → 39.2%) and visual retrieval (66.8% → 48.0%), demonstrating the indispensability of fine-grained cross-modal pattern completion.
Fast Retrieval primarily contributes to efficiency rather than accuracy: Removing it reduces accuracy by only 3.6%, but response time remains nearly unchanged since all queries then follow the detailed recall path.
Even when replacing GPT-4o with a smaller model (Qwen2.5-14B), accuracy remains at 70.8%, indicating that the cognitive architecture itself drives performance rather than reliance on a specific large model's capabilities.
The naive Hebbian auto-associative baseline SAM achieves only 30.3%, demonstrating that simple cognitive mappings are insufficient and that structured architectural design is necessary.
On the temporal understanding task NQA, HippoMM achieves 73.1%, a 46% improvement over VideoLLaMA 2.

Highlights & Insights¶

Cognitive science-guided system design: Rather than superficially invoking "bio-inspired" concepts, the work precisely maps the computational primitives of three hippocampal subregions (DG–CA3–CA1) to algorithmic modules, with each mapping accompanied by explicit functional correspondence and experimental validation.
The temporal window mechanism for cross-modal pattern completion cleverly exploits temporal co-occurrence as an associative cue—"sounds and images appearing at the same time belong to the same episode"—a simple assumption that proves highly effective in practice.
Confidence-gated dual-path retrieval avoids the overhead of exhaustive retrieval for all queries; semantically simple questions are answered directly at the summary level, while complex queries trigger fine-grained recall.
ThetaEvent dual representation bridges semantics and perception—embeddings support fast similarity search, text summaries enable LLM reasoning, and pointers back to raw data support detailed recall.

Limitations & Future Work¶

Memory formation requires 5.09 hours of processing time for 25 vlogs, making real-time deployment infeasible.
Segmentation and consolidation thresholds (\(\tau_v, \tau_a, \gamma\)) require manual tuning.
Cross-modal association relies on a temporal co-occurrence assumption, which may fail for semantically related but temporally non-overlapping content.
Evaluation is limited to daily vlog-style videos; generalization to other types (lectures, films, surveillance footage) has not been verified.
The system depends on multiple external models (ImageBind, Whisper, Qwen2.5-VL, GPT-4o), resulting in high system complexity.

vs. VideoRAG: VideoRAG performs retrieval augmentation directly without explicit memory structure. HippoMM achieves higher efficiency (5× speedup) and accuracy (+14%) through episodic memory organization.
vs. MA-LMM: MA-LMM introduces a memory bank for long video processing but remains unimodal in design. HippoMM uniquely integrates pattern separation, consolidation, and cross-modal pattern completion.
vs. HippoRAG: HippoRAG maps hippocampal mechanisms to text retrieval; HippoMM extends this to continuous audiovisual understanding and cross-modal association.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The multimodal memory architecture is grounded in cognitive science principles; the cross-modal pattern completion mechanism is novel and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ — Ablations are comprehensive and the proposed benchmark is valuable, but evaluation on external benchmarks is limited.
Writing Quality: ⭐⭐⭐⭐ — Biological mappings are clearly explained, though the overall system pipeline is relatively complex.
Value: ⭐⭐⭐⭐ — Proposes a principled paradigm for long video understanding; the proposed benchmark can facilitate future research on cross-modal memory.