Skip to content

ConnectomeBench: Can LLMs Proofread the Connectome?

Conference: NeurIPS 2025 arXiv: 2511.05542 Code: Project Page Area: Image Segmentation Keywords: connectomics proofreading, multimodal LLM, 3D neuron mesh, segmentation error detection, benchmark

TL;DR

This paper introduces ConnectomeBench, the first standardized benchmark for evaluating multimodal LLMs on three key connectomics proofreading tasks: segment identification, split error correction, and merge error detection. o4-mini achieves 85% on the split correction multiple-choice task, yet merge error detection remains significantly below human expert performance.

Background & Motivation

Field Bottleneck: Connectomics reconstructs brain neural connectivity through high-resolution electron microscopy imaging and automated segmentation, but segmentation algorithms inevitably produce split and merge errors that require extensive manual proofreading—the Drosophila whole-brain connectome consumed approximately 33 person-years of proofreading effort.

Scalability Challenge: As the field advances toward larger-scale connectome reconstructions (e.g., a 1 mm³ volume of mouse visual cortex), the cost of manual proofreading will grow exponentially, making automated solutions urgently necessary.

LLM Visual Reasoning Progress: Multimodal LLMs (e.g., o3 approaching human-level performance on CharXiv) have demonstrated increasingly strong visual reasoning capabilities, opening new possibilities for scientific task automation.

Limitations of Existing Methods: Prior work includes heuristic-based graph approaches (NEURD), CNN-guided proofreading, and RoboEM, but these methods are either brittle and non-generalizable or require task-specific training—lacking a universal evaluation framework.

Absence of Standardized Evaluation: No standardized benchmark exists for systematically measuring AI system performance on connectomics proofreading tasks, making consistent cross-model and cross-time comparison impossible.

Core Idea: Render 3D neuron meshes as multi-view 2D images (orthographic triplanar views) to transform the problem into a visual question-answering task suitable for LLMs, and construct a standardized benchmark covering three key proofreading capabilities.

Method

Overall Architecture

ConnectomeBench is built upon two large-scale open-source connectome datasets: - MICrONS: 1 mm³ volume of mouse visual cortex, approximately 200,000 proofread neurons - FlyWire: Drosophila whole brain, approximately 140,000 proofread neurons

The CAVEClient is used to access segmentation edit histories, obtaining pre- and post-proofreading segmentation states as ground truth. Each 3D mesh is rendered from top, side, and front views into 1024×1024 images, which are directly provided as visual input to LLMs. Three tasks are evaluated: segment identification, split error correction, and merge error detection.

Key Design 1: Segment Identification

  • Function: Classify segmentation fragments into 5 categories—single-body neurons, multi-body merged neurons, somata-free processes, nuclei, and non-neuronal cells
  • Mechanism: Render three views of the complete 3D mesh and prompt the LLM for multi-class classification; explore two prompting strategies: "Description" (providing morphological descriptions for each category) and "Null" (no additional context)
  • Design Motivation: Classification is a foundational step in proofreading—identifying multi-body merges is itself equivalent to detecting merge errors; category priors can assist subsequent proofreading decisions

Key Design 2: Split Error Correction

  • Function: Determine whether two separate fragments should be merged (binary classification), or select the correct merge target from multiple candidates (multiple-choice classification)
  • Mechanism: Positive examples are drawn from human-executed merge operations in edit histories; negative examples are generated by sampling neighboring fragments within 128 nm laterally and 120–880 nm longitudinally from interface points; rendering regions are cropped using a 4096 nm³ bounding box
  • Design Motivation: The longitudinal sampling range of 120–880 nm is designed to handle gaps caused by missing imaging sections—the primary source of split errors in real data

Key Design 3: Merge Error Detection

  • Function: Determine whether a merge error exists within a given fragment (binary classification), or identify which of two candidates contains a merge error (multiple-choice classification)
  • Mechanism: Positive examples are drawn from human-annotated merge corrections in edit histories; the cropping bounding box is \(\max(4096\,\text{nm},\, 2 \times \text{smaller fragment size})\), dynamically adapting to error regions of varying sizes; negative examples use the final proofread meshes
  • Design Motivation: Merge errors are typically visible as anomalous branches or unnatural connections in processes; variable-size cropping ensures sufficient spatial context

Key Design 4: Heuristic-Guided Reasoning

  • Function: Analyze reasoning error patterns of o4-mini and distill 7 heuristic rules to embed in prompts
  • Mechanism: The model exhibits biases such as "the correct merge target should be a small extension" and "a large gap indicates fragments should not be merged"—whereas in practice, split fragments can be as large as the original, and missing data can cause large gaps. Heuristics correcting these biases are written into the prompts
  • Design Motivation: Leverage LLMs' natural language reasoning ability to understand failure modes and improve performance in a targeted manner, avoiding the cost of training or fine-tuning

Evaluation Strategy

  • Each prompt is repeated 5–10 times with majority voting as the final answer
  • 100 randomly sampled examples per task are used for analysis
  • Expert baselines are provided by trained graduate/undergraduate annotators labeling approximately 50 samples
  • A ResNet-50 classifier is additionally trained as a conventional deep learning baseline

Key Experimental Results

Table 1: Segment Identification Balanced Accuracy

Model FlyWire MICrONS
Claude 3.7+Desc 0.459 0.822
o4-mini+Desc 0.511 0.728
GPT-4.1+Desc 0.529 0.655
GPT-4o+Desc 0.396 0.588
InternVL-3 78B+Desc 0.320 0.493
InternVL-3 8B+Desc 0.303 0.417
NVLM+Desc 0.234 0.258
ResNet-50 0.552 0.587
Random Baseline 0.200 0.250

Table 2: Split Error Correction (Binary & Multiple-Choice) — Best Configurations

Model+Prompt Binary FlyWire Binary MICrONS MC FlyWire MC MICrONS
o4-mini+Heuristics 0.754 0.786 0.788 0.850
o4-mini+Desc 0.631 0.679 0.828 0.790
Claude 4+Heuristics 0.551 0.587 0.677 0.770
GPT-4o+Heuristics 0.556 0.536 0.667 0.720
ResNet-50 0.720 0.667 0.721 0.693
Human Expert 0.840 0.902 0.900 0.920

Table 3: Merge Error Detection (Binary & Multiple-Choice)

Model+Prompt Binary FlyWire Binary MICrONS MC FlyWire MC MICrONS
o4-mini+Desc 0.628 0.615 0.670 0.703
o4-mini+Null 0.553 0.591 0.740 0.689
Claude 4+Desc 0.487 0.480 0.560 0.530
GPT-4o+Desc 0.538 0.517 0.345 0.361
ResNet-50 0.769 0.798 0.569 0.541
Human Expert 0.740 0.800 0.840 0.796

Key Findings

  • Surprisingly Strong Segment Identification: All proprietary models far exceed the random baseline (20–25%), with Claude 3.7 reaching 82.2% on MICrONS, surpassing ResNet-50 (58.7%)
  • Multiple-Choice Outperforms Binary in Split Correction: Relative comparison is easier than absolute judgment; o4-mini achieves up to 85.0% on multiple-choice
  • Heuristic Reasoning Effectively Improves Performance: o4-mini binary split correction improves from 67.9% to 78.6% (+10.7 pp), and multiple-choice from 79.0% to 85.0% (+6.0 pp)
  • Merge Error Detection Is the Hardest Task: The best model, o4-mini (62.8%), remains significantly behind human experts (74–80%) and ResNet-50 (76.9–79.8%)
  • Clear Gap for Open-Source Models: InternVL-3 78B/8B and NVLM are consistently weaker than proprietary models across all tasks
  • Description Prompting Effectiveness Varies by Model: It provides almost no benefit for Claude 3.7 on segment identification (whose internal priors are already sufficient), but is broadly effective for split correction multiple-choice

Highlights & Insights

  • First Systematic Connectomics AI Proofreading Benchmark: Provides a standardized framework for evaluating LLM capability on a critical neuroscience task, enabling cross-model and longitudinal progress tracking
  • Zero-Shot 3D-to-Multi-View-2D Pipeline: Enables LLMs to process 3D data without any training, demonstrating the generalization potential of multimodal models
  • Reasoning Analysis → Heuristic Feedback Loop: By analyzing systematic biases in LLM reasoning, the paper derives corrective rules directly embeddable in prompts—this methodology of "analyzing failures → improving prompts" has broad applicability
  • o4-mini Consistently Leads: o4-mini outperforms nearly all other models across all three tasks, suggesting that reasoning capability—rather than pure visual ability—is the key factor for such tasks

Limitations & Future Work

  • Insufficient Merge Error Detection: Current best LLM performance (~63%) falls far below human (~80%) and ResNet-50 (~80%) levels, and even underperforms simple baselines in some settings—this is the primary obstacle toward automated proofreading
  • 3D Information Loss: Triplanar rendering inevitably loses 3D topological information, which is particularly detrimental for merge error detection in complex branching structures
  • Limited Evaluation Scale: Only 100 samples per task are analyzed, with approximately 50 expert baseline annotations—statistical power is low and confidence intervals are wide
  • Incomplete Proofreading Pipeline Coverage: Key components such as synapse identification and merge error correction (detection only, no correction) are not included
  • High Computational Cost: Each sample requires multiple API calls with majority voting, posing cost constraints for large-scale deployment
  • Dataset Bias: Coverage is limited to MICrONS (mouse) and FlyWire (Drosophila), which may not generalize to other species or imaging conditions
  • NEURD (Celii et al., 2025): Converts 3D neuron meshes into annotated graph representations and applies heuristic graph rules to correct merge errors—interpretable but brittle
  • RoboEM (Schmidt et al., 2024): Models axon tracing as a flight control problem using CNNs—elegant but task-specific
  • RLCorrector (Nguyen et al., 2021): A reinforcement learning agent that simulates human proofreading workflows—foreshadows the potential of AI agents in this domain
  • Insights: Future work could combine LLMs' reasoning capabilities with specialized 3D models (e.g., point cloud Transformers) to build end-to-end proofreading systems within an agent framework

Rating

  • Novelty: ⭐⭐⭐⭐ First systematic benchmark evaluating LLM connectomics proofreading capability; the zero-shot 3D-to-multi-view rendering evaluation approach is novel
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three tasks × two datasets × eight models × multiple prompting strategies, with human expert and ResNet baselines, though sample sizes are relatively small
  • Writing Quality: ⭐⭐⭐⭐ Field background is clearly presented, task motivation is well-justified, and the data construction process is described in detail
  • Value: ⭐⭐⭐⭐ Provides an important capability boundary analysis for LLM applications in scientific tasks; the heuristic-guided reasoning enhancement methodology offers broadly applicable reference value