DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking¶

Conference: ACL 2025
arXiv: 2502.20730
Code: GitHub
arXiv: 2502.20730
Code: https://github.com/Li-Z-Q/DeepSolution
Area: Model Compression
Keywords: RAG, tree search, engineering solution design, bi-point thinking, benchmark
Authors: Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, Le Sun (ISCAS & Tongyi Lab)

TL;DR¶

This paper proposes SolutionBench, a new benchmark, and SolutionRAG, a new framework, for complex engineering solution design. By leveraging tree-based exploration and bi-point thinking (alternating design and review) within a RAG framework, it progressively generates reliable engineering solutions satisfying multiple constraints, achieving state-of-the-art (SOTA) results across 8 engineering domains.

Background & Motivation¶

Task Definition: Complex engineering solution design requires generating complete and feasible solutions for engineering demands characterized by multiple real-world constraints (e.g., designing a safe and efficient hospital construction plan in an area with 3000mm annual rainfall, expansive soil, and high-frequency earthquakes).
Limitations of Prior Work: Previous RAG research focused primarily on Multi-hop QA and Long-form QA, where answers are entity fragments or concatenated paragraphs. In contrast, engineering solution design involves a flexible improvement process and complete solutions that satisfy all constraints, showing fundamental differences.
Key Challenge: (1) The improvement path from sub-optimal to reliable solutions is highly flexible and lacks fixed reasoning patterns; (2) Demands contain multiple real-world constraints, making it extremely difficult to satisfy all of them in a single generation.

Method¶

1. SolutionBench Benchmark Construction¶

Construction process: Collect technical reports from authoritative engineering journals \(\rightarrow\) extract structured content using GPT-4o with hand-crafted templates \(\rightarrow\) manual verification and deduplication \(\rightarrow\) combine into a dataset and knowledge base spanning 8 engineering domains.

Each data item contains 5 fields: - Requirement: Complex engineering requirements from real-world scenarios. - Solution: Standard solutions designed by industry experts. - Analytical Knowledge: Professional knowledge used in analyzing the requirements. - Technical Knowledge: Technical knowledge utilized to resolve the requirements. - Explanation: Detailed description of the expert's solution design process.

Covers 8 domains: Environmental, Mining, Transportation, Aerospace, Telecom, Construction, Water Conservancy, and Agriculture, totaling approximately 950 data items and 6,000 pieces of knowledge.

2. SolutionRAG System¶

The core idea is to perform tree search reasoning over a Bi-point Thinking Tree:

(a) Bi-point Thinking Tree Structure - Solution Node: Stores solutions designed for the requirements (low reliability at shallower depth, higher reliability at deeper depth). - Comment Node: Stores review comments pointing out the limitations of a specific solution. - The two types of nodes alternate: Solution Node \(\rightarrow\) Comment Node \(\rightarrow\) Better Solution Node \(\rightarrow\) ...

(b) Node Expansion — Design & Review - Design: Given requirement \(q\), upper-level comment \(c\), and historical solution \(s\), the LLM first samples \(H\) improvement proposals \(\rightarrow\) retrieves relevant knowledge for each proposal from the knowledge base \(\rightarrow\) integrates this info to generate a better solution. - Review: Given requirement \(q\) and current solution \(s\), the LLM similarly generates \(H\) review directions \(\rightarrow\) retrieves knowledge \(\rightarrow\) generates review comments.

(c) Node Evaluation and Pruning - Use LLM logits to score solution nodes and comment nodes. - Solution score: Concatenate Solution + Comment + suffix "According to the comment, above solution is reliable", and take the average logits as the reliability score. - Comment score: Concatenate Old Solution + Comment + New Solution + suffix "Comparing the new solution and old solution, the comment is helpful", and take the average logits as the helpfulness score. - Only the \(W\) highest-scoring nodes are kept at each layer to balance efficiency and performance.

Hyperparameter Settings: Maximum tree depth \(L=5\), number of child nodes per node \(H=2\), number of retained nodes \(W=1\), base model Qwen2.5-7B-Instruct, retrieval model NV-Embed-v2, and top-\(R=10\) retrieved documents.

Key Experimental Results¶

Table 1: Data Statistics of SolutionBench¶

Engineering Domain	Data Count	Knowledge Count
Environmental	119	554
Mining	117	543
Transportation	124	870
Aerospace	115	802
Telecom	116	840
Construction	118	858
Water Conservancy	119	802
Agriculture	122	868

Table 2: Main Results (Analytical Score / Technical Score)¶

Method	Environmental	Mining	Transportation	Aerospace	Telecom	Construction	Water Conservancy	Agriculture
o1-2024-12-17	60.5/48.3	51.9/37.5	57.3/44.7	57.8/47.6	63.5/52.3	61.2/52.0	59.9/50.4	62.9/52.2
Naive-RAG	64.8/62.2	57.2/40.1	62.7/54.9	67.7/65.4	67.4/66.8	66.2/63.3	66.0/57.5	65.7/63.0
Self-RAG	64.2/63.6	56.1/41.6	62.9/56.5	68.8/69.9	67.6/66.9	66.7/65.9	64.8/58.6	65.1/61.1
SolutionRAG	66.4/67.9	59.7/50.5	64.1/58.5	69.9/72.7	68.8/69.0	67.9/68.0	66.0/60.7	66.9/65.2

SolutionRAG achieves SOTA across all 8 domains.
On Mining TS, it secures a \(+10.4\) gain over Naive-RAG and a \(+8.9\) gain over Self-RAG.

Table 3: Ablation Study (Overall AS/TS)¶

Configuration	Overall AS	Overall TS
SolutionRAG (Full)	66.2	64.1
w/o Tree Structure (degrades to single chain)	62.7	61.7
w/o Bi-point Thinking (solutions only, no reviews)	62.9	61.5

Tree search and bi-point thinking contribute comparably, both showing significantly positive effects.

Highlights & Insights¶

Novelly defines the "complex engineering solution design" task, filling a research gap for RAG in engineering.
Bi-point thinking (alternating design and review) acts as a structured upgrade to self-refine, explicitly ensuring solution completeness through review nodes.
The tree search + logit-based pruning mechanism is simple and efficient, outperforming larger models like o1 with only a 7B model.
Constructs a high-quality benchmark, SolutionBench, spanning 8 engineering domains with expert annotations.

Limitations & Future Work¶

Relies solely on existing LLM capabilities without incorporating reinforcement learning, leaving the solution quality capped by the base model.
Due to GPU resource constraints, the hyperparameter space (such as tree width \(H\) and depth \(L\)) is not fully explored.
Evaluation relies on GPT-4o scoring, which may introduce rating bias.
The paper's area label is marked as model_compression, but the actual direction is RAG/engineering design, presenting a taxonomy bias.

Evaluation Dimension	Existing Methods	SolutionRAG
Task Type	Multi-hop QA / Long-form QA	Complex engineering solution design (multi-constraint + complete solution)
Inference Structure	Single-chain iteration (Self-RAG/RQ-RAG)	Bi-point thinking tree (alternating solution-review)
Constraint Satisfaction	No explicit guarantee mechanism	Review nodes explicitly detect missing constraints
MCTS-based RAG	Lacks mechanism to guarantee engineering constraints	Guarantees solution reliability through bi-point thinking

Rating¶

Novelty: ⭐⭐⭐⭐ — First to define the engineering solution design task and construct a dedicated benchmark; the bi-point thinking tree is an interesting architectural innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive testing across 8 domains, ablation studies, tree depth analysis, and pruning effectiveness validation.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, intuitive figures, and well-defined tasks.
Value: ⭐⭐⭐ — The benchmark and system hold practical value for engineering, but the GPT-4o evaluation cost is high.