TripTailor: A Real-World Benchmark for Personalized Travel Planning¶

Conference: ACL 2025
arXiv: 2508.01432
Code: Yes (https://github.com/swxkfm/TripTailor)
Area: LLM Evaluation
Keywords: Travel Planning, Personalized Travel Planning, LLM Agent, Benchmark, Real-world Evaluation

TL;DR¶

Proposes TripTailor, a large-scale real-world data-based travel planning benchmark containing over 500k POIs across 40 cities and nearly 4000 real itineraries. By introducing a three-dimensional evaluation framework assessing feasibility, rationality, and personalization, the authors find that less than 10% of plans generated by state-of-the-art LLMs reach human-level quality.

Background & Motivation¶

LLMs have demonstrated significant potential in travel planning, but existing benchmarks suffer from evident limitations:

Unrealistic Data: TravelPlanner primarily relies on simulated data for evaluation, which fails to reflect real-world constraints.

Limited Scale: ChinaTravel only covers 10 cities with approximately 1200 POIs per city, which is insufficient to capture the complexity of real-world travel needs.

Single Evaluation Dimension: Existing frameworks overly focus on hard constraints (e.g., budget, time) while failing to assess the overall quality of the itinerary.

Lack of Human Baseline: There is no systematic comparison with human-crafted travel plans.

Core Problem: Is an itinerary good simply because it satisfies all hard constraints? The answer is negative—it may still suffer from issues such as unreasonable detours, improper time allocation, and a lack of personalization.

Method¶

Overall Architecture¶

TripTailor comprises three components: 1. Sandbox Environment: A comprehensive information database for 40 popular Chinese tourist cities. 2. Benchmark Dataset: Nearly 4000 real-world travel itineraries paired with user queries. 3. Three-Dimensional Evaluation Framework: Feasibility + Rationality + Personalization.

Key Designs¶

Large-Scale Sandbox Environment:
- 40 cities, averaging 12,500 POIs per city
- 28,832 train schedules + 15,110 flight routes
- 5,622 selected attractions (including ratings, ticket prices, geographical coordinates, and recommended visit durations)
- 89,224 hotels + 422,120 restaurants
- Design Motivation: Exceeds existing datasets by orders of magnitude, establishing a more realistic foundation for evaluation.
Four-Step Process for Real-World Itinerary Construction:
- Step I: Collect POI information for 40 cities from the public internet, and complete coordinates using Amap.
- Step II: Collect self-guided itineraries from online travel agencies (OTAs), select highly-rated detailed itineraries, and randomly assign departure cities and dates.
- Step III: Use an LLM to extract information from the itineraries to generate first-person user queries (note that the LLM only performs rewriting and does not fill in missing information).
- Step IV: Perform quality control, checking for tail anomalies, transit time conflicts, and scheduling rationality.
- Queries are split into two difficulty levels: Easy (2-3 days) and Hard (4-7 days).
Three-Dimensional Evaluation Framework:
- Feasibility: Verifies whether the information exists within the sandbox (to prevent hallucination) and checks if key information is complete.
- Rationality: Assesses non-duplication of restaurants and attractions, financial reasonableness of meals, appropriate visit durations, budget compliance, and route optimization (measured by the average distance between POIs).
- Personalization: Evaluates whether the itinerary reflects user interests, culinary preferences, activity types, and travel intensity.
- Combined Evaluation Methods: Objective metrics + LLM evaluation + dedicated reward models.
Route Optimization Metric \(D_{avg}\):
- Calculates the average geographical distance between consecutive POIs per day to measure transit efficiency.
- Ideal Strategy: Group geographically adjacent POIs on the same day.

Loss & Training¶

This work is a benchmarking study, primarily evaluating existing LLMs and agent frameworks.
The proposed Workflow method simulates human travel planning by decomposing the task into sequential steps (e.g., selecting attractions \(\to\) routing \(\to\) selecting restaurants \(\to\) selecting hotels).
GPT-4o-mini is utilized as the backbone LLM.

Key Experimental Results¶

Main Results (Performance of Methods on Easy/Hard Tasks)¶

Method	Route Distance Ratio	Feasibility (Micro/Macro)	Rationality (Macro)	Personalization (LLM/RM)	Final Outperformance Rate
Workflow (GPT-4o-mini) Easy	1.8	98.9/98.6	74.3	14.1/11.6	18.1%
Direct (GPT-4o) Easy	3.4	97.7/95.5	28.8	17.8/18.4	10.2%
Direct (o1-mini) Easy	3.6	91.0/83.9	33.3	29.1/9.6	16.1%
Workflow (GPT-4o-mini) Hard	1.7	97.7/96.0	53.0	17.5/12.0	14.3%
Direct (GPT-4o) Hard	3.2	98.9/97.7	16.3	8.0/26.4	4.9%

Comparison of Direct Generation across LLMs¶

Model	Feasibility Pass	Rationality Pass	Final Outperformance Rate
Qwen2.5-7b	68.8%	~5%	0.7%
Qwen2.5-32b	89.9%	~12%	4.7%
GPT-4o-mini	89.3%	~9%	0.9%
DeepSeek-V3	96.5%	~15%	7.8%
GPT-4o	98.3%	~22%	7.5%

Key Findings¶

Less than 10% Achieve Human Level: Even for the state-of-the-art LLM (GPT-4o), the comprehensive outperformance rate is below 10%, falling far short of human-made itineraries.
Feasibility \(\neq\) Rationality: Satisfying hard constraints only guarantees feasibility; the gap in rationality (route optimization, scheduling, etc.) is far wider.
Workflow Method Achieves Optimal Routing Efficiency: The route distance ratio is only 1.7-1.8, significantly lower than the 3-4 of direct generation methods.
Hard Tasks Are More Challenging: All evaluation metrics for 4-7 day itineraries are lower across the board compared to 2-3 day itineraries.
Personalization is the Greatest Weakness: The outperformance rate in personalization evaluated by the LLM is generally below 30%, indicating that LLMs struggle to understand user preferences deeply.
Limited Effectiveness of CoT, ReAct, and Reflexion: Advanced reasoning strategies do not significantly improve the quality of travel planning.

Highlights & Insights¶

Data Scale is the Core Advantage: Over 500k POIs and 4000 real itineraries exceed existing benchmarks by an order of magnitude.
Three-Dimensional Evaluation Framework: More comprehensive than simplistic constraint satisfaction rates. Feasibility is merely the "passing line," whereas rationality and personalization are the criteria for "excellence."
The route distance ratio metric is elegant and effective, intuitively reflecting the spatial planning quality of the itineraries.
Introduction of a Human Baseline: Provides a clear reference point for evaluation, rather than merely comparing relative rankings of different LLMs.
The performance degradation of LLMs in long-range planning (Hard tasks) reveals the limitations of modern LLMs in solving complex constraint optimization problems.

Limitations & Future Work¶

Only covers 40 cities in China, leaving international travel scenarios unexplored.
Personalization evaluation heavily relies on LLM scoring and reward models, lacking direct satisfaction feedback from real users.
Sandbox environment information may become outdated (e.g., dynamic attraction ratings and restaurant details).
Optimization of budget allocation is not considered (e.g., maximizing experience within a limited budget).
Multi-person travel scenarios, which require balancing diverse preferences, are not covered.

Compared to TravelPlanner (Xie et al., 2024), TripTailor utilizes real-world data and offers more comprehensive evaluations.
Compared to ChinaTravel (Shao et al., 2024), the data scale is an order of magnitude larger.
The external verifier strategy of LLM-Modulo (Gundawar et al., 2024) can complement the Workflow decomposition method proposed in this paper.
Insight: Travel planning is inherently a multi-objective optimization problem. Pure LLM reasoning is insufficient; it must be integrated with search algorithms and domain-specific knowledge.

Rating¶

Novelty: ⭐⭐⭐⭐ — Real-world large-scale benchmark + innovative three-dimensional evaluation framework.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple LLMs, various agent strategies, Easy/Hard difficulties, and combination of three evaluation modes (objective + LLM + RM).
Writing Quality: ⭐⭐⭐⭐ — Well-structured with detailed descriptions of the dataset construction process.
Value: ⭐⭐⭐⭐ — Establishes a reliable real-world baseline for evaluating LLM travel planning.