TripTailor: A Real-World Benchmark for Personalized Travel Planning¶
Conference: ACL 2025
arXiv: 2508.01432
Code: Yes (https://github.com/swxkfm/TripTailor)
Area: LLM Evaluation
Keywords: Travel Planning, Personalized Travel Planning, LLM Agent, Benchmark, Real-world Evaluation
TL;DR¶
Proposes TripTailor, a large-scale real-world data-based travel planning benchmark containing over 500k POIs across 40 cities and nearly 4000 real itineraries. By introducing a three-dimensional evaluation framework assessing feasibility, rationality, and personalization, the authors find that less than 10% of plans generated by state-of-the-art LLMs reach human-level quality.
Background & Motivation¶
LLMs have demonstrated significant potential in travel planning, but existing benchmarks suffer from evident limitations:
Unrealistic Data: TravelPlanner primarily relies on simulated data for evaluation, which fails to reflect real-world constraints.
Limited Scale: ChinaTravel only covers 10 cities with approximately 1200 POIs per city, which is insufficient to capture the complexity of real-world travel needs.
Single Evaluation Dimension: Existing frameworks overly focus on hard constraints (e.g., budget, time) while failing to assess the overall quality of the itinerary.
Lack of Human Baseline: There is no systematic comparison with human-crafted travel plans.
Core Problem: Is an itinerary good simply because it satisfies all hard constraints? The answer is negative—it may still suffer from issues such as unreasonable detours, improper time allocation, and a lack of personalization.
Method¶
Overall Architecture¶
TripTailor comprises three components: 1. Sandbox Environment: A comprehensive information database for 40 popular Chinese tourist cities. 2. Benchmark Dataset: Nearly 4000 real-world travel itineraries paired with user queries. 3. Three-Dimensional Evaluation Framework: Feasibility + Rationality + Personalization.
Key Designs¶
-
Large-Scale Sandbox Environment:
- 40 cities, averaging 12,500 POIs per city
- 28,832 train schedules + 15,110 flight routes
- 5,622 selected attractions (including ratings, ticket prices, geographical coordinates, and recommended visit durations)
- 89,224 hotels + 422,120 restaurants
- Design Motivation: Exceeds existing datasets by orders of magnitude, establishing a more realistic foundation for evaluation.
-
Four-Step Process for Real-World Itinerary Construction:
- Step I: Collect POI information for 40 cities from the public internet, and complete coordinates using Amap.
- Step II: Collect self-guided itineraries from online travel agencies (OTAs), select highly-rated detailed itineraries, and randomly assign departure cities and dates.
- Step III: Use an LLM to extract information from the itineraries to generate first-person user queries (note that the LLM only performs rewriting and does not fill in missing information).
- Step IV: Perform quality control, checking for tail anomalies, transit time conflicts, and scheduling rationality.
- Queries are split into two difficulty levels: Easy (2-3 days) and Hard (4-7 days).
-
Three-Dimensional Evaluation Framework:
- Feasibility: Verifies whether the information exists within the sandbox (to prevent hallucination) and checks if key information is complete.
- Rationality: Assesses non-duplication of restaurants and attractions, financial reasonableness of meals, appropriate visit durations, budget compliance, and route optimization (measured by the average distance between POIs).
- Personalization: Evaluates whether the itinerary reflects user interests, culinary preferences, activity types, and travel intensity.
- Combined Evaluation Methods: Objective metrics + LLM evaluation + dedicated reward models.
-
Route Optimization Metric \(D_{avg}\):
- Calculates the average geographical distance between consecutive POIs per day to measure transit efficiency.
- Ideal Strategy: Group geographically adjacent POIs on the same day.
Loss & Training¶
- This work is a benchmarking study, primarily evaluating existing LLMs and agent frameworks.
- The proposed Workflow method simulates human travel planning by decomposing the task into sequential steps (e.g., selecting attractions \(\to\) routing \(\to\) selecting restaurants \(\to\) selecting hotels).
- GPT-4o-mini is utilized as the backbone LLM.
Key Experimental Results¶
Main Results (Performance of Methods on Easy/Hard Tasks)¶
| Method | Route Distance Ratio | Feasibility (Micro/Macro) | Rationality (Macro) | Personalization (LLM/RM) | Final Outperformance Rate |
|---|---|---|---|---|---|
| Workflow (GPT-4o-mini) Easy | 1.8 | 98.9/98.6 | 74.3 | 14.1/11.6 | 18.1% |
| Direct (GPT-4o) Easy | 3.4 | 97.7/95.5 | 28.8 | 17.8/18.4 | 10.2% |
| Direct (o1-mini) Easy | 3.6 | 91.0/83.9 | 33.3 | 29.1/9.6 | 16.1% |
| Workflow (GPT-4o-mini) Hard | 1.7 | 97.7/96.0 | 53.0 | 17.5/12.0 | 14.3% |
| Direct (GPT-4o) Hard | 3.2 | 98.9/97.7 | 16.3 | 8.0/26.4 | 4.9% |
Comparison of Direct Generation across LLMs¶
| Model | Feasibility Pass | Rationality Pass | Final Outperformance Rate |
|---|---|---|---|
| Qwen2.5-7b | 68.8% | ~5% | 0.7% |
| Qwen2.5-32b | 89.9% | ~12% | 4.7% |
| GPT-4o-mini | 89.3% | ~9% | 0.9% |
| DeepSeek-V3 | 96.5% | ~15% | 7.8% |
| GPT-4o | 98.3% | ~22% | 7.5% |
Key Findings¶
- Less than 10% Achieve Human Level: Even for the state-of-the-art LLM (GPT-4o), the comprehensive outperformance rate is below 10%, falling far short of human-made itineraries.
- Feasibility \(\neq\) Rationality: Satisfying hard constraints only guarantees feasibility; the gap in rationality (route optimization, scheduling, etc.) is far wider.
- Workflow Method Achieves Optimal Routing Efficiency: The route distance ratio is only 1.7-1.8, significantly lower than the 3-4 of direct generation methods.
- Hard Tasks Are More Challenging: All evaluation metrics for 4-7 day itineraries are lower across the board compared to 2-3 day itineraries.
- Personalization is the Greatest Weakness: The outperformance rate in personalization evaluated by the LLM is generally below 30%, indicating that LLMs struggle to understand user preferences deeply.
- Limited Effectiveness of CoT, ReAct, and Reflexion: Advanced reasoning strategies do not significantly improve the quality of travel planning.
Highlights & Insights¶
- Data Scale is the Core Advantage: Over 500k POIs and 4000 real itineraries exceed existing benchmarks by an order of magnitude.
- Three-Dimensional Evaluation Framework: More comprehensive than simplistic constraint satisfaction rates. Feasibility is merely the "passing line," whereas rationality and personalization are the criteria for "excellence."
- The route distance ratio metric is elegant and effective, intuitively reflecting the spatial planning quality of the itineraries.
- Introduction of a Human Baseline: Provides a clear reference point for evaluation, rather than merely comparing relative rankings of different LLMs.
- The performance degradation of LLMs in long-range planning (Hard tasks) reveals the limitations of modern LLMs in solving complex constraint optimization problems.
Limitations & Future Work¶
- Only covers 40 cities in China, leaving international travel scenarios unexplored.
- Personalization evaluation heavily relies on LLM scoring and reward models, lacking direct satisfaction feedback from real users.
- Sandbox environment information may become outdated (e.g., dynamic attraction ratings and restaurant details).
- Optimization of budget allocation is not considered (e.g., maximizing experience within a limited budget).
- Multi-person travel scenarios, which require balancing diverse preferences, are not covered.
Related Work & Insights¶
- Compared to TravelPlanner (Xie et al., 2024), TripTailor utilizes real-world data and offers more comprehensive evaluations.
- Compared to ChinaTravel (Shao et al., 2024), the data scale is an order of magnitude larger.
- The external verifier strategy of LLM-Modulo (Gundawar et al., 2024) can complement the Workflow decomposition method proposed in this paper.
- Insight: Travel planning is inherently a multi-objective optimization problem. Pure LLM reasoning is insufficient; it must be integrated with search algorithms and domain-specific knowledge.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Real-world large-scale benchmark + innovative three-dimensional evaluation framework.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple LLMs, various agent strategies, Easy/Hard difficulties, and combination of three evaluation modes (objective + LLM + RM).
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with detailed descriptions of the dataset construction process.
- Value: ⭐⭐⭐⭐ — Establishes a reliable real-world baseline for evaluating LLM travel planning.