Skip to content

TripTailor: A Real-World Benchmark for Personalized Travel Planning

Conference: ACL 2025
arXiv: 2508.01432
Code: Yes (https://github.com/swxkfm/TripTailor)
Area: LLM Evaluation
Keywords: Travel Planning, Personalized Travel Planning, LLM Agent, Benchmark, Real-world Evaluation

TL;DR

Proposes TripTailor, a large-scale real-world data-based travel planning benchmark containing over 500k POIs across 40 cities and nearly 4000 real itineraries. By introducing a three-dimensional evaluation framework assessing feasibility, rationality, and personalization, the authors find that less than 10% of plans generated by state-of-the-art LLMs reach human-level quality.

Background & Motivation

LLMs have demonstrated significant potential in travel planning, but existing benchmarks suffer from evident limitations:

Unrealistic Data: TravelPlanner primarily relies on simulated data for evaluation, which fails to reflect real-world constraints.

Limited Scale: ChinaTravel only covers 10 cities with approximately 1200 POIs per city, which is insufficient to capture the complexity of real-world travel needs.

Single Evaluation Dimension: Existing frameworks overly focus on hard constraints (e.g., budget, time) while failing to assess the overall quality of the itinerary.

Lack of Human Baseline: There is no systematic comparison with human-crafted travel plans.

Core Problem: Is an itinerary good simply because it satisfies all hard constraints? The answer is negative—it may still suffer from issues such as unreasonable detours, improper time allocation, and a lack of personalization.

Method

Overall Architecture

TripTailor comprises three components: 1. Sandbox Environment: A comprehensive information database for 40 popular Chinese tourist cities. 2. Benchmark Dataset: Nearly 4000 real-world travel itineraries paired with user queries. 3. Three-Dimensional Evaluation Framework: Feasibility + Rationality + Personalization.

Key Designs

  1. Large-Scale Sandbox Environment:

    • 40 cities, averaging 12,500 POIs per city
    • 28,832 train schedules + 15,110 flight routes
    • 5,622 selected attractions (including ratings, ticket prices, geographical coordinates, and recommended visit durations)
    • 89,224 hotels + 422,120 restaurants
    • Design Motivation: Exceeds existing datasets by orders of magnitude, establishing a more realistic foundation for evaluation.
  2. Four-Step Process for Real-World Itinerary Construction:

    • Step I: Collect POI information for 40 cities from the public internet, and complete coordinates using Amap.
    • Step II: Collect self-guided itineraries from online travel agencies (OTAs), select highly-rated detailed itineraries, and randomly assign departure cities and dates.
    • Step III: Use an LLM to extract information from the itineraries to generate first-person user queries (note that the LLM only performs rewriting and does not fill in missing information).
    • Step IV: Perform quality control, checking for tail anomalies, transit time conflicts, and scheduling rationality.
    • Queries are split into two difficulty levels: Easy (2-3 days) and Hard (4-7 days).
  3. Three-Dimensional Evaluation Framework:

    • Feasibility: Verifies whether the information exists within the sandbox (to prevent hallucination) and checks if key information is complete.
    • Rationality: Assesses non-duplication of restaurants and attractions, financial reasonableness of meals, appropriate visit durations, budget compliance, and route optimization (measured by the average distance between POIs).
    • Personalization: Evaluates whether the itinerary reflects user interests, culinary preferences, activity types, and travel intensity.
    • Combined Evaluation Methods: Objective metrics + LLM evaluation + dedicated reward models.
  4. Route Optimization Metric \(D_{avg}\):

    • Calculates the average geographical distance between consecutive POIs per day to measure transit efficiency.
    • Ideal Strategy: Group geographically adjacent POIs on the same day.

Loss & Training

  • This work is a benchmarking study, primarily evaluating existing LLMs and agent frameworks.
  • The proposed Workflow method simulates human travel planning by decomposing the task into sequential steps (e.g., selecting attractions \(\to\) routing \(\to\) selecting restaurants \(\to\) selecting hotels).
  • GPT-4o-mini is utilized as the backbone LLM.

Key Experimental Results

Main Results (Performance of Methods on Easy/Hard Tasks)

Method Route Distance Ratio Feasibility (Micro/Macro) Rationality (Macro) Personalization (LLM/RM) Final Outperformance Rate
Workflow (GPT-4o-mini) Easy 1.8 98.9/98.6 74.3 14.1/11.6 18.1%
Direct (GPT-4o) Easy 3.4 97.7/95.5 28.8 17.8/18.4 10.2%
Direct (o1-mini) Easy 3.6 91.0/83.9 33.3 29.1/9.6 16.1%
Workflow (GPT-4o-mini) Hard 1.7 97.7/96.0 53.0 17.5/12.0 14.3%
Direct (GPT-4o) Hard 3.2 98.9/97.7 16.3 8.0/26.4 4.9%

Comparison of Direct Generation across LLMs

Model Feasibility Pass Rationality Pass Final Outperformance Rate
Qwen2.5-7b 68.8% ~5% 0.7%
Qwen2.5-32b 89.9% ~12% 4.7%
GPT-4o-mini 89.3% ~9% 0.9%
DeepSeek-V3 96.5% ~15% 7.8%
GPT-4o 98.3% ~22% 7.5%

Key Findings

  1. Less than 10% Achieve Human Level: Even for the state-of-the-art LLM (GPT-4o), the comprehensive outperformance rate is below 10%, falling far short of human-made itineraries.
  2. Feasibility \(\neq\) Rationality: Satisfying hard constraints only guarantees feasibility; the gap in rationality (route optimization, scheduling, etc.) is far wider.
  3. Workflow Method Achieves Optimal Routing Efficiency: The route distance ratio is only 1.7-1.8, significantly lower than the 3-4 of direct generation methods.
  4. Hard Tasks Are More Challenging: All evaluation metrics for 4-7 day itineraries are lower across the board compared to 2-3 day itineraries.
  5. Personalization is the Greatest Weakness: The outperformance rate in personalization evaluated by the LLM is generally below 30%, indicating that LLMs struggle to understand user preferences deeply.
  6. Limited Effectiveness of CoT, ReAct, and Reflexion: Advanced reasoning strategies do not significantly improve the quality of travel planning.

Highlights & Insights

  • Data Scale is the Core Advantage: Over 500k POIs and 4000 real itineraries exceed existing benchmarks by an order of magnitude.
  • Three-Dimensional Evaluation Framework: More comprehensive than simplistic constraint satisfaction rates. Feasibility is merely the "passing line," whereas rationality and personalization are the criteria for "excellence."
  • The route distance ratio metric is elegant and effective, intuitively reflecting the spatial planning quality of the itineraries.
  • Introduction of a Human Baseline: Provides a clear reference point for evaluation, rather than merely comparing relative rankings of different LLMs.
  • The performance degradation of LLMs in long-range planning (Hard tasks) reveals the limitations of modern LLMs in solving complex constraint optimization problems.

Limitations & Future Work

  • Only covers 40 cities in China, leaving international travel scenarios unexplored.
  • Personalization evaluation heavily relies on LLM scoring and reward models, lacking direct satisfaction feedback from real users.
  • Sandbox environment information may become outdated (e.g., dynamic attraction ratings and restaurant details).
  • Optimization of budget allocation is not considered (e.g., maximizing experience within a limited budget).
  • Multi-person travel scenarios, which require balancing diverse preferences, are not covered.
  • Compared to TravelPlanner (Xie et al., 2024), TripTailor utilizes real-world data and offers more comprehensive evaluations.
  • Compared to ChinaTravel (Shao et al., 2024), the data scale is an order of magnitude larger.
  • The external verifier strategy of LLM-Modulo (Gundawar et al., 2024) can complement the Workflow decomposition method proposed in this paper.
  • Insight: Travel planning is inherently a multi-objective optimization problem. Pure LLM reasoning is insufficient; it must be integrated with search algorithms and domain-specific knowledge.

Rating

  • Novelty: ⭐⭐⭐⭐ — Real-world large-scale benchmark + innovative three-dimensional evaluation framework.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple LLMs, various agent strategies, Easy/Hard difficulties, and combination of three evaluation modes (objective + LLM + RM).
  • Writing Quality: ⭐⭐⭐⭐ — Well-structured with detailed descriptions of the dataset construction process.
  • Value: ⭐⭐⭐⭐ — Establishes a reliable real-world baseline for evaluating LLM travel planning.