Tool4POI: A Tool-Augmented LLM Framework for Next POI Recommendation¶

Conference: AAAI 2026 arXiv: 2511.06405 Code: N/A Area: Recommender Systems Keywords: POI Recommendation, Tool-Augmented LLM, Agent, Open-Set Recommendation, Location-Based Services

TL;DR¶

This paper is the first to introduce the tool-augmented LLM paradigm to the next POI recommendation task. Through three modules—preference extraction, multi-round candidate retrieval, and reranking—the framework enables LLMs to retrieve recommendations from the full POI pool. It achieves over 40% accuracy in Out-of-History (OOH) scenarios (where existing methods yield 0%), with average Acc@5/10 improvements of 20%/30%.

Background & Motivation¶

Background: Next Point-of-Interest (POI) recommendation is a core task in location-based services. Traditional approaches (RNN/Transformer/GCN) model user trajectory sequences via embedding representations. Recently, the contextual reasoning capability of LLMs has been introduced to this task—e.g., LLM-Mob (in-context learning) and LLM4POI (supervised fine-tuning)—demonstrating potential for understanding spatiotemporal dynamics.

Limitations of Prior Work: LLM-based methods face two fundamental constraints: (1) heavy reliance on historical completeness—they can only recommend POIs the user has previously visited and cannot handle Out-of-History (OOH) scenarios (where users intend to visit places they have never been), which account for over 30% of real-world cases; (2) context window limitations—a city may contain hundreds of thousands of POIs, making it infeasible to encode all candidates into a prompt for open-set recommendation.

Key Challenge: User behavior exhibits both regularity (commuting patterns) and exploratory tendencies (trying new restaurants). Existing LLM methods overfit to visited POIs and cannot support exploratory behavior; fine-tuning approaches (e.g., GNPR-SID) further exacerbate this bias.

Goal: Design a plug-and-play, fine-tuning-free framework that enables LLMs to retrieve recommendations from the full POI pool via external tools, overcoming both the OOH limitation and the large-scale candidate space constraint.

Key Insight: The observation that humans select destinations through progressive filtering (by category, region, and distance) suggests that this narrowing process can be simulated via multi-round tool calls by an LLM agent.

Core Idea: Equip LLMs with external tool-calling capabilities and implement a three-stage pipeline—preference extraction → multi-round tool-based retrieval → reranking—to enable open-set POI recommendation.

Method¶

Overall Architecture¶

Tool4POI comprises three modules, all built on Qwen2.5-14B and applicable in a plug-and-play manner without fine-tuning: (1) Preference Extraction Module—extracts user preferences along region, category, and temporal dimensions from long-term check-in history; (2) Tool-Augmented Candidate Retrieval Module—the LLM acts as a retrieval agent, interacting with six external tools across multiple rounds to retrieve relevant candidates from the full POI pool; (3) Reranking Module—reorders candidates based on the user's recent check-in behavior to reflect current intent.

Key Designs¶

Preference Extraction Module:
- Function: Extracts a structured representation of long-term user preferences from historical check-in trajectories.
- Mechanism: A structured prompt is designed to feed the user's temporally ordered check-in sequence (converted to regional codes via Google Maps Plus Codes) into the LLM, which outputs preference keywords along Region, Category, and Time dimensions. Plus Codes aggregate latitude/longitude coordinates into region-level encodings, enabling spatially proximate POIs to share the same code, thereby simplifying geographic feature representation.
- Design Motivation: Historical data is voluminous yet rich in implicit patterns; the reasoning capability of LLMs is well suited for extracting multi-dimensional preference summaries.
Tool-Augmented Candidate Retrieval Module:
- Function: Enables the LLM to autonomously retrieve relevant candidates from the full POI pool, overcoming context window limitations.
- Mechanism: Six external tools are defined: query tools (getPOIinfo for POI metadata retrieval), retrieval tools (filterByCategories/filterByRegions for filtering by category/region), auxiliary tools (findPotential for generating initial candidates via POI-level collaborative filtering; sortByDistance for distance-based ranking), and a control tool (finish to terminate retrieval). The LLM acts as a RetrievalAgent, autonomously determining tool call order and parameters based on extracted preferences. Termination conditions: explicit invocation of finish, candidate set size falling below threshold \(\tau=10\), or reaching the maximum call count \(K=6\).
- Design Motivation: (1) findPotential introduces collective behavioral priors via a directed co-occurrence graph \(G=(\mathcal{P}, \mathcal{E})\), enabling OOH POIs to be retrieved; (2) multi-round interaction simulates the human decision-making process, progressively narrowing the candidate space to produce a high-quality candidate set.
Reranking Module:
- Function: Reranks retrieved candidates based on the user's recent behavior to capture short-term intent.
- Mechanism: The user's recent check-in trajectory \(R_u\), the target timestamp \(t_{i+1}\), and the candidate set \(\mathcal{C}\) are jointly fed into the LLM, which ranks candidate POIs by visit likelihood based on recent behavioral patterns, using natural language reasoning rather than embedding similarity.
- Design Motivation: Preference extraction captures long-term interests while reranking captures short-term dynamics (seasonal changes, life-stage transitions, etc.); the two are complementary.

Loss & Training¶

Tool4POI is entirely training-free and requires no fine-tuning. During inference, the three modules execute sequentially. The top-20 candidates from the retrieval module are passed to the reranking module.

Key Experimental Results¶

Main Results¶

Method	NYC Acc@5	NYC Acc@10	TKY Acc@5	TKY Acc@10	CA Acc@5	CA Acc@10
Tool4POI	0.6346	0.7623	Best	Best	Best	Best
GNPR-SID (FT LLM)	Low	Low	Low	Low	Low	Low
GETNext	0.4815	0.5811	0.4045	0.4961	0.3278	0.3946
STAN	0.4582	0.5734	0.3798	0.4464	0.2348	0.3018

Ablation Study¶

Configuration	All Acc@1	All Acc@10	OOH Acc@1	OOH Acc@10
Tool4POI (Full)	0.3164	0.7623	0.0522	0.5863
w/o Retrieval Module	0.2545	0.5559	0	0
w/o Reranking Module	0.1655	0.7145	0.0963	0.6024

Key Findings¶

Existing LLM methods achieve 0% accuracy in OOH scenarios, while Tool4POI reaches 40%+ Acc@10, demonstrating the critical value of tool-augmented retrieval.
The retrieval module contributes most to top-k recommendation (by introducing diverse candidates), while the reranking module contributes most to top-1 precision (by capturing current intent).
The most significant improvements are observed on the sparse dataset CA (averaging only 10 check-ins per user), with gains as high as 100%, demonstrating robustness in data-sparse settings.
Model scale effects: even a 3B model outperforms 7B fine-tuned methods, indicating that tool augmentation matters more than model scale.

Highlights & Insights¶

This is the first work to introduce the agent tool-calling paradigm to recommender systems, pioneering the integration of recommendation systems with LLM agents. The findPotential tool, which leverages a POI co-occurrence graph as a collective prior, is particularly elegant.
The training-free, plug-and-play design allows direct deployment across any city or dataset without retraining, making it highly practical.
The iterative candidate narrowing via multi-round interaction mirrors the cognitive process humans use when selecting destinations, and this paradigm is transferable to other open-set recommendation scenarios (e.g., product recommendation, content recommendation).

Limitations & Future Work¶

Retrieval quality depends on the LLM's tool-calling accuracy; smaller models may produce errors that break the tool chain.
The six-tool design is relatively hand-crafted; additional domain-specific tools (e.g., weather queries, event calendars) may further improve performance.
In OOH scenarios, reranking may actually degrade performance due to lack of contextual information; adaptive decisions on whether to perform reranking could be considered.
Inference latency is high due to multi-round LLM calls, which requires optimization for real-time recommendation settings.

vs. LLM4POI: Supervised fine-tuning of LLMs for QA-style recommendation leads to severe overfitting toward high-frequency POIs. Tool4POI requires no training and retrieves open-set candidates via tools.
vs. GNPR-SID: Uses semantic IDs to train LLMs but still cannot recommend OOH POIs. Tool4POI's findPotential tool leverages collective behavior to transcend individual history limitations.
vs. Traditional Methods (e.g., GETNext): Fixed embedding representations lack flexibility. Tool4POI leverages LLM reasoning for context-aware recommendation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First tool-augmented LLM POI recommendation framework; groundbreaking resolution of the OOH problem.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation on three datasets with in-depth IH/OOH analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear method description with complete algorithmic pseudocode.
Value: ⭐⭐⭐⭐⭐ — Pioneering contribution to the intersection of recommender systems and LLM agents.