Skip to content

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

TL;DR

Introduces ActionPiece, the first context-aware action sequence tokenizer that models user behavior sequences as "sequences of feature sets." By adopting a BPE-like merge strategy to discover high-frequency feature patterns both within sets and across adjacent sets, it allows the same action to be tokenized into different tokens depending on the context, significantly improving generative recommendation performance.

Background & Motivation

Generative Recommendation (GR) is an emerging paradigm that tokenizes a user's historical interaction actions into discrete tokens, and then uses an autoregressive model to generate token sequences as recommendation results. This approach shares a compact vocabulary, offering advantages in scalability and memory efficiency.

The core limitation of existing GR methods (such as TIGER, P5-CID, VQ-Rec, and HSTU) is that they process each action completely independently during tokenization—the same item is mapped to the identical token pattern regardless of the sequence in which it appears. This overlooks an important fact: the same action can carry different meanings in different contexts. For example, purchasing the same T-shirt carries different semantics in a sports equipment purchase sequence compared to a casual wear sequence.

Drawing an analogy to the evolution of tokenization in NLP—from early word-level tokenization (context-independent) to modern subword tokenization like BPE (context-aware)—the authors point out that action tokenization in recommender systems remains in the "word-level" stage and needs to evolve toward context awareness. However, there are fundamental differences between recommendation scenarios and text: text is a simple one-dimensional sequence of characters, whereas item features form unordered sets, requiring a custom tokenization algorithm tailored to "sequences of sets."

Method

Overall Architecture

The workflow of ActionPiece consists of three steps:

  1. Vocabulary Construction: Given the training corpus (a corpus of user action sequences) and starting from an initial vocabulary (where each unique feature corresponds to a token), high-frequency co-occurring token pairs are iteratively merged until the target vocabulary size \(Q\) is reached.
  2. Tokenization/Segmentation: Using Set Permutation Regularization (SPR), each feature set in the action sequence is randomly permuted and flattened, then segmented using the traditional BPE method to produce multiple semantically equivalent token sequences.
  3. Model Training and Inference: A Transformer encoder-decoder is used to autoregressively generate the token sequence of the next action; during training, different permutations are used in each epoch for data augmentation, while during inference, multiple permutations are used for ensemble predictions.

Key Design 1: Weighted Co-occurrence Counting

Unlike standard BPE, ActionPiece processes "sequences of sets" instead of one-dimensional sequences. There are two types of co-occurrence for token pairs: (1) two tokens within the same set; (2) two tokens in adjacent sets. To handle them uniformly, the authors randomly permute the sets to flatten them into a one-dimensional sequence and calculate the expected probability of two tokens being adjacent as the weight.

For two tokens \(c_1, c_2\) within the same set \(\mathcal{A}_i\):

\[P(c_1, c_2) = \frac{2}{|\mathcal{A}_i|}\]

For two tokens \(c_1 \in \mathcal{A}_i, c_3 \in \mathcal{A}_{i+1}\) in adjacent sets \(\mathcal{A}_i\) and \(\mathcal{A}_{i+1}\):

\[P(c_1, c_3) = \frac{1}{|\mathcal{A}_i| \times |\mathcal{A}_{i+1}|}\]

This weighting scheme assigns higher weights to token pairs within smaller sets (as features are more tightly coupled) and decays the cross-set weights as set sizes increase. By traversing the corpus to accumulate weights across all occurrences, the total co-occurrence score for each token pair is obtained, and the token pair with the highest score is selected for merging.

Key Design 2: Intermediate Node and Corpus Update Mechanism

When merging a token pair, if both tokens come from the same set, they can be directly replaced. However, if they come from two adjacent sets, the resulting new token spans across action boundaries. The authors introduce Intermediate Nodes to resolve this issue:

  • A doubly linked list is used to maintain each action sequence, where each node represents a set of tokens.
  • When merging cross-set tokens, an intermediate node containing the new token is inserted between the two "action nodes".
  • An intermediate node contains at most one token and is treated as a set of size 1 when calculating co-occurrences.
  • When merging tokens between an action node and an intermediate node, the new token remains in the intermediate node.

This design guarantees that there is at most one intermediate node between any two action nodes, naturally representing the fourth type of token (cross-action feature combinations).

Key Design 3: Set Permutation Regularization (SPR)

A naive segmentation strategy (greedy matching based on merge priority) leads to many unused tokens in the vocabulary (utilization is only 56.89%). The core ideas of SPR are:

  • During training, a random permutation is generated for each action's feature set in each epoch, which is then flattened and segmented using standard BPE.
  • Different permutations generate semantically equivalent but token-wise distinct sequences, serving as natural data augmentation.
  • During inference, segmentation results are generated from \(q\) different permutations, passed through the model to obtain ranked lists, and integrated by averaging their scores.

SPR improves the token utilization rate from 56.89% to over 87.01%, with almost no impact on training efficiency (the permutation operations are processed asynchronously on the CPU).

Efficient Implementation

The naive implementation has a time complexity of \(O(QNLm^2)\). By constructing an inverted index (token pairs \(\to\) linked list positions containing them) and using a global heap (lazy-update strategy), the complexity is reduced to \(O(\log Q \log H \cdot NLm^2)\), where \(H = O(NLm)\) is the maximum size of the heap.

Key Experimental Results

Main Results: Comparison with Baselines (Amazon Reviews Dataset)

Dataset Metric SASRec VQ-Rec TIGER HSTU SPM-SID ActionPiece Gain
Sports R@5 0.0233 0.0181 0.0264 0.0258 0.0280 0.0316 +12.86%
Sports N@10 0.0192 0.0154 0.0225 0.0215 0.0234 0.0264 +12.82%
Beauty R@5 0.0387 0.0434 0.0454 0.0469 0.0475 0.0511 +7.58%
Beauty N@10 0.0318 0.0372 0.0384 0.0389 0.0399 0.0424 +6.00%
CDs R@5 0.0351 0.0314 0.0492 0.0417 0.0509 0.0544 +6.88%
CDs N@10 0.0263 0.0264 0.0411 0.0346 0.0424 0.0451 +6.37%

Ablation Study (NDCG@10)

Variant Sports Beauty CDs
w/o tokenization (directly using raw features) 0.0215 0.0389 0.0346
w/o context-aware (only intra-set merges) 0.0258 0.0416 0.0429
w/o weighted counting (equal-weight counting) 0.0257 0.0412 0.0435
SPR used for inference only 0.0192 0.0316 0.0329
SPR used for training only 0.0244 0.0387 0.0422
TIGER + SPR (without ActionPiece vocabulary) 0.0202 0.0330 0.0351
ActionPiece (40k) 0.0264 0.0424 0.0451

Key Findings

  1. Context awareness is core: Removing cross-set merges (w/o context-aware) significantly drops performance, proving that capturing cross-action feature patterns is a critical contribution.
  2. SPR cannot be used in isolation: Applying SPR to TIGER (context-independent tokenization) actually degrades performance, because it disrupts the internal order of RQ-VAE semantic IDs without introducing new semantic information.
  3. Trade-off of vocabulary size: Increasing the vocabulary size improves performance and shortens token sequence length, but also increases the parameter size. Simply increasing the vocabulary size of TIGER does not yield performance gains, demonstrating that the challenge lies in the quality of tokenization rather than vocabulary size.
  4. Diminishing marginal utility of inference ensemble: The performance improvement is prominent when the ensemble size \(q\) increases from 1 to 5, which then slows down from 5 to 7.

Highlights & Insights

  • Elegant analogical transfer: Migrating the successful application of BPE in NLP to recommender systems without naive application—deeply adapting to the unique data structure of "sequences of sets" (via weighted counting, intermediate nodes, and SPR) reflects a profound understanding of the problem's nature.
  • Killing two birds with one stone with SPR: Set Permutation Regularization simultaneously addresses low token utilization and inference ensembling with a clean and efficient design.
  • Insight into cross-action tokens: Merged tokens can span the features of adjacent actions, which conceptually equates to capturing "transition patterns" of user behavior (e.g., brand consistency from T-shirts to socks), thus imbuing tokenization with sequence-level semantics.
  • Google DeepMind work with open-source code: The engineering implementation utilizes an efficient inverted index and a lazy-update heap, taking practical deployment efficiency into consideration.

Limitations & Future Work

  1. Evaluated only on small-scale Amazon Reviews data: The largest of the three datasets only contains 75k users and 64k items, lacking validation on industrial-scale massive scenarios.
  2. Feature selection relies on manual engineering: It requires pre-defined sets of discrete item features, and continuous features need to be discretized; the quality of feature engineering directly impacts the tokenization effect.
  3. Inference overhead scales linearly with the ensemble size: Although parallelizable, the computational cost of \(q\) forward passes remains high.
  4. Lack of comparison with LLM-based recommendation methods: Recommendation methods based on large language models, such as TALLRec, are not included in the comparison.
  • TIGER (Rajput et al., 2023): Generates hierarchical semantic IDs using RQ-VAE, which serves as a major baseline; the core distinction of ActionPiece lies in its context awareness.
  • SPM-SID (Singh et al., 2024): Applies SentencePiece to merge semantic ID patterns, representing the strongest baseline yet remaining context-independent.
  • HSTU (Zhai et al., 2024): Directly uses raw features as tokens without tokenization; the w/o tokenization ablation of ActionPiece validates the necessity of tokenization.
  • BPE for NLP (Sennrich et al., 2016): The direct inspiration for ActionPiece, though expanding it from a one-dimensional sequence to a sequence of sets is non-trivial.

Inspiration: This paradigm of migrating "from NLP tokenization to structured sequence tokenization" can be extended to other domains—such as audio modeling (sequences of frame feature sets), time series forecasting (multivariate set sequences), sequential decision making, etc. The concept of context-aware tokenization may also inspire representation learning in other fields.

Rating

  • Novelty: ⭐⭐⭐⭐ — The first context-aware action tokenizer; the analogy is clear but technical adaptation is non-trivial.
  • Technical Depth: ⭐⭐⭐⭐ — The three designs (weighted counting, intermediate nodes, and SPR) are tightly coupled, and the efficient implementation has engineering depth.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablations and deep analysis (token utilization, ensemble effects, case studies), though the dataset scale is relatively small.
  • Practical Value: ⭐⭐⭐⭐ — Open-source code, highly versatile method, but lacking industrial-scale validation.
  • Overall Recommendation: ⭐⭐⭐⭐ — A high-quality work that elegantly transfers NLP tokenization techniques to recommender systems.