TestNUC: Enhancing Test-Time Computing Approaches and Scaling through Neighboring Unlabeled Data Consistency¶

Conference: ACL 2025
arXiv: 2502.19163
Code: Yes (https://github.com/HenryPengZou/TestNUC)
Area: Others
Keywords: Test-time computing, neighborhood consistency, unlabeled data, majority voting, LLM inference scaling

TL;DR¶

TestNUC proposes a linearly scaling test-time inference enhancement method. By retrieving nearest neighbor unlabeled samples for a test instance, it prompts LLMs to predict both the test sample and its neighbors, then aggregates these predictions via weighted majority voting to consistently improve classification accuracy.

Background & Motivation¶

Test-time computing has emerged as a popular direction by allocating more computational resources during the inference phase to boost LLM performance. Existing strategies are categorized into two types:

Input-level (e.g., few-shot ICL): increasing the prompt length \(\to\) computational cost scales quadratically with the number of tokens.

Output-level (e.g., self-consistency, best-of-N): sampling multiple answers and aggregating them \(\to\) ignoring the extensive availability of unlabeled data in real-world scenarios.

Core Problem: How to efficiently leverage unlabeled data to enhance test-time inference?

The authors identify a phenomenon of "local consistency in embedding space": semantically similar instances are highly likely to share the same label. Preliminary analysis demonstrates that among the \(K=20\) nearest neighbors, even in the worst-case scenario (GoEmotion with 150 classes), the purity reaches \(\sim 30\%\), while most datasets exhibit much higher purity. Aggregating ground-truth labels from the neighborhood via majority voting yields highly accurate and stable predictions.

Method¶

Overall Architecture¶

TestNUC consists of two steps: 1. Neighbor Retrieval: Retrieving top-\(K\) nearest neighbor unlabeled samples of the test instance based on embedding similarity. 2. Collaborative Prediction: Prompts the LLM to make predictions for both the test sample and its \(K\) neighbors, then derives the final answer through the designed aggregation strategies.

Key Designs¶

Preliminary Analysis
- Defining neighborhood purity as \(\phi(\mathcal{N}) = \frac{1}{KN} \sum_{i=1}^N \sum_{j \in \mathcal{N}} \mathbf{1}(y_i = y_j)\)
- Empirical findings: nearest neighbor samples exhibit high label consistency, and majority voting accuracy remains stable as \(K\) increases.
- Weighted voting further improves stability under large \(K\).
Three Aggregation Strategies
- Naive Majority Voting: directly takes the most frequent class among the \(K\) predictions.
- Weighted Majority Voting: votes weighted by cosine similarity to mitigate the influence of noisy distantly-related neighbors.
- Filtered Weighted Majority Voting (Full Version): additionally leverages the verbalized confidence of the LLM to filter out low-quality predictions.
  - For each neighbor, the LLM outputs both a prediction and a confidence score.
  - Only predictions with confidence \(\ge \text{threshold } \theta\) participate in the voting.
Seamless Integration with Existing Methods
- Integration with Self-Consistency: performs self-consistency on each neighbor before aggregation.
- Integration with TopK-ICL: enhances individual neighbor predictions with ICL first, then aggregates.
- Integration with Best-of-N: performs best-of-N on top of the aggregated results of TestNUC.
- All integrations bring further performance gains.
Computational Complexity Analysis
- Embedding precomputation cost is \(\mathcal{O}(N)\) (done offline), and retrieval cost is \(\mathcal{O}(N)\).
- LLM inference cost is \(\mathcal{O}(K)\), which is comparable to self-consistency (\(\mathcal{O}(M)\)).
- Overall scaling is linear, which is significantly better than the quadratic scaling of ICL.

Key Experimental Results¶

Main Results — 4 LLMs × 8 Datasets (Table 1 Summary)¶

Model	Method	Average Accuracy
GPT-4o-mini	Standard Prompting	0.613
GPT-4o-mini	Self-Consistency	0.625
GPT-4o-mini	TestNUC	0.660
GPT-4o-mini	TestNUC† (K=50)	0.676
Llama-3.1-8B	Standard Prompting	0.572
Llama-3.1-8B	TestNUC†	0.652
GPT-4o	Standard Prompting	0.715
GPT-4o	TestNUC†	0.754

Integration with Existing Test-Time Methods (Table 2 Summary)¶

Base Method	+TestNUC Gain
KNN-ICL	+7.51%
KNN-ICL-P	+5.98%
Self-Consistency	+9.56%
Best-of-N	+6.24%

Ablation Study¶

Dimension	Conclusion
Sensitivity to \(K\)	Performance improves steadily for \(K=10\text{-}50\), demonstrating robustness to \(K\).
Embedding Models	Consistent effectiveness across various embedding models (e.g., SFR, GTE, NV-Embed).
Size of Unlabeled Data	Performance improves monotonically as data volume increases; more data leads to better results.
Aggregation Strategy	Filtered Weighted Voting > Weighted Voting > Naive Voting

Key Findings¶

TestNUC consistently outperforms baselines across all LLMs and datasets, achieving an average improvement of 4-8 percentage points.
Perfectly complementary to existing methods: all integrations yield additional gains, with the Self-Consistency integration showing the largest improvement (+9.56%).
Scales linearly with unlabeled data volume: more data yields better performance, making it highly suitable for data-rich practical scenarios.
Insensitive to embedding models: embedding models of different sizes and sources all perform well.
The largest improvements are observed in intent detection (BANKING/CLINC), while the smallest are in emotion detection (GoEmotion)—consistent with the neighborhood purity analysis.

Highlights & Insights¶

Simple and elegant formulation: The intuition that "similar samples should share the same label" is simple yet highly effective, backed by theoretical support (neighborhood purity analysis).
Plug-and-play: Orthogonal to existing test-time computing methods, allowing direct integration.
Training-free: Leverages off-the-shelf embedding models and LLMs without any extra training.
High practicality: Linear scaling + robustness to hyperparameters + cross-model generalizability = easy deployment.

Limitations & Future Work¶

Requires a certain amount of task-related unlabeled data, making it unsuitable for absolute cold-start scenarios.
Introduces \(K\) additional LLM call overheads, potentially impacting latency-sensitive scenarios when \(K\) is large.
Validation is restricted to classification tasks; its applicability to generative tasks (e.g., text summarization, translation) remains unexplored.
Neighborhood purity drops rapidly in fine-grained tasks (e.g., 150 classes), limiting its efficacy in extremely fine-grained classification.
More complex aggregation strategies (e.g., learning-based aggregation) have not yet been evaluated.

kNN-LM (Khandelwal et al., 2020) utilizes neighboring samples to improve language model generalization. While sharing a similar conception, it operates at the generation level.
Self-Consistency (Wang et al., 2023b) improves inference via output sampling and aggregation, whereas TestNUC complements it from the input space.
The paradigm of leveraging unlabeled data in semi-supervised learning (e.g., FixMatch) aligns with the neighborhood consistency assumption in TestNUC.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to systematically leverage the neighborhood consistency of unlabeled data within a test-time computing framework.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 LLMs \(\times\) 8 datasets \(\times\) multiple integration methods \(\times\) extensive ablations.
Writing Quality: ⭐⭐⭐⭐ — Preliminary analyses (purity, majority voting accuracy) establish a solid motivational foundation for the method.
Value: ⭐⭐⭐⭐ — Simple yet effective inference scaling method with significant practical value.