Interpreting Fedspeak with Confidence: A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths¶
Conference: AAAI2026 arXiv: 2508.08001 Code: yuuki20001/FOMC-sentiment-path Area: Time Series Keywords: Fedspeak, monetary policy stance, LLM, uncertainty quantification, financial sentiment analysis
TL;DR¶
This paper proposes an LLM-based uncertainty-aware framework for interpreting Fedspeak (Federal Reserve language). The framework enhances inputs through domain reasoning along monetary policy transmission paths, and introduces a dynamic uncertainty decoding module to quantify prediction confidence (Perceptual Uncertainty = Environmental Ambiguity × Cognitive Risk), achieving SOTA performance on FOMC monetary policy stance analysis.
Background & Motivation¶
Fedspeak is the specialized language used by the Federal Reserve to convey policy signals, characterized by strong context dependence — the same word may indicate opposite stances under different economic conditions (e.g., a "strong" labor market is dovish in a weak economy but hawkish in an overheating one).
Limitations of prior work: - Dictionary-based methods: Simple and interpretable but unable to handle complex context. - Fine-tuned models (e.g., FinBERT): Strong performance but black-box with limited transparency. - Zero-shot large models (e.g., GPT-4): Capable but neglect reliability, bias, and hallucination concerns. - Existing LLM work focuses predominantly on performance metrics while overlooking the evaluation of prediction reliability.
Core idea: The LLM is analogized to a policy analyst, with two uncertainty dimensions — Cognitive Risk (CR) and Environmental Ambiguity (EA) — introduced to quantify prediction confidence.
Method¶
Data Augmentation: Domain Reasoning¶
- Financial Entity Relation Extraction: Atomic relations \(r(e_i, e_j) \in \mathcal{R}\) are decomposed from Fedspeak, covering six types: CAUSE, COND, EVID, PURP, ACT, and COMP.
- Monetary Policy Transmission Path Reasoning: A quadruple \(\Gamma = (\mathbf{X}, \mathbf{Y}, \mathbf{Z}, \mathbf{M})\) is constructed, where:
- \(\mathbf{X}\): economic shock vector
- \(\mathbf{Y}\): transmission channels (credit channel, asset price channel, aggregate demand channel, etc.)
- \(\mathbf{Z}\): transmission paths (state transition sequences)
- \(\mathbf{M}\): final policy recommendations
- Structured templates combined with human-AI collaboration are used to construct the SFT dataset.
Dynamic Uncertainty Decoding¶
The top-\(k\) logits from the LLM output are used to construct a Dirichlet distribution, from which three uncertainty measures are derived:
-
Environmental Ambiguity (EA): Expected entropy of the predictive distribution $\(EA(a_t) = -\sum_{k=1}^{K} \frac{\alpha_k}{\alpha_0}(\psi(\alpha_k+1) - \psi(\alpha_0+1))\)$
-
Cognitive Risk (CR): Inversely proportional to the total evidence mass $\(CR(a_t) = \frac{K}{\sum_{k=1}^{K}(\alpha_k + 1)}\)$
-
Perceptual Uncertainty (PU): \(PU = EA \times CR\)
The decoding strategy switches dynamically based on the PU threshold: - Low PU → aggressive (select top-1 token directly) - High PU → conservative (sample from top-2)
Key Experimental Results¶
Experimental Setup¶
- Dataset: Trillion Dollar Words FOMC dataset (1996–2022), comprising three document types: meeting minutes, press conferences, and speeches.
- Baselines: 10+ models including GPT-4.1, Gemini-2.5-Pro, DeepSeek-R1, Phi-4, FinBERT, and AICBC.
- Backbone: Qwen3-14B fine-tuned with LoRA.
Main Results (All Categories)¶
| Method | Macro F1 | Weighted F1 |
|---|---|---|
| GPT-4.1 (zero-shot) | 0.6662 | 0.6763 |
| AICBC (zero-shot) | 0.6637 | 0.6802 |
| Qwen3-8B (fine-tuned) | 0.6586 | 0.6745 |
| Ours | 0.7327 | 0.7426 |
- Outperforms the strongest baseline by +6.6% in Macro F1 and +6.2% in Weighted F1.
- Best performance on meeting minutes: Macro F1 = 0.7449 (+7.4%).
- Speeches: Macro F1 = 0.7291 (+6.7%).
Ablation Study¶
| Configuration | Macro F1 | Weighted F1 |
|---|---|---|
| Full model | 0.7327 | 0.7426 |
| w/o PU | 0.7291 | 0.7378 |
| w/o Transmission Path | 0.6538 | 0.6699 |
| w/o Entity Relations | 0.6397 | 0.6551 |
The transmission path contributes most (−7.9% when removed), followed by entity relations, while the PU module contributes more modestly but consistently.
Uncertainty Validation¶
- Low-PU predictions: Macro F1 = 0.7791; high-PU predictions: Macro F1 = 0.2473.
- p-values from t-test, Mann-Whitney U test, and logistic regression are all well below 0.001, indicating strong statistical significance.
Highlights & Insights¶
- Domain reasoning innovation: The first work to formalize monetary policy transmission mechanisms as structured reasoning templates, simulating the analytical workflow of human domain experts.
- Practical PU measure: The decomposition of EA × CR aligns with the classical risk/ambiguity distinction in economics, making it intuitively natural for financial applications.
- High-PU warning mechanism: Enables identification of unreliable predictions, supporting human-in-the-loop decision making.
- Comprehensive superiority over GPT-4.1: Substantially outperforms closed-source large models on both meeting minutes and speeches.
Limitations & Future Work¶
- Performance on press conferences falls below GPT-4.1 (−1.3%), suggesting insufficient capture of dynamic context dependencies in real-time Q&A settings.
- Transmission path construction relies on manually designed templates, limiting automation.
- Validation is limited to FOMC English data; generalization to other central banks (ECB, BoE) or multilingual scenarios remains unexplored.
- The PU threshold requires search on a validation set and must be re-tuned for different datasets.
- The "abstain from answering" strategy in practical deployment has not been explored.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of monetary policy transmission path reasoning and PU quantification constitutes a clear methodological contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive coverage with 10+ baselines, three document types, ablation studies, and statistical testing.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with smooth integration of economics and NLP concepts.
- Value: ⭐⭐⭐⭐ — Meaningfully advances reliability research in financial NLP.