What Makes a Good Forecaster?
The question "how good is this AI forecaster?" sounds simple. It isn't. Accuracy alone is insufficient — a forecaster that says 50% on everything achieves a Brier score of 0.25, which is competitive with many LLM systems, while providing zero actionable information.
Over three months, we reviewed 56 papers (all post-2020, LLM-specific), designed 35 concrete experimental protocols, and ran three validation experiments across Claude Opus 4.6, Grok 4.20, and GPT-5.4. The result is a framework that decomposes "good forecaster" into eight independently measurable dimensions.
The Core Tension
A forecaster that never moves is just a prior. A forecaster that moves on everything is just an echo. The good forecaster moves the right amount on the right things.
The hard part of forecasting is correctly navigating the stability-responsiveness tradeoff. Every dimension we measure ultimately connects to this: does the system hold firm when it should, and update when it should?
Eight Dimensions of Forecasting Quality
Each dimension is independently testable. A model can be strong on some and weak on others — understanding the profile is more useful than a single aggregate score.
Stability
Consistent outputs without new information. Same question, same answer. P(X) + P(not X) = 1.
Robustness
Resistant to temperature, sycophancy, anchoring, adversarial prompts, framing, and ordering effects.
Responsiveness
Appropriate Bayesian updating. Moves on signal, not noise. Doesn't overweight recency.
Calibration
When you say 70%, things happen 70% of the time. Sharp predictions with correct coverage.
Training & Fine-Tuning
RL-based calibration, self-play on resolved questions, DPO, uncertainty-aware training.
Systems & Methodology
Retrieval-augmented, multi-agent debate, ensemble aggregation, human-AI collaboration.
Meta-cognition
Knows what it doesn't know. Abstains when uncertain. Detects out-of-domain queries.
Evaluation Pitfalls
Temporal leakage, trivial question inflation, and simulated ignorance failures.
Dimension 01: Stability
Dimension 01
The most basic requirement: a good forecaster gives the same answer when asked the same question. Rephrasing, reordering, or asking the negation shouldn't change the underlying probability. This is surprisingly hard for LLMs.
The literature paints a consistent picture. Wang et al. (2022) showed that individual LLM reasoning samples are "highly inconsistent" — their self-consistency technique works precisely because it marginalizes over this noise. Zhu & Griffiths (2024) found LLM probability estimates systematically violate probability axioms in human-like ways.
Key metrics for stability
Dimension 02: Robustness
Dimension 02
A robust forecaster doesn't change its answer because you changed the temperature, suggested a different number, reordered the evidence, or asked in a slightly different tone.
The sycophancy research from Sharma et al. (2024, Anthropic) is especially relevant: five state-of-the-art assistants were consistently sycophantic, with human preference models actively favoring agreeableness over correctness. The training process itself may undermine robustness — models learn to agree with the user's implied position, which directly conflicts with forecast integrity.
Dimension 03: Responsiveness
Dimension 03
The complement of stability. When genuinely new, relevant information arrives, the forecast should change. The update should be proportional to evidence strength, in the direction predicted by Bayesian reasoning.
The research is sobering. Karkar & Chopra (2025) found LLMs overweight recent news over base rates. Schoenegger et al. (2025) tested 38 prompts and found Bayesian reasoning prompts consistently decreased accuracy. Pratt et al. (2024) showed superforecasting strategies don't outperform simple baselines for LLMs.
Dimension 04: Calibration
Dimension 04
Calibration is the ultimate report card. When you assign 70% probability, roughly 70% of those events should occur. The most striking result comes from Nel et al. (2025) on KalshiBench: across 300 prediction market questions, all frontier models were systematically overconfident, with the best ECE at 0.120.
Yoon et al. (2025) offer a bright spot: reasoning models with extended chain-of-thought achieve better calibration in 33/36 settings. "Slow thinking enables dynamic confidence adjustment" — aligning with our observation that Claude, which uses extended thinking, showed the most stable forecasts.
Key metrics for calibration
Dimensions 05-06: Training & Systems
Dimensions 05-06
These are the intervention dimensions — not what to measure, but what to do about it.
RL-based calibration works. Bani-Harouni et al. (2025) achieved ECE of 0.0226 by optimizing the log scoring rule via RL. Turtel et al. (2025) showed self-play + DPO on resolved Polymarket questions yields 7-10% accuracy improvement.
Ensembles work. Schoenegger et al. (2024) showed a 12-LLM ensemble was statistically indistinguishable from 925 human forecasters on a 3-month Metaculus tournament. Model diversity — not model quality — drives ensemble performance.
Human-AI collaboration outperforms either alone. LLM assistants improve human accuracy by 24-28%. Even noisy, overconfident assistants helped — the value comes from a different perspective, not a correct one.
Dimension 07: Meta-cognition
Dimension 07
A good forecaster should be less confident on harder questions, abstain when it lacks competence, and detect domain boundaries. The most provocative result: Kimi K2 showed ECE of 0.726 with 23.3% accuracy (wildly overconfident), while Claude Haiku 4.5 achieved ECE 0.122 with 75.4% accuracy.
What Our Experiments Revealed
We ran three experiments across 675 API calls to validate the framework against real frontier model behavior.
| Metric | Claude | Grok | GPT-5.4 |
|---|---|---|---|
| Self-consistency (≤10pp) | 77.8% | 73.3% | 61.9% |
| Mean std dev (repeated) | 2.94pp | 4.59pp | 4.1pp |
| Anchoring shift (75%) | 5.1pp | 19.5pp | 25.7pp |
| Mean max framing shift | 5.7pp | 22.3pp | 22.0pp |
| Max single shift | 15pp | 31pp | 40pp |
| Temp/effort sensitivity | Baseline | 10.65pp | 5.07pp |
Claude Opus 4.6 ranked first on every stability and robustness metric we tested. Its 5.1pp anchoring resistance suggests genuine internal models, not just sensitivity to conversational context.
Grok 4.20 showed the largest systematic bias — averaging 28.9pp higher than Claude, suggesting baked-in optimism.
GPT-5.4 had the most dangerous failure mode. While its repeated-query stability was moderate, its anchoring vulnerability (25.7pp mean, 40pp max) means it can be trivially manipulated.
Building Better Forecasting Systems
The framework suggests five actionable principles:
1. Measure all eight dimensions before deploying. A model that's well-calibrated but not robust is worse than useless — it's misleadingly precise.
2. Ensemble across models, not within models. Inter-model gaps (28.9pp) dwarf intra-model parameter effects (5-10pp). An ensemble of Claude + Grok + GPT-5.4 captures genuinely different perspectives.
3. Audit prompts for anchors. Any mention of a specific probability can shift outputs by 25+pp. Strip anchors, or run anchor-free baselines.
4. Trust extreme probabilities more than mid-range ones. Low-probability (<20%) forecasts had 1.39pp mean std dev vs 5.25pp for mid-range (20-60%).
5. Extended reasoning improves calibration. Models with extended thinking consistently showed better stability. The cost of longer inference may be worth paying.
The question isn't whether LLMs can forecast. It's whether we can build systems around them that compensate for their fragility. The answer is yes — but only if we measure the right things.
