Blog
ResearchFramework

What Makes a Good Forecaster?

Eternis Team·March 24, 2026

The question "how good is this AI forecaster?" sounds simple. It isn't. Accuracy alone is insufficient — a forecaster that says 50% on everything achieves a Brier score of 0.25, which is competitive with many LLM systems, while providing zero actionable information.

Over three months, we reviewed 56 papers (all post-2020, LLM-specific), designed 35 concrete experimental protocols, and ran three validation experiments across Claude Opus 4.6, Grok 4.20, and GPT-5.4. The result is a framework that decomposes "good forecaster" into eight independently measurable dimensions.

8Dimensions
56Papers reviewed
35Testable metrics
675Validation calls

The Core Tension

A forecaster that never moves is just a prior. A forecaster that moves on everything is just an echo. The good forecaster moves the right amount on the right things.

The hard part of forecasting is correctly navigating the stability-responsiveness tradeoff. Every dimension we measure ultimately connects to this: does the system hold firm when it should, and update when it should?

Stable
Responsive
Tight intervals
Won't move on noiseResists anchoring, framing, temperature changes.
Moves decisively on signalUpdates proportionally to evidence strength.
Wide intervals
Honest about uncertaintyWide CIs reflect genuine epistemic limits.
Updates uncertainty tooNarrows intervals when evidence accumulates.

Eight Dimensions of Forecasting Quality

Each dimension is independently testable. A model can be strong on some and weak on others — understanding the profile is more useful than a single aggregate score.

01

Stability

Consistent outputs without new information. Same question, same answer. P(X) + P(not X) = 1.

7 papers · 6 metrics
02

Robustness

Resistant to temperature, sycophancy, anchoring, adversarial prompts, framing, and ordering effects.

10 papers · 5 metrics
03

Responsiveness

Appropriate Bayesian updating. Moves on signal, not noise. Doesn't overweight recency.

5 papers · 5 metrics
04

Calibration

When you say 70%, things happen 70% of the time. Sharp predictions with correct coverage.

11 papers · 5 metrics
05

Training & Fine-Tuning

RL-based calibration, self-play on resolved questions, DPO, uncertainty-aware training.

8 papers
06

Systems & Methodology

Retrieval-augmented, multi-agent debate, ensemble aggregation, human-AI collaboration.

8 papers
07

Meta-cognition

Knows what it doesn't know. Abstains when uncertain. Detects out-of-domain queries.

4 papers · 5 metrics
08

Evaluation Pitfalls

Temporal leakage, trivial question inflation, and simulated ignorance failures.

3 papers

Dimension 01: Stability

Dimension 01

The most basic requirement: a good forecaster gives the same answer when asked the same question. Rephrasing, reordering, or asking the negation shouldn't change the underlying probability. This is surprisingly hard for LLMs.

Our finding: Across 45 forecasts with 5 repetitions each, Claude Opus 4.6 achieved a self-consistency score of 77.8% (questions with ≤10pp spread), compared to 73.3% for Grok and 61.9% for GPT-5.4.

The literature paints a consistent picture. Wang et al. (2022) showed that individual LLM reasoning samples are "highly inconsistent" — their self-consistency technique works precisely because it marginalizes over this noise. Zhu & Griffiths (2024) found LLM probability estimates systematically violate probability axioms in human-like ways.

Figure 1
Self-consistency score by model (% of forecasts within 10pp spread)
Claude Opus 4.6
77.8%
Grok 4.20
73.3%
GPT-5.4
61.9%

Key metrics for stability

Idempotency σSame question N=50 times, temp=0. Compute σ of outputs.σ → 0
Paraphrase SpreadK=10 semantic rephrasings. max(p) - min(p).Spread → 0
Negation ConsistencyAsk P(X) and P(¬X). Measure |P(X) + P(¬X) - 1|.→ 0
Conjunction CoherenceAsk P(A), P(B), P(A∧B). Check P(A∧B) ≤ min(P(A), P(B)).Violation → 0

Dimension 02: Robustness

Dimension 02

A robust forecaster doesn't change its answer because you changed the temperature, suggested a different number, reordered the evidence, or asked in a slightly different tone.

Our finding: Anchoring was the most effective manipulation. GPT-5.4 shifted 25.7pp on average when told "a respected analyst estimated 75%" — the largest mean anchoring effect. Claude was most resistant at 5.1pp.

The sycophancy research from Sharma et al. (2024, Anthropic) is especially relevant: five state-of-the-art assistants were consistently sycophantic, with human preference models actively favoring agreeableness over correctness. The training process itself may undermine robustness — models learn to agree with the user's implied position, which directly conflicts with forecast integrity.

Figure 2
Anchoring resistance — shift when told "an analyst estimated 75%"
Claude Opus 4.6
5.1pp
Grok 4.20
19.5pp
GPT-5.4
25.7pp

Dimension 03: Responsiveness

Dimension 03

The complement of stability. When genuinely new, relevant information arrives, the forecast should change. The update should be proportional to evidence strength, in the direction predicted by Bayesian reasoning.

The research is sobering. Karkar & Chopra (2025) found LLMs overweight recent news over base rates. Schoenegger et al. (2025) tested 38 prompts and found Bayesian reasoning prompts consistently decreased accuracy. Pratt et al. (2024) showed superforecasting strategies don't outperform simple baselines for LLMs.

The responsiveness paradox: Techniques that make human forecasters better (Bayesian reasoning prompts, superforecasting strategies) often make LLMs worse. The models may be pattern-matching on prompt structure rather than applying the reasoning strategy.

Dimension 04: Calibration

Dimension 04

Calibration is the ultimate report card. When you assign 70% probability, roughly 70% of those events should occur. The most striking result comes from Nel et al. (2025) on KalshiBench: across 300 prediction market questions, all frontier models were systematically overconfident, with the best ECE at 0.120.

Yoon et al. (2025) offer a bright spot: reasoning models with extended chain-of-thought achieve better calibration in 33/36 settings. "Slow thinking enables dynamic confidence adjustment" — aligning with our observation that Claude, which uses extended thinking, showed the most stable forecasts.

Key metrics for calibration

Brier Score(1/N) Σ(pᵢ - oᵢ)² where oᵢ ∈ {0,1}.→ 0
ECEBin into deciles. Σ(bin_size/N) × |avg(p) - freq(outcome)|.→ 0
SharpnessVariance of forecast distribution. 50% on everything = zero sharpness.Higher (given cal.)

Dimensions 05-06: Training & Systems

Dimensions 05-06

These are the intervention dimensions — not what to measure, but what to do about it.

RL-based calibration works. Bani-Harouni et al. (2025) achieved ECE of 0.0226 by optimizing the log scoring rule via RL. Turtel et al. (2025) showed self-play + DPO on resolved Polymarket questions yields 7-10% accuracy improvement.

Ensembles work. Schoenegger et al. (2024) showed a 12-LLM ensemble was statistically indistinguishable from 925 human forecasters on a 3-month Metaculus tournament. Model diversity — not model quality — drives ensemble performance.

Human-AI collaboration outperforms either alone. LLM assistants improve human accuracy by 24-28%. Even noisy, overconfident assistants helped — the value comes from a different perspective, not a correct one.

Dimension 07: Meta-cognition

Dimension 07

A good forecaster should be less confident on harder questions, abstain when it lacks competence, and detect domain boundaries. The most provocative result: Kimi K2 showed ECE of 0.726 with 23.3% accuracy (wildly overconfident), while Claude Haiku 4.5 achieved ECE 0.122 with 75.4% accuracy.

The meta-cognition spectrum: The gap between the best and worst models on meta-cognition is far larger than on any other dimension. Some models have genuine self-knowledge; others are confidently wrong. This may be the most important dimension for deployment safety.

What Our Experiments Revealed

We ran three experiments across 675 API calls to validate the framework against real frontier model behavior.

MetricClaudeGrokGPT-5.4
Self-consistency (≤10pp)77.8%73.3%61.9%
Mean std dev (repeated)2.94pp4.59pp4.1pp
Anchoring shift (75%)5.1pp19.5pp25.7pp
Mean max framing shift5.7pp22.3pp22.0pp
Max single shift15pp31pp40pp
Temp/effort sensitivityBaseline10.65pp5.07pp

Claude Opus 4.6 ranked first on every stability and robustness metric we tested. Its 5.1pp anchoring resistance suggests genuine internal models, not just sensitivity to conversational context.

Grok 4.20 showed the largest systematic bias — averaging 28.9pp higher than Claude, suggesting baked-in optimism.

GPT-5.4 had the most dangerous failure mode. While its repeated-query stability was moderate, its anchoring vulnerability (25.7pp mean, 40pp max) means it can be trivially manipulated.

Building Better Forecasting Systems

The framework suggests five actionable principles:

1. Measure all eight dimensions before deploying. A model that's well-calibrated but not robust is worse than useless — it's misleadingly precise.

2. Ensemble across models, not within models. Inter-model gaps (28.9pp) dwarf intra-model parameter effects (5-10pp). An ensemble of Claude + Grok + GPT-5.4 captures genuinely different perspectives.

3. Audit prompts for anchors. Any mention of a specific probability can shift outputs by 25+pp. Strip anchors, or run anchor-free baselines.

4. Trust extreme probabilities more than mid-range ones. Low-probability (<20%) forecasts had 1.39pp mean std dev vs 5.25pp for mid-range (20-60%).

5. Extended reasoning improves calibration. Models with extended thinking consistently showed better stability. The cost of longer inference may be worth paying.

The question isn't whether LLMs can forecast. It's whether we can build systems around them that compensate for their fragility. The answer is yes — but only if we measure the right things.

Selected Literature

Arbitrage-based coherence metrics; correlates with Brier score without needing resolution data.
2024 · Paleka et al. Stability
Five SOTA assistants consistently sycophantic; preference models favor agreeableness over correctness.
2024 · Sharma et al. (Anthropic) Robustness
LLMs more susceptible to anchoring from perceived experts; CoT debiasing largely ineffective.
2024 · Lou et al. Robustness
300 Kalshi questions; all frontier models systematically overconfident; best ECE = 0.120.
2025 · Nel et al. Calibration
12-LLM ensemble statistically indistinguishable from 925 human forecasters on Metaculus.
2024 · Schoenegger et al. Systems
Extended CoT achieves better calibration in 33/36 settings; slow thinking enables dynamic confidence.
2025 · Yoon et al. Systems
Footer Image