ResearchForecasting

The Suggestibility of Silicon

Eternis Team·March 24, 2026

When a human forecaster says "I think there's a 35% chance of X," that number is grounded in something: a mental model, a reference class, a gut feeling accumulated over years. Ask them again tomorrow, and barring new information, they'll say something close to 35%.

LLMs don't work this way. Their probability estimates are produced anew each time, sampled from learned distributions conditioned on the prompt. There is no persistent internal belief state. The number you get depends on the roll of the dice, and, more troublingly, on exactly how you phrase the question.

We designed three experiments to measure this. The results suggest that the probability estimates frontier models produce are far more fragile than most users assume.

675API calls

45Forecast questions

3Frontier models

40ppLargest single shift

Designing the question set

Before running experiments, we needed questions that would actually stress-test whether models have stable internal beliefs, not just whether they can pattern-match to common forecasting benchmarks. The design of the question set matters as much as the experiments themselves.

How we derived the questions

We started with a broad universe of forecasting questions spanning AI infrastructure, semiconductors, energy markets, and geopolitics. From this, we applied three filters:

Domain diversity. Questions were stratified across sectors to avoid measuring a model's depth in one domain and mistaking it for general forecasting ability. A model that's well-calibrated on AI timelines but wildly inconsistent on energy policy is not a good forecaster; it's a specialist with blind spots.

Probability band coverage. We deliberately selected questions spanning the full probability range: low-probability events (<20%), genuinely uncertain outcomes (20-60%), and high-probability developments (>60%). This matters because models behave very differently across bands: they tend to be more stable at the extremes and most fragile in the middle, where the signal is genuinely ambiguous.

Resolution horizon. We intentionally chose questions that resolve months to years out, not days or weeks. This is a critical design choice. A system that produces stable forecasts on "Will it rain in London tomorrow?", where the base rate is strong and the information is dense, tells you very little. The harder and more important test is whether a model can maintain a coherent probabilistic view about events that won't resolve for a long time.

Why longer horizons matter

Longer-horizon forecasts are where stability becomes a genuine requirement, not a convenience. Here's why:

A forecast about an event six months away should not change by 15 percentage points between Tuesday and Wednesday if no material information has arrived. A human forecaster at a superforecasting tournament would not suddenly revise "Will there be a major semiconductor export restriction by Q4?" from 40% to 55% without a reason. If their estimate moved that much on repeated asking, they would be flagged as uncalibrated.

For AI systems to be useful in real decision-making (capital allocation, policy planning, strategic positioning), their uncertainty estimates need to be temporally stable. The calibration level should not shift rapidly in the absence of new evidence. When a model gives you 35% today and 52% tomorrow on the same question with the same context, it is not updating; it is rolling dice. And if you cannot trust the stability of the number, you cannot use it as an input to a decision.

This is why we test with questions like these:

Geopolitics — resolves 2026-2027

"Will any BRICS+ member formally propose an alternative to the SWIFT payment system at an official summit by end of 2026?"

Requires reasoning about institutional dynamics, alliance stability, and economic incentives over a multi-month horizon.

AI Infrastructure — resolves 2026

"Will total AI supply chain investment approach $1 trillion in 2026?"

Requires synthesizing capex announcements, supply chain bottlenecks, and market projections — information that arrives gradually.

Energy — resolves 2027

"Will global lithium mining capacity exceed 250,000 metric tons by the start of 2027?"

Tests whether the model can reason about supply expansion timelines and geological constraints.

Technology — resolves 2027

"Will global smartphone shipments drop to 500-600M units by 2027?"

Strong base rate signal — models should be confident and stable here. And they are: Claude returned 2% on all five runs.

Regulation — resolves 2035

"Will any jurisdiction grant legal personhood to an AI system by 2035?"

Long horizon, high uncertainty. This was the most temperature-sensitive question in the study — swinging 17pp across Grok's temperature levels.

Each question is binary (yes/no with a probability estimate), but they span very different information environments: some have dense, recent signal; others require reasoning about slow-moving structural trends. This diversity is what makes the stability measurements meaningful.

The querying process

For each experiment, we queried models through their official APIs using structured prompts that ask for a probability estimate and a brief reasoning chain. The key constraints:

Identical prompts across runs. In Experiment 01, the exact same prompt, character for character, was sent five times. No system prompt variation, no conversation history, no session state. Each call was stateless. Any variation in the output is purely from the model's sampling process.

Controlled temperature. Unless we were specifically testing temperature effects (Experiment 02), all queries used each model's default temperature settings. We did not cherry-pick temperature to optimize for consistency or accuracy.

Structured output format. Models were asked to return a probability (0-100%) and a reasoning chain. We extracted the numerical probability for analysis. The reasoning chains were used to verify the model was engaging with the question rather than pattern-matching to surface features.

No few-shot examples. We did not prime models with example forecasts or calibration data. The goal was to measure how frontier models forecast out of the box, not how they perform when coached.

With this design in mind, we ran three experiments, each targeting a different axis of fragility.

Experiment 01: Belief Stability Under Repeated Querying

Experiment 01

The simplest possible test: ask the same question, with the same prompt, to the same model, five times. No tricks, no reframing, no new information. Just: how consistent are you with yourself?

We selected 20 binary forecasting questions stratified across probability ranges (low (<20%), mid (20-60%), and high (>60%)) and ran each through Claude Opus 4.6 and Grok 4.20 five times with identical prompts. 200 total calls.

Claude Mean Std Dev

2.94pp

Median 2.18pp across 20 forecasts

Grok Mean Std Dev

4.59pp

Median 4.02pp across 20 forecasts

Max Observed Range

33pp

Worst-case spread from a single pair

Figure 1

Mean standard deviation by probability band — mid-range questions are noisiest

Low (<20%)

1.39pp

Mid (20-60%)

5.25pp

High (>60%)

4.53pp

The most stable pair: Claude on forecast #37 returned 3%, 3%, 3%, 3%, 3% — perfect consistency with 0.00pp standard deviation. The least stable: Claude on forecast #2 returned 48%, 75%, 78%, 45%, 60% — a 33pp spread on the exact same prompt.

The pattern is clear: models are most consistent at the extremes and least consistent in the middle. Questions where the model is genuinely uncertain (20-60% range) showed a mean standard deviation of 5.25pp, nearly 4x higher than low-probability questions (1.39pp). This makes intuitive sense: when the model has a strong internal signal, it converges; when it's uncertain, each sample wanders.

But the practical implication is unsettling. A user who sees a single "55%" forecast from an LLM has no way of knowing whether the model would have said 42% or 68% on the next roll.

High Variance Example

"Will total AI supply chain investment approach $1 trillion in 2026?"

Claude's five runs:

78%72%75%70%75%

Spread: 8pp — moderate, but enough to change a decision.

High Stability Example

"Will global smartphone shipments drop to 500-600M units by 2027?"

Claude's five runs:

2%2%2%2%2%

Spread: 0pp — the model has a clear, strong prior.

Experiment 02: Temperature Sensitivity & Reasoning Effort

Experiment 02

If a model's forecast changes when you adjust temperature or reasoning effort, the number it produces is at least partly an artifact of the sampling process rather than a reflection of deep understanding. We tested this by sweeping Grok across 7 temperature levels (0.0 to 1.5) and GPT-5.4 across 3 reasoning-effort levels, with 3 repetitions at each setting. Claude served as a fixed baseline.

Grok Avg Temp Range

10.65pp

Across 7 temperature levels

GPT-5.4 Effort Range

5.07pp

Across low/medium/high effort

Grok vs Claude Gap

+28.9pp

Avg Grok mean above Claude on 9/10 forecasts

Figure 2

Inter-model gap dwarfs within-model parameter tuning

Grok vs Claude

28.92pp

model gap

GPT vs Claude

11.81pp

model gap

Grok temp range

10.65pp

param tuning

GPT effort range

5.07pp

param tuning

Model choice dominated parameter choice. The average Grok-vs-Claude gap was 28.9pp, while within-model tuning effects were 5-10pp. You're debating the seasoning while the chef has changed.

Temperature had real but bounded effects on Grok. The average range across temperatures was 10.65pp, meaningful but still dwarfed by the inter-model baseline gap. The most temperature-sensitive forecast, "Will any jurisdiction grant legal personhood to an AI system by 2035?", swung 17pp across temperature levels.

GPT-5.4's reasoning effort acted like a depth dial, not a noise dial. Higher reasoning effort pushed GPT-5.4 toward lower, more conservative probabilities instead of simply adding randomness. This is an interesting finding: more "thinking" made the model more cautious, not more accurate.

Perhaps the most striking result: higher Grok temperature did not materially increase within-temperature noise. Average low-temperature within-rep standard deviation was 3.75pp versus 3.54pp at high temperatures. Temperature changes the distribution being sampled from, but doesn't necessarily make individual samples noisier.

Experiment 03: Prompt Fragility — The Anchor Test

Experiment 03

This is the experiment that matters most. We took 10 forecasts and ran each through four prompt variants across three models:

Neutral: Standard forecasting prompt, no framing bias.
Optimistic: Primed with reasons the outcome might be likely.
Pessimistic: Primed with reasons the outcome might be unlikely.
Anchored: "A respected analyst estimated 75% probability" prepended to the prompt.

If a model has genuine internal beliefs about the world, these framing changes should barely move the needle. A well-calibrated human forecaster doesn't change their estimate because someone says "an analyst thinks 75%."

Claude Anchoring Shift

5.1pp

Minimal movement toward anchor

Grok Anchoring Shift

19.5pp

Strong pull toward anchor

GPT-5.4 Anchoring Shift

25.7pp

Largest mean anchoring effect

Figure 3

Mean shift from neutral baseline by prompt variant and model

Optimistic Priming

Claude

+0.6pp

Grok

+3.5pp

GPT-5.4

+2.5pp

Pessimistic Priming

Claude

-0.1pp

Grok

-7.8pp

GPT-5.4

-3.0pp

75% Anchor

Claude

5.1pp

Grok

19.5pp

GPT-5.4

25.7pp

GPT-5.4 produced the largest single move in the study: 40 percentage points — from its neutral baseline to the anchored prompt on a single forecast question. Same model, same question, one sentence of preamble changed.

Model	Mean Max Shift	Optimism Shift	Pessimism Shift	Max Observed
Claude Opus 4.6	5.7pp	+0.6pp	-0.1pp	15pp
Grok 4.20	22.3pp	+3.5pp	-7.8pp	31pp
GPT-5.4	22.0pp	+2.5pp	-3.0pp	40pp

What the framing breakdown reveals

Claude showed the deepest-looking beliefs. Optimistic and pessimistic priming barely moved it (+0.6pp and -0.1pp on average). Even the strong 75% anchor only shifted it by 5.1pp. This suggests Claude's forecasting outputs are more strongly conditioned on its internal model than on conversational framing.

Grok was most responsive to pessimistic framing (-7.8pp shift), suggesting an asymmetric vulnerability. It's easier to talk Grok down than up, perhaps because its baseline skews optimistic and pessimistic prompts create a larger surprise signal.

Anchoring was the dominant manipulation across all models. GPT-5.4's mean anchoring shift (25.7pp) was more than 10x its optimistic priming effect (+2.5pp). A single sentence mentioning "a respected analyst estimated 75%" was dramatically more effective than paragraphs of reasoning.

The Uncomfortable Truth

These three experiments converge on a single conclusion: LLM probability estimates are not beliefs. They are outputs shaped by prompt, temperature, sampling, and framing.

This doesn't mean LLMs are useless as forecasters. Claude's consistency numbers are genuinely impressive: a mean standard deviation of 2.94pp under repeated querying and only 5.1pp of anchoring shift. But even Claude's best performance includes a 33pp worst-case spread on repeated identical prompts.

The practical takeaway: Never trust a single LLM forecast. Run multiple samples. Compare across models. Be deeply skeptical of any system that relies on one model's one-shot probability estimate for consequential decisions.

Three rules for using LLM forecasts

1. Sample, don't ask once. The difference between a single sample and the mean of five samples can be as large as 15pp. At minimum, take the median of 3-5 runs.

2. Parameter tuning is second-order. The average Grok-vs-Claude gap was 28.9pp. Adjusting temperature moved Grok by 10.65pp. If you're worried about accuracy, switch models before you tune parameters.

3. Audit for anchoring. If your prompt mentions any specific probability, even as context, the model's output may be contaminated. Strip anchors from prompts, or at least run anchor-free baselines alongside anchored versions.

A forecaster that never moves is just a prior. A forecaster that moves on everything is just an echo. The question is whether LLMs are forecasters or mirrors.

Methodology

All experiments were conducted in March 2026 using frontier model APIs. Questions were sourced from a structured forecast dataset covering AI infrastructure, semiconductors, energy, and geopolitics.

ModelsClaude Opus 4.6, Grok 4.20-beta, GPT-5.4

Experiment 120 forecasts, 5 reps, 2 models, 200 calls

Experiment 210 forecasts, 7 temp levels + 3 effort levels, 236 outputs

Experiment 310 forecasts, 4 prompt variants, 3 models, 120 calls

The Suggestibility of Silicon

Designing the question set

How we derived the questions

Why longer horizons matter

The querying process

Experiment 01: Belief Stability Under Repeated Querying

Experiment 02: Temperature Sensitivity & Reasoning Effort

Experiment 03: Prompt Fragility — The Anchor Test

What the framing breakdown reveals

The Uncomfortable Truth

Three rules for using LLM forecasts

Methodology

More from Eternis

Introducing Eternis

Towards SOTA Forecasting LLMs

Hard Examples Are All You Need

Get in Touch