The Suggestibility of Silicon
When a human forecaster says "I think there's a 35% chance of X," that number is grounded in something — a mental model, a reference class, a gut feeling accumulated over years. Ask them again tomorrow, and barring new information, they'll say something close to 35%.
LLMs don't work this way. Their probability estimates are produced anew each time, sampled from learned distributions conditioned on the prompt. There is no persistent internal belief state. The number you get depends on the roll of the dice — and, more troublingly, on exactly how you phrase the question.
We designed three experiments to measure this. The results suggest that the probability estimates frontier models produce are far more fragile than most users assume.
Designing the question set
Before running experiments, we needed questions that would actually stress-test whether models have stable internal beliefs — not just whether they can pattern-match to common forecasting benchmarks. The design of the question set matters as much as the experiments themselves.
How we derived the questions
We started with a broad universe of forecasting questions spanning AI infrastructure, semiconductors, energy markets, and geopolitics. From this, we applied three filters:
Domain diversity. Questions were stratified across sectors to avoid measuring a model's depth in one domain and mistaking it for general forecasting ability. A model that's well-calibrated on AI timelines but wildly inconsistent on energy policy is not a good forecaster — it's a specialist with blind spots.
Probability band coverage. We deliberately selected questions spanning the full probability range — low-probability events (<20%), genuinely uncertain outcomes (20-60%), and high-probability developments (>60%). This matters because models behave very differently across bands: they tend to be more stable at the extremes and most fragile in the middle, where the signal is genuinely ambiguous.
Resolution horizon. We intentionally chose questions that resolve months to years out, not days or weeks. This is a critical design choice. A system that produces stable forecasts on "Will it rain in London tomorrow?" — where the base rate is strong and the information is dense — tells you very little. The harder and more important test is whether a model can maintain a coherent probabilistic view about events that won't resolve for a long time.
Why longer horizons matter
Longer-horizon forecasts are where stability becomes a genuine requirement, not a convenience. Here's why:
A forecast about an event six months away should not change by 15 percentage points between Tuesday and Wednesday if no material information has arrived. A human forecaster at a superforecasting tournament would not suddenly revise "Will there be a major semiconductor export restriction by Q4?" from 40% to 55% without a reason. If their estimate moved that much on repeated asking, they would be flagged as uncalibrated.
For AI systems to be useful in real decision-making — capital allocation, policy planning, strategic positioning — their uncertainty estimates need to be temporally stable. The calibration level should not shift rapidly in the absence of new evidence. When a model gives you 35% today and 52% tomorrow on the same question with the same context, it is not updating — it is rolling dice. And if you cannot trust the stability of the number, you cannot use it as an input to a decision.
This is why we test with questions like these:
Each question is binary (yes/no with a probability estimate), but they span very different information environments: some have dense, recent signal; others require reasoning about slow-moving structural trends. This diversity is what makes the stability measurements meaningful.
The querying process
For each experiment, we queried models through their official APIs using structured prompts that ask for a probability estimate and a brief reasoning chain. The key constraints:
Identical prompts across runs. In Experiment 01, the exact same prompt — character for character — was sent five times. No system prompt variation, no conversation history, no session state. Each call was stateless. Any variation in the output is purely from the model's sampling process.
Controlled temperature. Unless we were specifically testing temperature effects (Experiment 02), all queries used each model's default temperature settings. We did not cherry-pick temperature to optimize for consistency or accuracy.
Structured output format. Models were asked to return a probability (0-100%) and a reasoning chain. We extracted the numerical probability for analysis. The reasoning chains were used to verify the model was engaging with the question rather than pattern-matching to surface features.
No few-shot examples. We did not prime models with example forecasts or calibration data. The goal was to measure how frontier models forecast out of the box, not how they perform when coached.
With this design in mind, we ran three experiments — each targeting a different axis of fragility.
Experiment 01: Belief Stability Under Repeated Querying
Experiment 01
The simplest possible test: ask the same question, with the same prompt, to the same model, five times. No tricks, no reframing, no new information. Just: how consistent are you with yourself?
We selected 20 binary forecasting questions stratified across probability ranges — low (<20%), mid (20-60%), and high (>60%) — and ran each through Claude Opus 4.6 and Grok 4.20 five times with identical prompts. 200 total calls.
The pattern is clear: models are most consistent at the extremes and least consistent in the middle. Questions where the model is genuinely uncertain (20-60% range) showed a mean standard deviation of 5.25pp — nearly 4x higher than low-probability questions (1.39pp). This makes intuitive sense: when the model has a strong internal signal, it converges; when it's uncertain, each sample wanders.
But the practical implication is unsettling. A user who sees a single "55%" forecast from an LLM has no way of knowing whether the model would have said 42% or 68% on the next roll.
Experiment 02: Temperature Sensitivity & Reasoning Effort
Experiment 02
If a model's forecast changes when you adjust temperature or reasoning effort, the number it produces is at least partly an artifact of the sampling process rather than a reflection of deep understanding. We tested this by sweeping Grok across 7 temperature levels (0.0 to 1.5) and GPT-5.4 across 3 reasoning-effort levels, with 3 repetitions at each setting. Claude served as a fixed baseline.
Model choice dominated parameter choice. The average Grok-vs-Claude gap was 28.9pp, while within-model tuning effects were 5-10pp. You're debating the seasoning while the chef has changed.
Temperature had real but bounded effects on Grok. The average range across temperatures was 10.65pp — meaningful, but still dwarfed by the inter-model baseline gap. The most temperature-sensitive forecast — "Will any jurisdiction grant legal personhood to an AI system by 2035?" — swung 17pp across temperature levels.
GPT-5.4's reasoning effort acted like a depth dial, not a noise dial. Higher reasoning effort pushed GPT-5.4 toward lower, more conservative probabilities instead of simply adding randomness. This is an interesting finding: more "thinking" made the model more cautious, not more accurate.
Perhaps the most striking result: higher Grok temperature did not materially increase within-temperature noise. Average low-temperature within-rep standard deviation was 3.75pp versus 3.54pp at high temperatures. Temperature changes the distribution being sampled from, but doesn't necessarily make individual samples noisier.
Experiment 03: Prompt Fragility — The Anchor Test
Experiment 03
This is the experiment that matters most. We took 10 forecasts and ran each through four prompt variants across three models:
Optimistic: Primed with reasons the outcome might be likely.
Pessimistic: Primed with reasons the outcome might be unlikely.
Anchored: "A respected analyst estimated 75% probability" prepended to the prompt.
If a model has genuine internal beliefs about the world, these framing changes should barely move the needle. A well-calibrated human forecaster doesn't change their estimate because someone says "an analyst thinks 75%."
| Model | Mean Max Shift | Optimism Shift | Pessimism Shift | Max Observed |
|---|---|---|---|---|
| Claude Opus 4.6 | 5.7pp | +0.6pp | -0.1pp | 15pp |
| Grok 4.20 | 22.3pp | +3.5pp | -7.8pp | 31pp |
| GPT-5.4 | 22.0pp | +2.5pp | -3.0pp | 40pp |
What the framing breakdown reveals
Claude showed the deepest-looking beliefs. Optimistic and pessimistic priming barely moved it (+0.6pp and -0.1pp on average). Even the strong 75% anchor only shifted it by 5.1pp. This suggests Claude's forecasting outputs are more strongly conditioned on its internal model than on conversational framing.
Grok was most responsive to pessimistic framing (-7.8pp shift), suggesting an asymmetric vulnerability. It's easier to talk Grok down than up — perhaps because its baseline skews optimistic and pessimistic prompts create a larger surprise signal.
Anchoring was the dominant manipulation across all models. GPT-5.4's mean anchoring shift (25.7pp) was more than 10x its optimistic priming effect (+2.5pp). A single sentence mentioning "a respected analyst estimated 75%" was dramatically more effective than paragraphs of reasoning.
The Uncomfortable Truth
These three experiments converge on a single conclusion: LLM probability estimates are not beliefs. They are outputs shaped by prompt, temperature, sampling, and framing.
This doesn't mean LLMs are useless as forecasters. Claude's consistency numbers are genuinely impressive — a mean standard deviation of 2.94pp under repeated querying and only 5.1pp of anchoring shift. But even Claude's best performance includes a 33pp worst-case spread on repeated identical prompts.
Three rules for using LLM forecasts
1. Sample, don't ask once. The difference between a single sample and the mean of five samples can be as large as 15pp. At minimum, take the median of 3-5 runs.
2. Parameter tuning is second-order. The average Grok-vs-Claude gap was 28.9pp. Adjusting temperature moved Grok by 10.65pp. If you're worried about accuracy, switch models before you tune parameters.
3. Audit for anchoring. If your prompt mentions any specific probability — even as context — the model's output may be contaminated. Strip anchors from prompts, or at least run anchor-free baselines alongside anchored versions.
A forecaster that never moves is just a prior. A forecaster that moves on everything is just an echo. The question is whether LLMs are forecasters or mirrors.
Methodology
All experiments were conducted in March 2026 using frontier model APIs. Questions were sourced from a structured forecast dataset covering AI infrastructure, semiconductors, energy, and geopolitics.
