Blog
ResearchForecasting

Towards SOTA Forecasting LLMs

Eternis·March 24, 2026

High-stakes decisions in policy, investing, and risk management require reasoning about the future under uncertainty. Language models can ingest and synthesize vast information across thousands of topics, making them promising candidates for this task. But off-the-shelf LLMs are poor forecasters: they are overconfident, poorly calibrated, and rarely trained to reason probabilistically about open-ended outcomes.

Most prior work focuses on binary yes/no prediction market questions (e.g. "Will X win the election?"), where a coin flip already gets you 50% accuracy. The harder and more valuable task is open-ended forecasting: predicting who, what, or where without a very small set of choices. This is the setting we train for.

Eternis-forecaster-8B

What we've built

We train an 8B-parameter language model to make calibrated, open-ended predictions about world events. On the OpenForesight benchmark (302 questions, May–August 2025), our model, EF-8B, surpasses all published baselines including OpenForecaster-8B, a recent open-source system that itself matched proprietary models like GPT-OSS-120B and DeepSeek-R1 in Brier score and accuracy. This result is interesting because when using the right reinforcement learning harness to post-train the model, it consistently beats models 10-15x larger in parameter count.

Each question is paired with retrieved news context from up to one month before the event resolves, ensuring no future information leaks into the prediction. The model must reason about what it knows, weigh competing signals, and produce both an answer and a calibrated probability.

How we got there

Context building. The quality of retrieved context is the single largest driver of forecasting accuracy. We developed a proprietary context-building pipeline that mirrors how a skilled human would prepare before making a prediction: iteratively searching, evaluating relevance, and synthesizing information from a large corpus. This goes well beyond standard vector similarity retrieval. Our system uses structured search strategies and can leverage capable reasoning models to assess and curate the most decision-relevant passages for each question. This already results in better baseline accuracy even with the same model, without any weight-space changes.

Reward design for uncertainty. Standard RL rewards in language model training treat correctness as binary: right or wrong. Forecasting demands more. We modified the reward function so the model is incentivized to take calibrated risks: making bold predictions when evidence supports it and hedging appropriately when it doesn't. The model is rewarded not just for being right, but for how confidently right it is, and penalized more harshly for confident mistakes than for uncertain ones.

Modified GRPO training. We adapted the Group Relative Policy Optimization (GRPO) algorithm itself to better suit the forecasting setting. In standard GRPO, all errors are treated roughly equally during policy updates. Our modifications introduce asymmetry: highly confident incorrect predictions receive disproportionately strong penalties, while highly confident correct predictions receive the largest rewards. This reshapes the optimization landscape to favor the kind of well-calibrated, exploratory reasoning that characterizes skilled human forecasters.

Results

EF-8B accuracy and Brier score vs baselines on OpenForesight benchmark

At 40% of training, EF-8B already matches or exceeds all prior baselines on the OpenForesight test set across both accuracy and Brier score. The model achieves this at 8B parameters: orders of magnitude smaller than the other models it competes with. Calibration improvements from our training also generalize to out-of-distribution benchmarks, suggesting the model has learned genuinely transferable probabilistic reasoning rather than dataset-specific shortcuts.

Improvement examples

In what ways does the model improve over the course of the training run?

The gains came from the model learning to reason better about questions in a few different ways:

Better domain reasoning. When asked "Which hazardous material will be a concern at the site of the contentious Belfast bonfire during the upcoming multi-agency risk assessment?" (right answer is asbestos), the base model confidently answered "tyres" at p=0.82, reasoning that multiple articles about bonfires mentioned tyre dumping. Around 40% training completion, the model instead answered "asbestos" at p=0.37, its reasoning now considered that the question specifically asked about a hazardous material concern during a risk assessment, and concluded asbestos was more fitting since it's a regulated hazardous substance commonly found in building materials at demolition sites, unlike tyres which are waste but not technically hazardous. It got the right answer while simultaneously becoming less overconfident.

Reasoning past the obvious. For "Which northeastern Ukrainian city will President Putin say his forces may take control of as part of a buffer zone by the end of June 2025?" (right answer is Sumy), the base model defaulted to "Kharkiv," the most prominent northeastern Ukrainian city in news coverage. Around 40% training completion, the model's reasoning explicitly considered that Kharkiv is a major city unlikely to be framed as a buffer zone target, and worked through alternatives in the northeast, landing on "Sumy," the correct and less obvious answer. Similarly, asked "Which country's flag will MV Wan Hai 503 be flying when it suffers the explosion off Kerala?" (right answer is Singapore), the base model answered "Taiwan" because Wan Hai Lines is a Taiwanese shipping company. The trained model drew on the distinction between a company's headquarters and a vessel's flag state, reasoning that many Asian shipping companies register under flags of convenience, and correctly guessed "Singapore."

Convergence on already-known answers. For many questions, the base model already had the right answer in 1 out of 3 samples but was inconsistent. For example, asked "Who will visit London in early July 2025 to unveil the UK-France migration deal?" (answer: Emmanuel Macron), the base model produced ["Unknown", "Unknown", "Emmanuel Macron"]. Two out of three samples gave up and said Unknown. Around 40% training completion, all three test samples confidently output "Emmanuel Macron," with the reasoning now going straight to the logical inference rather than spiraling into uncertainty. The same pattern appeared for "Which Republican Senator from Kentucky will be recorded as voting against the bill in the Senate roll call?" (answer: Rand Paul). The base model split between Rand Paul and Mitch McConnell, but over time all three samples converged on Rand Paul with the reasoning noting Paul's well-known record of voting against party-line bills.

Across all net-gained questions, the model's average predicted probability actually decreased (0.39 to 0.27), showing it learned calibration rather than gaming confidence. It got more answers right while simultaneously becoming more honest about its uncertainty.

Footer Image