- Authors: Benjamin Turtel, Danny Franklin, Philipp Schoenegger
- Source: Excerpts from “2502.05253.pdf”
Summary
This paper introduces a novel framework for improving the forecasting capabilities of Large Language Models (LLMs) through outcome-driven fine-tuning. The method leverages model self-play to generate diverse reasoning trajectories and probabilistic forecasts for future events. These forecasts are then ranked based on their accuracy compared to actual outcomes, and the model is fine-tuned using Direct Preference Optimization (DPO). The results demonstrate significant accuracy improvements (7-10%) on Phi-4 14B and DeepSeek-R1 14B models, bringing their performance on par with much larger models like GPT-4o, without relying on human-curated reasoning samples. This approach has implications for decision-making across various sectors like finance, policy, and law.
Problem Statement
- LLMs have shown promise in various areas, but their performance in judgemental forecasting lags behind human experts.
- Existing methods to improve LLM forecasting often rely on costly and slow human-curated data, hindering continuous learning and improvement. The paper states: “They are frequently reliant on human-curated data such as up-to-date crowd forecasts or output curation, and often fail to have the models learn from resolved outcomes. Human outputs are slow and costly to procure, making it difficult to have models continually learn from them and improve.”
Proposed Solution
The authors propose a self-play fine-tuning framework that allows LLMs to learn directly from actual outcomes and self-generated reasoning. The core components are:
- Self-Play Data Generation: LLMs generate multiple reasoning traces and probabilistic forecasts for a large dataset of forecasting questions (e.g., from Polymarket).
- Resolution-Driven Re-Ranking: Pairs of reasoning traces are ranked based on the proximity of their probabilistic forecasts to the actual outcome.
- Direct Preference Optimization (DPO) Fine-Tuning: DPO is used to optimize model outputs against self-play derived and outcome-driven preferences without the need to train a separate reward model. DPO learns a reward signal from sets of ranked reasoning pairs drawn from the self-play outputs.
- News Integration: News articles related to the forecasting questions are collected via the NewsCatcher API and used as input for reasoning and prediction.
Methodology
The method comprises six main steps:
- Data Collection and Preprocessing: Gathering forecasting questions with binary outcomes from Polymarket.
- News Collection: Collecting news articles related to the questions via the NewsCatcher API.
- Synthetic Training Data Generation: Generating reasoning and forecasts through base model self-play. As the paper states, “We then instructed the base models to provide reasoning and a final probabilistic forecast for each question.”
- Resolution-Driven Re-Ranking: Ranking reasoning–outcome pairs based on the accuracy of the probabilistic forecast.
- Direct Preference Optimization (DPO) Fine-Tuning: Fine-tuning the model using the ranked preference pairs. The paper further states, “We use Direct Preference Optimization (DPO) to optimise model outputs against self-play derived and outcome-driven preferences without the need to train a separate reward model.”
- Forecasting Test-Set Questions: Evaluating the model on a held-out test set.
The researchers used Phi-4 14B and DeepSeek-R1 14B models, both relatively small (14B parameters) but performant. They used a Brier score to evaluate probabilistic forecast accuracy.
Key Findings and Results
- The proposed fine-tuning method significantly improves forecasting accuracy (7-10%) for both Phi-4 14B and DeepSeek-R1 14B, compared to base models and control models fine-tuned with randomized labels.
- Fine-tuned models achieve performance on par with the much larger GPT-4o model.
- The improvement is statistically significant and not simply due to exposure to additional information (news articles).
- While fine-tuned models sometimes make more highly inaccurate forecasts, they also produce far more extremely accurate ones, more than compensating for the large errors. The paper notes “Comparing the distributions of accuracy scores across the questions for DeepSeek-R1 14B, we find that the fine-tuned model had a Brier score above 0.5 (very low accuracy) on 8.52% of questions, slightly higher than the base (7.48%) and control (7.61%) models. However, it also had a Brier score below 0.05 (very high accuracy) on 32.78% of questions, compared to only 23.22% and 23.13% for the base and control models.”
- Significance and Impact:
- Demonstrates a scalable, outcome-driven approach for enhancing LLM forecasting without relying on human annotation.
- Offers a method for LLMs to continuously learn and improve their forecasting abilities.
- Potentially improves decision-making in various sectors by providing more accurate and reliable forecasts.
-
Modern LLMs have already been shown to conduct financial analysis [3], evaluate the impact of events on time series [4], and improve climate policy decision-making [5]. This makes improving LLMs’ forecasting abilities potentially impactful and wide-ranging.
Key Quotes
-
In this paper, we propose a new approach to improving LLM forecasting performance that sidesteps the use of human inputs above and beyond real-world resolutions and enables the model to directly learn from actual outcomes and self-play.
-
Our results…show that for both of the models that we employed our method on, Phi-4 14B and DeepSeek-R1 14B, we find accuracy improvements of between 7–10% over the base versions of these models as well as the same models fine-tuned with randomized outcome labels as a control…
-
Strikingly, our fine-tune of both models are also on par with the performance of the much larger GPT-4o.
Future Directions
The authors’ method paves the way for further research into self-supervised learning techniques for improving LLM reasoning and prediction capabilities.