Skip to content

LLMs Can Teach Themselves to Better Predict the Future

Published:Suggest Changes
Content has been generated from NotebookLM

Summary

This paper introduces a novel framework for improving the forecasting capabilities of Large Language Models (LLMs) through outcome-driven fine-tuning. The method leverages model self-play to generate diverse reasoning trajectories and probabilistic forecasts for future events. These forecasts are then ranked based on their accuracy compared to actual outcomes, and the model is fine-tuned using Direct Preference Optimization (DPO). The results demonstrate significant accuracy improvements (7-10%) on Phi-4 14B and DeepSeek-R1 14B models, bringing their performance on par with much larger models like GPT-4o, without relying on human-curated reasoning samples. This approach has implications for decision-making across various sectors like finance, policy, and law.

Problem Statement

Proposed Solution

The authors propose a self-play fine-tuning framework that allows LLMs to learn directly from actual outcomes and self-generated reasoning. The core components are:

Methodology

The method comprises six main steps:

  1. Data Collection and Preprocessing: Gathering forecasting questions with binary outcomes from Polymarket.
  2. News Collection: Collecting news articles related to the questions via the NewsCatcher API.
  3. Synthetic Training Data Generation: Generating reasoning and forecasts through base model self-play. As the paper states, “We then instructed the base models to provide reasoning and a final probabilistic forecast for each question.”
  4. Resolution-Driven Re-Ranking: Ranking reasoning–outcome pairs based on the accuracy of the probabilistic forecast.
  5. Direct Preference Optimization (DPO) Fine-Tuning: Fine-tuning the model using the ranked preference pairs. The paper further states, “We use Direct Preference Optimization (DPO) to optimise model outputs against self-play derived and outcome-driven preferences without the need to train a separate reward model.”
  6. Forecasting Test-Set Questions: Evaluating the model on a held-out test set.

The researchers used Phi-4 14B and DeepSeek-R1 14B models, both relatively small (14B parameters) but performant. They used a Brier score to evaluate probabilistic forecast accuracy.

Key Findings and Results

  1. Significance and Impact:

Key Quotes

Future Directions

The authors’ method paves the way for further research into self-supervised learning techniques for improving LLM reasoning and prediction capabilities.


Previous Post
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Next Post
Introducing the Model Context Protocol