Papers
arxiv:2603.19835

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Published on Mar 20
· Submitted by
Kexin Huang
on Apr 1
#2 Paper of the day
Authors:
,
,
,
,
,

Abstract

FIPO enhances reinforcement learning for language models by using discounted future-KL divergence to improve credit assignment and extend reasoning chains, achieving better mathematical problem-solving performance.

AI-generated summary

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

Community

Paper author Paper submitter

FIPO replaces trajectory-end binary rewards with the Future-KL mechanism. It quantifies the real-time "causal influence" of every single token on the subsequent reasoning path, enabling precise, token-level reinforcement or suppression. It achieved 58.0% on AIME 2024, outperforming DAPO, o1-mini and DeepSeek-Zero-MATH at the same scale. It also broke the standard 4k length stagnation, scaling effective reasoning to 10,000+ tokens where generation length strongly correlates with actual accuracy.

future-KL as a training signal for deep reasoning is a clever framing, curious how it compares to vanilla KL in practice. wrote up notes on this here https://arxivexplained.com/papers/fipo-eliciting-deep-reasoning-with-future-kl-influenced-policy-optimization if anyone wants the breakdown

·
Paper author

Thanks a lot for covering our paper and sharing your notes! For anyone looking for further reading, we've also released a detailed text blog covering FIPO here: https://qwen-pilot.notion.site/fipo. Thanks again for the support!

the dense, future-informed credit signal in fipo is the standout move, it feels like a simple tweak that unlocks real long-horizon reasoning without a separate value model. i’m curious about how sensitive the length growth is to the soft decay horizon; did you try different horizons and does the 10k+ token stretch survive if you flatten the horizon? the memory-efficient chunked computation for Future-KL and the influence weight clipping look like the invisible gears keeping training stable, and i appreciate the explicit discussion of compute costs. btw, the arxivlens breakdown helped me parse the method details, especially the way the dense signal interacts with the grpo update—a quick takeaway on when this approach might fail would be super helpful.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.19835
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.19835 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.19835 in a Space README.md to link it from this page.

Collections including this paper 2