arxiv:2603.19835

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Published on Mar 20

· Submitted by

Kexin Huang on Apr 1

Authors:

Chiyu Ma ,

Shuo Yang ,

Kexin Huang ,

Shangshang Wang ,

Abstract

FIPO enhances reinforcement learning for language models by using discounted future-KL divergence to improve credit assignment and extend reasoning chains, achieving better mathematical problem-solving performance.

AI-generated summary

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

View arXiv page View PDF Project page GitHub 54 Add to collection

Community

737443h

Paper author Paper submitter about 16 hours ago

FIPO replaces trajectory-end binary rewards with the Future-KL mechanism. It quantifies the real-time "causal influence" of every single token on the subsequent reasoning path, enabling precise, token-level reinforcement or suppression. It achieved 58.0% on AIME 2024, outperforming DAPO, o1-mini and DeepSeek-Zero-MATH at the same scale. It also broke the standard 4k length stagnation, scaling effective reasoning to 10,000+ tokens where generation length strongly correlates with actual accuracy.

grantsing

about 14 hours ago

future-KL as a training signal for deep reasoning is a clever framing, curious how it compares to vanilla KL in practice. wrote up notes on this here https://arxivexplained.com/papers/fipo-eliciting-deep-reasoning-with-future-kl-influenced-policy-optimization if anyone wants the breakdown

ShuoY

Paper author about 14 hours ago

Thanks a lot for covering our paper and sharing your notes! For anyone looking for further reading, we've also released a detailed text blog covering FIPO here: https://qwen-pilot.notion.site/fipo. Thanks again for the support!

avahal

about 7 hours ago

the dense, future-informed credit signal in fipo is the standout move, it feels like a simple tweak that unlocks real long-horizon reasoning without a separate value model. i’m curious about how sensitive the length growth is to the soft decay horizon; did you try different horizons and does the 10k+ token stretch survive if you flatten the horizon? the memory-efficient chunked computation for Future-KL and the influence weight clipping look like the invisible gears keeping training stable, and i appreciate the explicit discussion of compute costs. btw, the arxivlens breakdown helped me parse the method details, especially the way the dense signal interacts with the grpo update—a quick takeaway on when this approach might fail would be super helpful.