Gain-based prefix evaluation for LLM reasoning

From correctness to utility.

PUM evaluates a reasoning prefix by asking a future-facing question: does this prefix make the problem easier to solve?

20Kreasoning trajectories
280Kutility preferences
3uses: selection, search, RL
Paper Figure 1 comparing PRM correctness scoring and PUM utility-based evaluation.

Core idea

Correctness is not always usefulness.

A locally correct step can still leave the remaining reasoning brittle. A non-final prefix can be useful if it exposes a productive decomposition that increases downstream solve probability.

Prefix gain
Gain(x, p) = q(x, p) - q(x, ∅)

Compare success after conditioning on prefix p against solving the same problem from scratch.

Method

Sample prefixes, estimate gain, learn utility.

PUM converts outcome-grounded solve-rate gains into pairwise prefix preferences, then trains a scalar utility model to score partial and complete reasoning trajectories.

Paper Figure 2 showing prefix sampling, gain-based preference construction, and PUM training.
01

Prefix sampling

Construct vertical and horizontal comparisons from diverse reasoning trajectories.

02

Gain estimation

Lightweight students solve with and without a prefix; their solve-rate differences form a gain profile.

03

Pairwise training

Gain differences become preference labels for training an LLM backbone with a scalar value head.

Results

One signal for selection, search, and RL.

The webpage below uses the original figures and tables from the PDF so visitors can inspect the evidence directly.

Best-of-N best / near-best

Consistent gains across policy models and datasets, especially when candidate pools grow.

Beam Search 78.00%

MATH500 with Qwen2.5-3B at N=100; GAOKAO2023 reaches 71.43% in the same setting.

RL 55.21 → 58.09

PUM+GRPO improves average accuracy over vanilla GRPO in the outcome-anchored setup.

Labeling cost 1.0× vs 18.5×

PUM-Math uses lightweight students and avoids human step-level annotations.

Paper Figure 3 with Best-of-N selection and ranking robustness results.
Paper Table 1 with Beam Search accuracy on MATH500 and GAOKAO2023.
Paper Table 2 with RL pass@1 accuracy after training.
Paper Figure 4 with RL training curves.
Paper Figure 5 with further analyses on weak-to-strong transfer, policy dependence, scaling, and hard-data RL.
Paper Table 3 with supervision construction cost comparison.

Interactive example

A real math case: which prefix is more useful?

This example illustrates why evaluating the future effect of a prefix can be different from simply checking surface plausibility.

Problem

Let x and y be real numbers such that x + y = 10 and xy = 21. Find x² + y².

PUM-style utility judgment

Prefer Prefix A

Prefix A reduces the remaining task to a direct calculation: x² + y² = (x + y)² - 2xy = 100 - 42 = 58. Prefix B looks simple but violates xy = 21 because 6 × 4 = 24, so it pushes the continuation toward a wrong answer.

Resources

Paper, code, data, and model.

Replace the placeholder links below when the public release is ready.

BibTeX
@article{zhou2026from,
  title = {From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning},
  author = {Yuhang Zhou, Yixin Cao, Guangnan Ye},
  journal={arXiv preprint arXiv:2606.07190},
  year = {2026},  
}