PAWS: Preference Learning with Advantage-Weighted Segments

Taranovic, Aleksandar; Celik, Onur; Freymuth, Niklas; Li, Ge; Thilges, Serge; Le, Huy; Hoang, Tai; Rayyes, Rania; Neumann, Gerhard

PAWS: Preference Learning with Advantage-Weighted Segments

Aleksandar Taranovic¹, Onur Celik¹, Niklas Freymuth¹, Ge Li¹, Serge Thilges¹, Huy Le^1,2, Tai Hoang¹, Rania Rayyes³, Gerhard Neumann¹

¹Autonomous Learning Robots, Karlsruhe Institute of Technology
²Bosch Center for Artificial Intelligence ³Institute for Material Handling and Logistics (IFL), KIT

Published at ICML 2026

Code arXiv

The temporal credit assignment problem: an advantage model trained on preferred and non-preferred segments constrains only the trajectory-level sum of advantages, so per-step credit assignment is underdetermined. — **The Temporal Credit Assignment Problem.** The advantage model is trained on preferred (τ⁺) and non-preferred (τ⁻) segments, a loss that depends only on the *sum* of per-step advantages. Many different per-step assignments are consistent with the same preference label, yet policy optimization queries the advantage on individual state–action pairs, inducing a distribution shift between segment-level training and step-level inference.

TL;DR: PAWS performs policy updates directly with segment-level advantage functions, aligning utility training with policy optimization to mitigate the temporal credit assignment problem in preference-based RL.

Abstract

Preference-based reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose Preference Learning with Advantage-Weighted Segments (PAWS), a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.

The Training–Inference Mismatch

Most PbRL methods train a reward or advantage model on segment-level comparisons, but then query that model at the level of individual state–action pairs during policy optimization. Because the preference loss only constrains the sum of per-step advantages within a segment, many distinct per-step assignments explain the same preference label. This leaves temporal credit assignment fundamentally ambiguous and injects distribution shift into policy learning.

Four pairs of preferred and non-preferred segments with identical advantage sums but markedly different per-step assignments. — **Ambiguity in per-step credit assignment.** All four segment pairs share the same trajectory-level advantage sum, and thus the same preference label, yet their per-step assignments differ markedly. Any consistent per-step assignment is equally compatible with the supervision signal, exposing downstream policy updates to arbitrary choices made by the utility model.

Our Approach: PAWS

PAWS keeps training and inference distribution-consistent by using the learned advantage function directly on trajectory segments during policy optimization:

Advantage learning. An advantage function A_φ is trained on segment-level preferences via a Bradley–Terry objective over the difference in cumulative segment advantages. Both an MLP and an encoder-only Transformer parameterization are supported.
Segment-level policy update. A trust-region-constrained optimization in segment space yields a reweighted segment distribution p*(τ) ∝ p_D(τ) exp(A_φ(τ)/λ). Projecting it back onto the policy gives a weighted maximum-likelihood update where every step in a segment shares the segment's advantage weight, never a per-step utility estimate.
Data-driven step size. Instead of hand-tuning the KL trust region ε, PAWS sets it automatically from a target effective sample size n_eff, an intuitive knob describing how many preferences meaningfully contribute to each policy update.

Contributions

We analyze temporal credit assignment in PbRL through the lens of a training–inference distribution shift, identifying it as a core limitation of existing methods.
We propose PAWS, a segment-based preference learning method that aligns utility training with policy optimization, enabling reliable propagation of preference signals.
We introduce an intuitive, data-driven strategy for setting policy-optimization hyperparameters based on the effective sample size of preference-weighted data.
We validate PAWS on diverse simulated manipulation and locomotion tasks, with both oracle and real-human preferences, showing consistent gains over established baselines.

Results

Meta-World Manipulation

Meta-World task success rates comparing PAWS against baselines for 50 and 500 preferences. — **Meta-World task success (%).** PAWS (MLP) and PAWS (Transformer) achieve the highest average success and the largest improvement over Behavior Cloning across both preference budgets. In the low-data regime (50 preferences) several baselines even fall below Behavior Cloning, while PAWS keeps improving.

Locomotion

Real Human Preferences

Beyond oracle labels, we collected 50 pairwise comparisons per task from each of 10 non-author human labelers on Button Press and Door Open. Each seed was trained on the labels of a single distinct labeler, so the variance reflects both seed and labeler variability.

Task success with human-collected preferences from 10 non-author participants on Button Press and Door Open. — **Human-collected preferences.** The results mirror the oracle setting: PAWS (MLP) achieves the highest success rate on both tasks, with PAWS (Transformer) second on Door Open.

Segment- vs. State-Action-Based Updates

The core claim of the paper made measurable: querying the same learned advantage function per-step instead of per-segment reintroduces the distribution shift and costs up to 15 points of success rate. Shortening segments at update time degrades performance monotonically even when the advantage function itself was trained on long segments.

Aggregated success rates for segment-level versus state-action-level policy updates with MLP and Transformer architectures. — **Update granularity.** Segment-level updates (PAWS) clearly beat state-action-level updates for both architectures and both preference budgets.

Success rate decreases monotonically as the update-time segment length shrinks. — **Update-time segment length.** With a fixed advantage function trained on length-64 segments, success drops monotonically as update segments get shorter.

Ablations

To quantify temporal credit assignment directly, we also measure Spearman's rank correlation between each learned policy's action likelihoods and the expert's: segment-based updates preserve the expert's ranking far better (r_s = 0.22) than per-step updates (r_s = 0.05).

BibTeX

@inproceedings{taranovic2026paws,
  title={PAWS: Preference Learning with Advantage-Weighted Segments},
  author={Aleksandar Taranovic and Onur Celik and Niklas Freymuth and Ge Li and Serge Thilges and Huy Le and Tai Hoang and Rania Rayyes and Gerhard Neumann},
  booktitle={Forty-third International Conference on Machine Learning},
  year={2026},
  url={https://openreview.net/forum?id=IPeIlnJzYa}
}