PAWS: Preference Learning with Advantage-Weighted Segments
Abstract
Preference-based reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose Preference Learning with Advantage-Weighted Segments (PAWS), a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.
BibTeX
@inproceedings{taranovic2026paws,
title={{PAWS}: Preference Learning with Advantage-Weighted Segments},
author={Aleksandar Taranovic and Onur Celik and Niklas Freymuth and Ge Li and Serge Thilges and Huy Le and Tai Hoang and Rania Rayyes and Gerhard Neumann},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=IPeIlnJzYa}
}