PAWS: Preference Learning with Advantage-Weighted Segments

Published in ICML 2026, 2026

Existing preference-based RL methods train utility functions on trajectory- or segment-level preferences while relying on per-step utility estimates during policy optimization. This mismatch between training and inference induces a distribution shift that degrades temporal credit assignment. PAWS aligns utility training with policy optimization by performing policy updates directly with segment-level advantage functions, preserving trajectory-level preference information and avoiding unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks show that PAWS consistently outperforms existing preference-based RL approaches.

Recommended citation: Aleksandar Taranovic, Onur Celik, Niklas Freymuth, Ge Li, Serge Thilges, Huy Le, Tai Hoang, Rania Rayyes, Gerhard Neumann. (2026). "PAWS: Preference Learning with Advantage-Weighted Segments." International Conference on Machine Learning (ICML).
Download Paper