Scaffolding Dexterous Manipulation with Vision-Language Models
Published in NeurIPS 2025, 2025
This paper leverages vision-language models to encode commonsense spatial and semantic knowledge for dexterous manipulation. Given a task description and visual scene, an off-the-shelf VLM identifies task-relevant keypoints and synthesizes 3D trajectories, which a residual RL policy learns to track. The method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.
Recommended citation: Vincent de Bakker, Joey Hejna, Tyler Ga Wei Lum, Onur Celik, Aleksandar Taranovic, Denis Blessing, Gerhard Neumann, Jeannette Bohg, Dorsa Sadigh. (2025). "Scaffolding Dexterous Manipulation with Vision-Language Models." Conference on Neural Information Processing Systems (NeurIPS).
Download Paper
