Scaffolding Dexterous Manipulation with Vision-Language Models

Published in NeurIPS 2025, 2025

This paper leverages vision-language models to encode commonsense spatial and semantic knowledge for dexterous manipulation. Given a task description and visual scene, an off-the-shelf VLM identifies task-relevant keypoints and synthesizes 3D trajectories, which a residual RL policy learns to track. The method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.

Recommended citation: Vincent de Bakker, Joey Hejna, Tyler Ga Wei Lum, Onur Celik, Aleksandar Taranovic, Denis Blessing, Gerhard Neumann, Jeannette Bohg, Dorsa Sadigh. (2025). "Scaffolding Dexterous Manipulation with Vision-Language Models." Conference on Neural Information Processing Systems (NeurIPS).
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)