This preprint explores how to augment our GUIDER framework with vision-language and language-only models to filter relevant objects and locations during collaborative manipulation. The goal is to let the robot reason about user prompts in natural language, refine its belief over tasks, and shift autonomy only when the model is confident about the operator’s intent.
Framework Highlights
- A perception stack (YOLO + Segment Anything) provides candidate object crops that a vision-language model scores against the operator’s prompt.
- A lightweight language model ranks detected object labels, supplying semantic priors that suppress irrelevant items.
- Autonomy shifts happen once the combined belief across navigation and manipulation layers exceeds a threshold, triggering target selection and grasp execution.
Planned Evaluation
- Integration targets Isaac Sim with a Franka Emika arm on a Ridgeback base.
- Metrics will track how quickly the system adapts when the operator revises their intent mid-task.
- Future work focuses on real-time performance and transparent explanations of VLM reasoning.
Why It Matters
Intent inference must go beyond geometric reasoning to truly assist humans. By blending multimodal language models with probabilistic planners, operators can describe missions naturally while the robot handles the heavy lifting of perception and decision-making.