KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
Summary: arXiv:2604.07034v1 Announce Type: cross
Abstract: We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird’s-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence.
Key Features of KITE
KITE offers several innovative features that enhance its effectiveness in robot failure analysis:
- Motion-Salient Keyframes: The system captures critical moments in robot execution, enabling focused analysis on significant events.
- Open-Vocabulary Detections: This feature allows KITE to identify a wide range of objects and scenarios, improving the adaptability of the model.
- Bird’s-Eye-View Representation: The BEV layout provides a comprehensive overview of the scene, including object relative positions and confidence levels.
Unified Prompt for Enhanced Functionality
KITE integrates various visual cues into a unified prompt, which consists of:
- Robot-profile tokens that describe the robot’s characteristics.
- Scene-context tokens that provide background information on the environment.
- Keyframe and BEV representations that encapsulate vital execution data.
This integration allows KITE to effectively support multiple tasks such as failure detection, identification, localization, explanation, and correction using an off-the-shelf VLM.
Performance on RoboFAC Benchmark
In extensive testing on the RoboFAC benchmark, KITE demonstrated substantial improvements over the vanilla Qwen2.5-VL model in a training-free setting. Key highlights include:
- Significant enhancements in simulation failure detection, identification, and localization.
- Competitive performance compared to a RoboFAC-tuned baseline.
- Further improvements in explanation and correction quality with a small QLoRA fine-tune.
Real-World Application
KITE’s practicality is underscored by qualitative results obtained from real dual-arm robots. These findings illustrate KITE’s capability as a structured and interpretable front-end for analyzing robot failures, emphasizing its relevance in real-world applications.
For more information, code, and models, please visit our project page: KITE Project Page.
