KITE: Efficient Robot Failure Analysis with VLMs

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Summary: arXiv:2604.07034v1 Announce Type: cross

Abstract: We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird’s-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence.

Key Features of KITE

KITE offers several innovative features that enhance its effectiveness in robot failure analysis:

Motion-Salient Keyframes: The system captures critical moments in robot execution, enabling focused analysis on significant events.
Open-Vocabulary Detections: This feature allows KITE to identify a wide range of objects and scenarios, improving the adaptability of the model.
Bird’s-Eye-View Representation: The BEV layout provides a comprehensive overview of the scene, including object relative positions and confidence levels.

Unified Prompt for Enhanced Functionality

KITE integrates various visual cues into a unified prompt, which consists of:

Robot-profile tokens that describe the robot’s characteristics.
Scene-context tokens that provide background information on the environment.
Keyframe and BEV representations that encapsulate vital execution data.

This integration allows KITE to effectively support multiple tasks such as failure detection, identification, localization, explanation, and correction using an off-the-shelf VLM.

Performance on RoboFAC Benchmark

In extensive testing on the RoboFAC benchmark, KITE demonstrated substantial improvements over the vanilla Qwen2.5-VL model in a training-free setting. Key highlights include:

Significant enhancements in simulation failure detection, identification, and localization.
Competitive performance compared to a RoboFAC-tuned baseline.
Further improvements in explanation and correction quality with a small QLoRA fine-tune.

Real-World Application

KITE’s practicality is underscored by qualitative results obtained from real dual-arm robots. These findings illustrate KITE’s capability as a structured and interpretable front-end for analyzing robot failures, emphasizing its relevance in real-world applications.

For more information, code, and models, please visit our project page: KITE Project Page.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

KITE: Efficient Robot Failure Analysis with VLMs

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Key Features of KITE

Unified Prompt for Enhanced Functionality

Performance on RoboFAC Benchmark

Real-World Application

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related