KITE: Efficient Robot Failure Analysis with VLMs

Date:


KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Summary: arXiv:2604.07034v1 Announce Type: cross

Abstract: We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird’s-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence.

Key Features of KITE

KITE offers several innovative features that enhance its effectiveness in robot failure analysis:

  • Motion-Salient Keyframes: The system captures critical moments in robot execution, enabling focused analysis on significant events.
  • Open-Vocabulary Detections: This feature allows KITE to identify a wide range of objects and scenarios, improving the adaptability of the model.
  • Bird’s-Eye-View Representation: The BEV layout provides a comprehensive overview of the scene, including object relative positions and confidence levels.

Unified Prompt for Enhanced Functionality

KITE integrates various visual cues into a unified prompt, which consists of:

  • Robot-profile tokens that describe the robot’s characteristics.
  • Scene-context tokens that provide background information on the environment.
  • Keyframe and BEV representations that encapsulate vital execution data.

This integration allows KITE to effectively support multiple tasks such as failure detection, identification, localization, explanation, and correction using an off-the-shelf VLM.

Performance on RoboFAC Benchmark

In extensive testing on the RoboFAC benchmark, KITE demonstrated substantial improvements over the vanilla Qwen2.5-VL model in a training-free setting. Key highlights include:

  • Significant enhancements in simulation failure detection, identification, and localization.
  • Competitive performance compared to a RoboFAC-tuned baseline.
  • Further improvements in explanation and correction quality with a small QLoRA fine-tune.

Real-World Application

KITE’s practicality is underscored by qualitative results obtained from real dual-arm robots. These findings illustrate KITE’s capability as a structured and interpretable front-end for analyzing robot failures, emphasizing its relevance in real-world applications.

For more information, code, and models, please visit our project page: KITE Project Page.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.