ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
In the rapidly evolving field of autonomous driving, Vision-Language Models (VLMs) have emerged as pivotal components, integrating visual perception with natural language processing for enhanced decision-making. However, the deployment of these models faces significant challenges due to the extensive computational demands associated with multi-view camera systems and multi-frame video inputs. A recent study presented on arXiv highlights an innovative solution to this pressing issue, introducing a framework called ST-Prune.
Token pruning methods, which aim to reduce computational overhead, have predominantly been tailored for single-image inputs. This approach has proven ineffective in leveraging the inherent spatio-temporal redundancies found in dynamic driving scenarios. To address this limitation, ST-Prune proposes a novel, training-free framework that incorporates two distinct yet complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP).
Key Features of ST-Prune
- Motion-aware Temporal Pruning (MTP): This module focuses on mitigating temporal redundancy by prioritizing dynamic motion and recent frame content. By encoding motion volatility and temporal recency as soft constraints in the diversity selection objective, MTP effectively enhances the model’s ability to prioritize critical information, such as dynamic trajectories, over static historical backgrounds.
- Ring-view Spatial Pruning (RSP): RSP addresses spatial redundancy by utilizing the unique geometry of ring-view camera systems. This module penalizes cross-view similarity, thereby eliminating duplicate projections and residual background elements that may still persist after temporal pruning. The integration of RSP ensures a more efficient processing of spatial data, further optimizing the overall performance of the model.
Performance Validation
The effectiveness of ST-Prune has been rigorously validated across four distinct benchmarks encompassing perception, prediction, and planning tasks within autonomous driving. The results demonstrate that ST-Prune sets a new state-of-the-art standard for training-free token pruning methodologies.
Remarkably, even with a token reduction of up to 90%, ST-Prune delivers near-lossless performance. In certain evaluation metrics, it even surpasses the baseline performance of full-model implementations. Furthermore, it maintains inference speeds that are comparable to existing pruning methods, making it a highly efficient option for real-time applications in autonomous driving.
Conclusion
As the automotive industry increasingly relies on advanced AI technologies for autonomous driving, the introduction of ST-Prune signifies a substantial advancement in the optimization of Vision-Language Models. By effectively addressing both spatio-temporal redundancies, ST-Prune not only enhances computational efficiency but also ensures that critical scene information is preserved, thereby facilitating safer and more reliable autonomous navigation.
