Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models
Summary: arXiv:2505.12509v3 Announce Type: replace-cross
Abstract: Post-hoc explanations provide transparency and are essential for guiding model optimization, such as prompt engineering and data sanitation. However, applying model-agnostic techniques to Large Language Models (LLMs) is hindered by prohibitive computational costs, rendering these tools dormant for real-world applications. To revitalize model-agnostic interpretability, we propose a budget-friendly proxy framework that leverages efficient models to approximate the decision boundaries of expensive LLMs. We introduce a screen-and-apply mechanism to statistically verify local alignment before deployment. Our empirical evaluation confirms that proxy explanations achieve over 90% fidelity with only 11% of the oracle’s cost. Building on this foundation, we demonstrate the actionable utility of our framework in prompt compression and poisoned example removal. Results show that reliable proxy explanations effectively guide optimization, transforming interpretability from a passive observation tool into a scalable primitive for LLM development. Additionally, we open-source code and datasets to facilitate future research.
Introduction
The field of machine learning has seen tremendous growth, particularly with the advent of Large Language Models (LLMs). However, the complexity of these models often leads to a lack of transparency in their decision-making processes. Understanding how these models arrive at specific outputs is crucial for developers and researchers alike, primarily for model optimization and ethical considerations.
Challenges in Interpretability
Despite the importance of interpretability, applying model-agnostic techniques to LLMs presents several challenges:
- High Computational Costs: Traditional interpretability methods often require extensive computational resources, making them impractical for LLMs.
- Scalability Issues: As LLMs grow in size and complexity, existing interpretability tools struggle to maintain effectiveness.
- Real-world Applicability: Many interpretability techniques remain dormant due to their inability to handle the demands of real-world applications.
Proposed Proxy Framework
To address these challenges, we propose a novel proxy framework that utilizes efficient models to approximate the decision boundaries of more resource-intensive LLMs. This approach is designed to be budget-friendly while maintaining high fidelity in explanations.
Screen-and-Apply Mechanism
Our framework also introduces a screen-and-apply mechanism, which statistically verifies local alignment before the deployment of proxy models. This step ensures that the insights derived from the proxy model closely align with the original LLM’s decisions, enhancing reliability.
Empirical Evaluation
Our empirical evaluations demonstrate the effectiveness of the proposed framework:
- Proxy explanations achieve over 90% fidelity compared to the oracle model.
- Implementation costs are reduced to only 11% of the oracle’s cost, making it feasible for large-scale applications.
Actionable Utility in Model Optimization
By leveraging our framework, we show significant improvements in model optimization tasks, such as:
- Prompt Compression: Efficiently refining prompts to enhance model performance.
- Poisoned Example Removal: Identifying and eliminating harmful examples from training data to improve model robustness.
Conclusion
Our study highlights the transformative potential of actionable interpretability in LLM development. By shifting interpretability from a passive observation tool to an active component of model optimization, we pave the way for more transparent and efficient AI systems. Furthermore, we are committed to advancing research in this area by open-sourcing our code and datasets.
