Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?
Summary: arXiv:2604.02532v1 Announce Type: cross
Abstract: Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity.
We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap—and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets—ImageNet-1K, MS COCO, and CIFAR-10—FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.
Key Features of the Feature Attribution Stability Suite
- Prediction-Invariance Filtering: This feature ensures that the stability metrics are evaluated while maintaining consistent model predictions, providing a more accurate assessment of attribution methods.
- Decomposed Stability Metrics: FASS breaks down stability into three distinct metrics:
- Structural Similarity: Measures how similar the attributions are structurally across different perturbations.
- Rank Correlation: Assesses the consistency of rank orders of features across perturbations.
- Top-k Jaccard Overlap: Evaluates the overlap of the top-k most significant features between different perturbations.
- Diverse Perturbation Types: The benchmark evaluates attributions across various perturbations, including:
- Geometric perturbations (e.g., rotations, translations)
- Photometric perturbations (e.g., brightness, contrast changes)
- Compression perturbations (e.g., JPEG compression)
Findings from the Evaluation
The evaluation of attribution methods using FASS revealed significant insights:
- Geometric perturbations led to a notably higher instability in attributions compared to photometric changes.
- Without conditioning on prediction preservation, a staggering 99% of the evaluated pairs showed discrepancies in predictions.
- Grad-CAM consistently demonstrated the highest level of stability across the evaluated datasets, suggesting its robustness in practical applications.
In conclusion, the Feature Attribution Stability Suite provides a comprehensive framework for evaluating the stability of post-hoc feature attribution methods, highlighting important trends and considerations for future research in the field of explainable AI.
