Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
Recent advancements in artificial intelligence (AI) have brought about a growing emphasis on explainable AI (XAI), with Shapley values emerging as a significant tool for providing insights into model decision-making processes. However, the increasing number of Shapley variants has led to fragmentation in XAI methodologies, making it challenging to achieve consensus on their practical deployment in critical applications.
A new study published on arXiv (arXiv:2604.22662v1) highlights the urgent need for a comprehensive evaluation framework that aligns with human decision-making needs in high-stakes environments. The research focuses on the evaluation of eight different Shapley value formulations within the context of operational risk workflows, particularly emphasizing the implications for fraud detection scenarios.
Key Findings and Methodology
The authors utilized a unified amortized framework to assess the semantic differences between the various Shapley variants. This approach allowed for a more nuanced understanding of how these differences manifest under the constraints of low-latency environments typical in risk management. The study involved a large-scale empirical evaluation that included:
- Four distinct risk datasets
- A realistic fraud detection environment
- Engagement with professional analysts in 3,735 case reviews
The findings from this extensive analysis revealed a fundamental misalignment between standard quantitative metrics and human-perceived clarity in decision-making. Metrics such as sparsity and faithfulness, while important in the theoretical realm, did not correlate effectively with how analysts perceived the explanations provided by the AI systems.
Implications for Future Research and Practice
One of the most striking outcomes of the study was the observation that, despite the lack of improvement in objective analyst performance across all formulations, the explanations generated by the systems consistently increased the decision confidence of the analysts. This phenomenon raises concerns about potential automation bias in high-stakes settings, where overreliance on AI-generated explanations could lead to critical errors.
The authors argue that the current evaluation proxies, which rely heavily on quantitative assessments, are inadequate for predicting the real-world impact of AI explanations on human decision-making. They emphasize the need for a shift toward more human-centered evaluation metrics that consider how explanations influence analyst behavior and decision outcomes.
Recommendations for Operational Decision Systems
Based on their findings, the researchers offer several recommendations for organizations looking to implement XAI in operational decision systems:
- Prioritize human-centric evaluation metrics that assess clarity, relevance, and decision utility.
- Conduct user studies that involve professionals in relevant fields to gather qualitative feedback on AI explanations.
- Continuously iterate and refine Shapley formulations based on empirical findings to enhance alignment with human cognitive processes.
- Foster interdisciplinary collaboration to bridge the gap between theoretical AI research and practical application in high-stakes environments.
In conclusion, the study underscores the critical need for rethinking how we evaluate XAI systems, particularly in high-stakes settings. By placing human decision-making at the forefront of evaluation frameworks, we can enhance the effectiveness of AI systems and mitigate the risks associated with automation bias.
Related AI Insights
- SSG: Enhanced Logit-Balanced Watermarking for LLMs
- CNSL-bench: Evaluating MLLMs on Chinese Sign Language
- QDTraj: Diverse Trajectory Primitives for Robotic Manipulation
- ChangeQuery: Advanced Remote Sensing for Disaster Analysis
- AI-Based Emboli Detection Protects Brain During Heart Treatment
- Semantic Error Correction for Short Block Channel Codes
- AI-Assisted Verified Code Generation with Dafny Formal Verification
- Unified Transportation Model for Safer Urban Mobility
- Probabilistic Framework for Hierarchical Goal Recognition AI
- Foundation Models Beat ML in Energy Time Series Forecasting
