FaCT: Faithful Concept Traces for Explaining Neural Network Decisions
Summary: arXiv:2510.25512v2 Announce Type: replace-cross
Introduction
Deep neural networks have revolutionized the field of artificial intelligence by delivering exceptional performance across a variety of tasks, including image recognition, natural language processing, and more. However, understanding the inner workings of these complex models remains a significant challenge. Researchers have developed various post-hoc concept-based approaches to provide insights into model behavior, yet many of these methods lack fidelity to the actual model operations.
The Challenge of Faithfulness
One of the main issues with existing concept-based explanations is their reliance on restrictive assumptions about the concepts learned by neural networks. Common limitations include:
- Class-specificity: Concepts are often tied to specific classes, which can limit their generalizability.
- Small spatial extent: Explanations may only focus on minor parts of the input, missing larger contextual information.
- Alignment to human expectations: Many models assume a predefined notion of what constitutes an understandable concept, which may not align with how humans interpret information.
Introducing Faithful Concept Traces (FaCT)
In response to these challenges, our research emphasizes the importance of faithfulness in concept-based explanations. We present a novel model that integrates mechanistic concept explanations inherently within the neural network architecture. Key features of our approach include:
- Shared Concepts: Unlike traditional methods that create class-specific concepts, our model allows concepts to be shared across multiple classes, enhancing interpretability and applicability.
- Layer-wise Contribution Tracing: From any layer of the network, we can trace the contribution of concepts to the final decision (logit) and visualize their impact on the input, ensuring transparency in the decision-making process.
- Foundation Models: By leveraging foundation models, we propose a new concept-consistency metric, denoted as C2-Score, which serves as a benchmark for evaluating the consistency of concept-based methods.
Results and Evaluation
Our experiments demonstrate that the concepts produced by our model are not only quantitatively more consistent compared to prior work but also more interpretable according to user feedback. These findings indicate a significant advancement in the field of explainable AI, allowing practitioners to gain deeper insights into the inner workings of neural networks without sacrificing performance.
Conclusion
In summary, the FaCT model represents a significant step towards addressing the challenge of faithfulness in neural network explanations. By developing concept explanations that are shared across classes and traceable at various network layers, we pave the way for more interpretable AI systems. Our approach, along with the introduction of the C2-Score metric, offers a robust framework for evaluating and enhancing concept-based methods in the ongoing quest for explainable AI.
