Attribution Explanations for Deep Neural Networks: A Theoretical Perspective
/ Authors
/ Abstract
Attribution explanation is a typical approach for interpreting deep neural networks (DNNs), aiming to quantify the contribution score of individual input variables to model predictions. Despite extensive methodological development, a fundamental faithfulness problem remains unresolved: whether existing attribution methods faithfully reflect the true decision-making logic of DNNs, which significantly limits their reliability and practical adoption. These concerns largely stem from three core challenges: the lack of a unified theoretical framework, clear theoretical rationales, and principled faithfulness evaluation in the absence of ground truth. Recently, a growing body of theoretical studies has begun to address these issues, marking an important shift toward principled understanding of attribution methods. In this survey, we provide a comprehensive review of these advances, with a particular emphasis on three interconnected directions: (i) Theoretical unification, which uncovers key commonalities and differences among attribution methods; (ii) Theoretical rationale, which clarifies the mathematical and conceptual justifications underlying existing methods; (iii) Theoretical evaluation, which rigorously proves whether attribution methods satisfy established faithfulness principles. Beyond a comprehensive review, we provide practical recommendations and a case study illustrating how theoretical findings can be translated into operational decision rules for method design, selection, and usage. We conclude with a discussion of promising open problems for further work.
Journal: IEEE transactions on pattern analysis and machine intelligence