Abstract
End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts towards model explainability in recent years. One desideratum of model explanation is faithfulness, that is, an explanation should accurately represent the reasoning process behind the model’s prediction. In this survey, we review over 110 model explanation methods in NLP through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability. We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and reflect on future work directions towards faithful explainability in NLP.
1. Introduction
Since the birth of deep learning, end-to-end neural language models (LMs) have achieved remarkable success in a wide range of Natural Language Processing (NLP) tasks (Clark et al. 2018; Wang et al. 2019b, a; Zellers et al. 2019; Cobbe et al. 2021; Hendrycks et al. 2021; Sakaguchi et al. 2020, i.a.). However, they largely remain a black-box to humans, that is, they lack explainability. This hinders their real-life application, especially in high-stake scenarios where user trust is crucial. To address this issue, a plethora of explanation methods have been developed to shed light on how these LMs work. However, there are still many open questions in this field: How do we objectively evaluate the quality of explanation methods? How do we choose the most appropriate one(s) for a given use case? Does interpretability come at the cost of performance?
In this survey, we review over 110 model explanation methods through the lens of faithfulness, a fundamental principle of explanations. Faithfulness refers to the extent to which an explanation accurately reflects a model’s reasoning process (Jacovi and Goldberg 2020). In other words, an explanation should not “lie” about the underlying mechanism at work. Explanations that lack faithfulness can be dangerous, especially when they still appear plausible, that is, convincing to humans. This can mislead users into over-trusting the model even if it has unwanted biases. In a variety of tasks including recidivism prediction, college acceptance prediction, and credit prediction, it is shown that when a model relies on sensitive features like race and gender, an unfaithful explanation can be exploited to hide these biases, leading users into believing that the model is innocuous (Pruthi et al. 2020; Slack et al. 2020).
Despite the critical nature of faithfulness, in practice, it is frequently either overlooked or conflated with other principles of explanation, predominantly plausibility (Jacovi and Goldberg 2020). Moreover, even in cases where faithfulness is deliberately pursued and assessed, there is still no established consensus on how to quantitatively measure it. In cases where faithfulness is evaluated, authors typically commit to a single evaluation method, sometimes underpinned by problematic assumptions (see Section 3.4 ). As a result, faithfulness evaluation results from various studies are often mutually incompatible, leading to a landscape of inconsistent findings.
Within the scope of this survey, we have chosen to spotlight faithfulness, as it is arguably the most important principle for explainability that remains to be formalized in NLP research. Most other principles, such as plausibility, have already received great attention in existing interpretability studies.1 Also, they typically have a relatively stable definition and established evaluation methods, setting them apart from the more complicated notion of faithfulness.
In this survey, we will first clarify the concept of faithfulness itself and synthesize its multifaceted evaluation methodologies. Following this, we intend to provide a thorough critique of existing model explanation methods in the context of faithfulness, elucidating their strengths and shortcomings. Finally, we will summarize their common merits and remaining challenges, and conclude with potential avenues for further improvement. We hope that this survey will contribute to fostering a more transparent and standardized field of interpretability research.
Target Audience.
For NLP students, researchers, and practitioners hoping to better understand the reasoning mechanism of their models, this survey will serve as an introductory manual of existing explanation methods and will help them choose the most suitable one(s) for their own use cases. For researchers interested in studying NLP interpretability, this survey will offer an accessible and comprehensive overview of state-of-the-art work in the field, laying the basis for further exploration. The challenges and future work directions identified in this survey will also pave the way for developing novel methodologies addressing the shortcomings of existing approaches.
>Contributions.
In the field of NLP interpretability, a few existing surveys/reviews have predominantly focused on an individual class of techniques, such as SHAP-based methods (Mosca et al. 2022), saliency methods (Ding and Koehn 2021), neuron-level analysis (Sajjad, Durrani, and Dalvi 2022), and instance attribution methods (Pezeshkpour et al. 2021). More broadly, Belinkov and Glass (2019) provide a systematic overview of model analysis methods in NLP, with an emphasis on methods that uncover what knowledge is encoded in LMs, rather than why models make certain predictions. In addition, Danilevsky et al. (2020) present a comprehensive review of existing model explanation methods in NLP, categorized along five axes. In our survey, we aim to refine the taxonomy to be more complete and intuitive (see Table 1), as well as place a specific emphasis on faithfulness in reviewing current explainability methods.
In summary, our contributions are as follows:
We present a comprehensive review of over 110 model explanation methods, offering a broad perspective of the field.
We introduce a taxonomy that categorizes these methods into five families, which help clarify previously ambiguous terminology.2
We delve into an in-depth discussion on the concept of faithfulness, with a particular focus on critically reviewing its various methods of evaluation.
This survey does not purport to:
Conduct an empirical faithfulness evaluation of all discussed methods: Currently, there is still no universally accepted standard for evaluating faithfulness. Instead of committing to a single standard, we critically examine all existing evaluation methodologies, highlighting their motivation, assumptions, and potential constraints. In the subsequent discussion of various explanation methods, we aim to introduce empirical results under each evaluation standard whenever possible.
Provide a simple answer to “which family of methods is the most faithful”: Given the diversity of methods in each family, it is impractical and unjust to provide a sweeping assertion of which family is categorically more faithful than others. Our objective instead is to elucidate (a) which families are intrinsically more/less driven by faithfulness, and (b) within each family, what faithfulness challenges each method confronts, and which methods are more faithful than others. In essence, we hope to convince the reader that when deciding which explanation methods to use in practice, it is recommended to choose those that are intrinsically motivated and empirically validated in terms of faithfulness, while still remaining aware of potential pitfalls highlighted afterwards.
Survey Organization.
The survey is structured as follows:
Section 2 introduces the general notion of explainability in NLP, including its definition, importance, characterizing properties, and principles.
Section 3 zooms in on faithfulness as a fundamental principle of model explanations, discussing its definition, importance, relationship with other principles, and possible means of evaluation.
Section 4 synthesizes existing model explanation methods in pursuit of faithfulness, by grouping them into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models.
Section 5 discusses their common virtues and current challenges, and identifies future work directions towards improving faithfulness in explainable NLP.
Section 6 concludes the survey.
Figure 1 shows a structured overview of the survey. To make it easier to navigate through each section, we provide a more detailed outline with method names and relevant pointers in the Supplementary Materials.
2. Explainability in NLP
We start by introducing the notion of explainability in NLP, discussing its definition and importance. To prepare for our analysis of model explanation methods in subsequent sections, we will also present a set of properties that can serve to categorize different methods, as well as several common design principles.
2.1 What Is Explainability?
In the context of machine learning, explainability (also referred to as interpretability3) stands for
Despite this intuitive notion, explainability has no established technical definition in the community, which results in numerous papers “wielding the term in a quasi-scientific way,” essentially referring to different concepts (Lipton 2016; Doshi-Velez and Kim 2017; Miller 2017; Murdoch et al. 2019). We argue that the confusion mainly lies in the interpretation of two key concepts in the above definition: (a) what are the “internal mechanics” and (b) who is the “human.”
Internal Mechanics.
This can refer to either (i) what knowledge a model encodes, or (ii) why a model makes certain predictions.4 Existing work in NLP explainability can be categorized according to which of these questions it is addressing:
- (i)
The “what” type of work aims to accurately characterize the extent to which a model M encodes some target knowledge K, which can be linguistic, commonsense, world knowledge, and so forth. For example, given a Machine Translation (MT) model, does it implicitly capture linguistic knowledge such as the syntactic tree of the source sentence (Belinkov et al. 2020)? Previous research answers such questions with methods including but not limited to:
Probing classifiers (or auxiliary/diagnostic classifiers) (Veldhoen, Hupkes, and Zuidema 2016; Adi et al. 2017; Conneau et al. 2018);
Information-theoretic measurement (Voita and Titov 2020; Lovering et al. 2020);
Behavioral tests: including challenge sets (Levesque, Davis, and Morgenstern 2012; Gardner et al. 2020), adversarial attacks (Ebrahimi et al. 2018; Wallace et al. 2019a), and prompting (Petroni et al. 2019; Choenni, Shutova, and van Rooij 2021);
Visualization (Reif et al. 2019; Ethayarajh 2019; Karidi et al. 2021).
It should be noted that “what is known” does not imply “what is used.” For example, even if an MT model encodes knowledge about syntax trees, it does not necessarily use it in translation. In fact, it has been shown that LMs do incidentally encode linguistic features even when they are irrelevant to the end task labels (Ravichander, Belinkov, and Hovy 2021). This leads us to the “why” question.
- (ii)
The “why” type of work addresses the causal question of what factors (input features, model structures, decision rules, etc.) have led the model M to certain predictions Y. This line of research aims to establish causality between these potential factors and the prediction through various approaches. These approaches will be the main focus of this survey, since there already exists a comprehensive review of studies on the “what” question (Belinkov and Glass 2019).
Human.
This refers to the target audience of the explanation, which can include but is not limited to the model developers, fellow researchers, industry practitioners, or end users. Depending on the audience, the form and goal of the explanation can be entirely different. We will revisit this in Section 2.3 .
Instantiating the initial definition with the two clarified concepts, this survey will cover work on NLP explainability in the following sense: The extent to which why a model makes certain predictions can be presented in understandable terms to a certain type of target audience.
2.2 Why Is Explainability Important
Compared to classic machine learning models, end-to-end neural NLP models are intrinsically harder to understand in terms of their reasoning mechanism (Bommasani et al. 2021). For example, it is easy to interpret a decision tree, since every node denotes a decision rule. By contrast, it is much more opaque what a node/layer in a Neural Network (NN) represents and how it contributes to the prediction. Nevertheless, NNs have become the predominant paradigm in NLP research and are increasingly being adopted in real-life applications, which has driven the need to better understand their behavior. Concretely, we identify three key reasons why explainability is important:
First, explainability can allow us to discover artifacts in datasets. Solving the dataset does not mean learning the task, since there can often be unexpected shortcuts (e.g., statistical cues) in data creation. Models are surprisingly good at exploiting them (Kaushik and Lipton 2018; McCoy, Pavlick, and Linzen 2019; Geva, Goldberg, and Berant 2019, i.a.). For example, only using the hypothesis allows the model to perform decently on Natural Language Inference (NLI)5 datasets (Poliak et al. 2018). Explaining the contribution of various features to the prediction will help us discover such artifacts and create more reliable datasets.
Additionally, explainability can assist in diagnosing a model’s strengths and weaknesses, and debugging it. Explainability allows us to find where a model succeeds or fails, and fix the weaknesses before they can be exploited by adversaries. For example, if a model secretly relies on unwanted biases on gender and race, we can diagnose and eliminate them through explanations (Ravfogel et al. 2020). Also, if a model is susceptible to subtle perturbations in the data, it is better to discover and guard this in development prior to deployment (Wallace et al. 2019a).
Finally, explainability may help calibrate user trust in high-stake applications. In domains like health, law, finance, and human resources, an end user may not trust a model if it only provides a prediction but no explanation. For example, in computer-aided diagnosis, if an algorithm provides a prediction along with supporting evidence, such as relevant symptoms, it could be easier for human decision-makers to determine when to trust the model prediction and when to be skeptical. Empirical studies have found that explanation quality highly influences the level of user trust in the model decision (Kunkel et al. 2019; Ye and Durrett 2022). In some cases, however, it is also possible for users to blindly trust the model decision simply because of the presence (instead of the content) of the explanation (Bansal et al. 2021). Even worse, malicious actors can manipulate user trust by carefully designing misleading explanations (Lakkaraju and Bastani 2020). All these underscore the need for rigorous evaluation of explanation methods and emphasize the importance of fostering a broader understanding and literacy in AI interpretability among end users.
One important caveat lies in the interplay between explainability and performance. In some cases, it has been found that there exists an empirical trade-off between these two factors (Camburu et al. 2018; Narang et al. 2020; Subramanian et al. 2020; Hase et al. 2020, i.a.), where a more interpretable model can result in lower accuracy on the end task. Nonetheless, there are also studies that have shown otherwise, suggesting explanations can boost models’ performance across a variety of tasks, particularly under low-resource settings (Wei et al. 2022; Wang et al. 2022b, i.a.). We will delve into this topic in greater depth in Section 4.5.
2.3 Properties of Explanations
We propose to characterize model explanation methods in terms of the following set of properties:
- (a)
Time: when the explanation happens. An explanation can be post-hoc, that is, it is produced after the prediction. Any opaque model is given, and then an external method explains its predictions. Or, an explanation can be built-in, namely, it is produced at the same time as the prediction. This type of model is so-called “self-explanatory.”
- (b)
Model accessibility: what parts of the model the explanation method has access to. A black-box explanation can only see the model’s input and output, while a white-box explanation can additionally access the model-internal structures and representations.
- (c)
Scope: where the explanation applies in the dataset. A local explanation only explains a model’s behavior on a single data point (or a local vicinity of the data point), whereas a global explanation provides insights into the general reasoning mechanisms for the entire data distribution.
- (d)
Unit of explanation: what the explanation is in terms of. A prediction can be explained in terms of input features (Ribeiro, Singh, and Guestrin 2016), examples (Wallace, Feng, and Boyd-Graber 2018), neurons (Sajjad, Durrani, and Dalvi 2022), concepts6 (Rajagopal et al. 2021; Dalvi et al. 2022), feature interactions (Hao et al. 2021), or a combination of them (Jacovi et al. 2021).
- (e)
Form of explanation: how the explanation is presented. Typical forms include visualization (Li et al. 2016), importance scores (Arras et al. 2016), natural language (Kumar and Talukdar 2020), or causal graphs (Dalvi et al. 2021). Note that unit and form are different: To illustrate, for gradient methods in Table 1, the unit of explanation is input features, and the form is importance scores.
- (f)
Target audience: who the explanation is provided for. As mentioned in Section 2.1 , different audience groups can have distinct goals with respect to model explanations. Model developers may want to debug the model; fellow researchers may want to find how the model can be extended/improved; industry practitioners may want to assess if the model complies with practical regulations; and end users may want to verify that they can safely rely on the model’s decisions.
As a preview, Table 1 compares the model explanation approaches to be discussed in terms of these properties.7 Note that many existing taxonomies do not explicitly state which properties are taken into account, thus often producing confusing terms. For instance, certain taxonomies juxtapose example-based, local, and global as three classes of explanations, but they are in fact not about the same property. As another example, the term saliency methods has been used to refer to “backpropagation-based methods” in our taxonomy. However, saliency only describes the form and the unit of explanation—importance scores of input features. Then, technically speaking, all methods in Table 1 have instances that can be called a saliency method. In our taxonomy, we characterize each method using all these properties, so as to provide a clearer comparison along each dimension.
2.4 Principles of Explanations
To motivate principled design and evaluation of model explanations, existing research identifies various principles that a good explanation should satisfy. We hereby provide a non-exhaustive synthesis of these principles.
Faithfulness (also referred to as fidelity or reliability): An explanation should accurately reflect the reasoning process behind the model’s prediction (Harrington et al. 1985; Ribeiro, Singh, and Guestrin 2016; Jacovi and Goldberg 2020). This is often considered the most fundamental requirement for any explanation, and sometimes used interchangeably with the term “interpretability” (Jain and Wallace 2019; Bastings and Filippova 2020; Jacovi and Goldberg 2020, i.a.). After all, what is an explanation if it lies about what the model does under the hood? An unfaithful explanation can look plausible to humans, but has little to do with how the model makes the prediction. For example, by looking at the attention weights8 of a sentiment classification model, it may be intuitive to interpret tokens with higher weights as “more important” to the prediction, whereas empirically it is questionable if such causal relation exists (Jain and Wallace 2019).
Plausibility (also referred to as persuasiveness or understandability): An explanation should be understandable and convincing to the target audience (Herman 2017; Jacovi and Goldberg 2020). This implies that plausibility depends on who the target audience is. For example, a relevance propagation graph across NN layers may be a perfectly understandable explanation for model developers, but not at all meaningful to non-expert end users.
Input Sensitivity: This term has been perceived in slightly different senses, but the general idea is that an explanation should be sensitive (respectively insensitive) to changes in the input that influence (respectively do not influence) the prediction. Specifically, it includes the following senses:
- (i)
Sundararajan, Taly, and Yan (2017): With respect to explanations whose unit is features and form is importance scores, if two inputs differ only at one feature and lead to different model predictions, then the explanation should assign non-zero importance to the feature.
- (ii)
Adebayo et al. (2018): If the input data labels are randomized, the explanation should also change. This is because label randomization does change model predictions, so the explanation should be sensitive.
- (iii)
Kindermans et al. (2019): If a constant shift is added to every input example in the training data, the explanation should not change. This is because the model is invariant to the constant shift, and thus the explanation should not be sensitive.
Model Sensitivity: Similar to Input Sensitivity, the term has been used in different senses. In general it means that an explanation should be sensitive (resp., insensitive) to changes in the model that influence (resp., do not influence) the prediction. Specifically, it includes the following senses:
- (i)
Sundararajan, Taly, and Yan (2017): An explanation should be insensitive to model implementation details that do not affect the prediction. More specifically, two models are functionally equivalent if their outputs are equal for all possible inputs, although their implementations might be different. The explanation should always be identical for functionally equivalent models. This principle is called Implementation Invariance.
- (ii)
Adebayo et al. (2018): If the model weights are randomized, the explanation should also change. Similar to Input Sensitivity Definition (ii), weight randomization does change model predictions, so the explanation should be sensitive.
Completeness:An explanation should comprehensively cover all relevant factors to the prediction (Sundararajan, Taly, and Yan 2017). More formally, for explanations in the form of importance scores, the importance of all features should sum up to some kind of “total importance” of the model output.9
Minimality (also referred to as compactness): An explanation should only include the smallest number of necessary factors (Halpern and Pearl 2005; Miller 2017). Intuitively, this is analogous to the Occam’s razor principle, which prefers the simplest theory among all competing ones.
Note that not all of these principles have an established technical definition or evaluation standard, and there might not be consensus on whether they are necessary.10 In this survey, we mainly focus on faithfulness as it is generally considered one of the most central requirements for explanations (Jain and Wallace 2019; Bastings and Filippova 2020; Jacovi and Goldberg 2020, i.a.), but we will still refer to the other principles when discussing relevant work in subsequent sections.
3 Faithfulness
Now, we will zoom in on faithfulness, a fundamental principle of model explanations, analyzing what it means, why it is important, its relationship with other principles, and how it can be measured.
3.1 Definition
As mentioned before, a faithful explanation should accurately reflect the reasoning process behind the model’s prediction (Harrington et al. 1985; Ribeiro, Singh, and Guestrin 2016; Jacovi and Goldberg 2020). This is only a loose description though; in fact, there is not yet a consistent and formal definition of faithfulness in the community. Instead, people often define faithfulness on an ad-hoc basis, in terms of different evaluation metrics, to be detailed in Section 3.4 .
3.2 Importance
We believe that faithfulness is one of the most fundamental principles for explainability, as also established in previous work (Jain and Wallace 2019; Bastings and Filippova 2020; Jacovi and Goldberg 2020, i.a.). By definition (cf. Section 2.1 ), the goal of a model explanation is to reveal the mechanism behind the predictions in human-understandable terms. Faithfulness requires that the mechanism presented by the explanation is true to the model’s underlying reasoning process. In other words, if an “explanation” is unfaithful, it does not even satisfy the definition of an explanation.
In NLP specifically, there are two additional pieces of empirical evidence supporting the crucial role of faithfulness.
First, faithfulness establishes causality. In interpretability work, it is often implicitly assumed that “what is known by the model” is also “what is used by the model in making predictions.” However, this assumption is not sound. For example, Ravichander, Belinkov, and Hovy (2021) show that language models encode linguistic features like tense and number, although they are irrelevant to the label of an artificially crafted end task. This means that a model can encode more features than what eventually gets used. Therefore, findings from the “what is known by the model” type of work are correlational but not causal. To establish causality, we need faithful explanations of how the model makes predictions.
Second, an unfaithful explanation can be dangerous. Consider an explanation that is not faithful but extremely plausible. Now, even if the model makes a wrong prediction, users may still trust it simply because the explanation looks convincing (Bansal et al. 2021). For example, Pruthi et al. (2020) show that attention weights can be a deceiving explanation to end users. They train the model to attend minimally to gender-related tokens (he and she), therefore hiding the fact that it is relying on gender bias in prediction. Users may still find it safe to deploy the model in real-world scenarios since it seems to be free from bias.
3.3 Relationship between Faithfulness and Other Principles
We now elucidate the relationship between faithfulness and several other principles introduced in Section 2.4 , since they are often implicitly conflated in the literature.
Faithfulness vs. Plausibility.
There is an intrinsic tension between these two notions. Think about an extreme case: If an “explanation” is just a copy of all model weights, it would be perfectly faithful but not at all plausible to any target audience. Now consider the other extreme: We could ask the target audience which features are most important when they themselves make a prediction, then simply copy their response as our “explanation” of how the model works. This “explanation” would be perfectly plausible since it fully matches the human reasoning process, but not at all faithful since it has nothing to do with how the model works. Therefore, plausibility does not guarantee faithfulness, and vice versa.
Moreover, when we observe that an explanation is implausible in human terms, there can be two possibilities: (a) the model itself is not reasoning in the same way as humans do, or (b) the explanation is unfaithful. For instance, if an explanation says that a sentiment classification model is mainly relying on function words like the, a instead of sentiment words like awesome, great, it could be that the model is truly relying on those uninformative words to make predictions (potentially because of spurious correlations in the dataset), or that the model is correctly relying on content words, but the explanation does not faithfully reflect the fact.
In practice, faithfulness has often been conflated with plausibility. For example, explanations that better align with human perception are often thought to be more faithful, while in fact, they are only more plausible. See Section 3.4.6 for more discussion.
Faithfulness vs. Input Sensitivity, Model Sensitivity, and Completeness.
These principles are sometimes seen as necessary conditions for faithfulness, though not always explicitly stated (Sundararajan, Taly, and Yan 2017; Kindermans et al. 2019; Yeh et al. 2019). Practically, they are often used to prove that an explanation is not faithful via counterexamples. For instance, given an explanation method, researchers run it on a model and check through all the principles. If any of them (for instance, input sensitivity) is violated, then the explanation method is said to be unfaithful. This is also called sanity checks in the literature (Adebayo et al. 2018). We will revisit them in more detail in Section 3.4.1 .
3.4 Evaluation of Faithfulness
As explained in previous sections, faithfulness does not have an established formal definition, but is usually defined ad-hoc during evaluation. However, the evaluation metrics are often not directly comparable with each other and yield inconsistent results, making it difficult to objectively assess progress in this field.
In their seminal opinion piece, Jacovi and Goldberg (2020) outline five design principles of faithfulness evaluation metrics, three of which are the most important (and most ignored) in our view. The first principle is “be explicit in what you evaluate,” especially, do not conflate plausibility and faithfulness. Second, “faithfulness evaluation should not involve human judgment on explanation quality.” This is because humans do not know whether an explanation is faithful; if they did, the explanation would be unnecessary. Finally, faithfulness evaluation should not involve human-provided gold labels (for the examples to be explained). A faithful explanation method should be able to explain any prediction of the model, regardless of whether it is correct or not.11
With the above principles in mind, we review existing faithfulness evaluation methods, which we broadly categorize into: axiomatic evaluation, predictive power evaluation, robustness evaluation, perturbation-based evaluation, white-box evaluation, and human perception evaluation.
3.4.1 Axiomatic Evaluation
Axiomatic evaluation treats certain principles (also called axioms) (e.g., those from Section 2.4) as necessary conditions for faithfulness, and tests if an explanation method satisfies them. If it fails any test, then it is unfaithful. Existing tests focus on the following axioms: Model Sensitivity, Input Sensitivity, Polarity Consistency, Robustness Equivalence, and Feature Importance Agreement, among others.
Model Sensitivity.
One form of Model Sensitivity, as previously mentioned, is Implementation Invariance (Sundararajan, Taly, and Yan 2017). This means that two functionally equivalent models (which have the same outputs for all inputs) should have the same explanation. An assumption of this test is that two models are functionally equivalent only if they have the same reasoning process (Jacovi and Goldberg 2020). If this assumption holds, when a method provides different explanations for functionally equivalent models, it is unfaithful. However, we argue that this assumption may not be grounded. There do exist functionally equivalent models that rely on entirely different reasoning mechanisms, a trivial example of which is various sorting algorithms. The assumption that all of them have the same explanation is counter-intuitive. Therefore, we do not believe that Implementation Invariance is a necessary condition for faithfulness.
Another form of Model Sensitivity is that an explanation should be sensitive to meaningful changes in the model that influence the prediction (Adebayo et al. 2018). For instance, when model weights are randomized, the explanation should also change. We consider this as a more sensible check than Implementation Invariance because of the reason above.
Input Sensitivity.
This means that an explanation should be sensitive (resp., insensitive) to changes in the input that influence (resp., do not influence) the model prediction. We refer the reader back to Section 2.4, where we elaborate on how previous studies instantiate the notion differently (Sundararajan, Taly, and Yan 2017; Kindermans et al. 2018; Adebayo et al. 2018). In our view, all of them are reasonable tests of necessary conditions for faithfulness.
Polarity Consistency.
This axiom measures the consistency between importance-score-based explanations and the impact polarity on model predictions (Liu et al. 2022). For example, if an explanation method assigns a positive weight to a feature as its contribution to some predicted label, then after removing this feature, the model confidence in the label should be suppressed. More generally, the sign of importance scores should agree with the polarity of the confidence change. We consider this as a valid sanity check, with one caveat: Removing features may create nonsensical inputs, which decrease model prediction confidence solely because the inputs are out of distribution (OOD), but not because the removed features are important. We will come back to this issue when discussing perturbation-based evaluation.
Robustness Equivalence & Feature Importance Agreement.
Both axioms are proposed by Wiegreffe, Marasović, and Smith (2021) specifically to measure the association between model-predicted labels and free-text rationales. Robustness Equivalence means that the explanation and the predicted label should be equally robust (or non-robust) under noise. Feature Importance Agreement means that input tokens that are important (resp., unimportant) for label prediction should also be important (resp., unimportant) for explanation generation. We consider both as reasonable sanity checks. However, one concern with the latter axiom is that in the implementation of the test, token importance is computed with gradient-based methods, many of which are themselves known to be unreliable in terms of faithfulness.12 In other words, the Feature Importance Agreement test needs to rely on another explanation method to provide feature importance, and assumes that the method is faithful.
Note that all the above axioms only test for necessary instead of sufficient conditions of faithfulness. Passing all tests does not guarantee that an explanation is faithful. Therefore, axiomatic evaluation can only be used to disprove faithfulness via counterexamples. Still, practically speaking, the more tests an explanation method passes, the more confidence we can have in its faithfulness.
3.4.2 Predictive Power Evaluation
This type of evaluation uses the explanation to predict model decisions (instead of gold labels) on unseen examples, and considers a higher accuracy as an indicator of higher faithfulness. This accuracy is also called simulatability in the literature. Intuitively, the more faithful an explanation is, the more information it should contain about the model’s decision mechanism, and thus the easier it would be for an external observer, or simulator, to predict the model’s behavior based on the explanation. In practice, people often treat the difference in simulatability with and without the explanation as a measure of its faithfulness. The underlying assumption is that if an explanation leads to a different prediction than that made by the model it explains, then it is unfaithful (Jacovi and Goldberg 2020).
To implement this evaluation, we need a way to derive the prediction from a given explanation. In existing work, predictions are derived either directly from the explanation itself or with a simulator, which can be an external model or humans.
To derive the prediction directly from the explanation, the explanation should be an executable model itself capable of making predictions (e.g., decision trees or rule lists [Sushil et al. 2018]). Alternatively, it should be possible to define simple rules on top of the explanation to predict the label. For example, Ye, Nair, and Durrett (2021) compute a scalar (say, sum) on top of the importance scores assigned to some feature, and compare the scalar to a pre-defined threshold to predict the label. Another example is given by Sia et al. (2022), who derive the prediction from the explanation using logical rules, specifically in the context of NLI tasks.
Using an external model as the simulator is similar to the idea of model distillation, where a student model (often smaller or simpler) learns to simulate the behavior of the original teacher model (the model to be explained) (Li et al. 2020; Hase et al. 2020; Pruthi et al. 2022). Here, the goal is not to obtain a good student model, but to evaluate the quality of given explanations in terms of how well they can help the student simulate the teacher.
Instead of external models, humans can also be considered as the simulator, since they are the eventual target audience. In this evaluation, human subjects are asked to simulate the model’s decision on new examples without access to the model itself, but only to the input and the explanation (Doshi-Velez and Kim 2017; Ribeiro, Singh, and Guestrin 2018; Chen et al. 2018; Hase and Bansal 2020; Zheng et al. 2022).
In our opinion, the first way (directly deriving the prediction from the explanation) is the most rigorous, in the sense that there is no external simulator as a potential confounding factor. However, it can be hard to generate large-scale and high-quality counterfactual examples to test simulatability. Existing work either creates them manually, which limits the scale of evaluation (Ye, Nair, and Durrett 2021), or automatically generates them with logical rules, which instead constrains the task scope (Sia et al. 2022).
When there is an external simulator, label leakage becomes an issue. This means that an explanation can trivially leak the label to the simulator with superficial regularities. For example, in the task of NLI, if a free-text explanation contains “not,” then the label is very likely contradiction. Therefore, simulatability may favor explanations that contain trivial cues to the label but are otherwise uninformative: As an extreme example, the explanation can be the label itself.
When an external model serves as the simulator, such leakage can be even harder to avoid, since the explanation and the model simulator can communicate in numerous human-imperceptible ways (e.g., a punctuation mark can be the cue for some class). In addition, it is also questionable how expressive the simulator model should be. If too expressive, the proxy model can easily learn the label itself, and thus there will be little difference in simulatability with and without the explanation. This is analogous to the expressivity concern of probing classifiers expressed by Hewitt and Liang (2019).
When the simulator is human, it is easier to control for leakage by filtering out visible cues. However, there are several other concerns. First, this evaluation mingles plausibility with faithfulness. If humans fail to simulate model predictions, then it could be that either the explanation is not plausible (to them) or that the explanation is unfaithful. Moreover, when simulating the model’s prediction, it is difficult to ensure that humans can eliminate their own judgments of what the gold label should be.
Note that label leakage does not mean that an explanation is necessarily bad. Oftentimes, explanations provided by humans also unavoidably leak the decision. For instance, how would we explain that a premise contradicts a hypothesis without using any negation-related expression at all? This means that we should not blindly reject all label-leaking explanations. Recent studies provide several alternatives. Hase et al. (2020) propose a variant of simulatability called Leakage-Adjusted Simulatability (LAS), which performs a macro-average of leaking and non-leaking explanations. However, this is effectively reduced to vanilla simulatability when most or all explanations are leaking. Pruthi et al. (2022) argue that to avoid leakage, explanations should only be provided to the simulator at training time but not at test time. In their evaluation protocol, during training, the simulator learns to predict the model’s decision, and explanations are provided as a multitask learning objective or attention regularization. Then, during inference, the simulator is expected to generalize to unseen examples without explanations. As a third alternative, Chen et al. (2022a) propose to measure the additional information provided by an explanation beyond what is already in the input or the label, with a metric called REV (Rationale Evaluation with conditional V-information). Suppose we are trying to explain why a model answers “enjoy nature” given the question “why do people go hiking?”. Under this metric, trivial explanations like “people go hiking to enjoy nature” will be penalized, although they result in high simulatability; whereas those like “hiking means the activity of going for long walks, especially across the country, or in nature ...” will be favored, as they provide extra information.
To summarize, we believe that the predictive power evaluation is a sensible test for faithfulness. However, when using an external simulator, we should take into account the label leakage issue. In addition, experimental design with human simulators should be cautious of the above-mentioned pitfalls.
3.4.3 Robustness Evaluation
Robustness evaluation measures whether the explanation is stable against subtle changes in the input examples, such as pixel perturbations that result in images that are indistinguishable from each other. In its earliest version, robustness requires that similar inputs (as perceived by humans) should have similar explanations (Alvarez-Melis and Jaakkola 2018). However, this does not rule out the possibility that the model itself can be non-robust to subtle input perturbations. Later work remedies this flaw by imposing constraints on model predictions. Now, robustness means that for similar inputs that have similar model outputs, the explanations should be similar (Ghorbani, Abid, and Zou 2019; Yeh et al. 2019; Ding and Koehn 2021; Zheng et al. 2022; Wang et al. 2022a; Yin et al. 2022). The underlying assumption is that on similar inputs, the model makes similar predictions only if the reasoning process is similar (Jacovi and Goldberg 2020).
We identify two problems with this evaluation. First, though the notion of “indistinguishable inputs” makes sense in vision, it is hard to apply in NLP since the input space is discrete. If we substitute tokens with semantically similar ones, we do not know whether their model representations are also indistinguishable. If we instead directly add a noise term to the model representation, then it may create OOD tokens that do not map to anything in the vocabulary. Second, the assumption mentioned in the above paragraph is questionable. Even though the inputs and outputs are similar, the model’s reasoning mechanism can still differ. As a simple example, consider two similar cat images, which differ only in the length of the cat’s fur. We observe that the model predicts cat for both images with similar confidence. Now, it is still possible that the model is relying on different features (e.g., body shape in the first image, and ear shape in the second) in the two predictions. There is theoretically nothing preventing the model from doing so. Ju et al. (2022) provide empirical evidence for this in realistic scenarios, demonstrating that rationale-based models can predict the same label for semantically similar inputs, yet use different reasoning. Thus, if we observe that an explanation is not robust, we cannot conclude if it is because the explanation is unfaithful or because the model’s reasoning mechanism is indeed not robust.
As a result, we do not recommend robustness as a good evaluation metric for faithfulness, but this does not mean that it is without value. Robustness may be a necessary metric for user understandability, since stable explanations are easier for the human audience to comprehend than those that change drastically and unpredictably given similar inputs (Zhou, Ribeiro, and Shah 2022).
3.4.4 Perturbation-based Evaluation
This kind of evaluation perturbs parts of the input and observes the change in the output. It differs from robustness evaluation in that robustness considers extremely similar inputs and expects that the explanation is similar; but in perturbation-based evaluation, we consider inputs that are not necessarily similar, and our expectation of how the explanation should change depends on which parts of the input are perturbed.
Concretely, consider an explanation in the form of feature importance scores. We now remove a fixed portion of features from the input, based on the explanation. If the most important features (as indicated by the explanation) are first removed, the model prediction is expected to change drastically. Conversely, removing the least important features should result in a smaller change. This type of evaluation has been widely adopted in both vision (Bach et al. 2015; Samek et al. 2016; Shrikumar, Greenside, and Kundaje 2017; Chen et al. 2018; Kindermans et al. 2019) and language (Arras et al. 2016; Chen et al. 2018; Serrano and Smith 2019; Jain and Wallace 2019; Atanasova et al. 2020). In particular, a popular set of metrics are sufficiency and comprehensiveness (DeYoung et al. 2020). Sufficiency measures how much the model prediction can be recovered when only including the important features, while comprehensiveness measures to what extent the model prediction is suppressed by removing the important features.
One underlying assumption of this evaluation is that different parts of the input are independent in their contribution to the output (Jacovi and Goldberg 2020). However, features can be correlated. When one feature is removed, we cannot guarantee that other features stay untouched.
Another assumption is that the observed performance change does not come from nonsensical inputs. When a feature is perturbed, the resulting input can become OOD. Compared to Computer Vision (CV), this has more serious consequences in NLP, since removing a word can make the sentence ungrammatical or meaningless, but removing a pixel almost does not change the semantics of an image. Hooker et al. (2019) address the issue in CV by proposing the RemOve And Retrain (ROAR) benchmark. According to a given explanation method, the set of most important features is removed from both the training and the testing data. The model is then trained and tested again on the new data, and a larger performance drop indicates higher faithfulness. In their experiments, image classification models are found to still achieve decent accuracy even after most input features (90%) are removed. This indicates that the performance drop observed in previous evaluation approaches without retraining might indeed come from the distribution shift instead of the lack of important features. However, although ROAR ensures that the training and testing data come from the same distribution, it brings about a new problem—the retrained model is no longer the same as the original model to be explained. Alternatively, Yin et al. (2022) address the OOD issue in NLP by performing small but adversarial perturbations on the feature embeddings, instead of removing or replacing the feature altogether. Intuitively, if the perturbation scale is constant, the model should be more sensitive to perturbations on important tokens than those on unimportant ones. One pitfall we notice is that the density of neighboring token vectors in the embedding space might be non-uniform for different tokens. Concretely, suppose token x1 lies in a region with 100 neighboring tokens within distance d, while token x2 only has 10 neighbors within the same distance. Then, when we perform a perturbation of the same scale (e.g., in terms of norm) on both tokens, it is more likely to result in an OOD input for x2 .
The final and most fundamental assumption is that perturbation-based evaluation implicitly regards another explanation method—a perturbation-based one, which we will discuss in Section 4.4 —as the ground truth. For example, as Ju et al. (2022) point out, AOPC (Area Over the Perturbation Curve), a perturbation-based evaluation metric, can be reduced to a basic explanation method, leave-one-out (Li et al. 2016) under constraints on the number of features.
In short, while perturbation-based evaluation has been widely used, we should be cautious about the above-mentioned assumptions and their consequences, especially since there is still no perfect fix in NLP yet.
3.4.5 White-box Evaluation
This type of faithfulness evaluation relies on known ground-truth explanations, against which a candidate explanation can be compared. The ground-truth explanations come from either transparent tasks or transparent models.
In transparent tasks, the data is created in a way such that the set of informative features is controlled. One way to do this is via synthetic tasks, like reconstructing a simple function (Chen et al. 2018; Hooker et al. 2019), counting and comparing the number of digits (De Cao et al. 2020; Hase and Bansal 2022), or text classification on hybrid documents where only one sentence is relevant to the label (Poerner, Schütze, and Roth 2018). A less artificial way is to modify existing datasets of natural tasks, such that models trained on them have to use the desired feature(s) in order to achieve high performance. This method was first proposed in the vision domain (Yang and Kim 2019; Adebayo et al. 2020), and then adapted to NLP. Specifically, Zhou et al. (2022b) and Bastings et al. (2022) concurrently explore the idea of using spurious features to evaluate faithfulness. Given a natural dataset, the first step is to de-correlate all the original features with the label, e.g., by randomly re-assigning the label for all examples. The second step is to inject the spurious feature in the data, such that it perfectly correlates with the new label. For example, for all the reviews with a positive sentiment, we change every article in them to “the”; otherwise, we change the articles to “a.” In this way, only the article feature is (perfectly) indicative of the label, so any model that performs better than random on the dataset must have used this feature. Finally, to evaluate an explanation method, we measure the extent to which it can correctly identify the spurious feature as the contributor for models that achieve perfect performance.
Transparent models provide another way to obtain ground-truth explanations. They are inherently interpretable models (e.g., Logistic Regression or Decision Tree). The ground-truth explanation of important features can be directly obtained through their weights or prediction structure (Ribeiro, Singh, and Guestrin 2016; Ramamurthy et al. 2020). To evaluate an explanation method, one can apply it to these transparent models and compare the explanation with the ground-truth feature importance.
Note that this test is still largely a sanity check, constituting a necessary instead of sufficient condition for faithfulness. Since the transparent setups are simplified, passing the white-box test does not guarantee that the explanation method can faithfully generalize to real-world scenarios.
3.4.6 Human Perception Evaluation
Human perception evaluation assesses whether the model explanation matches human intuition when they make predictions on the same task. For example, if the explanation is in the form of feature importance scores, to what extent does it align with human-annotated importance scores (Feng et al. 2018; DeYoung et al. 2020; Clinciu, Eshghi, and Hastie 2021)?
However, many previous studies of this type do not report which principle is evaluated. We argue that such tests only evaluate plausibility. For them to touch on faithfulness, we need to make the assumption that models reason in the same way as humans do. Obviously, this does not always hold; otherwise, we will not need explanations at all. As said at the beginning of Section 3.4 , faithfulness evaluation should not involve human judgment on the explanation quality (Jacovi and Goldberg 2020). Again, this is not to say that human evaluation has no value. Similar to robustness, human perception is a crucial evaluation to perform since our ultimate goal is for the target audience to better understand the model (Zhou, Ribeiro, and Shah 2022).
3.4.7 Meta-evaluation of Faithfulness Evaluation Metrics
To compare the quality of different faithfulness metrics, there are several meta-evaluation standards.
One such standard is the ability to detect known unfaithful explanations. This means that a good evaluation metric should be able to distinguish explanations that are known to be unfaithful from “relatively faithful” ones. In practice, these unfaithful explanations are often randomly generated (Chan, Kong, and Guanqing 2022) or artificially created (Sia et al. 2022). It is less trivial to come up with those allegedly “relatively faithful” explanations, though; existing work typically uses its proposed explanation method or other widely used methods. But after all, if we know in advance which explanations are relatively more faithful, we would not have needed faithfulness evaluation metrics, let alone another evaluation metric of these metrics.
Another standard is “solvability.” If an evaluation metric can be treated as an objective and an optimal explanation can be found algorithmically, the metric is said to be solvable (Zhou and Shah 2023). For example, it is demonstrated that two widely used perturbation-based evaluation metrics, sufficiency and comprehensiveness, are solvable with beam-search.13
The third standard is time complexity. Specifically, Chan, Kong, and Guanqing (2022) propose to measure the average number of model forward passes needed to compute an evaluation metric, and a smaller number is favored as it requires fewer computational resources.
3.4.8 Summary
Based on the above discussion, we recommend axiomatic evaluation, predictive power evaluation (no simulator, or with an external model as the simulator), and white-box evaluation as desired faithfulness evaluation methods, with caveats specified before. More ideally, more than one of these evaluations should be done, since many of them only test a necessary condition of faithfulness.
To complement the list of design principles provided by Jacovi and Goldberg (2020), we additionally propose a few more towards a better evaluation of faithfulness.
We believe it is important to define faithfulness in advance rather than ad-hoc. Instead of using the same term to refer to different things, a clear definition at the beginning of the evaluation will greatly benefit comparability.
It is also crucial to state the assumptions of the evaluation, where they do not hold and the resulting implications. Especially, we caution against making assumptions about how models reason, e.g., “they reason in the same way as humans do,” as this conflates plausibility with faithfulness.
In addition, it is helpful to disentangle the capacity of the model and the quality of the explanation. For example, a non-robust explanation can result from either the model relying on inconsistent features or the explanation being unfaithful, as mentioned in the discussion of robustness evaluation.
Finally, how to evaluate the quality of evaluation metrics is still an under-studied problem. Recent studies such as Chan, Kong, and Guanqing (2022) and Parcalabescu and Frank (2024) made the effort to systematically compare a set of existing faithfulness metrics under a unified setting. We call for more attention to this fundamental problem, so as to deepen our understanding of faithfulness and measure the progress in interpretability.
4. Families of Model Explanation Methods
We group existing explanation methods in pursuit of faithfulness into five categories: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. To give the reader a quick overview, we briefly explain the intuition behind each type of method with a motivating example.
Consider a Sentiment Analysis task, where a model determines if the sentiment of a given piece of text (e.g., product/movie review) is positive, negative, or neutral. An example input can be:
This movie is great. I love it.
Suppose the model prediction is positive, which matches the ground truth. Our goal, now, is to explain why the model makes such a prediction. Here is how each method answers the question on a high level:
Similarity-based methods provide explanations in terms of previously seen examples, similar to how humans justify their actions by analogy. Specifically, they identify training instances or concepts14 that are similar to the current test example in the model’s induced representation space (e.g., The movie is awesome, The TV show is great) as an explanation, assuming that the reasoning mechanism for similar examples is intrinsically similar.
Analysis of model-internal structures examines the activation patterns of nodes, layers, or other model-specific mechanisms (e.g., attention [Bahdanau, Cho, and Bengio 2015]), and derives an explanation with techniques like visualization, clustering, correlation analysis, and so forth. For example, on a visualized heat map of attention weights, great and love may have the highest weight among all token positions in the final layer. This can be interpreted as these two tokens contributing the most to the prediction.
Backpropagation-based methods compute the gradient (or some variation of it) of the model prediction with respect to each feature (e.g., input token). Features with the largest absolute gradient value (say, great and love) are then considered most important to the prediction. The intuition behind this is that even a slight change in these features15 could have resulted in a large change in the model output. For example, if great becomes good and love becomes like, the model’s confidence of positive will probably not be as high.
Counterfactual intervention perturbs a specific feature (e.g., input token) while controlling for other features and observes the resulting influence in the model prediction. For example, to test if the word great is important for the model prediction, we can mask it out or replace it with another word (e.g., OK) and see how the model prediction changes. If the probability of the positive class decreases considerably, then the word great has been an important feature for the original prediction.
Self-explanatory models do not rely on post-hoc explanation methods but provide explanations as a byproduct of the inference. For example, a self-explanatory model can be trained to predict the sentiment label (positive) and justify its prediction at the same time, by producing a natural language explanation (“The words great and love indicate that the person feels positive about the movie”).
Next, we introduce each family of methods in detail, discussing their representative studies, strengths, and weaknesses.
4.1 Similarity-based Methods
Similarity-based methods provide explanations in terms of training examples. Specifically, in order to explain the model prediction on a test example, they find its most similar16 training examples in the learned representation space, as support for the current prediction. This is akin to how humans explain their actions by analogy, e.g., doctors make diagnoses based on past cases.
Caruana et al. (1999) theoretically formalize the earliest approach of this kind, named “case-based explanation.” Based on the learned hidden activations of the trained model, it finds the test example’s k-nearest neighbors (kNN) in the training set as an explanation. Note that the similarity is defined in terms of the model’s learned representation space but not the input feature space, since otherwise the explanation would be model-independent.
Wallace, Feng, and Boyd-Graber (2018) also use the kNN search algorithm; but instead of deriving a post-hoc explanation, they replace the model’s final softmax classifier with a kNN classifier at test time. Concretely, during training, the model architecture is unmodified. Then, with the trained parameters fixed, each training example is passed through the model again, and their representations are saved. The inference is done with a modified architecture: a test example is classified based on the labels of its kNNs from the training examples in the learned representation space. However, the resulting explanations are only evaluated based on whether the explanations align with human perception of feature importance on qualitative examples, which is irrelevant to faithfulness.
Rajagopal et al. (2021) introduce a self-explanatory classification model, where one component, the “global interpretable layer,” also uses the idea of similarity-based explanation. This layer essentially identifies the most similar concepts (phrases in this case) in the training data for a given test example. Their approach is mainly evaluated in terms of plausibility, that is, how adequate/understandable/trustworthy an explanation is based on human judgment. One metric touches on faithfulness—human simulatability—however, the authors only report the relative difference with and without the explanation instead of the absolute scores, which makes it hard to determine how faithful the approach is.
4.1.1 Strengths and Weaknesses
Similarity-based methods exhibit several strengths. First, they are intuitive to humans since the justification by analogy paradigm has long been established. Second, they also are easy to implement, as no re-training or data manipulation is needed. The similarity scores are available by simply passing examples through the trained model to obtain the model’s representation of them. Third, they are highly model-agnostic, since all kinds of neural networks have a representation space, and any similarity metric (cosine, Euclidean, etc.) can be easily applied. Finally, certain similarity-based explanations are rated by human subjects as more understandable and trustworthy (Rajagopal et al. 2021) compared with several other baselines in the families of backpropagation-based methods and counterfactual intervention.
Nevertheless, there are also several weaknesses: First, most similarity-based methods only provide the user with the outcome of the model’s reasoning process (i.e., which examples are similar in the learned space), but do not shed light on how the model reasons (i.e., how the space is learned) (Caruana et al. 1999). Second, existing work mostly evaluates the resulting explanations with plausibility-related metrics, including adequacy, relevance, understandability, and so on, with human judgments. But as mentioned in Section 3 , plausibility does not entail faithfulness. Thus, it is questionable whether similarity-based methods can truly establish causality between model predictions and the explanation. Additionally, the space of exploration is confined to the training set. This inherently limits the diversity and scope of the explanation, potentially leaving certain edge cases unexplained. It also implies that the behavior of similarity-based methods depends on the distribution of the training data. In other words, the explanation outcomes may vary considerably if the training set does not well represent the broader data distribution or is biased in some manner. Finally, similarity-based methods inherently offer instance-level explanations and do not provide insights into the feature-level contribution to the prediction. This lack of granularity can limit the potential for actionable insights.
4.2 Analysis of Model-Internal Structures
The analysis of model-internal structures (e.g., individual neurons, or specific mechanisms like convolution or attention [Bahdanau, Cho, and Bengio 2015]) is believed to shed light on the inner workings of NLP models. Common analysis techniques include visualization (e.g., activation heatmaps, information flow) (Vig 2019), clustering (e.g., neurons with similar functions, inputs with similar activation patterns) (Brunner et al. 2018), and correlation analysis (e.g., between neuron activations and linguistic properties) (Qian, Qiu, and Huang 2016).
We categorize previous research by the target of the analysis, individual neurons or the attention mechanism, as two main lines of work in this area. We focus on attention among all mechanisms, because it has become the basis of the most widely adopted architectures in NLP systems nowadays and has significantly reshaped this line of work ever since.
4.2.1 Analysis on Neurons
The initial success of neural models in NLP sparked interest in finding interpretable functions of individual neurons. Karpathy, Johnson, and Fei-Fei (2015) examine the activation patterns of neurons in a character-level LSTM language model. They find neurons with specific purposes, for example, one that activates within quotes, inside if-statements, or toward the end of a line, respectively (Figure 2).
Using a similar approach, Li et al. (2016) and Strobelt et al. (2018) then find individual LSTM neurons that specifically activate for certain compositional structures in language, such as negation, intensification, and adjective-noun composition. Poerner, Roth, and Schütze (2018) and Hiebert et al. (2018) take the reverse direction: Instead of analyzing which neurons fire for a given input pattern, they look for inputs that have similar neuron activations. Preliminary observations show that Gated Recurrent Units (GRU) and LSTM models can capture certain orthographic and grammatical patterns.
4.2.2 Analysis on Attention Mechanism
Transformers (Vaswani et al. 2017) have become the foundation of the state-of-the-art (SOTA) systems on many NLP tasks (Devlin et al. 2019; Liu et al. 2019; Clark et al. 2020; Brown et al. 2020; Raffel et al. 2020; OpenAI 2023). The core of Transformers is an attention mechanism, called self-attention. Simply put, self-attention is a function that takes in a sequence of vectors (each xi ∈ℝd ) as input and returns another sequence of vectors (each yi ∈ℝd ) of the same length. Each yi is a weighted average of a transformed version of all xi ’s, i.e., , where is an affine transformation. These weights aij are called attention weights, intuitively representing how much the model “attends to” each input vector when computing the weighted average. In NLP Transformer models, we can think of the initial X as token embeddings, that is, each xi is a vector representation of a token in the input. Then, each yi can be viewed as a composite embedding.
Given this structure, it may be tempting to interpret attention weights as the importance of input tokens to the output. Consider our running example: This movie is great. I love it. A BERT-based classifier makes a prediction by passing the input tokens through a stack of 12 Transformer blocks, and then predicting the label positive using the representation of [CLS] (a special token for sentence classification tasks). Figure 3 shows the attention weights from [CLS] to all tokens in the penultimate layer. Among all non-special input tokens, great and love receive the highest averaged weights over all heads. Intuitively, this is often understood as that they are the most important tokens for the prediction. This type of understanding has been used (implicitly or explicitly) as evidence for model interpretability in different tasks and domains, such as text classification (Martins and Astudillo 2016), knowledge base induction (Xie et al. 2017), and medical code prediction (Mullenbach et al. 2018).
Nevertheless, there has been a long debate on whether attention constitutes a faithful model explanation. In their work “Attention is Not Explanation,” Jain and Wallace (2019) find that it is possible to construct an adversarial attention distribution, i.e., one that is maximally different from the original distribution but has little influence on the model output. For example, in Figure 4, a sentiment classification model predicts that a movie review is negative. Original attention weights suggest that waste and asking appear the most important, but the adversarial distribution shifts the model’s attention onto uninformative words like myself and was, without changing its prediction. This indicates that attention weights do not always causally influence the prediction.
As a direct response, Wiegreffe and Pinter (2019) offer a direct counter-argument in their paper titled “Attention is Not Not Explanation,” emphasizing that adversarial distributions are not adversarial weights. The adversarial attention distributions are artificially constructed by humans, but not learned by models through training. In other words, a trained model would probably not naturally attend most to uninformative words like myself and was. In fact, even when the authors try to guide the model’s attention towards such uninformative words using specially designed objectives, they seldom converge to these adversarial distributions after training.
Pruthi et al. (2020) again refute their argument, showing that with a new training objective, they successfully guide the model to learn intended adversarial attention distributions. For example, when predicting the occupation of a person in the text, the model is trained to assign minimal attention weights to gender indicator tokens (e.g., “he” and “she”), while it is still using this signal for prediction. This implies that attention weights can be deceiving, that is, a human user might find the model trustworthy since it is seemingly not relying on gender biases, yet it still does under the hood.
Nevertheless, in favor of the use of attention as explanation, other researchers argue that existing criticisms mainly target sequence classification tasks (e.g., Sentiment Analysis), but sequence-to-sequence tasks (e.g., MT) may be a more suitable use case for attention as explanation. It is found that modifying the encoder-decoder attention heads does influence MT model generations substantially (Voita et al. 2019; Vashishth et al. 2019), in contrast to the findings of Jain and Wallace (2019) on single-sequence tasks. Other attention heads, such as the encoder- or decoder-only ones, are less influential (Voita et al. 2019; Raganato, Scherrer, and Tiedemann 2020).
Concluding the debate, Bastings and Filippova (2020) still argue against attention and in favor of saliency methods17 as faithful explanations. However, they acknowledge that understanding the role of the attention mechanism is still a valuable scientific question (e.g., what linguistic information it captures; which heads can be pruned without performance loss).
A natural question to ask next is: (how) can attention become a more faithful model explanation? To answer it, we first need to understand what might have caused attention to be unfaithful. We summarize three potential factors:
- (a)
Information mixing: Attention weights are assigned to hidden states (in intermediate layers) instead of input features (in the initial layer), but we nevertheless interpret attention weights as the importance of the corresponding input feature. In fact, hidden states have already mixed in information from other input features (Tutek and Snajder 2020).
- (b)
Locality: Existing methods mostly focus on attention weights in a single layer (often the final) from a single position (often the special token [CLS]), but this fails to capture the big picture of how the information flows globally (Pascual, Brunner, and Wattenhofer 2021, i.a.).
- (c)
Intrinsic lack of causality: Attention weights are simply not designed to have any causal connection with the prediction, and thus cannot provide a faithful explanation alone, but need to be tied to other explanation methods (Mylonas, Mollas, and Tsoumakas 2022, i.a.).
Recent studies have started exploring ways to remedy each of these potential flaws in order to make attention more faithful.
To address the issue of (a) information mixing, Tutek and Snajder (2020) introduce two regularization techniques. The first is weight tying, which minimizes the distance between intermediate hidden states and their corresponding input features. The second is an auxiliary Masked Language Modeling (MLM) task, which decodes input representations from their corresponding hidden state. Both techniques aim to make hidden states more representative of input representations. Experiments show that they effectively increase the causal influence of attention modification on model predictions, thus improving faithfulness.
In response to (b) the locality issue, one solution is to characterize the information flow through the entire network, instead of only considering a single layer or token position. The earliest attempt is made by Abnar and Zuidema (2020), who propose Attention Rollout and Attention Flow. These are empirically shown to be more plausible compared to raw attention weights (evaluated with human perception), as well as claimed to be more faithful (in terms of correlation with gradient-based methods).18 In addition, Ethayarajh and Jurafsky (2021) theoretically prove that Attention Flow can be Shapley values19 —a counterfactual explanation with provable faithfulness guarantees—under certain conditions, whereas raw attention weights cannot. However, Liu et al. (2022) point out that Attention Rollout performs almost as badly as raw attention weights in the polarity consistency test,20 again casting doubt on its faithfulness.
Finally, to tackle (c) the intrinsic lack of causality, a number of studies propose to tie attention to other explanation methods, especially backpropagation-based ones (Section 4.3 ). This includes Hao et al. (2021), Lu et al. (2021), Pascual, Brunner, and Wattenhofer (2021), and Mylonas, Mollas, and Tsoumakas (2022), which differ mainly in the explanation method that attention is connected with.
Despite the controversy on faithfulness, recent work in cognitive science shows that attention can uncover interesting similarities between how the human brain and LMs work. Attention distributions in Transformers are found to partially converge with the activation patterns of the human brain in masked word prediction (Caucheteux and King 2022). Similarly, they are predictive of human eye fixation patterns during task-reading to some extent (Eberle et al. 2022).
4.2.3 Strengths and Weaknesses
To summarize, analysis of model-internal structures, as a family of explanation methods, has several strengths. First, the visualization of model-internal structures is intuitive and readable to humans, especially end-users. Second, there are many interactive tools (see Appendix A.1), which can help the user form hypotheses about their data and models and dynamically adjust them through minimal testing. Third, the attention mechanism can capture the interaction between features, whereas many other methods can only capture the influence of individual features themselves. Finally, model weights are easily accessible and computationally efficient, compared with other methods.
However, these methods also suffer from several key weaknesses: First, it is questionable to what extent raw attention weights represent causal contribution, as mentioned in the debate. Second, the lack of faithfulness may be due to our interpretation of attention weights on intermediate hidden states as the importance of input features; however, hidden states have already mixed in contextual information through previous self-attention layers, and therefore may not be representative of input features. Finally, existing methods often focus on attention weights in a single layer and/or from a single token position. This may reflect how much the model attends to each input position locally, but without taking the whole computation path into account. Methods that characterize the global information flow may be a better alternative (Abnar and Zuidema 2020, i.a.).
4.3 Backpropagation-based Methods
Backpropagation-based methods aim to identify the contribution of input features via a backward pass through the model, propagating the importance (or relevance, used interchangeably in the literature) attribution scores from the output layer to the input features. They can be further distinguished into two categories, gradient methods and propagation methods. Gradient methods follow standard backpropagation rules. In other words, they directly compute the gradient (or some variant of it) of the output with respect to (w.r.t.) the input features via the chain rule, assuming features with larger gradient values are more influential to the model prediction. By contrast, propagation methods define custom backpropagation rules for different layer types and compute the relevance scores layer by layer until reaching the input. This is believed to better capture the redistribution of relevance through special layers, such as ReLU.
Most ideas in this family have been first proposed in Computer Vision (CV). In the following subsection, we will explain their origin in vision and adaptation in language.
To formally synthesize existing work, we will use the following notation throughout the rest of this section: An example x (e.g., an image or a sentence) has features xi, i ∈{1,2,...n} (e.g., a pixel, a region, or a token). A model M takes x as input and predicts y = M(x) as output. Our goal is to explain the relevance of each feature xi to y, denoted by ri(x) . For some specific methods, we also define a baseline input (e.g., an all-black image, or a sentence with all-zero token embeddings) against which x is compared. We will discuss each subsequent method using this formalization.
4.3.1 Gradient Methods
As their name suggests, gradient methods treat the gradient (or some variant of it) of the model output w.r.t. each input feature as its relative importance. The feature can typically be a pixel in vision and a token in language. Intuitively, the gradient represents how much difference a tiny change in the input will make to the output. This idea comes from generalized linear models (e.g., Logistic Regression), where each feature has a linear coefficient as their importance to the output. In the case of deep NNs, a natural analog of such coefficients would be gradients, as they characterize the marginal effect of a feature change on the output.
Using the notation above, the core difference of existing gradient methods lies in how they calculate ri(x) , the relevance of feature xi .21
The most straightforward idea is to take the gradient itself (referred to as Simple Gradients or Vanilla Gradients), , as the feature relevance (Baehrens et al. 2010; Simonyan, Vedaldi, and Zisserman 2014). The sign of the gradient represents whether the feature is contributing positively or negatively to the output, for example, increasing or decreasing the probability of a certain class in a classification task. The magnitude of the gradient stands for the extent to which the feature influences the output. Simple Gradients are easy to implement, intuitive to understand, and flexible for extension—beyond individual input features, Kim et al. (2018) extend it to high-level concepts (color, pattern, gender …) in their method called TCAV. However, Simple Gradients have two apparent problems related to faithfulness. First, a function can be saturated. Consider common neuron activation functions like sigmoid () and tanh (). When , we have . In other words, when the absolute value of an input feature is large enough, it has a very small gradient locally, although the feature may have a large contribution to the output y globally. Second, gradient only measures the responsiveness of the output w.r.t. the feature (how much the output changes in response to an infinitesimal change in the feature), but not the contribution of the feature (how much the current feature value contributes to the output value) (Bastings and Filippova 2020). Say, in an image classification model, we can interpret gradients as “how to make an image more (or less) a cat”, but not “what makes the image a cat”. More formally, taking a simple linear model as an example, the gradient wi measures the responsiveness while wixi measures the contribution. If wi is small but xi is very large, the proportion of wixi in y can still be large. This cannot be captured by the gradient alone.
Gradient × Input (Denil, Demiraj, and de Freitas 2015) is proposed as a natural solution to the latter issue. It computes the relevance score of a feature as the dot product of the input feature and the gradient, , analogous to wixi in a linear model. Intuitively, this takes into account the feature value itself. In CV, Gradient × Input empirically reduces noise in feature relevance visualizations. However, this only improves plausibility, and is not necessarily related to faithfulness. Also, Gradient × Input fails the Input Sensitivity test for faithfulness (cf. Section 2.4 ), that is, if two inputs differ only at feature xi and lead to different predictions, then xi should have non-zero relevance. As a simple counterexample, when the differing feature xi has a zero gradient in both inputs, the product of the input and the gradient would also be zero in both cases, which fails to capture the difference in their contribution.
To address the saturation and input sensitivity issues, Sundararajan, Taly, and Yan (2017) introduce Integrated Gradients. This method estimates the global relevance of a feature by comparing the input with a baseline input. Typically, the baseline is chosen as an all-black image for vision and a sentence with all-zero token embeddings for language. Integrated Gradients satisfy the Input Sensitivity principle, as opposed to Gradient × Input. Still, it is empirically observed to be visually noisy in CV, often resulting in blurry or unintelligible feature relevance maps (Smilkov et al. 2017).22
To this end, SmoothGrad is introduced (Smilkov et al. 2017), aiming to “remove noise by adding noise”. It argues that the Integrated Gradients method is visually noisy because the gradient can fluctuate rapidly with only subtle changes in the input.23 Such local fluctuations may lead to the apparent visual diffusion in relevance maps. To address this issue, SmoothGrad creates a few noisy copies of the original input, computes relevance maps for each copy with any existing gradient method, and averages all maps to obtain a less noisy map.
This method proves effective in visually denoising the relevance maps. However, it is only qualitatively evaluated in terms of human readability; faithfulness is not assessed.
In NLP, both Simple Gradients and Integrated Gradients have been adopted, but mostly targeting sequence classification tasks. As an example, Figure 5 shows a visualization of these methods on sentiment classification. Li et al. (2016) use Simple Gradients to explain token importance in RNN models on sentiment classification. More recently, targeting Transformer models, Hao et al. (2021) and Janizek, Sturmfels, and Lee (2021) adapt Integrated Gradients to capture token interactions on paraphrase detection, NLI, and sentiment classification.
4.3.2 Propagation Methods
While gradient methods follow standard backpropagation rules, propagation methods define a custom backward pass, using purposely designed local propagation rules for special layer types, such as ReLU, to reflect the relevance redistribution patterns that cannot be captured by gradients.
We now formalize the process using a feed-forward network as an example, as shown in Figure 6. Using the previous notations, let x represent the input and M(x) represent the output of model M after a forward pass. Denote any layer in M by l ( l = 1,2,…, L ), the dimension of which is dl . Define as the relevance score of any neuron i in layer l. Our goal is to find , the relevance of a given feature xi in the input layer, which also equals ri(x) .
The recursion terminates once reaching l = 1 . All subsequently introduced propagation methods follow the same procedure, while differing in the definition of D() .24 We will illustrate each method individually and provide visualizations when possible.
Among the earliest methods in this family, DeconvNet (Zeiler and Fergus 2014) and Guided BackPropagation (GBP) (Springenberg et al. 2015) both design custom rules for ReLU units ( y = max(0, x) ) in particular, since it is then the most commonly used non-linear activation. Formally, suppose a ReLU unit j in layer l + 1 is connected to a set of neurons i = 0,1,…dl in layer l, and let ai denote the activation of any such neuron i. Then we have by the definition of ReLU, where wij is the weight connecting i and j.25 According to the standard backpropagation rule used by Simple Gradients, only positive inputs in the forward pass () will have non-zero gradients in the backward pass. This rule essentially loses all the relevance information from when the inputs are negative ().
By contrast, DeconvNet proposes to zero out the redistributed relevance only if the incoming relevance is non-positive, regardless of the input . On the other hand, Guided BackPropagation combines the two rules above, zeroing out the redistributed relevance if either the incoming relevance or the input is non-positive. Compared with Simple Gradients, both DeconvNet and Guided BackPropagation produce cleaner feature relevance visualizations as perceived by humans. However, they share several shortcomings pertaining to faithfulness. First, they fail the Input Sensitivity test (Section 2.4 ) and similarly suffer from the saturation issue (Sundararajan, Taly, and Yan 2017; Shrikumar, Greenside, and Kundaje 2017), like certain gradient methods mentioned before. Second, because of zeroing out negative inputs and/or negative incoming relevance, both methods cannot highlight features that contribute negatively to the predicted class (Shrikumar, Greenside, and Kundaje 2017). Third, it is shown that both methods are essentially doing (partial) input recovery, which is unrelated to the network’s decision (Nie, Zhang, and Patel 2018). Whichever label class is predicted, the feature attribution is almost invariant. Even with a network of random weights, Guided BackPropagation can still generate human-understandable visualizations. Thus, it is suspected that the visualization has little to do with the model’s reasoning process.
While the previous two methods only treat ReLU specifically, Layerwise Relevance Propagation (LRP) has been proposed as a more generalized solution (Bach et al. 2015). Instead of handcrafting rules directly, it first proposes a high-level Relevance Conservation constraint, namely, the total incoming relevance into a neuron should equal the total outgoing relevance from it.
Any propagation rule conforming to this constraint can be called an instance of LRP.
Bach et al. (2015) quantitatively assess the faithfulness of LRP via a perturbation-based evaluation (a pixel flipping experiment in a digit classification task), yet no comparison is provided with other explanation methods. Also, LRP still suffers from the saturation problem (Shrikumar et al. 2017) and violates Implementation Invariance (Section 2.4 ) (Shrikumar, Greenside, and Kundaje 2017).
In NLP, LRP has been applied and extended to sentence classification tasks (Arras et al. 2016 2017). Regarding faithfulness, similar to pixel flipping in vision, a perturbation-based evaluation on the level of input tokens is used: A fixed number of tokens are deleted from the input according to the relevance score assigned by an explanation (LRP, Simple Gradients, and a random baseline), and then we measure the impact on classification performance. It is observed that for correctly classified inputs, when deleting the most relevant tokens first, LRP leads to the most rapid classification performance drop; but for incorrectly classified inputs, when deleting the most irrelevant tokens first, LRP can most effectively boost the performance. This indicates that LRP does provide more faithful insights into the model’s reasoning mechanism compared to Simple Gradients and the random baseline.
To address LRP’s failure with saturation and the Input Sensitivity test, two reference-based methods, DeepLift (Shrikumar, Greenside, and Kundaje 2017) and Deep-Taylor Decomposition (DTD) (Montavon et al. 2017), have been introduced. Analogous to Integrated Gradients, they aim to measure the global contribution of input features by finding a reference point, or baseline, , to compare with the input x. Ideally, the baseline should represent some “neutral” input, that is, satisfying , so we can attribute all positive contribution to the presence of x. In practice, it needs to be chosen with domain-specific knowledge. The two methods differ in how the baseline input is chosen. DeepLift empirically chooses a baseline input which results in a neutral output, but DTD additionally requires the baseline to lie in the vicinity of the original input.26 By using a baseline, they do not suffer from the issues of saturation and satisfy the Input Sensitivity principle. When evaluated on a similar pixel flipping test for faithfulness, DeepLift proves the most effective in manipulating the prediction toward the target class, compared to Integrated Gradients, Gradient × Input, Simple Gradients, and Guided BackPropagation. However, DeepLift still fails the Implementation Invariance test (Sundararajan, Taly, and Yan 2017). In terms of evaluation for DTD, only qualitative results on plausibility are reported, while faithfulness is unverified.
In NLP, Chefer, Gur, and Wolf (2021) extend DTD to explain the decision of Transformer models on sentiment classification. However, the explanations are evaluated against human-annotated token relevance, therefore also unrelated to faithfulness.
4.3.3 Strengths and Weaknesses
Summarizing the discussion above, backpropagation-based methods have several key strengths. First, they generate a spectrum of feature relevance scores, which is easily understandable for all kinds of target users. Second, the computational cost of these methods can vary significantly, but in general, they are relatively efficient to compute. Gradient-based techniques only require a handful of calls to the model’s backward function. On the other hand, propagation methods involve a customized implementation of the backward pass, allowing for precise control over the relevance redistribution process while necessitating more complex computation.27 Third, in terms of faithfulness, gradients (and variants) are intrinsically tied to the influence of input features on the prediction. Empirically, certain recently proposed methods (e.g., Layerwise Relevance Propagation, DeepLift, Deep-Taylor Decomposition) are shown to be more faithful than previous baselines via perturbation-based evaluation, as mentioned before. Finally, unlike most methods for the analysis of model-internal structures (e.g., raw attention weights), backpropagation-based methods take the entire computation path into account.
At the same time, these methods are far from perfect due to a number of weaknesses. First, most existing backpropagation-based methods target low-level features only, for example, pixels in vision and input tokens in language. It is unclear how to compute any sort of gradient w.r.t. higher-level features like case, gender, part-of-speech, semantic role, syntactic dependency, coreference, discourse relations, and so on. Second, it is questionable how to apply such methods to non-classification tasks, especially when there is no single output of the model, for example, text generation or structured prediction. Additionally, as detailed before, certain methods violate axiomatic principles of faithfulness, for example, Input Sensitivity and Model Sensitivity (Sundararajan, Taly, and Yan 2017). Lastly, the explanation can be unstable, that is, minimally different inputs can lead to drastically different relevance maps (Ghorbani, Abid, and Zou 2019; Feng et al. 2018). Most methods are not empirically evaluated on faithfulness when they are first proposed, with only a few exceptions mentioned above. Moreover, subsequent researchers find many systematic deficiencies of them in ad-hoc evaluations. As mentioned before, Guided BackPropagation and DeconvNet are shown to be only doing partial input recovery, regardless of the model’s behavior (Nie, Zhang, and Patel 2018). In addition, certain explanations (including Simple Gradients, Integrated Gradients, and SmoothGrad) can be adversarially manipulated, that is, one can construct entirely different gradient distributions with little influence on the prediction (Wang et al. 2020).
4.4 Counterfactual Intervention
The notion of counterfactual reasoning stems from the causality literature in social science: “Given two co-occurring events A and B, A is said to cause B if, under some hypothetical counterfactual case that A did not occur, B would not have occurred” (Roese and Olson 1995; Winship and Morgan 1999; Lipton 2016). In the context of machine learning, counterfactual intervention methods explain the causal effect between a feature/concept/example and the prediction by erasing or perturbing it and observing the change in the prediction. A larger change indicates stronger importance.
One axis along which we can categorize existing studies is where the intervention happens, in inputs or in model representations. The former manipulates the input and passes it through the original model; by contrast, the latter directly intervenes in the model-internal representations, for example, neurons or layers. The rest of this section will elaborate on the two categories accordingly.
4.4.1 Input Intervention
Input intervention methods can be further categorized along two dimensions: the target and the operation. The target refers to “what is affected by the intervention,” normally input features (e.g., tokens) or examples (the entire input instance). The operation stands for the specific intervention method, which can be erasure (masking out or deleting the target) or perturbation (changing the value of the target).
We will first classify existing work based on the target, and then on the operation.
Feature-targeted Intervention.
Earliest work mostly relies on erasure, since it is relatively straightforward to implement.
One intuitive idea is leave-one-out, which erases a single feature at a time and assesses the resulting change in the prediction. The feature can be input tokens (Li, Monroe, and Jurafsky 2016) or vector dimensions (Kádár, Chrupała, and Alishahi 2017; Li, Monroe, and Jurafsky 2016). In these studies, only plausibility is examined based on human perception of qualitative examples, and no faithfulness evaluation is reported. Also, crucially, leave-one-out captures the linear contribution of single features, but cannot handle higher-order feature interactions.
To address this issue, researchers propose explanation methods that erase subsets of features instead of individual ones. Several studies (e.g., Li, Monroe, and Jurafsky 2016) aim to find the minimum subset of input tokens to erase such that the model’s decision is flipped. Others (Ribeiro, Singh, and Guestrin 2018) look for the contrary—the minimum subset of input tokens to keep such that the model’s decision is unchanged (called “Anchors”). No matter which objective is taken, finding the exact desired subset of tokens is intractable, and thus both studies rely on approximated search. As a more efficient alternative, De Cao et al. (2020) propose DiffMask, a method that trains a classifier on top of input representations to decide which subset of tokens to mask. In terms of faithfulness, Anchors is evaluated with human simulatability: Compared to a popular baseline, LIME (Ribeiro, Singh, and Guestrin 2016), it allows users to more accurately predict model decisions on unseen examples. DiffMask is shown to be more faithful than several other baselines including Integrated Gradients (Sundararajan, Taly, and Yan 2017), but only with white-box evaluation using synthetic tasks.
Generalizing the idea of feature erasure, a novel family of methods based on surrogate models is proposed. The intuition is to locally approximate a black-box model with a white-box surrogate model as an explanation of the current prediction. LIME (Ribeiro, Singh, and Guestrin 2016), or Local Interpretable Model-agnostic Explanations, is a representative method of this type. Suppose the black-box model to be explained has a complicated decision boundary, as shown by light blue/pink in Figure 7. Our goal is to provide a local explanation for a given prediction (bold red cross). LIME first samples instances in the neighborhood of the current example by masking out different subsets of its features. Next, it obtains the model prediction on these instances and weighs them by their proximity to the current example (represented by size). Then, it approximates the model’s local decision boundary by learning an interpretable model, for example, a sparse linear regression (dashed line), on the input features, which constitutes the explanation.
Another widely adopted surrogate-model-based method, SHAP (Lundberg and Lee 2017), or SHapley Additive exPlanations, can be thought of as using additive surrogate models as an explanation. Originating in the game theory, Shapley values (Shapley 1953) are initially used to determine how to fairly distribute the “payout” among the “players.” In machine learning, it is adapted to explain the contribution of input features (players) to the prediction (payout). Think of an input sentence x as containing a set of binary features xi , that is, the presence of a token. The Shapley value computes the marginal contribution of xi , averaged across all token subsets that include xi . When the number of features is large, the above process is computationally expensive. Therefore, the SHAP paper introduces an efficient approximation, which obviates re-sampling a combinatorial number of subsets. Follow-up work such as Yeh et al. (2020) further extends SHAP from input tokens to higher-level concepts, represented as high-dimensional vectors. Interested readers can see Mosca et al. (2022) for a targeted survey on SHAP-based methods.
In terms of faithfulness, LIME is evaluated with white-box tests: When used to explain models that are themselves interpretable (e.g., Decision Tree), LIME successfully recovers 90% of important features. On the other hand, Shapley values are theoretically shown to be locally faithful, but there is no empirical evidence on whether this property is maintained after the SHAP approximation. Subsequent work also finds other limitations: (i) The choice of neighborhood is critical for such surrogate-model-based methods (Laugel et al. 2018); (ii) Linear surrogate models have limited expressivity. For example, if the decision boundary is a circle and the target example is inside the circle, it is impossible to derive a locally faithful linear approximation.
Beyond simply erasing features or feature subsets, recent work also explicitly models the contribution of feature interactions. For example, Archipelago (Tsang, Rambhatla, and Liu 2020) measures the contribution of the interaction between a pair of features by erasing all other features and recomputing the prediction. It is shown to be faithful via white-box evaluation on synthetic tasks, such as function reconstruction. However, on realistic tasks like sentiment classification, only plausibility is evaluated.
A common problem with all the above feature erasure methods is that they can produce OOD inputs, for example, ungrammatical or nonsensical sentences after tokens are masked out. Exploiting this weakness, Slack et al. (2020) design an adversarial model to fool popular erasure-based explanation methods. Suppose a task (e.g., loan application decision) involves sensitive features (e.g., gender or race), and the explanation method is used for model auditing—examining if a model is relying on these features. The adversarial model has two modules: a classifier to identify if an input is pre- or post-erasure, and a predictor to handle the end task based on this classification. For in-distribution inputs, the predictor will rely solely on sensitive features, while for OOD inputs, it uses only insensitive features. In other words, this model is indeed biased on all in-distribution examples. However, any explanation method that works by sampling neighboring examples via feature erasure (like LIME and SHAP) will report that the model is unbiased, because they are designed to characterize the model’s behavior only using these neighboring instances.
This leads us to the other operation, perturbation, as another type of feature-targeted intervention. Compared to simple erasure, perturbing the value of the feature is less likely to result in OOD inputs. Consider again our running example: This movie is great. I love it. To study the importance of great to the prediction, one can replace it with some other word (e.g., good, OK, …) instead of just deleting it altogether, and observe the probability change of positive. The outcome of such perturbations is called counterfactual examples, which resemble adversarial examples in the robustness literature. They differ in three aspects: (i) the goal of the former is to explain the model’s reasoning mechanism, while that of the latter is to examine model robustness; (ii) the former should be meaningfully different in the perturbed feature to the original example (e.g., This movie is greatThis movie is not so great), while the latter should be similar to or even indistinguishable from it (e.g., This movie is greatThis movie is Great); (iii) the former can lead to changes in the ground truth label, whereas the latter should not (Kaushik, Hovy, and Lipton 2020).
Generating high-quality counterfactual examples is non-trivial, as they need to simultaneously accord with the counterfactual target label, be semantically coherent, and only differ from the original example in the intended feature. In prior work, the most reliable (yet expensive) approach to collect counterfactual examples is still manual creation, as in Kaushik, Hovy, and Lipton (2020) and Abraham et al. (2022). However, recent studies propose promising ways to automate the generation, focusing on different aspects of perturbation: domain adaptation (Calderon et al. 2022), morpho-syntactic features (Zmigrod et al. 2019; Amini et al. 2022), or general-purpose (Wu et al. 2021).
Example-targeted Intervention.
In addition to feature-targeted intervention, counterfactual input intervention can also directly happen on the level of examples.
A representative method is called influence functions (Koh and Liang 2017), which is designed to explain which training examples are most influential in the prediction of a test example. This may remind us of similarity-based methods in Section 4.1 (Caruana et al. 1999). Although both methods share the same goal, they rely on different mechanisms. Similarity-based methods identify the most influential training examples via similarity search, whereas influence functions are based on counterfactual reasoning—if a training example were absent or slightly changed, how would the prediction change? Since it is impractical to retrain the model after erasing/perturbing every single training example, influence functions provide an approximation by directly recomputing the loss function. After the invention in vision (Koh and Liang 2017), influence functions have been adapted to NLP (Han, Wallace, and Tsvetkov 2020). Although they are claimed to be “inherently faithful,” this is not well-supported empirically. Crucially, the approximation relies on the assumption that the loss function is convex. Koh and Liang (2017) examine the assumption in vision, showing that influence functions can still be a good approximation even when the assumption does not hold (specifically, on CNNs for image classification). In NLP, though, only a qualitative sanity check is performed (on BERT for text classification); moreover, no baseline is provided. In fact, Basu, Pope, and Feizi (2021) discover that influence functions can become fragile in the age of deep NNs. The approximation accuracy can vary significantly depending on a variety of factors: “the network architecture, its depth and width, the extent of model parameterization and regularization techniques, and the examined test points.” The findings call for increased caution on the consequences of the assumption as models become more complex.
4.4.2 Model Representation Intervention
Similar to input intervention methods, we also categorize model representation intervention methods according to the target and the operation. Here, the target can typically be individual neurons or high-level feature representations. The operation still involves erasure and perturbation. We will still introduce existing work along the line of target first and operation next.
Neuron-targeted Intervention.
By intervening in individual neurons in an NN, one can explain the importance of each neuron to the prediction. The intervention can still be either erasure or perturbation.
The simplest form of erasure is still leave-one-out. Using the same strategy as with input features, Li, Monroe, and Jurafsky (2016) study the effect of zeroing out a single dimension in hidden units on the prediction. Bau et al. (2019) adapt the method to MT models and improve its efficiency, by searching for important neurons in a guided fashion instead of brute-force enumeration.
Apart from erasure, perturbation is another form of neuron-targeted intervention. One representative example is causal mediation analysis (Vig et al. 2020), which measures the effect of a control variable on a response variable, mediated by an intermediate variable (or mediator). In machine learning, we can think of the input example as the control variable, the model output as the response variable, and an internal neuron as the mediator. Vig et al. (2020) use this framework to analyze gender bias in LMs. Through a case study on GPT-2, the authors show that gender bias is concentrated in a relatively small proportion of neurons, especially in the middle layers.
The causal mediation analysis approach is further applied to tasks such as subject-verb agreement in English (Finlayson et al. 2021) and multilingual scenarios (Mueller, Xia, and Linzen 2022). However, all the above studies only intervene in one neuron at a time, failing to capture feature interactions. De Cao et al. (2022) address this issue by proposing a differentiable relaxation technique that allows efficient search through the combinatorial space of possible neuron combinations, identifying a small subset of neurons responsible for particular linguistic phenomena. In terms of faithfulness, only De Cao et al. (2022) report the result of a perturbation-based evaluation, whereas the remaining three studies do not report empirical faithfulness evaluation results.
Feature-representation-targeted Intervention.
Beyond intervening in neurons, directly targeting feature representations in the model allows us to answer more insightful questions like “Is some high-level feature, e.g., syntax tree, used in prediction?”. This is particularly meaningful to the line of work on what knowledge a model encodes (Section 2.1 ), which oftentimes discovers linguistic features in model representations, but it is unclear whether these features are used by the model when making predictions.
Similar to neuron-targeted intervention, the most intuitive way to perform an intervention on feature representation is erasure. Two pieces of concurrent work, Amnesic Probing (Elazar et al. 2021) and CausalLM (Feder et al. 2021), are representative examples. They both aim to answer the following question: is a certain feature (e.g., POS, dependency tree, constituency boundary) used by the model on a task (e.g., language modeling, sentiment classification)? To answer the question, they exploit different algorithms to erase the target feature from the model representation, via either Iterative Null-space Projection (INLP) (Ravfogel et al. 2020) or adversarial training. Then, with the new representation, they measure the change in the prediction. The larger the change, the more strongly it indicates that the feature has been used by the original model. In terms of faithfulness, only CausalLM is validated with a white-box evaluation, whereas no explicit evaluation is provided for Amnesic Probing.
Methods along this line also differ in the stage where the erasure happens: post-hoc erasure methods remove the feature after a model representation is trained, while adversarial erasure methods jointly train the model with an adversarial predictor to forget the target feature. INLP (Ravfogel et al. 2020) is an example of post-hoc erasure, which removes the feature via a series of iterative linear projections, such that the target feature cannot be predicted by any linear classifier in the end, thus “linearly guarded.” Haghighatkhah et al. (2022) improve INLP by proposing a variant that only needs a single projection. On the contrary, CausalLM (Feder et al. 2022) exemplifies adversarial erasure. However, neither of the two methods is perfect: Ravfogel, Goldberg, and Cotterell (2022) discover that linear guardedness does not guarantee that linear classifiers do not use the target features. At the same time, Kumar, Tan, and Sharma (2022) find that both post-hoc and adversarial erasure methods cannot guarantee to remove the target feature entirely; even worse, they may end up destroying other task-relevant features.
Taking a step back, even with perfect erasure techniques, a higher-level problem with the feature representation erasure methodology lies in unrealistic representations, similar to OOD inputs in the case of input erasure. For instance, since syntax is such a fundamental component of language, what does it mean if an LM is entirely ignorant of syntax? Is it still an LM at all?
To address this issue, perturbation-based methods targeting feature representations are proposed. Ravfogel et al. (2021) introduce AlterRep, an algorithm to manipulate the target feature value in model representations. Specifically, they investigate the task of subject-verb agreement prediction. For example, given the sentence
The man that they see [MASK] here.
with an embedded relative clause (that they see), the correct verb to fill in the mask should be is as opposed to are, since it should agree with man. They now ask the question: Does the model use syntactic information (e.g., the relative clause boundary) in making such predictions? To answer this, they first probe for syntactic knowledge in the model representation. If the model “thinks” that the [MASK] token is outside the relative clause (factual), AlterRep would flip this knowledge via linear projection techniques like INLP, such that [MASK] is now inside the relative clause according to the new model representation (counterfactual). They find that the probability ratio of are versus is increases significantly, as the model is potentially tempted to associate the verb with they in the relative clause. Such findings suggest that syntactic information is indeed used by the model in predicting the verb.
Tucker, Qian, and Levy (2021) study the same task and feature, but improve the method by providing syntactically ambiguous contexts, e.g.,
I saw the boy and the girl [MASK] tall.
which can be interpreted as either [I saw the boy] and [the girl [MASK] tall]., or I saw [the boy and the girl [MASK] tall]. Therefore, [MASK] can be either singular or plural. Likewise, they first probe for the model’s syntactic representation, and then flip the structure of the syntax tree from one of the above interpretations to the other. Similar findings are reported, suggesting BERT-based models are using syntax in agreement prediction. However, no extrinsic evaluation of faithfulness is provided.
4.4.3 Strengths and Weaknesses
Overall, there are several advantages unique to counterfactual intervention methods. First, having its root in the causality literature, counterfactual intervention is therefore designed to capture causal instead of mere correlational effects between inputs and outputs. Second, compared to other methods, counterfactual intervention methods are more often explicitly evaluated in terms of faithfulness (e.g., Ribeiro, Singh, and Guestrin 2018; De Cao et al. 2020; Ribeiro, Singh, and Guestrin 2016; Lundberg and Lee 2017; Tsang, Rambhatla, and Liu 2020; Feder et al. 2021). They are also shown to outperform existing baselines, mostly with predictive power or white-box tests for faithfulness.
However, they also share a number of disadvantages. First, compared to other methods, counterfactual intervention is relatively more expensive in computational cost, normally requiring multiple forward passes and modifications to the model representation. Searching for the right targets to intervene in can also be costly. Second, as explained before, erasure-based intervention can result in nonsensical inputs or representations, which sometimes allow adversaries to manipulate the explanation (Slack et al. 2020). Third, intervening in a single feature relies on the assumption that features are independent. Consider the sentence This movie is mediocre, maybe even bad (Wallace, Gardner, and Singh 2020). If we mask out mediocre or bad individually, the predicted sentiment will probably not change much (still negative). Hence, an explanation method that relies on single-feature erasure might report that neither token is important for model prediction. However, such a method does not capture feature interactions, like the OR relationship between mediocre or bad here—as long as one of them is present, the sentiment is likely negative. More examples are provided in Shrikumar, Greenside, and Kundaje (2017). Additionally, interventions are often overly specific to the particular example (Wallace, Gardner, and Singh 2020). This calls for more insights into the scale of such explanations (i.e., if we discover a problem, is it only about one example or a bigger issue?) and general takeaways (i.e., what do we know about the model from this explanation?). Finally, counterfactual intervention may suffer from hindsight bias (De Cao et al. 2020), which questions the foundation of counterfactual reasoning. Specifically, the fact that a feature can be dropped without influencing the prediction might not mean that the model “knows” that it can be dropped and has not used it in the original prediction. De Cao et al. (2020) illustrates this point with an intuitive example of the Reading Comprehension task, where a model is given a paragraph and a question, and should identify an answer span in the paragraph. Now, using counterfactual intervention, if we mask out everything except the answer in the paragraph, the model will for sure predict the gold span. Nonetheless, this does not imply that everything else is unimportant for the model’s original prediction. The issue again calls for a rethinking of the fundamental assumptions of counterfactual reasoning.
4.5 Self-Explanatory Models
In contrast with all the above post-hoc methods, self-explanatory models provide built-in explanations. Typically, explanations can be in the form of feature importance scores, natural language, causal graphs, or the network architecture itself.
Prior work on self-explanatory models can be broadly categorized into two lines based on how the explanation is formed: explainable architecture or generating explanations. The former relies on a transparent model architecture, such that no extra explanation is necessary. The latter, though, may still involve opaque architectures, but generates explicit explanations as a byproduct of the inference process.
4.5.1 Explainable Architecture
While end-to-end NNs are a black-box, classic machine learning models like Decision Trees and linear regression have a highly interpretable reasoning mechanism. Drawing inspiration from them, researchers have attempted to design neural models with more structural transparency while maintaining their performance.
Neural Module Networks (NMNs) are one representative example, specifically in the context of Question Answering (QA) tasks. Given a complex question (e.g., Are there more donuts than bagels in the image?), humans naturally decompose it into a sequence of steps (e.g., look for donuts and bagels, count them, and compare the counts). With the same motivation, NMNs parse the input question into a program of learnable modules (e.g., compare(count(donuts), count(bagels))), which is then executed to derive the answer.
There are three potentially learnable components in NMNs: (i) the question parser (question syntax tree), (ii) the network layout predictor (syntax tree program), and (iii) the module parameters (program answer). Previous studies differ in whether and how each component is learned. Andreas et al. (2016b) introduce the earliest version of NMN, where only module parameters are learned and the other two components are pretrained or deterministic. In a follow-up study (Andreas et al. 2016a), they extend the framework to also jointly learn the network layout specific to each question, which is then named Dynamic Neural Module Network (DNMN). Hu et al. (2017) further extend the method to learn the question parser as well, resulting in their End-to-End Module Network (N2NMN).
The above methods have been demonstrated to be effective on various visual QA tasks including VQA (Antol et al. 2015), SHAPES (Andreas et al. 2016b), GeoQA (Krishnamurthy and Kollar 2013), and CLEVR (Johnson et al. 2017), most of which are based on synthetic data, though. More recent studies investigate the application of NMNs on realistic data, especially in the language-only domain. Jiang et al. (2019) apply NMN to HotpotQA (Yang et al. 2018), a multiple-document QA dataset. However, their model is incapable of symbolic reasoning, such as counting and sorting, so it can only solve questions with directly retrievable answers in the context. To address this issue, Gupta et al. (2020) introduce a set of modules for each symbolic operation, e.g., count, find-max-num, compare-num. Their model is capable of answering questions involving discrete reasoning in the DROP dataset (Dua et al. 2019), for example, Who kicked the longest field goal in the second quarter?, given a long description of the match. See Figure 8 for an illustration.
Despite their presumably transparent structure, there are two main faithfulness-related problems with NMNs: First, the modules’ actual behavior may not be faithful to their intended function. Most NMNs pre-define modules only in terms of input/output interface. Their actual behavior—module parameters—are then typically learned from end-to-end supervision on the entire pipeline, that is, only the question, context, and the final answer. Thus, there is no control over the intermediate output of individual modules. Consider our previous example compare(count(donuts), count(bagels)). Though we name the module compare, it may not perform the comparison function. Theoretically, it is possible that compare alone outputs the final answer, whereas count is ignored. Subramanian et al. (2020) empirically confirm such failure cases and experiment with several remedies, such as introducing intermediate supervision and limiting the module complexity. Nevertheless, improved faithfulness comes at the cost of end-task accuracy. The second problem is that symbolic modules may not be expressive enough for the flexible semantics of natural language. For example, Gupta et al. (2020) note that questions like Which quarterback threw the most touchdown passes? would necessitate modules with some key-value representation ({quarterback: count}), which are non-obvious to design. Other subtle semantic phenomena like context-conditional parsing and coreference pose similar challenges.
Nonetheless, the compositional nature of NMNs brings about promising research opportunities. In particular, it is possible to pretrain modules independently on synthetic tasks and then use them with fixed parameters when training the QA pipeline. This allows us to ensure their faithfulness as well as exploit transferable knowledge.
NMNs can be thought of as a variant of Neural-Symbolic Models (NSMs), a more generic concept loosely defined as neural models that integrate symbolic reasoning (Yi et al. 2018; Mao et al. 2019; Bogin et al. 2021, i.a.). Interested readers can refer to Hamilton et al. (2022) for a focused review on neuro-symbolic NLP.
Models with constraints are another example of explainable architectures. The idea is to incorporate constraints into neural networks from classic interpretable models, like generalized linear regression (Alvarez-Melis and Jaakkola 2018) and finite-state automata (Schwartz, Thomson, and Smith 2018; Deutsch, Upadhyay, and Roth 2019; Jiang et al. 2020). Still, a major challenge lies in the trade-off between interpretability and performance.
4.5.2 Generating Explanations
Besides using architecture as an implicit explanation, another type of self-explanatory model learns to generate an explicit explanation as an additional task, aside from making the prediction. For supervision, human-written explanations are often used as an additional training signal, along with the end-task label. According to the dependency relationship between the predictor and the explainer, we can classify existing work into three categories: predict-then-explain, explain-then-predict, and jointly-predict-and-explain, as illustrated in Figure 9.
Predict-then-explain models first make a prediction with a standard black-box predictor, and then justify the prediction with an explainer (Figure 9). This is analogous to previous post-hoc explanation methods (from Section 4.1). This framework has been applied to many domains, including vision (Hendricks et al. 2016), language (Camburu et al. 2018), and multimodal tasks (Park et al. 2018). However, it suffers from the same faithfulness challenge as all other post-hoc methods: Because the predictor does not depend on the explainer, there is no guarantee that the explanation accurately reflects the reasoning process behind the prediction. Moreover, as the supervision comes from human-provided explanations, the explainer is only explicitly optimized in terms of plausibility, but not faithfulness.
Explain-then-predict methods (Figure 9) have been introduced in response to this issue. In this framework, the explainer first generates an explanation, which is then provided as the only input to the predictor. In other words, the predictor can only access the explanation, but not the original input example. The intuition is that the prediction can only be made based on the explanation, which renders the predictor “faithful by construction.”
Methods within this framework mainly differ in the form of explanation, which is typically either an extract from the input or natural language, analogous to extractive and abstractive summarization, respectively.
The former is also known as rationale-based methods, where a rationale is defined as a part of the input that is short but sufficient for the prediction (Zaidan, Eisner, and Piatko 2007).28 For example, in sentiment classification, seeing the phrase not good is probably enough for predicting negative. The job of the explainer is to extract such a rationale, and thus it is also called extractor in this scenario. Then, the predictor will make the prediction using only the extracted rationale.
One difficulty lies in how to effectively find the rationale span, given the formidably large search space. Lei, Barzilay, and Jaakkola (2016) propose to guide the search with reinforcement learning. Bastings, Aziz, and Titov (2019) introduce a re-parameterization technique as an alternative, which makes the learning differentiable. Jain et al. (2020) discard searching and directly obtain candidate rationales from existing post-hoc explanation methods (e.g., backpropagation-based ones) instead.
Although rationale-based models seem to be “faithful by construction,” they are not necessarily so: (i) Clearly, only the rationale is used in the prediction, but this does not tell us anything about how it is used. For example, the predictor might only be looking at superficial patterns in the rationale (e.g., the number of tokens that are kept). Jacovi and Goldberg (2021) confirm the existence of these so-called “trojan” explanations in practice. (ii) The rationales are shown to be unstable to minimal meaning-preserving perturbations on the input and hard to understand by users, even those with advanced machine learning knowledge (Zheng et al. 2022). (iii) Finally, whether rationales can ever be a sufficient explanation for the prediction is highly task-specific. For instance, it might make sense to classify the sentiment only based on a subset of tokens, but what about numerous tasks that are intrinsically context-dependent, such as NLI, coreference resolution, relation extraction, and so forth?
Alternative to extracted rationales, a more flexible form of explanation is natural language (also called free-text explanation). Consider the NLI task as an example. Given a hypothesis (e.g., An adult dressed in black holds a stick) and a premise as input (e.g., An adult is walking away, empty-handed), the explainer first generates an explanation (Holds a stick implies using hands so it is not empty-handed), and then the predictor makes a prediction (Contradiction) only based on the explanation. When experimenting with this model on the SNLI dataset (Bowman et al. 2015), Camburu et al. (2018) discover a trade-off between the task accuracy and the plausibility of the explanation. It is also found that the model can generate self-inconsistent explanations (Camburu et al. 2020), for example, Dogs are animals and Dogs are not animals. Moreover, the explanation might contain cues to the label, for instance, patterns like X implies Y / X is a type of Y oftentimes indicate Entailment, while X is not the same as Y is a strong signal of Contradiction. To overcome this issue, Kumar and Talukdar (2020) propose the Natural language Inference over Label-specific Explanations (NILE) model, where every class label has a corresponding explainer. Given an input, an explanation is generated for each label (Entailment, Neutral, and Contradiction). Then, all three explanations are fed to the predictor, which makes a decision after comparing them. This precludes the possibility of the predictor exploiting cues in the explanation pattern. NILE is shown to have comparable accuracy with SOTA models on SNLI, as well as better transferability to OOD datasets. Through an extensive evaluation, the authors compare the faithfulness of a few variants of NILE.
Jointly-predict-and-explain methods have two possible structures: (i) there are still an explainer and a predictor, but the predictor can access both the explanation and the input example;29 (ii) there are no separate explainer and predictor at all—everything is produced jointly.
Approaches in (i) suffer from a similar faithfulness challenge as predict-and-explain methods do, because the predictor can make its decision only based on the input using whatever reasoning mechanism, while entirely ignoring the explanation. For example, Rajani et al. (2019) introduce a QA model that takes in a question, generates an explanation first, and then produces an answer based on both the question and the explanation. Another example is a variant of the NILE model (Kumar and Talukdar 2020) previously discussed in explain-then-predict, which allows the predictor to look at both the premise-hypothesis pair and the explanation.
For works of type (ii) (where there are no separate explainer and predictor), given the input example as the prompt, a generation model outputs a continuation including both the explanation and the prediction in some designated order. This is analogous to any other generation task. Existing studies along this line differ in the choice of the generation model and the end task. For example, Ling et al. (2017) use LSTMs to solve algebraic problems and provide intermediate steps, and Narang et al. (2020) train T5 to generate predictions with explanations for NLI, sentiment classification, and QA.
In particular, we elaborate on a series of studies on generating structured proofs for deductive reasoning. This is especially interesting since most previous work only aims at generating single-step explanations (e.g., a single sentence), but complex tasks would require a structured reasoning chain as explanations. As one of the earliest studies of this kind, Tafjord, Dalvi, and Clark (2021) develop ProofWriter, which takes in a set of premises and a hypothesis, and then decides if the hypothesis is true by providing a structured proof. Specifically, there are two versions of ProofWriter, all-at-once (generating the entire proof in one shot) and iterative (generating one step at a time). The all-at-once ProofWriter cannot guarantee that the proof is faithful, since it may not “believe” in the proof that it has generated. The iterative ProofWriter bridges this gap by deriving the proof one step at a time. All steps are already verified during generation. Therefore, it is faithful by construction. However, compared to the all-at-once version, it suffers from efficiency and input length limitations.
After ProofWriter, Dalvi et al. (2021) generalize the idea to real-world data, developing the EntailmentWriter to verify hypotheses about scientific questions. Nevertheless, only an all-at-once version is implemented, indicating that the same faithfulness risk exists. To address this issue, Hong et al. (2022) propose METGEN (Module-based Entailment Tree GENeration), a modularized version of EntailmentWriter. They define multiple single-step entailment modules each performing a certain type of reasoning, such as substitution, conjunction, and if-then, like in Neural Module Networks (NMNs). Then, a high-level reasoning controller takes in the given facts and selects which module to use at each step to eventually arrive at the hypothesis. Notably, to encourage faithfulness, each module is trained with well-formed synthetic data and then fine-tuned on EntailmentBank, and their weights are frozen from then on. The improvement in faithfulness also leads to higher performance of METGEN compared to previous versions of EntailmentWriter.
With the advances in in-context learning (Brown et al. 2020), recent work has started to explore the possibility of generating explanations with few-shot prompting. Earliest studies including Wiegreffe et al. (2022) and Marasovic et al. (2022) find that Large Language Models (LLMs) like GPT-3 show potential for generating decently plausible free-text explanations with only a few examples, though the quality is still far from human-provided ones. On the other hand, Ye and Durrett (2022) find that such few-shot model-generated explanations are often non-factual (i.e., not correctly grounded in the input) and inconsistent (i.e., not entailing the prediction). Nonetheless, these explanations are still helpful for users to calibrate the confidence of model predictions, since explanations rated by humans as plausible are more likely to co-occur with accurate predictions.
Another line of work under few-shot explanation generation is Chain-of-Thought-style (CoT) prompting, which is specifically effective for complex reasoning tasks like Math Word Problems and Multi-hop QA. Given a complex question Q, an LM is prompted to generate a reasoning chain C along with the final answer A. Specifically, the prompt consists of a few instances of (Q, C, A) triples as in-context exemplars. This allows pre-trained LLMs to solve unseen questions with much higher accuracy than standard prompting (Brown et al. 2020), where the exemplars do not contain the reasoning chain C. We categorize existing CoT-style prompting methods into three types: all-at-once, ensemble-based, and modularized. All-at-once prompting means that the LM produces C and A as one continuous string, without any dependencies or constraints in between. Scratchpad (Nye et al. 2021), CoT (Wei et al. 2022), and “Let’s think step by step” (Kojima et al. 2022), are all examples of this kind. Ensemble-based prompting is designed to overcome the locality issue of one-shot generation in previous methods by sampling multiple (C, A) pairs and choosing the best answer via strategies like majority voting. Examples include Self-Consistent CoT (Wang et al. 2022b), Minerva (Lewkowycz et al. 2022), and DIVERSE (Li et al. 2022), which differ mainly in the granularity of voting and the underlying LM. Modularized methods break down Q into subproblems and then conquer them individually (Hong et al. 2022; Creswell and Shanahan 2022; Qian et al. 2022; Jung et al. 2022). In particular, Least-to-Most Prompting (Zhou et al. 2022a) uses an LLM to first reduce the question to subquestions, and then sequentially answers them conditioned on its answers to previous subquestions. However, for all the above CoT-style prompting methods, the generated reasoning chain is entirely in NL, so there is no guarantee of faithfulness: the answer does not need to causally follow from the reasoning chain. To bridge this gap, recent work attempts to generate the reasoning chain in some Symbolic Language (SL) (e.g., Python) and calls an external solver (e.g., a Python interpreter) to derive the answer by deterministically executing the reasoning chain. In Program-of-Thought (PoT) (Chen et al. 2022b) and Program-Aided Language Models (PAL) (Gao et al. 2022), the reasoning chain is entirely in Python, which allows the same underlying LM to outperform vanilla CoT with NL by a large margin on a wide range of arithmetic and symbolic reasoning tasks. Lyu et al. (2023) propose Faithful CoT, which interleaves NL comments and SL programs for users to better understand and potentially interact with the model. These symbolic CoT prompting methods guarantee that the reasoning chain is a faithful explanation of how the model derives the answer. However, how the model generates the reasoning chain is still an opaque process, so there is no full interpretability of the entire pipeline.
4.5.3 Strengths and Weaknesses
In summary, self-explanatory models have several strengths. First, by definition, self-explanatory models provide built-in explanations, so there is no need for post-hoc explanations. Second, the form of explanation is flexible, e.g., model architecture, input features, natural language, or causal graphs. Third, it is possible to supervise the explainer with human-provided explanations. This is helpful for learning more plausible explanations, as well as encouraging the model to rely on desired human-like reasoning mechanisms instead of spurious cues. Finally, certain self-explanatory models (e.g., Tafjord, Dalvi, and Clark 2021; Hong et al. 2022; Chen et al. 2022b; Gao et al. 2022; Lyu et al. 2023), are faithful by construction (we should be extra cautious about this claim, though).
Self-explanatory models, however, also present a few key weaknesses. First, many such models cannot guarantee faithfulness, e.g., Neural Module Networks without intermediate supervision, predict-then-explain models, rationale-based explain-then-predict models, and certain jointly-predict-and-explain models. Second, the influence of explanations on task performance is mixed in self-explanatory models. Many studies discover a trade-off between performance and interpretability (Narang et al. 2020; Subramanian et al. 2020; Hase et al. 2020, i.a.), while others observe a positive impact on the performance from including explanations (e.g., CoT-style prompting). The effect highly depends on the task, the model family and size, the format of explanations, whether they are tuned, and how they are used (as inputs, targets, or priors) (Hase and Bansal 2022; Lampinen et al. 2022). To make the process less mysterious, Ye et al. (2022) and Ye and Durrett (2023) develop principled methods to select exemplars with explanations for few-shot prompts, in order for interpretability to benefit performance. Finally, large-scale human supervision on explanations can be costly and noisy (Dalvi et al. 2021). Also, it is hard to automatically evaluate the quality of model-generated explanations given reference human explanations, since there can be multiple ways to explain a prediction.
5 Summary and Discussion
After presenting existing explanation methods in detail, we now summarize all five method families in terms of faithfulness from both theoretical and empirical perspectives.
Similarity-based methods rely on the theoretical assumption that models use similar reasoning mechanisms for examples with similar representations in the learned space. However, this assumption might be invalid if the model is not robust to subtle changes in this space. Empirically, similarity-based methods are rarely evaluated with regard to faithfulness.
Analysis of model-internal structures has sparked a long debate regarding the empirical faithfulness of raw attention weights as explanations. Theoretically, the lack of faithfulness can be attributed to issues like information mixing, locality, and/or an intrinsic lack of causality in attention weights. Recent approaches, including regularization, global characterization, and integration with other explanation methods, show promise in addressing these concerns.
Backpropagation-based methods are faithfulness-driven by definition, but they still encounter theoretical challenges such as saturation, input sensitivity, and implementation variance. Subsequent variants have addressed these issues by considering the input alongside the gradient or by incorporating a baseline for comparison. Empirically, these methods are more frequently assessed for faithfulness, mainly through perturbation-based evaluation, than earlier methods. However, certain methods in this family are shown to be only doing partial input recovery, regardless of the model’s prediction. Meanwhile, much progress has been made in improving the plausibility of these methods, especially in reducing the visual noise in relevance maps.
Counterfactual intervention is theoretically grounded in the causality literature, yet it still faces nuanced pitfalls such as the feature independence assumption and hindsight bias. Empirically, methods of this family are most often evaluated with predictive power or white-box tests, which show their improved faithfulness compared to earlier baselines. Still, counterfactual intervention might produce OOD inputs, potentially exploitable by adversaries. Perturbation-based interventions are less prone to generating OOD inputs than erasure-based ones but are more complex to automate.
Self-explanatory models, designed for intrinsic interpretability, might encounter theoretical challenges like the lack of intermediate supervision, label leakage, and social misalignment. As a result, these can empirically undermine their “faithfulness by construction” claim.
Next, we discuss the common virtues and challenges of existing methods, as well as identify future work directions towards faithful interpretability in NLP.
5.1 Virtues
Many studies are conducive to bridging the gap between competence and performance in language models. The two terms originate from linguistics: competence describes humans’ (unconscious) knowledge of a language, whereas performance refers to their actual use of the knowledge (Chomsky 1965). For humans, there is a gap between competence and performance, for example, we can theoretically utter a sentence with infinitely many embedded clauses, but in practice, it is impossible to do so. Similarly, for language models, what they know can be different from what they use (in a task), as discussed in Section 2.1 . Previous work on interpretability predominantly focuses on competence, whereas more recent studies (e.g., all five methods discussed in this survey) aim at answering the performance question. This allows us to better understand whether the same gap exists in models, and if so, how we can bridge it.
It is also noteworthy that there has been increasing awareness of faithfulness and other principles of model explanation methods, especially since the seminal opinion piece by Jacovi and Goldberg (2020). A number of evaluation methods have been proposed; see Section 2.4 for details. Though each of them depends on assumptions and application scenarios, this is a good starting point for quantitatively assessing the quality of explanations.
In addition, explanations produced by most above-mentioned methods are intuitive to understand, even for lay people. This is because the form of explanation is simple, mostly feature importance scores, visualization, natural language, or causal graphs. Though the model and the explanation method may be opaque, the explanation itself is easily understandable.
Finally, in terms of applicability, many available explanation tactics are model-agnostic, especially for classification tasks. Also, numerous toolkits have been developed to help users apply explanation methods to their own models. See Appendix A for more details.
5.2 Challenges and Future Work
Despite the remarkable advances, the area of NLP interpretability still faces several major challenges, which also provide exciting opportunities for future research.
So far, a large number of explanation methods still lack objective quality evaluation, especially in terms of faithfulness. There has not been any established consensus on how to measure faithfulness. Different evaluations are often not directly comparable and yield inconsistent results. This necessitates the need for a universal evaluation framework (and maybe even a meta-evaluation framework of existing evaluation frameworks), which is fundamental to measuring the progress of any research in this area.
Next, most existing methods provide explanations in terms of surface-level features, e.g., pixels in vision and tokens in language. Future work can focus more on how to capture the contribution of higher-level features in a task, including linguistic (case, gender, part-of-speech, semantic role, syntax dependency, coreference, discourse relations, etc.), and extra-linguistic (demographic features, commonsense and world knowledge, etc.) ones. Several studies on counterfactual intervention provide inspiring examples (Ravfogel et al. 2020; Elazar et al. 2021; Tucker, Qian, and Levy 2021); see Section 4.4 for details.
Another challenge is that most existing methods capture the contribution of individual features to the prediction, but not that of higher-order feature interactions. See Section 4.4.3 for an illustration of why this is a problem. Future work can address the issue by developing more flexible forms of explanation instead of flat importance scores, e.g., feature subsets as in certain counterfactual intervention methods (e.g., Ribeiro, Singh, and Guestrin 2018) and causal graphs as in several self-explanatory methods (e.g., Tafjord, Dalvi, and Clark 2021; Dalvi et al. 2021).
In addition, existing work mostly focuses on limited task formats, for example, classification and span identification. This limits the downstream applicability of these methods to real-world scenarios. Future work can study alternative task formats such as language generation and structured prediction, or even better, develop explanation methods that are generalizable across tasks. Recent work such as Yin and Neubig (2022) makes promising initial progress in this direction.
Meanwhile, it is not always obvious whether insights from model explanations are actionable. For example, given the explanation of the model’s decision on one test example, the user finds that the model is not using the desired features. Then how should they go about fixing it—through the data, model architecture, training procedure, hyper-parameters, or something else? How does the user communicate with the model? Consequently, interactive explanations will be a fruitful area for future study. A few studies on knowledge editing have shown the plausibility of the idea (Madaan et al. 2021; Kassner et al. 2021).
Interestingly, the relationship between model performance and interpretability is not always predictable. Sometimes there is a synergy, but sometimes there is tension. This issue is especially evident in self-explanatory models; see Section 4.5 for more details. It will be greatly helpful to have a theoretical understanding of when and how explanations can help with model performance, as demonstrated by several recent studies (Ye et al. 2022; Hase and Bansal 2022). In the meantime, we need to cautiously balance between performance and interpretability depending on application scenarios.
Finally, we want to emphasize that faithfulness is not the only desideratum. After all, explanations are meant to help the target audience better understand the model, and faithfulness is only one (but fundamental) condition to this end. Furthermore, explanations should be useful in helping the target audience with certain goals, such as decision making, model debugging, knowledge discovery, and so on. Findings from existing studies are still quite disappointing in this respect: for example, according to Bansal et al. (2021), when human decision makers collaborate with a model (e.g., in computer-assisted diagnosis), current explanation methods rarely help them make more accurate decisions, but instead exacerbate their over-trust in model predictions even when they are wrong. As a result, more work should be done to investigate faithfulness under the context of real-world applications, especially its relationship with user-oriented desiderata like utility.
6 Conclusion
This survey provides an extensive tour of recent advances in NLP explainability, through the lens of faithfulness. Despite being a fundamental principle of model explanation methods, faithfulness does not have a universally accepted technical definition or evaluation framework. This absence makes it challenging to compare different methods based on faithfulness, and many methods do not provide quantitative faithfulness evaluation results.
We critically review five families of existing model explanation methods: similarity-based methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. We introduce each category in terms of their representative work, strengths, and weaknesses, with a special focus on faithfulness.
In summary, similarity-based methods are not primarily driven by faithfulness and are rarely assessed on this basis. For the analysis of model-internal structures, the early attempt to treat raw attention weights as explanations has faced strong criticism regarding faithfulness. However, there has been promising new progress in improving the faithfulness of attention by addressing these concerns. Backpropagation-based methods are intuitively faithfulness-motivated due to the nature of gradients, but theoretical and empirical evidence often points to various faithfulness issues, such as saturation and input sensitivity. Still, later variants in this family have shown gradual improvements in some of these aspects. Counterfactual intervention methods, rooted in causal inference, are also faithfulness-motivated. Nevertheless, certain types of intervention, such as perturbation, are more likely to be faithful than others like erasure, due to the OOD issue. There are also nuanced practical concerns such as feature independence and hindsight bias to consider. Self-explanatory models, which do not rely on any post-hoc explanation methods, often claim to be “faithful by construction,” yet many fall short due to obstacles like lack of label leakage. As such, we need to be extra cautious about such claims. In essence, when deciding which explanation methods to use in practice, we advocate for those that are intrinsically motivated and empirically validated in terms of faithfulness, while still remaining aware of the potential pitfalls highlighted above.
Finally, we discuss the common virtues and challenges of all methods and suggest potential directions for future research. In particular, we are eager to see future work on establishing a universal standard for faithfulness evaluation, exploring the relationship between interpretability and performance, and developing explanation methods that consider high-level features, flexible forms, and alternative task formats. We hope that this survey serves as an overview of the area for researchers interested in interpretability, and provides a practical guide for users seeking to better understand their models.
Appendix A. Additional Details
To complement the discussion of model explanation methods in Section 4, here we provide interested readers with additional details about certain families of methods, including mathematical formalization, visualization of examples, and existing tools for implementation.
A.1 Analysis of Model-Internal Structures
Tools
For neuron visualization, a variety of visualization tools have been developed, including RNNvis30 (Ming et al. 2017), LSTMVis31 (Strobelt et al. 2018), and Seq2Seq-Vis32 (Strobelt et al. 2019). For attention visualization, readers can look into the following tools: BertViz33 (Vig 2019) and LIT34 (Tenney et al. 2020).
A.2 Backpropagation-based Methods
Technical Details
Tables A.1 and A.2 summarize the mathematical formalization of gradient methods and propagation methods respectively. Figure A.1 shows a visualization of different gradient methods on image classification.35
Method . | Computation of ri(x) . |
---|---|
Simple Gradients | , , or |
Gradient × Input | |
Integrated Gradients | |
approximated by | |
SmoothGrad | |
where is any other relevance computation |
Method . | Computation of ri(x) . |
---|---|
Simple Gradients | , , or |
Gradient × Input | |
Integrated Gradients | |
approximated by | |
SmoothGrad | |
where is any other relevance computation |
Method . | Definition of D() . |
---|---|
Simple Gradients | |
DeconvNet | |
Guided BackPropagation | |
Layerwise Relevance Propagation | |
DeepLift | |
Deep-Taylor Decomposition |
Method . | Definition of D() . |
---|---|
Simple Gradients | |
DeconvNet | |
Guided BackPropagation | |
Layerwise Relevance Propagation | |
DeepLift | |
Deep-Taylor Decomposition |
Tools
A.3 Counterfactual Intervention
Tools
The following tools implement certain type(s) of counterfactual intervention: Captum40, LIT41 (Tenney et al. 2020), LIME42 (Ribeiro, Singh, and Guestrin 2016), SHAP43 (Lundberg and Lee 2017), Anchors44 (Ribeiro, Singh, and Guestrin 2018), Seq2Seq-Vis45 (Strobelt et al. 2019), and the What-if Tool46 (Wexler et al. 2020).
Acknowledgments
This research is based upon work supported in part by the Air Force Research Laboratory (contract FA8750-23-C-0507), the DARPA KAIROS Program (contract FA8750-19-2- 1004), the IARPA HIATUS Program (contract 2022-22072200005), and the National Science Foundation (award 1928631). Approved for Public Release, Distribution Unlimited. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of AFRL, DARPA, IARPA, NSF, or the U.S. Government.
Notes
For example, see the encouraging progress on reducing visual noise (i.e., improving plausibility) in Backpropagation-based Methods in Section 4.3 .
For example, “saliency methods.” See Section 2.3 for details.
Despite their subtle distinctions in some previous literature, we use the terms interchangeably in this survey.
There can be interesting “how” questions as well, for example, “how to make interpretability insights useful,” which we believe are beyond the scope of the internal mechanics of the model, but rather address how to best apply the insights obtained from interpretability methods. Another example can be “how a model makes certain predictions,” but this is essentially equivalent to “why a model makes certain predictions” in our terms.
Given a premise P and a hypothesis H, the NLI task is to determine if P semantically entails H (whenever P is true, is H also true?). For example, it rains heavily today entails it rains today.
Note that the target audience is not included in the table, since it largely depends on the task and the specific instance of the method.
We will elaborate on the attention mechanism in Section 4.2 .
See Section 4.3.2 – Propagation Methods for more details on “total importance.”
See Section 3.4.1 – Axiomatic Evaluation for a critical discussion of a subset of these principles.
The other two principles are “do not trust ’inherent interpretability’ claims” and “faithfulness evaluation of Intelligent User Interface systems should not rely on user performance.” The former is not relevant to the faithfulness evaluation methods, but rather to self-explanatory models, which we will revisit in Section 4.5 . The latter is similar to the second principle and thus omitted.
See Section 4.3 for details.
This does not mean that they are bad metrics, since the search-based explainer also performs decently on several other metrics, so this explainer could be at least a strong baseline to compare explanation methods against.
See Section 2.3 .
For discrete inputs like tokens, the extent of the change is defined in terms of similarity metrics in the embedding vector space.
In practice, commonly used similarity metrics include cosine, Euclidean, etc.
These include backpropagation-based methods (Section 4.3 ) and counterfactual intervention (Section 4.4 ) in our terms.
It is questionable if we can treat gradient-based methods as the gold standard for faithfulness though, as discussed in Section 4.3 .
We will revisit this in Section 4.4 .
See Section 3.4 for details.
Appendix Table A.1 summarizes the formal definition of ri(x) in each method, and Figure A.1 presents a visualization of all methods in the vision domain.
Of course, it is questionable if this noise comes from the deficiency of the explanation method or from the model reasoning mechanism itself.
For example, one of the most commonly used activation functions, ReLU ( y = max(0, x) ), is not continuously differentiable (at x = 0 , the gradient does not exist). Thus the fluctuation is “infinite” in some sense at x = 0 .
Appendix Table A.2 provides a formalization of how each method defines D() .
To conveniently incorporate the bias term b, we let a0 = 1 and w0j = b .
Specifically, this is due to its theoretical perspective, treating the explanation as an approximation of the upper-level relevance through Taylor decomposition. Only when the baseline input lies in the vicinity of the original input can the Taylor decomposition be accurate.
In general, their computational cost is lower than counterfactual intervention (Section A.3 ), which typically requires multiple forward passes in addition, but higher than internal-structure analysis (Section 4.2 ), which typically requires no additional model calls. It’s worth noting that these computational costs exist along a continuum rather than being binary categories.
This is analogous to the notion of “Anchors” mentioned in Section 4.4 .
In contrast, the above-mentioned explain-then-predict methods do not allow the predictor to access the input example. This is why we categorize (i) as jointly-predict-and-explain methods.
Since D() is layer-specific, we only show one or more representative rules for each method here: the ReLU unit propagation rule for Simple Gradients, DeconvNet, and GBP; the general-form rule for LRP; the Rescale rule For DeepLift; and the general-form rule for Deep-Taylor Expansion.