survey

Open Access

Local Interpretations for Explainable Natural Language Processing: A Survey

Authors:
Siwen Luo

The University of Western Australia, Perth, Australia

The University of Western Australia, Perth, Australia

0000-0003-0480-1991
View Profile

,
Hamish Ivison

University of Washington, Seattle, United States

University of Washington, Seattle, United States

0000-0002-0069-7659
View Profile

,
Soyeon Caren Han

The University of Melbourne, Melbourne, Australia

The University of Melbourne, Melbourne, Australia

0000-0002-1948-6819
View Profile

,
Josiah Poon

The University of Sydney, Sydney, Australia

The University of Sydney, Sydney, Australia

0000-0003-3371-8628
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 56 Issue 9Article No.: 232pp 1–36https://doi.org/10.1145/3649450

Published:25 April 2024Publication History

ACM Computing Surveys

Abstract

As the use of deep learning techniques has grown across various fields over the past decade, complaints about the opaqueness of the black-box models have increased, resulting in an increased focus on transparency in deep learning models. This work investigates various methods to improve the interpretability of deep neural networks for Natural Language Processing (NLP) tasks, including machine translation and sentiment analysis. We provide a comprehensive discussion on the definition of the term interpretability and its various aspects at the beginning of this work. The methods collected and summarised in this survey are only associated with local interpretation and are specifically divided into three categories: (1) interpreting the model’s predictions through related input features; (2) interpreting through natural language explanation; (3) probing the hidden states of models and word representations.

1 INTRODUCTION

As a result of the explosive development of deep learning techniques over the past decade, the performance of deep neural networks (DNN) has significantly improved across various tasks. DNN has been broadly applied in different fields, including business, healthcare, and justice. For example, in healthcare, artificial intelligence startups have raised $864 million in the second quarter of 2019, with higher amounts expected in the future, as reported by the TDC Group.¹ However, while deep learning models have brought many foreseeable benefits to both patients and medical practitioners, such as enhanced image scanning and segmentation, the inability of these models to provide explanations for their predictions is still a severe risk, limiting their application and utility.

Before demonstrating the importance of the interpretability of deep learning models, it is essential to illustrate the opaqueness of DNNs compared to other interpretable machine learning models. Neural networks roughly mimic the hierarchical structures of neurons in the human brain to process information among hierarchical layers. Each neuron receives the information from its predecessors and passes the outputs to its successors, eventually resulting in a final prediction [120]. DNNs are neural networks with a large number of layers, meaning they contain up to billions of parameters. Compared to interpretable machine learning models such as linear regressions, where the few parameters in the model can be extracted as the explanation to illustrate influential features in prediction, or the decision trees, where a model’s prediction process can be easily understood by following the decision rules, the complex and huge computations done by DNNs are hard to comprehend both for experts and non-experts alike. In addition, the representations used and constructed by DNNs are often complex and incredibly difficult to tie back to a set of observable variables in image and natural language processing tasks. As such, vanilla DNNs are often regarded as opaque “black-box” models that have neither interpretable architectures nor clear features for interpretation of the model outputs.

However, why should we want interpretable DNNs? One fundamental reason is that while the recent application of deep learning techniques to various tasks has resulted in high levels of performance and accuracy, these techniques still need improvement. As such, when applying these models to critical tasks where prediction results can cause significant real-world impacts, they are not guaranteed to provide faultless predictions. Furthermore, given any decision-making system, it is natural to demand explanations for the decisions provided. For example, the European Parliament adopted the General Data Protection Regulation (GDPR) in May 2018 to clarify the right of explanation for all individuals to obtain “meaningful explanations of the logic involved” for automated decision-making procedures [58]. As such, it is legally and ethically crucial for the application of DNNs to develop and design ways for these networks to provide explanations for their predictions. In addition, explanations of predictions would help specialists verify their correctness, allowing them to judge if a model is making the right predictions for the right reasons. As such, increasing interpretability is vital for expanding the applicability and correctness of DNNs.

In the past few years, several works have been proposed to improve the interpretability of DNNs. In this survey article, we focus on local interpretable methods proposed for natural language processing tasks. As described in the following sections, we define local methods as those that provide explanations only for specific decisions made by the model—that is, methods that provide explanations for single instances rather than aiming to provide general descriptions of the model’s decision-making process. We explore several recent local interpretation methods/techniques in Natural Language Processing (NLP), which aims to support the normal users with no machine/deep learning expertise²:

—	Feature importance methods, which work by determining and extracting the most important elements of an input instance.
—	Natural language explanation, in which models generate text explanations for a given prediction.
—	Probing, in which model’s internal states are examined when given certain inputs.

1.1 Definitions of Interpretability

While there has been much study of the interpretability of DNNs, there are no unified definitions for the term interpretabilty, with different researchers defining it from different perspectives. We summarise the key aspects of interpretability used by these researchers below.

1.1.1 Explainability vs. Interpretability.

The terms interpretability and explainability are often used synonymously across the field of explainable AI [1, 27], with both terms being used to refer to the ability of a system to justify or explain the reasoning behind its decisions.³ Overall, the machine learning community tends to use the term interpretability, while the HCI community tends to use the term explainability [1]. Recent work has suggested more formal definitions of these terms [27, 44, 58]. Following Doshi-Velez and Kim [44], we define interpretability as “the ability [of a model] to explain or to present [its predictions] in understandable terms to a human.” We take explainability to be synonymous with interpretability unless otherwise stated, reflecting its general usage within the field.

1.1.2 Local and Global Interpretability.

An essential distinction in interpretable machine learning is between local and global interpretability. Following Guidotti et al. [58] and Doshi-Velez and Kim [44], we take local interpretability to be “the situation in which it is possible to understand only the reasons for a specific decision” [58]. That is, a locally interpretable model is a model that can give explanations for specific predictions and inputs. We take global interpretability to be the situation in which it is possible to understand “the whole logic of a model and follow the entire reasoning leading to all the different possible outcomes” [58]. A classic example of a globally interpretable model is a decision tree, in which the general behaviour of the model may be easily understood by examining the decision nodes that make up the tree. As understanding the whole logic of a model often requires the use of specific models or significant changes to an existing model, in this article, we focus on local interpretation methods, as these tend to be more generally applicable to existing and future NLP models.

1.1.3 Post Hoc vs. In-built Interpretations.

Another important distinction is whether an interpretability method is applied to a model after the fact or integrated into the internals of a model. The former is referred as a post hoc interpretation method [118], while the latter is an in-built interpretation method. As post hoc methods are applied to model the fact, they generally do not impact the model’s performance. Some post hoc methods do not require any access to the internals of the model being explained and so are model-agnostic. An example of a typical post hoc interpretable method is LIME [143], which generates the local interpretation for one instance by permuting the original inputs of an underlying black-box model. In contrast to post hoc interpretations, in-built interpretations are closely integrated into the model itself. The interpretation may come from the transparency of the model, where the workings of the model itself are clear and easy to understand (for example, a decision tree) or may come from an interpretation generated by the model in an opaque manner (for example, a model that generates a text explanation during its prediction process). In this survey, we will examine both methods.

1.2 Article Layout

Before examining interpretability methods, we first discuss different aspects of interpretability in Section 2. In Section 3, we summarise and categorise three main interpretable methods in NLP, including (1) improving a model’s interpretability by identifying the important input features; (2) explaining a model’s predictions by generating direct natural language explanations; (3) probing the internal state and mechanisms of a model. We also provide a quick summary of datasets that are commonly used for the study of each method. In Section 4, we summarise several primary methods to evaluate the interpretability of each method discussed in Section 3. We finally discussed the limitations of current interpretable methods in NLP in Section 5 and the possible future trend of interpretability development at the end.

2 ASPECTS OF INTERPRETABILITY

2.1 Interpretability Requirements

Before discussing the various aspects of interpretability, it is also essential to consider what problems require interpretable solutions and what interpretable models best fit these problems. Following Reference [44], we suggest that anyone looking to build interpretable models first determine the following four points:

(1)	Do you need an explanation for a specific instance or understand how a model works? In the former case, local interpretation methods will likely prove more suitable, while in the latter, global interpretation methods will be required.
(2)	How much time does/will a user have to understand the explanation? This, along with the point below, is an important concern for the usability of an interpretation method. Certain methods lend themselves to quick, intuitive understanding, while others require some more effort and time to comprehend.
(3)	What background and expertise will the users of your interpretable model have? As mentioned, this is an important usability concern. For example, regression weights have classically been considered “interpretable” but require a user to have some understanding of regression beforehand. In contrast, decision trees (when rendered in a tree structure) are often understandable even to non-experts.
(4)	What aspects or parts of the problem do you want to explain? It is important to consider what can and cannot be explained by your model and prioritise accordingly. For example, explaining all potential judgements a self-driving car could make in any situation is infeasible, but restricting explanations to certain systems or situations allows easier measuring and assurance of interpretation quality.

These points allow categorisation of interpretability-related problems, and thus clearer understanding of what is required from an interpretable system and suitable interpretation methods for the problem itself.

2.2 Dimensions of Interpretability

“Interpretability” is not a simple binary or monolithic concept, but rather one that can be measured along multiple dimensions. Different aspects of interpretability have been identified across the literature, which we condense and summarise into four key aspects: faithfulness, stability, comprehensibility, and trustworthiness.

2.2.1 Faithfulness.

Faithfulness measures how well an interpretation method reflects the decision-making process used by the underlying model. For example, an image heatmap that highlights parts of the image not genuinely used by the model would be unfaithful, while highlighting the parts genuinely used by the model would be more faithful. Traditionally, this has been more a concern for post hoc methods such as LIME [143] and SHAP [106]. However, more recent work has called into question the faithfulness of in-built interpretability methods such as attention weight examination [75, 78, 180]. Faithfulness is essential for claims that an interpretation method accurately reflects a model’s process to reach a judgement. Explanations provided by an unfaithful method may hide existing biases that the underlying model uses for judgements, potentially engendering unwarranted trust or belief in these predictions [75]. Related is the notion of fidelity as defined in Molnar [118]: the extent of how well an interpretable method can approximate the performance of a black-box model. Underlying this definition is the assumption that a method that better approximates a black box also must use a similar reasoning process to that underlying model.⁴ As such, this definition of fidelity is a more specific form of faithfulness as applied to interpretability methods that construct models approximating an underlying black-box model, such as LIME [143].

2.2.2 Stability.

An interpretation method is stable if it provides similar explanations for similar inputs [118] unless the difference between the inputs is highly important for the task at hand. For example, an explanation produced by natural language generation (NLG) would be stable if minor differences in the input resulted in similar text explanations and would be unstable if the slight differences resulted in wildly different explanations. Stability is a generally desirable trait important for research [189] and is required for a model to be trustworthy [121]. In addition, the stability of human explanations for a particular task should be considered, i.e., if explanations written by humans differ significantly from each other, then it is unreasonable to expect a model trained on such explanations to do any better. This is especially important for highly free-form interpretation methods such as natural language explanations.

2.2.3 Comprehensibility.

An interpretation is considered comprehensible if it is understandable to an end-user. For an explanation to be useful at all, it must be understandable to some degree. However, this is subjective: There is no global common standard for “understandability.” In addition, the background of the end-user matters: A medical professional will be able to understand an explanation with scientific medical terms far better than a layperson. Nevertheless, there are still several general ways to rate the interpretability of an explanation: examining its size (how much a user must process when “reading” the explanation), examining how well a human can predict a model’s prediction given just the explanation, and examining the understandability of individual features of the explanation [118]. For example, a sparse linear model with only a few non-zero weights has far fewer components for a user to consider and so would be more comprehensible than a linear model with hundreds of weights. Furthermore, comprehensibility is related to the concept of transparency [100], which refers to how well a person can understand the mechanism by which a model works. Transparency can be achieved in several ways: through being able to simulate the model in your mind (for example, a linear regression with few weights) or having deep knowledge of the underlying algorithm used by the model (for example, proving some property of any solution an algorithm will produce). Models with greater degrees of transparency are thus also more comprehensible than non-transparent models.

2.2.4 Trustworthiness.

An interpretation is considered comprehensible if it is understandable to an end-user. For an explanation to be helpful at all, it must be understandable to some degree. However, this is subjective. In other words, there is no global shared standard for “understandability.” In addition, the background of the end-user matters: A medical professional will be able to understand an explanation with scientific medical terms far better than a layperson. Nevertheless, there are still several general ways to rate the interpretability of an explanation: examining its size (how much a user must process when “reading” the explanation), examining how well a human can predict a model’s prediction given just the explanation, and examining the understandability of individual features of the explanation [118]. For example, a sparse linear model with only a few non-zero weights has far fewer components for a user to consider and would be more comprehensible than a linear model with hundreds of weights. Furthermore, comprehensibility is related to the concept of transparency [100], which refers to how well a person can understand the mechanism by which a model works. There are several ways to achieve transparency: through being able to simulate the model in your mind (for example, a linear regression with few weights) or having deep knowledge of the underlying algorithm used by the model (for example, proving some property of any solution an algorithm will produce). Models with greater transparency degrees are thus more comprehensible than non-transparent models.

3 INTERPRETABLE METHODS

3.1 Feature Importance

Identifying the important input features that significantly impact a model’s prediction results is a straightforward method of improving a model’s local interpretability, directly linking model outputs to inputs. Important features can be, for example, words for text-based tasks or image regions for image-based tasks. This article focuses on the four main different methods of extracting important features as the interpretation for the model’s outputs: rationale extraction, input perturbation, attribution methods, and attention weight extraction. We conclude the typology of feature importance methods in Figure 2 and present the sample visualisations of extracted features from inputs in Figure 1.

Fig. 1. Sample visualizations of identified important features from the inputs detected by four different methods: (a) rationale extraction on sentiment analysis task; (b) attention weights on visual question answering task: (c) word importance from attribution methods on machine translation task; (d) input perturbation on sentiment analysis task and the expansion of counterfactual explanation.

Fig. 2. Typology of local interpretable methods by identifying the important features from inputs.

3.1.1 Rationale Extraction.

Rationale extractions are usually used as the local interpretable method for NLP tasks of sentiment analysis and document classification. Rationales are short and coherent phrases from the original textual inputs and represent the critical textual features that contribute most to the output prediction. These identified textual features work as the local explanation that interprets the information the model primarily pays attention to when making the prediction decision for a particular textual input. The good rationales valid for the explanation should lead to the same prediction results as the original textual inputs. As this work area developed, researchers also made extra efforts to extract coherent and consecutive rationales to use them as more readable and comprehensive explanations.

The rationale extraction methods can be mainly divided into two streams: (1) a sequential selector-predictor stacked model, where the selector first selects the rationales from the original textual inputs and then passes to the predictor for the prediction result; (2) the adversarial-based model that involves the parallel models to calibrate the rationales extracted by the selector. In this article, we summarise several iconic and milestone works of rationale extractions for each stream.

For the selector-predictor stream, Lei et al. [92] is one of the first works for the rationale extraction in NLP tasks. The selector process first generates a binary vector of 0 and 1 through a Bernoulli distribution conditioned on the original textual inputs. This binary vector will then be multiplied over the original inputs, where 1 indicates the selection of input words as rationales and 0 indicates the non-selection, resulting in a sparse input representation that indicates which textual tokens are selected as rationales and which tokens are not. The predictor will then process based on such information. Since the selected rationales are represented with non-differentiable discrete values, the REINFORCE algorithm [182] was applied for optimization to update the binary vectors for the eventually accurate rational selection. Lei et al. [92] performed rationale extraction for a sentiment analysis task with the training data that has no pre-annotated rationales to guide the learning process. The training loss is calculated through the difference between a ground truth sentiment vector and a predicted sentiment vector generated from extracted rationales selected by the selector model. Such selector-predictor structure is designed to mainly boost the interpretability faithfulness, i.e., selecting valid rationales that can predict the accurate output as the original textual inputs. To increase the readiness of the explanation, Lei et al. [92] used two different regularizers over the loss function to force rationales to be consecutive words (readable phrases) and limit the number of selected rationales (i.e., selected words/phrases). Bastings et al. [17] followed the same selector-predictor structure as Lei et al. [92]. The main difference is that they used rectified Kumaraswamy distribution [90] instead of Bernoulli distribution to generate the rationale selection vector, i.e., the binary vector of 0 and 1 to be masked over textual inputs. Kumaraswamy distribution allows the gradient estimation for optimization, so there is no need for the REINFORCE algorithm to do the optimization. To boost the short and coherent rationales for better readability and comprehensibility, Bastings et al. [17] also applied a relaxed form of $L_0$ regularization [103] and the Lagrangian relaxation to encourage adjacent words selected or not selected together. Different from the above methods, where rationale extraction is wrapped in an end-to-end model and has not used annotated rationales during the training of rationale selection, Du et al. [45] uses rationales annotated by external experts as guidance during the training of rationale selector to generate the local explanations (short and coherent rationales) that are consistent with these external human-annotated rationales.

For the stream of adversarial-based models, a third module is usually added in addition to the selector-predictor stacks, functioning as a guide to boost the faithfulness of rationales and improve the comprehensibility of interpretation. For example, to boost the faithfulness of extracted rationales, Yu et al. [190] inserted the target labels of sentiment analysis as additional inputs into the rationale selector to boost its participation in prediction. Additionally, to improve the comprehensibility that prevents the rationale selector from selecting meaningless small snippets, this work added a third element: a complement predictor. This additional module predicts the labels for original textual inputs based on non-rationale words. The complement predictor and the generator work much like the discriminative and generative networks in generative adversarial networks (GANs) [56]: the rationale selector aims to extract as many prediction-relevant words as possible as rationales to avoid the complement predictor from being able to predict the actual textual label. Similar to Yu et al. [190], Chang et al. [29] also involved a third module where the target labels of the original inputs are used as additional inputs, but with the addition that these target labels can be incorrect. This work also proposed a counterfactual rational generator to extract relevant rationales that cause false predictions. A discriminator is then applied to discriminate between the actual and counterfactual rationale generator. Recent work, such as Reference [150], reduces the complexity of using three modules but constructs a guider model that operates over the original textual inputs for prediction and the rationale selector model in the adversarial-based architecture to encourage the final prediction vectors from two separate models to be close to each other, and thus achieve the faithfulness of extracted rationales. Also, to achieve better comprehensibility, Reference [150] proposed language models as a regularizer, which significantly contributes to the better fluency of the extracted rationale by selecting consecutive tokens that describe the rationale well.

In general, using extracted rationales from original textual inputs as the models’ local interpretations focuses on the faithfulness and comprehensibility of interpretations. While trying to select rationales that can well represent the complete inputs in terms of accurate prediction results, extracting short and consecutive sub-phrases is also the key objective of the current rationale extraction works. Such fluent and consecutive sub-phrases (i.e., the well-extracted rationales) make this rationales extraction a friendly, interpretable method that provides readable and understandable explanations to non-expert users without NLP-related knowledge.

3.1.2 Input Perturbation.

Another method for identifying important features of textual inputs is input perturbation. For this method, a word (or a few words) of the original input is modified or removed (i.e., “perturbed”), and the resulting performance change is measured. The more significant the model’s performance drop, the more critical these words are to the model and therefore are regarded as important features. Input perturbation is usually model-agnostic, which does not influence the original model’s architecture. The main difference among the proposed input perturbation methods lies in how to perturb the tokens or phrases from original inputs into the new instances.

Ribeiro et al. [143] proposed a local interpretable model-agnostic explanations (LIME) model that can be used as an interpretable method for any black-box model. The main idea of LIME is the approximation of a black-box model with a transparent model using variants of original inputs. For natural language processing tasks such as text classification, words of original textual inputs are randomly selected and removed from the inputs, using a binary representation to mark the inclusion of words. Basaj et al. [16] applied LIME to a QA task for identifying the important words in a question, where the words in the questions are considered to be features, while the associated context (i.e., text containing the answer to the given question) was held constant. The results indicate that in QA tasks, the complete sentence of question plays a minor role, and just a small amount of question words are sufficient for correct answer prediction.

Ribeiro et al. [144] argued that the important features identified by Ribeiro et al. [143] are based on word-level (single token) instead of phrase-level (consecutive tokens) features. Word-level features relate to only one instance and cannot provide general explanations, which makes it difficult to extend such explanations to unseen instances. For example, in sentiment analysis, “not” in “The movie is not good” is a contributing feature for negative sentiment but is not a contributing feature for positive sentiment in “The weather is not bad.” The single token “not” is insufficient as a general explanation for unseen instances, as it will lead to different meanings when combined with different words. Thus, Ribeiro et al. [144] emphasized the phrase-level features for more comprehensive local interpretations and proposed a rule-based method for identifying critical features for predictions. Their proposed algorithm iteratively selects predicates from inputs as key tokens while replacing the rest of the tokens with random tokens that have the same POS tags and similar word embeddings. If the probability of classifying the perturbed text into the same class as that of the original text is above a predefined threshold, then the selected predicates will be considered as the ultimate key features to interpret the prediction results.

Similar to Ribeiro et al. [143], ,144], Alvarez-Melis and Jaakkola [5] also proposed a model-agnostic interpretable method to relate inputs to outputs through the use of perturbed inputs generated by a variational auto-encoder applied to the original input. The perturbed input is supposed to have a similar meaning to the original input. A bipartite graph is then constructed to link these perturbed inputs and outputs, and the graph is then partitioned to highlight the relevant parts to show which inputs are relevant to the specific output tokens.

Feng et al. [54] proposed a method to gradually remove unimportant words from original texts while maintaining the model’s performance. The remaining words are then considered as the important features for prediction. The importance of each token of the textual input is measured through a gradient approximation method, which involves taking the dot product between a given token’s word embedding and the gradients of its output with respect to its word embedding [47]. The authors show that while the reduced inputs are nonsensical to humans, they are still enough for a given model to maintain a similar level of accuracy when compared with the original inputs.

The input perturbation method seems straightforward in identifying the significant input features by measuring the target task’s performance changes with new perturbed instances. However, there are also works questioning the faithfulness of input perturbation. For example, Reference [154] conducted several experiments and argued that when the distributions of perturbed instances and original instances are less similar, the explanations of LIME [143] are not faithful. Another problem of most input perturbation explanations is that the identified important features are mostly independent tokens instead of coherent phrases like argued by Ribeiro et al. [144], which limits comprehensibility. The recent new track of local explanation: Counterfactual explanations [31, 145, 184] are generated via the approaches of input perturbation to provide counterfactual explanations to show what would happen if some certain features are replaced and prove those features are important for particular model decision. These counterfactual explanations extend beyond the input perturbation from the simple word-level to present the interpretation differently with the more straightforward counterfactual examples. Such presentation of the input perturbation interpretation would give normal users a more intuitive understanding.

3.1.3 Attention Weights.

Attention weight is a weighted sum score of input representation in intermediate layers of neural networks [14]. Extracting attention weights for inputs to provide local interpretations for predictions is commonly used among models that utilise attention mechanisms. For NLP tasks with only textual inputs, tokens with higher attention weights are considered to have more impact on the outputs during the neural network training and are, therefore, regarded as the more important features. Attention weights have been used for explainability in sentiment analysis [107, 112, 173], question answering [151, 164, 166], and neural machine translation [14, 109]. In tasks with both visual and textual inputs, such as Visual Question Answering (VQA) [25, 43, 105, 186, 191] and image captioning [7, 61, 185], attention weights are extracted from both images and questions to identify the contributing features from both modalities. In the case of such multi-modal tasks, it is also important to boost the consistency between the attended image regions and sentence tokens for a plausible explanation. In recent years, different attention mechanisms have been proposed, including the self-attention mechanism [169] and the co-attention mechanism for multi-modal inputs [191], aiming for better attention weights calculation that genuinely reflects the contributing factors to the final prediction.

Though attention mechanisms have proved their effectiveness in performance increment in different tasks and have been used as the indicators of important features to explain the model’s prediction results, there have always been debates arguing about the faithfulness of attention weights as the interpretation for neural networks.

Bai et al. [15] proposed the concept of combinatorial shortcuts caused by the attention mechanism. It argued that the masks used to map the query and key matrices of the self-attention [169] are biased, which would lead to the same positional tokens being attended regardless of the actual word semantics of different inputs. Clark et al. [34] detected that the large amounts of attention of BERT [40] focus on the meaningless tokens such as the special token [SEP]. Jain and Wallace [79] argued that the tokens with high attention weights are not consistent with the important tokens identified by the other interpretable methods, such as the gradient-based measures. Serrano and Smith [149] applied the method of intermediate representation erasure and claimed that attention can only indicate the importance of even intermediate components and are not faithful enough to explain the model’s decision from the level of the actual inputs.

In contrast, Wiegreffe and Pinter [181] proposed the work of “Attention is not not explanation” specifically against the arguments in Reference [79], arguing that whether attention weights are faithful explanations is dependent on the definition of explanation and conducted four different experiments to prove when attention can be used as the explanation. A similar view is also proposed by Jacovi and Goldberg [76], illustrating that under some instances, attention maps over input can be considered as a faithful explanation, which can be verified by the erasure method [9, 54], i.e., whether or not that erasing the attended tokens from inputs would change the prediction results.

To improve the faithfulness of attention as the explanation, some recent works have proposed different methods. For example, Bai et al. [15] proposed a method of generating unbiased mask distribution by using random mask distributions to get attention weights through solely training the attention layers while fixing the other downstream parts of the model, which will therefore scale the attention weights towards tokens that are truly correlated with the predicted label. Chrysostomou and Aletras [33] introduced three different task-scaling mechanisms that scaled over the word representations from different aspects before passing to the attention mechanism and claimed that such scaled word representations help to produce a more faithful attention-based explanation.

Overall, the dilemma of using inputs with high attention weights as the explanation to a black-box model’s decision is associated with the various definition and inconsistent evaluations of explanation faithfulness from different works. Jacovi and Goldberg [76] also proposed in their work that the possible approach to solving this issue is to construct a unified evaluation of the degree of faithfulness either from the level of a specific task or from the level of sub-spaces of the input space. Nevertheless, regardless of the debates over the faithfulness of attention, explanation by attention weights has a lower level of readability. Compared to rationale extraction works that explicitly force the consecutive rationales to be extracted for better comprehensibility, current works using attention as explanation neglect such interpretability aspect. Therefore, even in some cases where the input tokens with high attention weights could work as faithful explanations, it would be hard for non-experts to understand the explanation well with non-coherent highlighted tokens of the textual inputs. However, for the multimodal task such as the visual question answering, some works have attention maps over the images as the explanation [108] or the part of the explanations [183]; the attended region are usually consecutive pixels of the images, which can be more straightforward to be understood by non-expert users compared to the attention map over pure texts.

3.1.4 Attribution Methods.

Another method of detecting important input features that contribute most to a specific prediction is attribution methods, which aim to interpret prediction outputs by examining the gradients of a model. Common attribution methods include DeepLift [153], Layer-wise relevance propagation (LRP) [13], deconvolutional networks [192], and guided back-propagation [157].

Extracting model gradients allows for identifying high-contributing input features to a given prediction. However, directly extracting gradients does not work well with regards to two key properties: sensitivity and implementation invariance. Sensitivity emphasizes that if we have two inputs with one differing feature that lead to different predictions, then this differing feature should be noted as important to the prediction. Implementation invariance means that the outputs of two models should be equivalent if they are functionally equivalent, whether their implementations are the same or not. Focusing on these properties, Sundararajan et al. [163] proposed an integrated gradient method. Integrated gradients are the accumulative gradients of all points on a straight line between an input and a baseline point (e.g., a zero-word embedding). He et al. [65] applied this method to natural machine translation to find the contribution of each input word to each output word. Here, the baseline input is a sequence of zero embeddings in the same length as the input to be translated. Mudrakarta et al. [119] applied integrated gradients to a question-answering task to identify the critical words in questions and found that only a few words in a question contribute to the model answer prediction.

Besides extracting the gradients, scoring input contributions based on the model’s hidden states is also used for attribution. For example, Du et al. [46] proposed a post hoc interpretable method that leaves the original training model untouched by examining the hidden states passed along by RNNs. Ding et al. [42] applied LRP [13] to neural machine translation to provide interpretations using the hidden state values of each source and target word.

The attribution methods are the preliminary approaches for deep learning researchers to explain the neural networks through the identified input features with outstanding gradients. The idea of the attribution methods was mostly proposed before the mature development and vast researches of rationale extraction, attention mechanisms, and even the input perturbation methods. Compared to the other input feature explanation methods, the attribution methods hardly consider the interpretation’s faithfulness and comprehensibility as the other three input feature explanation methods. Visualizing the identified features from inputs would be at the same plausible level as that of the other three feature importance methods to non-expert users, but the attribution methods do not work to form the interpretation into coherent sub-phrases for better readability and easier understanding. Thus, compared to rationale extraction, attention weights extraction, and input perturbation, using attribution methods to generate the interpretation is more like a diagnosis method for deep learning experts to understand the model’s decision and learn the model’s functionality.

3.1.5 Datasets.

Tasks used for examining the interpretable methods discussed above include sentiment analysis, reading comprehension, natural machine translation, question answering, and visual question answering. Below, we list and summarise some common datasets that are used for these tasks:

(1)	BeerAdvocate review dataset [115] is a multi-aspect sentiment analysis dataset that contains around 1.5 million beer reviews written by online users. The average length of each review is about 145 words. These reviews are associated with the overall review of the beer or a particular aspect, such as the appearance, smell, palate, and taste. Each written review also has a corresponding overall rating for beer and another four different ratings for the four review aspects, where each rating ranges from 0 to 5.
(2)	IMDB [110] is a large movie review usually used for binary sentiment classification. The dataset contains 50k reviews labelled as positive or negative and is split in half into train and test sets. The average length for each review is 231 words and 10.7 sentences.
(3)	WMT is a workshop for natural machine translation. Tasks announced in these workshops include translation of different language pairs, such as French to English, German to English, and Czech to English in WMT14, and Chinese to English additionally added in WMT17. The sources are normally news and biomedical publications. For many papers examining interpretable methods, the commonly used datasets are French to English news and Chinese to English news.
(4)	HotpotQA [187] is a multi-hop QA dataset that contains 113k Wikipedia-based question-answer pairs where multiple documents are supposed to be used to answer each question. Apart from questions and answers, the dataset also contains sentence-level supporting facts for each document. This dataset is often used to experiment with interpretable methods for identifying sentence-level significant features for answer prediction.
(5)	SQuAD [140] is a reading comprehension dataset that contains 100k question-answer pairs from Wikipedia articles. SQuAD v2 [139], proposed in 2018, includes around 50k additional unanswerable questions to find the answerable questions with similar semantic meanings.
(6)	VQA datasets are used for multi-modal tasks with both textual and visual inputs. VQA v1 [8] is the first visual question-answering dataset. VQA v1 contains 204,721 images, 614,163 questions and 7,964,119 answers, where most images are authentic images extracted from MS COCO dataset [97] and 50,000 images are newly generated abstract scenes of clipart objects. VQA v2 [57] is an improved version of VQA v1 that mitigates the biased-question problem and contains 1M pairs of images and questions as well as 10 answers for each question. Work on VQA commonly utilises attention weight extraction as a local interpretation method.

3.2 Natural Language Explanation

Natural Language Explanation (NLE) refers to the method of generating free text explanations for a given pair of inputs and their prediction. In contrast to rational extraction, where the explanation text is limited to that found within the input, NLE is entirely freeform, making it an incredibly flexible explanation method. This has allowed it to be applied to tasks outside of NLP, including reinforcement learning [48], self-driving cars [85], and solving mathematical problems [99]. We focus here on methods in which explanations are generated without any or minimal scaffolding, that is, we do not cover methods that form ‘‘natural language explanations’ by filling in templates, but rather cases where the explanation model is tasked with generating the entirety of the explanation content itself.

3.2.1 Multimodal NLE.

Multimodal NLE focuses on generating natural language explanations for tasks that involve multiple input modalities, including images and video. While explanations may span multiple modalities, we focus on cases where the explanations significantly involve natural language. Much work, including text-only NLE, stems from Hendricks et al. [66], which draws upon image captioning research to generate explanations for image classification predictions of bird images. The model first makes a prediction using an image classification network, and then the features from the final layers of the network are fed into an LSTM decoder [71] to generate the explanation text. The explanation is trained with a reinforcement learning–based approach both to match a ground truth correction and to be able to be used to predict the image label itself. Later work has directly built on this model by improving the use of image features used during the explanation generation [177], using a critic model to improve the relevance of the explanations [67] and conditioning on specific image attributes [168]. Park et al. [125] make use of an attention mechanism to augment the text-only explanations with heatmap-based explanations and find that training a model to provide both types of explanations improves the quality of both the text and visual-based explanations. Most of these earlier approaches use learned LSTM decoders to generate the explanations, learning a language generation module from scratch. Most of these methods generate their explanations post hoc, making a prediction before generating an explanation. This means that while the explanations may serve as valid reasons for the prediction, they may also not truthfully reflect the reasoning process of the model itself. Wu and Mooney [183] attempt to build a multimodal model whose explanations better match the model’s reasoning process by training the text generator to generate explanations that can be traced back to objects used for prediction in the image as determined by gradient-based attribution methods. They explicitly evaluate their model’s faithfulness using LIME and human evaluation and find that this improves performance and does indeed result in explanations faithful to the gradient-based explanations.

More recently, NLE datasets have been developed for VQA [72], self-driving car decisions [85], arcade game agents [49], visual commonsense [193], physical commonsense [138], image manipulation detection [37], explaining facial biometric scans [117], as well as for more general vision-language benchmarks [84].

The recent rise of large pretrained language models [40, 128, 134] has also impacted multimodal NLE, with recent approaches replacing the standard LSTM-based decoder with pretrained text generation models such as GPT-2 [12, 84, 114] with a good deal of success. Kayser et al. [84] additionally finds that using a pre-trained unified vision-language model along with GPT-2 works best over other combinations of vision and language-only models. This suggests that further utilising the growing number of large pre-trained multimodal models such as VLBERT [162], UNITER [32], or MERLOT [194] may lead to improved explanations for multimodal tasks. However, while these models often do yield higher-quality explanations that better align with human preferences, the use of large unified transformer models means that the faithfulness of these explanations in representing the reasoning process of the model is hard to determine, as the exact reasoning processes used by these large models is hard to uncover.

3.2.2 Text-only NLE.

Earlier work examining explanations accompanying NLP tasks largely examined integrating them as inputs for fact-checking, concept learning, and relation extraction [4, 62, 158]. These efforts provided useful datasets for examining natural language explanations, but the first work examining generating natural language explanations for NLP tasks in an automated fashion was done by Camburu et al. [23], using a set of explanations gathered for the SNLI dataset [21] called e-SNLI. Similar to the multimodal models discussed above, the baseline models for e-SNLI proposed in Camburu et al. [23] are made up of two parts: a predictor module and an explanation module, with the best-performing model first generating explanations and then using these explanations to make predictions. While this tighter integration of explanation generation into the overall model may suggest more faithful and higher-quality explanations, Camburu et al. [24] shows that this model can still provide explanations that are inconsistent with their predictions, suggesting that either the explanations are faulty or the model uses a flawed decision-making process. Several works try to improve the faithfulness of such models by using generated explanations as inputs to the final predictor model [89, 137, 197, 198]. By ‘‘explaining then predicting,” the explanations are by construction used as part of the prediction process. This may aid overall model performance by exposing latent aspects of the task [63]. Inoue et al. [74] additionally show summarisation models can be trained to serve as explanation generation models for this construction. However, recent work by Wiegreffe et al. [179] suggests that jointly producing explanations actually results in models with a stronger correlation between the predicted label and explanation, suggesting these models are more faithful than explain-then-predict methods despite the different construction. Further evaluation linking the underlying model’s predictive mechanics with the generated explanations (e.g., Prasad et al. [132] for highlighted rationales) may work to investigate further how much these explanations align with the underlying model.

Beyond NLI, other early tasks to which NLE was applied include commonsense QA [137] and user recommendations [123]. While early work used human-collected explanations, Ni et al. [123] shows that using distant supervision via rationales can also work well for training explanation-generating models. Li et al. [93] additionally embed extra non-text features (i.e., user ID, item ID) by using randomly initialised token embeddings. This provides a way to integrate non-text features besides the use of large pre-trained multimodal models.

Much like multimodal NLE, large pre-trained language models have also been integrated into text-based NLE tasks, and most recent papers make use of these models in some way. Rajani et al. [137] introduce an NLE dataset for commonsense QA (“cos-e”) and use a pre-trained GPT model [133] to generate explanations used to make a final prediction. More recently, wT5 [122], which follows the T5 model [135] in framing explanation generation and prediction as a purely text-to-text task, generates the prediction followed by a text explanation. More recent work has shown that using these models allow good explanation generation (and even may improve performance) for tasks and settings with little data [51, 80, 113, 188]. Automatically collecting explanations from existing datasets or generating explanations using existing models can also provide extra supervision for learning to generate NLEs in limited-data settings [22]. This highlights the strength of NLEs: By framing the explanation as a text generation problem, explanation generation is as simple as fine-tuning or even few-shot prompting a large language model to produce explanations, often with fairly good results. However, while these approaches are often impressive, generated explanations can still “hallucinate” data not actually present in the training or input data and fail to generalise to challenging test sets such as HANS [200].

3.2.3 NLE in Dialog.

While the above work has all assumed a setup where a model is able to generate only one explanation and has no memory of previous interactions with a user, some work has examined dialog-based setups where a user is assumed to repetitively interact with a model. Madumal et al. [111] propose a model for the components of an explanation dialog comprising two sections: an explanation dialog, which consists mainly of presenting and accepting explanations; and an argument dialog, where the provided explanation is challenged with an argument. Rebanal et al. [142] draw on QA systems to design a model for explaining basic algorithms, presenting the model as an “interactive dialog that allows users to ask for specific kinds of explanations they deem useful.” More recently, Li et al. [95] use feedback from users as explanations to supervise and improve an open domain QA model, showing how models can improve by taking into account live feedback from users. Given the success of using human-written instructions to train large models [148, 176], making further use of human feedback to improve and guide the way explanations are generated may further improve the quality and utility of NLEs.

3.2.4 Datasets.

There are a number of NLE datasets for NLP tasks, which we summarise in Table 1. Many of these datasets consist of human-generated explanations applied to existing datasets or make use of some automatic extraction method to retrieve explanations from supporting documents. While most datasets simply present one explanation per input sample, others present setups where multiple explanations are attached to each sample, but only one is valid [172, 195]. Wiegreffe and Marasovic [178] also summarise existing NLE-for-NLP datasets, focussing also on text-based rationale and structured explanation datasets. We also provide a list of datasets for multimodal NLE in Table 2.

Table 1.

Ref.	Year	Dataset Name	Task	Human-written explanations?
[23]	2016	e-SNLI	NLI	✓
[81]	2016	-	Science Exam QA	Extracted from auxiliary documents
[99]	2017	-	Algebraic Word Problems	✓
[158]	2017	-	Email Phishing classification	✓
[62]	2018	BabbleLabble	Relation Extraction	✓
[4]	2018	LIAR-PLUS	Fact-checking	Extracted from auxiliary documents
[137]	2019	cos-e	Commonsense QA	✓
[172]	2019	-	Sense making	✓
[11]	2019	ChangeMyView	Opinion changing	Extracted from reddit posts
[195]	2020	WinoWhy	Winograd Schema	✓
[88]	2020	PubHealth	Medical claim fact-checking	Extracted from auxiliary documents
[174]	2020	-	Relation Extraction, Sentiment Analysis	✓
[161]	2020	e-FEVER	Fact-checking	Generated using GPT-3
[3]	2021	ECQA	Commonsense QA	✓
[22]	2021	e-$\delta$-NLI	$\delta$-NLI Rationale Generation	Extracted from auxiliary documents, automatically generated

View Table

Table 1. Summary of Datasets with Natural Language Explanations for Text-based Tasks

Table 2.

Ref.	Year	Dataset Name	Task	Human-written explanations?
[72]	2018	VQA-X	Visual QA	✓
[72]	2018	ACT-X	Activity Recognition	✓
[85]	2018	BDD-X	Self-driving Car Decision Explanation	✓
[94]	2018	VQA-E	Visual QA	Generated from captions
[49]	2019	-	Frogger Game	✓
[193]	2019	VCR	Visual Commonsense Reasoning	✓
[138]	2020	ESPIRIT	Physical Reasoning	✓
[91]	2020	VLEP	Event Prediction	✓
[37]	2021	EMU	Understanding edits	✓
[84]	2021	E-ViL	Vision-language Tasks	✓

View Table

Table 2. Summary of Datasets with Natural Language Explanations for Multimodal Tasks

3.2.5 Challenges and Future Work.

NLE is very attractive as a human-comprehensible approach to interpretation: Rather than trying to utilise model parameters, NLE-based approaches essentially allow their models to ‘‘talk for themselves.” Despite being freely generated, these explanations still display a degree of faithfulness in their agreement with gradient-based explanation methods and can be quite robust to noise [179]. This suggests that this approach exhibits a degree of faithfulness and stability despite a lack of formal guarantee that these methods have either quality. Furthermore, pipeline methods that use explanations for predictions can further guarantee that the generated explanations represent the information being used for prediction, even if their performance suffers compared to joint prediction models. NLEs have the benefit of being extremely comprehensible: Unlike text rationales or gradient methods, which often require some understanding of the model being used, natural language explanations can be easily read and understood by anyone, and tailoring explanations to a specific audience is ‘‘simply’’ a matter of training a model on similar explanations, which is even possible in low-data scenarios [51, 80, 113, 188]. Finally, the trustworthiness of NLE methods is not often explicitly evaluated. The focus has been put on the overall ‘‘explanation quality’’ when evaluating NLEs [35, 74]. While rating ‘‘explanation quality’’ may in some ways suggest how trustworthy the annotators find the explanations, more careful consideration of the type of contract-based trust [77] an NLE-based model may involve is required in determining the utility of deploying these models in real-world scenarios.

Overall, NLE is a very flexible and attractive explanation method, with the potential to greatly improve model explainability without requiring complex setups: Just train your model to output explanations [122]. However, evaluation must be carefully considered due to issues with automated metrics [35] and the human-generated explanations themselves [26]. In addition, further exploring the link between generated NLEs and other explanation or interpretability methods may further yield insights into models and improve our understanding of the faithfulness of this method.

3.3 Probing

Linguistic probes, also referred to as “diagnostic classifiers” [73] or “auxiliary tasks” [2], are a post hoc method for examining the information stored within a model. Specifically, the probes themselves are (often small) classifiers that take as input some hidden representations (either intermediate representations within a model or word embeddings) and are trained to perform some small linguistic task, such as verb-subject agreement [55] or syntax parsing [70]. Intuition follows that if there is more task-relevant information present within the hidden representations, then the classifier will perform better, thus allowing researchers to determine the presence or lack of presence of linguistic knowledge within both word embeddings and at various layers within a model. However, recent research [70, 130, 141] has shown that probing experiments require careful design and consideration of truly faithful measurements of linguistic knowledge.

While current probing methods do not provide layperson-friendly explanations, they do allow for research into the behaviour of popular models, allowing a better understanding of what linguistic and semantic information is encoded within a model [98]. Hence, the target audience of a probe-based explanation is not a layperson, as is the case with other interpretation methods discussed in this article, but rather an NLP researcher or ML practitioner who wishes to gain a deeper understanding of their model. We conclude the typology of different probing methods in Figure 3. Note, we do not provide a list of common datasets in this section, unlike the previous sections, as probing research has largely not focused on any particular subset of datasets and can be applied to most text-based tasks.

3.3.1 Embedding Probes.

Early work on probing focused on using classifiers to determine what information could be found in distributional word embeddings [116, 126]. For example, Gupta et al. [59], Köhn [87], Rubinstein et al. [146] all investigated the information captured by word embedding algorithms through the use of simple classifiers (e.g., linear or logistic classifiers) to predict properties of the embedded words, such as part-of-speech or entity attributes (e.g., the colour of the entity referred to by a word). These works all found word embeddings captured the properties probed for, albeit to varying extents. More recently, Sommerauer and Fokkens [155] used both a logistic classifier and a multi-layer perceptron (MLP) to determine the presence of certain semantic information in Word2Vec embeddings, finding that visual properties (e.g., colour) were not represented well, while functional properties (e.g., “is dangerous”) were. Research into distributional models has reduced currently due to the rise of pre-trained language models such as BERT [40].

Alongside word embeddings, sentence embeddings have also been the target of analysis via probing. Ettinger et al. [52] (following Gupta et al. [59]) train a logistic classifier to classify if a sentence embedding contains specific words and specific words with specific semantic roles. Adi et al. [2] train MLP classifiers on sentence embeddings to determine if the embeddings contain information about sentence length, word content, and word order. They examine LSTM auto-encoder, continuous bag-of-words (CBOW), and skip-thought embeddings, finding that CBOW is surprisingly effective at encoding the properties of sentences examined in low dimensions, while the LSTM auto-encoder-based embeddings perform very well, especially with a larger number of dimensions. Further developing on this work, Conneau et al. [36] propose 10 different probing tasks, covering semantic and syntactic properties of sentence embeddings and controlling for various cues that may allow a probe to “cheat” (e.g., lexical cues). To determine if encoding these properties aids models in downstream tasks, the authors also measure the correlation between probing task performance and performance on a set of downstream tasks. More recently, Sorodoc et al. [156] propose 14 additional new probing tasks for examining information stored in sentence embeddings relevant to relation extraction.

3.3.2 Model Probes.

Following the work on probing distributional embeddings, Shi et al. [152] extended probing to NLP models, training a logistic classifier on the hidden states of LSTM-based Neural Machine Translation (NMT) models to predict various syntactic labels. Similarly, they train various decoder models to generate a parse tree from the encodings provided by these models. By examining the performance of these probes on different hidden states, they find that lower-layer states contain more fine-grained word-level syntactic information, while higher-layer states contain more global and abstract information. Following this, Belinkov et al. [18] and Belinkov et al. [20] both examine NMT models with probes in more detail, uncovering various insights about the behaviour of NMT models, including a lack of powerful representations in the decoder, and that the target language of a model has little effect on the source language representation quality. Instead of using a logistic classifier, both studies opt for a basic neural network featuring a hidden layer and a ReLU activation function. This choice exhibits analogous trends to those observed with a simpler classifier, while yielding superior performance. More recently, Raganato and Tiedemann [136] analysed transformer-based NMT models using a similar probing technique alongside a host of other analyses. Finally, Dalvi et al. [38] presented a method for extracting salient neurons from an NMT model by utilising a linear classifier, allowing examination of not just information present within a model but also what parts of the model contribute most to both specific tasks and the overall performance of the model.

Probing is not limited to NMT, however: Research has also turned to examining the linguistic information encoded by language models. Hupkes et al. [73] utilised probing methods to explore how well an LSTM model for solving basic arithmetic expressions matches the intermediate results of various solution strategies, thus examining how LSTM models break up and solve problems with nested structures. Utilising the same method, Giulianelli et al. [55] investigated how LSTM-based language models tracked agreement. The authors trained their probe (a linear model) on the outputs of an LSTM across timesteps and components of the model, showing how the information encoded by the LSTM model changes over time and in model parts. Jumelet and Hupkes [82], Zhang and Bowman [196] also probe LSTM-based models for particular linguistic knowledge, including NPI-licensing and CCG tagging. Importantly, the authors find that even untrained LSTM models contain information probe-based models can exploit to memorise labels for particular words, highlighting the need for careful control of probing tasks (we discuss this further in the next section). More recently, Sorodoc et al. [156] probe LSTM and transformer-based language for referential information. We also note that probing has been applied to speech processing–based models [19, 131].

Finally, probing-based analyses of deep pre-trained language models have also been popular as a method for understanding how these models internally represent language. Peters et al. [127] briefly utilised linear probes to investigate the presence of syntactic information in bidirectional LSTM models, finding that POS tagging is learned in lower layers than constituent parsing. Recently, both Lin et al. [98] and Clark et al. [34] used probing classifiers to investigate the information stored in BERT’s hidden representations across both layers and heads. Clark et al. [34] focused on attention, using a probe trained on attention weights in BERT to examine dependency information, while Lin et al. [98] focused on examining syntactic and positional information across layers. Hewitt and Manning [70] examined representations generated by ELMo [129] and BERT, training a small linear model to predict the distance between words in a parse tree of a given sentence. Liu et al. [102] proposed and examined 16 different probing tasks, involving tagging, segmentation, and pairwise relations, utilising a basic linear model. They compared results across several models, including BERT and ELMo, examining the performance of the models on each task across layers. Tenney et al. [165] trained two-layer MLP classifiers to predict labels for various NLP tasks (POS tagging, named entity labelling, semantic role labelling, etc.), using the representations generated by four different contextual encoder models. They found that the contextualised models improve more on syntactic tasks than semantic tasks when compared to non-contextual embeddings and found some evidence that ELMo does encode distant linguistic information. Klafka and Ettinger [86] investigated how much information about surrounding words can be found in contextualised word embeddings, training MLP classifiers to predict aspects of important words within the sentence, e.g., predicting the gender of a noun from an embedding associated with a verb in the same sentence.

3.3.3 Probe Considerations and Limitations.

The continued growth of probing-based papers has also led to recent work examining best practices for probes and how to interpret their results. Hewitt and Liang [69] considered how to ensure that a probe is genuinely reflective of the underlying information present in a model and proposed the use of a control task, a randomised version of a probe task in which high performance is only possible by memorisation of inputs. Hence, a faithful probe should perform well on a probe task and poorly on a corresponding control task if the underlying model does indeed contain the information being probed for. The authors found that most probes (including linear classifiers) are over-parameterised, and they discuss methods for constraining complex probes (e.g., multilayer perceptrons) to improve faithfulness while still allowing them to achieve similar results.

While most papers we have discussed above follow the intuition that probes should avoid complex probes to prevent memorisation, Pimentel et al. [130] suggest that instead the probe with the best score on a given task should be chosen as the tightest estimate, since simpler models may simply be unable to extract the linguistic information present in a model, and such linguistic information cannot be “added” by more complex probes (since their only inputs are hidden representations). In addition, the authors argue that memorisation is an important part of linguistic competence, and as such probes should not be artificially punished (via control tasks) for doing this. Recent work has also presented methods that avoid making assumptions about probe complexity, such as MDL probing [104, 171], which directly measures ‘‘amount of effort’’ needed to achieve some extraction task, or DirectProbe [199], which directly examines intermediate representations of models to avoid having to deal with additional classifiers.

Finally, Hall Maudslay et al. [60] compared the structural probe [70] with a lightweight dependency parser (both given the same inputs) and demonstrated that the parser is generally able to extract more syntactic information from BERT embedding. In contrast, the probe performs better with a different metric, showing that the choice of metric is important for probes: When testing for evidence of linguistic information, one should consider not only the nature of the probe but also the metric used to evaluate it. Furthermore, the significance of well-performing probes is not clear: Models may encode linguistic information not actually used by the end-task [141], showing that the presence of linguistic information does not imply it is being used for prediction. Some approaches proposed later that integrated the causal approaches such as amnesiac probing [50]—which directly intervene in the underlying model’s representations—might be a possible solution to distinguish between these cases.

3.3.4 Interpretability of Probes and Future Work.

As noted at the beginning of the section, probing is a way for NLP researchers to investigate models rather than end-users. As such, their comprehensibility is relatively low: Understanding probing results requires understanding the linguistic properties they are probing and the more complex experimental setups they make use of (as simple metrics such as task accuracy do not show the whole story [69]). However, probes are naturally fairly faithful in that they directly use the model’s hidden states and are specifically designed to represent only information present within these hidden states. This faithfulness is degraded somewhat by the fact that this information may not be used for predictions [141], but recent causal approaches work towards alleviating this. This also suggests that probing results could be considered trustworthy only when the experimental design is carefully considered, in that their results can only be relied upon if carefully controlled. Finally, probing methods are often reasonably stable for the same model and property, as the probe classifier is trained to some convergence. However, across models (even those with the same architecture but just trained on different data), results can differ quite drastically [50], which shows differences between pre-trained and fine-tuned BERT models. This is more likely to be a function of the underlying models rather than the technique, but also shows that probing results are very specific to the models and properties being examined.

Overall, probes are exciting and valuable tools for investigating models’ ‘‘inner workings.” However, much like other explanation methods, the setup and evaluation of probing techniques must be carefully considered. Some future works of the probing may be associated with the integration of some causal methods [50] as better approaches to make stronger statements about what a model is and is not using for its predictions, better allowing probing to provide explanations for model judgements rather than just show what could be potentially used. Combining this with methods that further reduce the complexity of probing setups [199] may allow even simpler and better ways to get insights into NLP models. Causal models have been applied to the traditional predictive tasks and covered with the convergence of causal inference and language processing [53]. Recent NLP works have tried to involve auxiliary causal-based approaches in their models [53, 68]. Such involvement of causal approaches can be seen as a future trend of interpretable NLP tasks including probing. However, the essence of causal models is different from the association essence of neural networks. Thus, we consider a detailed discussion of causal approaches is out of the scope of this survey. But we do notice that this could be a future trend for the further development of probing.

4 EVALUATION METHODS

4.1 Evaluation of Feature Importance

4.1.1 Automatic Evaluation.

Evaluations on the interpretable methods of extracting important features usually align with the evaluation of the explanation faithfulness, i.e., whether the extracted features are sufficient and accurate enough to result in the same label prediction as the original inputs. When the datasets come with pre-annotated explanations, the extracted features used as the explanation can be compared with the ground truth annotation through exact matching or soft matching. The exact matching only considers the validness of the explanation when it is exactly the same as the annotation, and such validity is quantified through the precision score. For example, the HotpotQA dataset provides annotations for supporting facts, allowing a model’s accuracy in reporting these supporting facts to be easily measured. This is commonly used for extracting rationals, where the higher the precision score, the better the model matches human-annotated explanations, likely indicating improved interpretability. On the contrary, soft matching will take the extracted features as a valid explanation if some features (tokens/phrases in the case of NLP) matched with the annotation. For instance, DeYoung et al. [41] proposed Intersection-Over-Union (IOU) on the token level, taking the overlap size of the tokens over two spans divided by the union of their token sizes and considering the extracted rationales as a valid explanation if the IOU score is over 0.5.

However, DeYoung et al. [41] also argued that the matching between the identified features and the annotation only measures the plausibility of interpretability but not faithfulness. In other words, either the exact matching or soft matching can reveal if the model’s decisions truly depend on the identified contributing features. Therefore, some other erasure-based metrics are specifically proposed to evaluate the impact of the identified important features to the model’s results. For example, Du et al. [46] proposed a faithfulness score to verify the importance of the identified contributing sentences or words to a given model’s outputs. It is assumed that the probability values for the predicted class will significantly drop if the truly important inputs are removed. The score is calculated as in Equation (1): (1) $\begin{equation} S_{Faithfulness} = \frac{1}{N}\sum _{i=1}^{N}\left(y_{x^{i}}-y_{x_{\setminus A}^{i}} \right), \end{equation}$ where $y_{x^{i}}$ is the predicted probability for a given target class with original inputs and $y_{x_{\setminus A}^{i}}$ is the predicted probability for the target class for the input with significant sentences/words removed.

The Comprehensiveness score proposed by DeYoung et al. [41] in later years is calculated in the same way as the Faithfulness score [46]. What is to be noted here is that the Comprehensiveness score is not related to the evaluation of the comprehensibility of interpretability but to measure whether all the identified important features are needed to make the same prediction results. A high score implies the enormous influence of the identified features, while a negative score indicates that the model is more confident in its decision without the identified rationales. DeYoung et al. [41] also proposed a Sufficiency score to calculate the probability difference from the model for the same class once only the identified significant features are kept as the inputs. Thus, opposite to the Comprehensiveness score or Faithfulness score, a lower Sufficiency score indicates the higher faithfulness of the selected features.

Apart from using the above-proposed evaluation metrics, another direct way to evaluate the validity of the explanations for a model’s output is to examine the performance decrease of a model based on the tasks standard performance evaluation metrics after removing or perturbing identified important input features (i.e., words/phrases/sentences). For example, He et al. [65] measured the change in BLEU scores to examine whether certain input words were essential to the predictions in natural machine translation.

4.1.2 Human Evaluation.

Human evaluation is also a common and straightforward but relatively more subjective method for evaluating the validity of explanations for a model. This can be done by researchers themselves or by a large number of crowdsourced participants (sourced from, e.g., Amazon Mechanical Turk). For example, Chen et al. [30] asked Amazon Mechanical Turk workers to predict the sentiment based on predicted keywords in a text, examining the faithfulness of the selected features as interpretation. Sha et al. [150] sampled 300 input-output-interpretation cases to ask the human evaluator to examine whether the selected features are useful (to explain the output), complete (enough to explain the output), and fluent to read).

While faithfulness can be evaluated more easily via automatic evaluation metrics, the comprehensibility and trustworthiness of interpretations usually are evaluated through human evaluations in the current research works. Though using large numbers of participants helps remove the subjective bias, this requires the cost of setting up larger-scale experiments, and it is also hard to ensure that every participant understands the task and the evaluation criteria. It is undoubtedly that the human evaluation results can provide some hints about the interpretation validity and comprehensibility, but we cannot erase the suspicion of the existence of subjective bias, which also limits further references and fair comparison of the human evaluation results for future works.

4.2 Evaluation of NLE

4.2.1 Automatic Evaluation.

As NLE involves generating text, the automatic evaluation metrics for NLE are generally the same metrics used in tasks with free-form text generation, such as machine translation or summarisation. As such, standard automated metrics for NLE are BLEU [124], METEOR [39], ROUGE [96], CIDEr [170], and SPICE [6], with all five generally being reported in VQA-based NLE papers. Perplexity is also occasionally reported [23, 99], keeping in line with other natural language generation–based works. However, these automated metrics must be used carefully, as recent work has found they often correlate poorly with human judgements of explanation quality. Clinciu et al. [35] suggest that model-based scores such as BLUERT and BERTScore better correlate with human judgements, and Hase et al. [64] point out that only examining how well the explanation output matches labels does not measure how well the explanations accurately reflect the model’s behaviour.

Additionally, the quality of the annotated human explanations collected in datasets such as e-SNLI has also come into question. Carton et al. [26] find that human-created explanations across several datasets perform poorly at metrics such as sufficiency and comprehensiveness, suggesting they do not contain all that is needed to explain a given judgement. This suggests that just improving our ability to compare generated explanations with human-generated ones may not be enough to best measure the quality of a given generated explanation, and further work in improving the gold annotations provided by explanation datasets could also help.

4.2.2 Human Evaluation.

Given the limitations of current automatic evaluation methods and the free-form nature of NLE, human evaluation is always necessary to truly judge explanation quality. Such evaluation is most commonly done by getting crowdsourced workers to rate the generated explanations (either just as correct/not correct or on a point scale), which allows easy comparison between models. In addition, Liu et al. [101] use crowdsourced workers to compare their model’s explanations against another, with workers noting which model’s explanation related best to the final classification results. Considering BLEU and similar metrics do not necessarily correlate well with human intuition, all work on NLE should include human evaluation results to some level, even if the evaluation is limited (e.g., just on a sample of generated explanations).

4.3 Evaluation of Probing

As probing tasks are more tests for the presence of linguistic knowledge rather than explanations, the evaluation of probing tasks differs according to the tasks. However, careful consideration should be given to the choice of metric. As Hall Maudslay et al. [60] showed, different evaluation metrics can result in different apparent performances for different methods, so the motivation behind a particular metric should be considered. Beyond metrics, Hewitt and Liang [69] suggested that the selectivity of probes should also be considered, where selectivity is defined as the difference between probe task accuracy and control task⁵ accuracy. While best practices for probes are still being actively discussed in the community [130], control tasks are undoubtedly helpful tools for further investigating and validating the behaviour of models uncovered by probes.

5 DISCUSSION AND CONCLUSION

This article focused on the local interpretable methods commonly used for natural language processing models. In this survey, we have divided these methods into three different categories based on their underlying characteristics: (1) explaining the model’s outputs from the input features, where these features could be identified through rationale extraction, perturbing inputs, traditional attribution methods, and attention weight extraction; (2) generating the natural language explanations corresponding to each input; (3) using diagnostic classifiers to analyse the hidden information stored within a model. For each method type, we have also outlined the standard datasets used for different NLP tasks and different evaluation methods for examining the validity and efficacy of the explanations provided.

By going through the current local interpretable methods in the field of NLP, we identified several limitations and research gaps to be overcome to develop explanations that can stably and faithfully explain the model’s decisions and be easily understood and trusted by users. First, as we have stated in Section 1.1.1, there is currently no unified definition of interpretability across the interpretable method works. While some researchers distinguish interpretability and explainability as two separate concepts [147] with different difficulty levels, many works use them as synonyms of each other, and our work also follows this way to include diverse works. However, such an ambiguous definition of interpretability/explainability leads to inconsistent interpretation validity for the same interpretable method. For example, the debate about whether the attention weights can be used as a valid interpretation/explanation between Wiegreffe and Pinter [181] and Jain and Wallace [79] is due to the conflicting definition. The argument of Jain and Wallace [79] is based on the fact that only the faithful interpretable methods are truly interpretable, while Wiegreffe and Pinter [181] argued that attention is an explanation if we accept that explanation should be plausible but not necessarily faithful as proposed by Reference [147]. Thus, we need a unified and legible definition of interpretability that should be broadly acknowledged and agreed to help further develop valid interpretable methods.

Second, we need effective evaluation methods that can evaluate the multiple dimensions of interpretability, the results of which can be reliable for future baseline comparison. However, the existing evaluation metrics measure only limited interpretability dimensions. Taking the evaluation of rationales as an example, examining the matching between the extracted rationales and the human rationales only evaluates the plausibility but not faithfulness [41]. However, when it comes to the faithfulness evaluation metrics [10, 33, 41, 149], the evaluation results on the same dataset can be opposite by using different evaluation metrics. For example, two evaluation metrics DFFOT [149] and SUFF [41] conclude opposite evaluation results on LIME method of the same dataset [28]. Moreover, the current automatic evaluation approaches mainly focus on the faithfulness and comprehensibility of interpretation. It can hardly be applied to evaluate the other dimensions, such as stability and trustworthy. The evaluation of other interpretability dimensions relies too much on the human evaluation process. Though human evaluation is currently the best approach to evaluate the generated interpretation from various aspects, human evaluation can be subjective and less reproducible. In addition, it is essential to have efficient evaluation methods that can evaluate the validity of interpretation in different formats. For example, the evaluation of the faithful NLE relies on the BLEU scores to check the similarity of generated explanations with the ground truth explanations. However, such evaluation methods neglect that the natural language explanations with different contents from the ground truth explanations can also be faithful and plausible for the same input and output pair. To sum up, there is still a considerable research gap for developing effective evaluation methods and frameworks to verify the interpretable methods from various dimensions, and such development would also require explainable datasets with good-quality annotations. The evaluation framework should provide fair results that can be reused and compared by future works, and should be user-centric, taking into account the aspects of different groups of users [83].

6 FUTURE TREND OF INTERPRETABILITY

The future trend of developing interpretable methods cannot avoid further conquering the current limitations. Developing truly faithful interpretable methods that can precisely explain the model’s decisions is critical to enable the vast application of deep neural networks to crucial fields, including medicine, justice, and finance. Faithful interpretable methods and easily understandable interpretations are key to bringing users’ trust to the model’s decisions, especially for users without deep learning knowledge. It is natural for them to question the decisions from an unfamiliar technique. Providing faithful, comprehensible, and stable interpretations of a model helps eliminate the questions and uncertainties about using a black-box model for any users.

However, apart from the discussed limitations of the current interpretable methods, one existing problem is that evaluating whether an interpretation is faithful mainly considers the interpretations for the model’s correct predictions. In other words, most existing interpretable works only explain why an instance is correctly predicted but do not give any explanations about why an instance is wrongly predicted. If the explanations of a model’s correct predictions precisely reflect the model’s decision-making process, then this interpretable method will usually be regarded as a faithful interpretable method. However, it is also significant to generate explanations of the wrong prediction results to investigate and examine which parts of the input instances the model attended to when it made the wrong decision and whether those parts can reflect the model’s wrong decision-making process. However, the interpretation and explanation of the model’s wrong prediction are not considered in any existing interpretable works. Some works even directly consider the interpretations generated by their interpretable models for the models’ wrong predictions are invalid and incorrect [84, 114, 125, 183] and therefore, would not be taken into account for the measurement of intepretabality faithfulness. This seems reasonable when the current works are still struggling with developing interpretable methods that can at least faithfully explain the model’s correct predictions. However, the interpretation of a model’s decision should not only be applied to one side but to both correct and wrong prediction results.

This also brings us the reflection that the fundamental reason to develop model interpretability is more than providing evidence/support/explanation of a correct prediction to users to make them believe the model’s correct decisions, but also to give them valuable guidance about why the model makes a wrong prediction. The comprehensive interpretations of a model’s decisions should provide faithful explanations for the model’s both correct and incorrect predictions. Such comprehensive interpretations from both sides are the key to developing the ultimate trustworthiness for black-box models and boosting their broader and more stable applications in required fields. Moreover, understanding the reason for the wrong prediction is also essential for deep learning researchers to learn and adjust the model better in the future works.

Therefore, the future works of interpretability would be to fill the current research gap and develop interpretable models that can generate faithful and comprehensible interpretations for both correct and incorrect decisions made by the model, providing reliable information to improve the trust of non-experts in using deep neural networks in crucial fields and help experts understand and improve the model in a more accurate and better way.

Footnotes

¹ https://www.thedoctors.com/articles/the-algorithm-will-see-you-now-how-ais-healthcare-potential-outweighs-its-risk/
Footnote
² Note that there are several local interpretation methods, such as counterfactuals, example-based approaches, have not been included in this article, since only a few initial NLP research tasks have been conducted with these example-based approaches.
Footnote
³ For example, Liu et al. [101], Stadelmaier and Padó [159], Stahlberg et al. [160], Wang et al. [175] primarily use explainability or explainable, while Camburu et al. [23], Ribeiro et al. [143], Serrano and Smith [149], Tutek and Šnajder [167] primarily use interpretable or interpretability.
Footnote
⁴ This is stated as “the model assumption” in Jacovi and Goldberg [75].
Footnote
⁵ A control task being a variant of the probe task that utilises random outputs to ensure that high scores on the task are only possible through “memorisation” by the probe.
Footnote

REFERENCES

[1] Adadi Amina and Berrada Mohammed. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160.Google ScholarCross Ref
Reference 1Reference 2
[2] Adi Yossi, Kermany Einat, Belinkov Yonatan, Lavi Ofer, and Goldberg Yoav. 2016. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207 (2016).Google Scholar
Reference 1Reference 2
[3] Aggarwal Shourya, Mandowara Divyanshu, Agrawal Vishwajeet, Khandelwal Dinesh, Singla Parag, and Garg Dinesh. 2021. Explanations for commonsenseQA: New dataset and models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 3050–3065. DOI:Google ScholarCross Ref
Reference
[4] Alhindi Tariq, Petridis Savvas, and Muresan Smaranda. 2018. Where is your evidence: Improving fact-checking by justification modeling. In Proceedings of the 1st Workshop on Fact Extraction and VERification (FEVER’18). Association for Computational Linguistics, 85–90. DOI:Google ScholarCross Ref
Reference 1Reference 2
[5] Alvarez-Melis David and Jaakkola Tommi. 2017. A causal framework for explaining the predictions of black-box sequence-to-sequence models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 412–421.Google ScholarCross Ref
Reference
[6] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382–398.Google ScholarCross Ref
Reference
[7] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarCross Ref
Reference
[8] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.Google ScholarDigital Library
Reference
[9] Arras Leila, Horn Franziska, Montavon Grégoire, Müller Klaus-Robert, and Samek Wojciech. 2017. “What is relevant in a text document?”: An interpretable machine learning approach. PloS One 12, 8 (2017), e0181142.Google ScholarCross Ref
Reference
[10] Vijay Arya, Rachel K. E. Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Q. Vera Liao, Ronny Luss, Aleksandra Mojsilović, Sami Mourad, Pablo Pedemonte, Ramya Raghavendra, John Richards, Prasanna Sattigeri, Karthikeyan Shanmugam, Moninderr Singh, Kush R. Varshney, Dennis Wei, and Yunfeng Zhang. 2019. One explanation does not fit all: A toolkit and taxonomy of AI explainability techniques. arXiv preprint arXiv:1909.03012 (2019).Google Scholar
Reference
[11] Atkinson David, Srinivasan Kumar Bhargav, and Tan Chenhao. 2019. What gets echoed? Understanding the “pointers” in explanations of persuasive arguments. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 2911–2921. DOI:Google ScholarCross Ref
Reference
[12] Ayyubi Hammad A., Tanjim Md, McAuley Julian J., Cottrell Garrison W., et al. 2020. Generating rationales in visual question answering. arXiv preprint arXiv:2004.02032 (2020).Google Scholar
Reference
[13] Bach Sebastian, Binder Alexander, Montavon Grégoire, Klauschen Frederick, Müller Klaus-Robert, and Samek Wojciech. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One 10, 7 (2015).Google ScholarCross Ref
Reference 1Reference 2
[14] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.Google Scholar
Reference 1Reference 2
[15] Bai Bing, Liang Jian, Zhang Guanhua, Li Hao, Bai Kun, and Wang Fei. 2021. Why attentions may not be interpretable? In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 25–34.Google ScholarDigital Library
Reference 1Reference 2
[16] Basaj Dominika, Rychalska Barbara, Biecek Przemyslaw, and Wróblewska Anna. 2018. How much should you ask? On the question structure in QA systems. In Proceedings of the BlackboxNLP@EMNLP Conference.Google Scholar
Reference
[17] Bastings Joost, Aziz Wilker, and Titov Ivan. 2019. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2963–2977.Google ScholarCross Ref
Reference 1Reference 2
[18] Belinkov Yonatan, Durrani Nadir, Dalvi Fahim, Sajjad Hassan, and Glass James. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 861–872. DOI:Google ScholarCross Ref
Reference
[19] Belinkov Yonatan and Glass James. 2019. Analysis methods in neural language processing: A survey. Trans. Assoc. Computat. Ling. 7 (2019), 49–72.Google ScholarCross Ref
Reference
[20] Belinkov Yonatan, Màrquez Lluís, Sajjad Hassan, Durrani Nadir, Dalvi Fahim, and Glass James. 2017. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, 1–10. Retrieved from DOI: DOI: https://www.aclweb.org/anthology/I17-1001Google Scholar
Reference
[21] Bowman Samuel R., Angeli Gabor, Potts Christopher, and Manning Christopher D.. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15).Google ScholarCross Ref
Reference
[22] Brahman Faeze, Shwartz Vered, Rudinger Rachel, and Choi Yejin. 2021. Learning to rationalize for nonmonotonic reasoning with distant supervision. Proc. AAAI Conf. Artif. Intell. 35, 14 (May 2021), 12592–12601. Retrieved from DOI: DOI: https://ojs.aaai.org/index.php/AAAI/article/view/17492Google Scholar
Reference 1Reference 2
[23] Camburu Oana-Maria, Rocktäschel Tim, Lukasiewicz Thomas, and Blunsom Phil. 2018. e-SNLI: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems 31, Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., and Garnett R. (Eds.). Curran Associates, Inc., 9539–9549. Retrieved from DOI: DOI: http://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-language-explanations.pdfGoogle Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[24] Camburu Oana-Maria, Shillingford Brendan, Minervini Pasquale, Lukasiewicz Thomas, and Blunsom Phil. 2020. Make up your mind! Adversarial generation of inconsistent natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4157–4165. Retrieved from DOI: DOI: https://www.aclweb.org/anthology/2020.acl-main.382Google ScholarCross Ref
Reference
[25] Cao Feiqi, Luo Siwen, Nunez Felipe, Wen Zean, Poon Josiah, and Han Soyeon Caren. 2023. SceneGate: Scene-graph based co-attention networks for text visual question answering. Robotics 12, 4 (2023), 114.Google ScholarCross Ref
Reference
[26] Carton Samuel, Rathore Anirudh, and Tan Chenhao. 2020. Evaluating and characterizing human rationales. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 9294–9307. DOI:Google ScholarCross Ref
Reference 1Reference 2
[27] Chakraborty S., Tomsett R., Raghavendra R., Harborne D., Alzantot M., Cerutti F., Srivastava M., Preece A., Julier S., Rao R. M., Kelley T. D., Braines D., Sensoy M., Willis C. J., and Gurram P.. 2017. Interpretability of deep learning models: A survey of results. In Proceedings of the IEEE SmartWorld, Ubiquitous Intelligence Computing, Advanced Trusted Computed, Scalable Computing Communications, Cloud Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI’17). IEEE, 1–6.Google ScholarCross Ref
Reference 1Reference 2
[28] Chan Chun Sik, Kong Huanqi, and Guanqing Liang. 2022. A comparative study of faithfulness metrics for model interpretability methods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5029–5038.Google ScholarCross Ref
Reference
[29] Chang Shiyu, Zhang Yang, Yu Mo, and Jaakkola Tommi. 2019. A game theoretic approach to class-wise selective rationalization. In Proceedings of the Advances in Neural Information Processing Systems Conference. 10055–10065.Google Scholar
Reference
[30] Chen Jianbo, Song Le, Wainwright Martin, and Jordan Michael. 2018. Learning to explain: An information-theoretic perspective on model interpretation. In Proceedings of the International Conference on Machine Learning. PMLR, 883–892.Google Scholar
Reference
[31] Chen Qianglong, Ji Feng, Zeng Xiangji, Li Feng-Lin, Zhang Ji, Chen Haiqing, and Zhang Yin. 2021. KACE: Generating knowledge aware contrastive explanations for natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2516–2527.Google ScholarCross Ref
Reference
[32] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision (ECCV’20).Google ScholarDigital Library
Reference
[33] Chrysostomou George and Aletras Nikolaos. 2021. Improving the faithfulness of attention-based explanations with task-specific information for text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 477–488.Google ScholarCross Ref
Reference 1Reference 2
[34] Clark Kevin, Khandelwal Urvashi, Levy Omer, and Manning Christopher D.. 2019. What does BERT look at? An analysis of BERT’s attention. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 276–286. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[35] Clinciu Miruna-Adriana, Eshghi Arash, and Hastie Helen. 2021. A study of automatic metrics for the evaluation of natural language explanations. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, 2376–2387. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[36] Conneau Alexis, Kruszewski German, Lample Guillaume, Barrault Loïc, and Baroni Marco. 2018. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2126–2136. DOI:Google ScholarCross Ref
Reference
[37] Da Jeff, Forbes Maxwell, Zellers Rowan, Zheng Anthony, Hwang Jena D., Bosselut Antoine, and Choi Yejin. 2021. Edited media understanding frames: Reasoning about the intent and implications of visual misinformation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2026–2039. DOI:Google ScholarCross Ref
Reference 1Reference 2
[38] Dalvi Fahim, Durrani Nadir, Sajjad Hassan, Belinkov Yonatan, Bau Anthony, and Glass James. 2019. What is one grain of sand in the desert? Analyzing individual neurons in deep NLP models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6309–6317.Google ScholarDigital Library
Reference
[39] Denkowski Michael and Lavie Alon. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL Workshop on Statistical Machine Translation.Google ScholarCross Ref
Reference
[40] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[41] DeYoung Jay, Jain Sarthak, Rajani Nazneen Fatema, Lehman Eric, Xiong Caiming, Socher Richard, and Wallace Byron C.. 2020. ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4443–4458.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[42] Ding Yanzhuo, Liu Yang, Luan Huanbo, and Sun Maosong. 2017. Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1150–1159.Google ScholarCross Ref
Reference
[43] Ding Yihao, Luo Siwen, Chung Hyunsuk, and Han Soyeon Caren. 2023. VQA: A new dataset for real-world VQA on PDF documents. arXiv preprint arXiv:2304.06447 (2023).Google Scholar
Reference
[44] Doshi-Velez Finale and Kim Been. 2018. Considerations for Evaluation and Generalization in Interpretable Machine Learning. Springer International Publishing, Cham, 3–17. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[45] Du Mengnan, Liu Ninghao, Yang Fan, and Hu Xia. 2019. Learning credible deep neural networks with rationale regularization. In Proceedings of the IEEE International Conference on Data Mining (ICDM’19). 150–159.Google Scholar
Reference
[46] Du Mengnan, Liu Ninghao, Yang Fan, Ji Shuiwang, and Hu Xia. 2019. On attribution of recurrent neural network predictions via additive decomposition. In Proceedings of the World Wide Web Conference. 383–393.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[47] Ebrahimi Javid, Rao Anyi, Lowd Daniel, and Dou Dejing. 2018. HotFlip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 31–36.Google ScholarCross Ref
Reference
[48] Ehsan Upol, Harrison Brent, Chan Larry, and Riedl Mark O.. 2018. Rationalization: A neural machine translation approach to generating natural language explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 81–87.Google ScholarDigital Library
Reference
[49] Ehsan Upol, Tambwekar Pradyumna, Chan Larry, Harrison Brent, and Riedl Mark O.. 2019. Automated rationale generation: A technique for explainable AI and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI’19). Association for Computing Machinery, New York, NY, 263–274. DOI:Google ScholarDigital Library
Reference 1Reference 2
[50] Elazar Yanai, Ravfogel Shauli, Jacovi Alon, and Goldberg Yoav. 2021. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Trans. Assoc. Computat. Ling. 9 (03 2021), 160–175. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[51] Erliksson Karl Fredrik, Arpteg Anders, Matskin Mihhail, and Payberah Amir H.. 2021. Cross-domain transfer of generative explanations using text-to-text models. In Natural Language Processing and Information Systems, Métais Elisabeth, Meziane Farid, Horacek Helmut, and Kapetanios Epaminondas (Eds.). Springer International Publishing, Cham, 76–89.Google ScholarDigital Library
Reference 1Reference 2
[52] Ettinger Allyson, Elgohary Ahmed, and Resnik Philip. 2016. Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP. Association for Computational Linguistics, 134–139. DOI:Google ScholarCross Ref
Reference
[53] Feder Amir, Keith Katherine A., Manzoor Emaad, Pryzant Reid, Sridhar Dhanya, Wood-Doughty Zach, Eisenstein Jacob, Grimmer Justin, Reichart Roi, Roberts Margaret E., Brandon M. Stewart, Victor Veitch, and Diyi Yang. 2022. Causal inference in natural language processing: Estimation, Prediction, Interpretation and Beyond. Trans. Assoc. Computat. Ling. 10 (2022), 1138–1158.Google ScholarCross Ref
Reference 1Reference 2
[54] Feng Shi, Wallace Eric, II Alvin Grissom, Iyyer Mohit, Rodriguez Pedro, and Boyd-Graber Jordan. 2018. Pathologies of neural models make interpretations difficult. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 3719–3728.Google ScholarCross Ref
Reference 1Reference 2
[55] Giulianelli Mario, Harding Jack, Mohnert Florian, Hupkes Dieuwke, and Zuidema Willem. 2018. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 240–248. DOI:Google ScholarCross Ref
Reference 1Reference 2
[56] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems Conference. 2672–2680.Google Scholar
Reference
[57] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904–6913.Google ScholarCross Ref
Reference
[58] Guidotti Riccardo, Monreale Anna, Ruggieri Salvatore, Turini Franco, Giannotti Fosca, and Pedreschi Dino. 2018. A survey of methods for explaining black box models. Comput. Surv. 51, 5 (2018). DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[59] Gupta Abhijeet, Boleda Gemma, Baroni Marco, and Padó Sebastian. 2015. Distributional vectors encode referential attributes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 12–21. DOI:Google ScholarCross Ref
Reference 1Reference 2
[60] Maudslay Rowan Hall, Valvoda Josef, Pimentel Tiago, Williams Adina, and Cotterell Ryan. 2020. A tale of a probe and a parser. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7389–7395. DOI:Google ScholarCross Ref
Reference 1Reference 2
[61] Han Caren, Long Siqu, Luo Siwen, Wang Kunze, and Poon Josiah. 2020. VICTR: Visual information captured text representation for text-to-vision multimodal tasks. In Proceedings of the 28th International Conference on Computational Linguistics, Scott Donia, Bel Nuria, and Zong Chengqing (Eds.). International Committee on Computational Linguistics, 3107–3117. DOI:Google ScholarCross Ref
Reference
[62] Hancock Braden, Varma Paroma, Wang Stephanie, Bringmann Martin, Liang Percy, and Ré Christopher. 2018. Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1884–1895. DOI:Google ScholarCross Ref
Reference 1Reference 2
[63] Hase Peter and Bansal Mohit. 2022. When can models learn from explanations? A formal framework for understanding the roles of explanation data. In Proceedings of the 1st Workshop on Learning with Natural Language Supervision. Association for Computational Linguistics, 29–39. DOI:Google ScholarCross Ref
Reference
[64] Hase Peter, Zhang Shiyue, Xie Harry, and Bansal Mohit. 2020. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 4351–4367. DOI:Google ScholarCross Ref
Reference
[65] He Shilin, Tu Zhaopeng, Wang Xing, Wang Longyue, Lyu Michael, and Shi Shuming. 2019. Towards understanding neural machine translation with word importance. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 952–961.Google ScholarCross Ref
Reference 1Reference 2
[66] Hendricks Lisa Anne, Akata Zeynep, Rohrbach Marcus, Donahue Jeff, Schiele Bernt, and Darrell Trevor. 2016. Generating visual explanations. In Proceedings of the European Conference on Computer Vision. Springer, 3–19.Google ScholarCross Ref
Reference
[67] Hendricks Lisa Anne, Hu Ronghang, Darrell Trevor, and Akata Zeynep. 2018. Generating counterfactual explanations with natural language. In Proceedings of the ICML Workshop on Human Interpretability in Machine Learning. 95–98.Google Scholar
Reference
[68] Heskes Tom, Sijben Evi, Bucur Ioan Gabriel, and Claassen Tom. 2020. Causal Shapley values: Exploiting causal knowledge to explain individual predictions of complex models. Adv. Neural Inf. Process. Syst. 33 (2020), 4778–4789.Google Scholar
Reference
[69] Hewitt John and Liang Percy. 2019. Designing and interpreting probes with control tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 2733–2743. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[70] Hewitt John and Manning Christopher D.. 2019. A structural probe for finding syntax in word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4129–4138. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[71] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.Google ScholarDigital Library
Reference
[72] Park Dong Huk, Hendricks Lisa Anne, Akata Zeynep, Rohrbach Anna, Schiele Bernt, Darrell Trevor, and Rohrbach Marcus. 2018. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8779–8788.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[73] Hupkes Dieuwke, Veldhoen Sara, and Zuidema Willem. 2018. Visualisation and “diagnostic classifiers” reveal how recurrent and recursive neural networks process hierarchical structure. J. Artif. Intell. Res. 61 (2018), 907–926.Google ScholarDigital Library
Reference 1Reference 2
[74] Inoue Naoya, Trivedi Harsh, Sinha Steven, Balasubramanian Niranjan, and Inui Kentaro. 2021. Summarize-then-answer: Generating concise explanations for multi-hop reading comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 6064–6080. DOI:Google ScholarCross Ref
Reference 1Reference 2
[75] Jacovi Alon and Goldberg Yoav. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4198–4205. Retrieved from DOI: DOI: https://www.aclweb.org/anthology/2020.acl-main.386Google ScholarCross Ref
Reference 1Reference 2
[76] Jacovi Alon and Goldberg Yoav. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4198–4205.Google ScholarCross Ref
Reference 1Reference 2
[77] Jacovi Alon, Marasović Ana, Miller Tim, and Goldberg Yoav. 2021. Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT’21). Association for Computing Machinery, New York, NY, 624–635. DOI:Google ScholarDigital Library
Reference
[78] Jain Sarthak and Wallace Byron C.. 2019. Attention is not explanation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 3543–3556. DOI:Google ScholarCross Ref
Reference
[79] Jain Sarthak and Wallace Byron C.. 2019. Attention is not explanation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 3543–3556.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[80] Jang Myeongjun and Lukasiewicz Thomas. 2021. Are training resources insufficient? Predict first then explain! CoRR abs/2110.02056 (2021).Google Scholar
Reference 1Reference 2
[81] Jansen Peter, Balasubramanian Niranjan, Surdeanu Mihai, and Clark Peter. 2016. What’s in an explanation? Characterizing knowledge and inference requirements for elementary science exams. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, 2956–2965. Retrieved from DOI: DOI: https://aclanthology.org/C16-1278Google Scholar
Reference
[82] Jumelet Jaap and Hupkes Dieuwke. 2018. Do language models understand anything? On the ability of LSTMs to understand negative polarity items. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 222–231. DOI:Google ScholarCross Ref
Reference
[83] Kaur Harmanpreet, Nori Harsha, Jenkins Samuel, Caruana Rich, Wallach Hanna, and Vaughan Jennifer Wortman. 2020. Interpreting interpretability: Understanding data scientists’ use of interpretability tools for machine learning. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarDigital Library
Reference
[84] Kayser Maxime, Camburu Oana-Maria, Salewski Leonard, Emde Cornelius, Do Virginie, Akata Zeynep, and Lukasiewicz Thomas. 2021. E-ViL: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 1244–1254.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[85] Kim Jinkyu, Rohrbach Anna, Darrell Trevor, Canny John, and Akata Zeynep. 2018. Textual explanations for self-driving vehicles. In Proceedings of the European Conference on Computer Vision (ECCV’18). 563–578.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[86] Klafka Josef and Ettinger Allyson. 2020. Spying on your neighbors: Fine-grained probing of contextual embeddings for information about surrounding words. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4801–4811. DOI:Google ScholarCross Ref
Reference
[87] Köhn Arne. 2015. What’s in an embedding? Analyzing word embeddings through multilingual evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2067–2073. DOI:Google ScholarCross Ref
Reference
[88] Kotonya Neema and Toni Francesca. 2020. Explainable automated fact-checking for public health claims. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 7740–7754. DOI:Google ScholarCross Ref
Reference
[89] Kumar Sawan and Talukdar Partha. 2020. NILE : Natural language inference with faithful natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 8730–8742. Retrieved from DOI: DOI: https://www.aclweb.org/anthology/2020.acl-main.771Google ScholarCross Ref
Reference
[90] Kumaraswamy Ponnambalam. 1980. A generalized probability density function for double-bounded random processes. J. Hydrol. 46, 1-2 (1980), 79–88.Google ScholarCross Ref
Reference
[91] Lei Jie, Yu Licheng, Berg Tamara, and Bansal Mohit. 2020. What is more likely to happen next? Video-and-language future event prediction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 8769–8784. DOI:Google ScholarCross Ref
Reference
[92] Lei Tao, Barzilay Regina, and Jaakkola Tommi. 2016. Rationalizing neural predictions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 107–117.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[93] Li Lei, Zhang Yongfeng, and Chen Li. 2021. Personalized transformer for explainable recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 4947–4957. DOI:Google ScholarCross Ref
Reference
[94] Li Qing, Tao Qingyi, Joty Shafiq, Cai Jianfei, and Luo Jiebo. 2018. VQA-E: Explaining, elaborating, and enhancing your answers for visual questions. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google Scholar
Reference
[95] Li Zichao, Sharma Prakhar, Lu Xing Han, Cheung Jackie, and Reddy Siva. 2022. Using interactive feedback to improve the accuracy and explainability of question answering systems post-deployment. In Proceedings of the Findings of the Association for Computational Linguistics (ACL’22). Association for Computational Linguistics, 926–937. DOI:Google ScholarCross Ref
Reference
[96] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81. Retrieved from DOI: DOI: https://www.aclweb.org/anthology/W04-1013Google Scholar
Reference
[97] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google ScholarCross Ref
Reference
[98] Lin Yongjie, Tan Yi Chern, and Frank Robert. 2019. Open Sesame: Getting inside BERT’s linguistic knowledge. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 241–253. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[99] Ling Wang, Yogatama Dani, Dyer Chris, and Blunsom Phil. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 158–167.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[100] Lipton Zachary C.. 2018. The mythos of model interpretability. Commun. ACM 61, 10 (2018), 35–43. DOI:arxiv:1606.03490Google ScholarDigital Library
Reference 1Reference 2
[101] Liu Hui, Yin Qingyu, and Wang William Yang. 2019. Towards explainable NLP: A generative explanation framework for text classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5570–5581. DOI:Google ScholarCross Ref
Reference
[102] Liu Nelson F., Gardner Matt, Belinkov Yonatan, Peters Matthew E., and Smith Noah A.. 2019. Linguistic knowledge and transferability of contextual representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 1073–1094. DOI:Google ScholarCross Ref
Reference
[103] Louizos Christos, Welling Max, and Kingma Diederik P.. 2018. Learning sparse neural networks through L_0 regularization. In Proceedings of the International Conference on Learning Representations.Google Scholar
Reference
[104] Lovering Charles, Jha Rohan, Linzen Tal, and Pavlick Ellie. 2021. Predicting inductive biases of pre-trained models. In Proceedings of the International Conference on Learning Representations. Retrieved from DOI: DOI: https://openreview.net/forum?id=mNtmhaDkArGoogle Scholar
Reference
[105] Lu Jiasen, Yang Jianwei, Batra Dhruv, and Parikh Devi. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the Advances in Neural Information Processing Systems Conference. 289–297.Google Scholar
Reference
[106] Lundberg Scott M. and Lee Su-In. 2017. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems Conference. 4765–4774.Google Scholar
Reference
[107] Luo Ling, Ao Xiang, Pan Feiyang, Wang Jin, Zhao Tong, Yu Ningzi, and He Qing. 2018. Beyond polarity: Interpretable financial sentiment analysis with hierarchical query-driven attention. In IJCAI. 4244–4250.Google ScholarCross Ref
Reference
[108] Luo Siwen, Han Soyeon Caren, Sun Kaiyuan, and Poon Josiah. 2020. REXUP: I reason, i extract, i update with structured compositional reasoning for visual question answering. In International Conference on Neural Information Processing. Springer, 520–532.Google ScholarDigital Library
Reference
[109] Luong Minh-Thang, Pham Hieu, and Manning Christopher D.. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1412–1421.Google ScholarCross Ref
Reference
[110] Maas Andrew, Daly Raymond E., Pham Peter T., Huang Dan, Ng Andrew Y., and Potts Christopher. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142–150.Google ScholarDigital Library
Reference
[111] Madumal Prashan, Miller Tim, Vetere Frank, and Sonenberg Liz. 2018. Towards a grounded dialog model for explainable artificial intelligence. Workshop on Socio-Cognitive Systems IJCAI abs/1806.08055 (2018). arxiv:1806.08055 Retrieved from DOI: DOI: http://arxiv.org/abs/1806.08055Google Scholar
Reference
[112] Mao Qianren, Li Jianxin, Wang Senzhang, Zhang Yuanning, Peng Hao, He Min, and Wang Lihong. 2019. Aspect-based sentiment classification with attentive neural turing machines. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’19). 5139–5145.Google ScholarCross Ref
Reference
[113] Marasovic Ana, Beltagy Iz, Downey Doug, and Peters Matthew. 2022. Few-shot self-rationalization with natural language prompts. In Proceedings of the Findings of the Association for Computational Linguistics(NAACL’22). Association for Computational Linguistics, 410–424. DOI:Google ScholarCross Ref
Reference 1Reference 2
[114] Marasović Ana, Bhagavatula Chandra, Park Jae sung, Bras Ronan Le, Smith Noah A., and Choi Yejin. 2020. Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs. In Proceedings of the Findings of the Association for Computational Linguistics (EMNLP’20). Association for Computational Linguistics, 2810–2829. DOI:Google ScholarCross Ref
Reference 1Reference 2
[115] McAuley Julian, Leskovec Jure, and Jurafsky Dan. 2012. Learning attitudes and attributes from multi-aspect reviews. In Proceedings of the IEEE 12th International Conference on Data Mining. IEEE, 1020–1025.Google ScholarDigital Library
Reference
[116] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, Burges C. J. C., Bottou L., Welling M., Ghahramani Z., and Weinberger K. Q. (Eds.). Curran Associates, Inc., 3111–3119. Retrieved from DOI: DOI: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdfGoogle ScholarDigital Library
Reference
[117] Mirzaalian Hengameh, Hussein Mohamed E., Spinoulas Leonidas, May Jonathan, and Abd-Almageed Wael. 2021. Explaining face presentation attack detection using natural language. In Proceedings of the 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG’21). 1–8. DOI:Google ScholarDigital Library
Reference
[118] Molnar Christoph. 2019. Interpretable Machine Learning. Retrieved from https://christophm.github.io/interpretable-ml-book/Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[119] Mudrakarta Pramod Kaushik, Taly Ankur, Sundararajan Mukund, and Dhamdhere Kedar. 2018. Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1896–1906.Google ScholarCross Ref
Reference
[120] Mueller John Paul and Massaron Luca. 2019. Deep Learning for Dummies. John Wiley & Sons.Google Scholar
Reference
[121] Murdoch W. James, Singh Chandan, Kumbier Karl, Abbasi-Asl Reza, and Yu Bin. 2019. Definitions, methods, and applications in interpretable machine learning. Proc. Nat’l Acad. Sci. 116, 44 (2019), 22071–22080. DOI:Google ScholarCross Ref
Reference
[122] Narang Sharan, Raffel Colin, Lee Katherine, Roberts Adam, Fiedel Noah, and Malkan Karishma. 2020. WT5?! Training Text-to-Text Models to Explain their Predictions. arxiv:2004.14546 [cs.CL]Google Scholar
Reference 1Reference 2
[123] Ni Jianmo, Li Jiacheng, and McAuley Julian. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 188–197. DOI:Google ScholarCross Ref
Reference 1Reference 2
[124] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. LEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 311–318. DOI:Google ScholarDigital Library
Reference
[125] Park Dong Huk, Hendricks Lisa Anne, Akata Zeynep, Rohrbach Anna, Schiele Bernt, Darrell Trevor, and Rohrbach Marcus. 2018. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8779–8788. DOI:Google ScholarCross Ref
Reference 1Reference 2
[126] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1532–1543. DOI:Google ScholarCross Ref
Reference
[127] Peters Matthew, Neumann Mark, Zettlemoyer Luke, and Yih Wen-tau. 2018. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1499–1509. DOI:Google ScholarCross Ref
Reference
[128] Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237. DOI:Google ScholarCross Ref
Reference
[129] Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 2227–2237.Google ScholarCross Ref
Reference
[130] Pimentel Tiago, Valvoda Josef, Maudslay Rowan Hall, Zmigrod Ran, Williams Adina, and Cotterell Ryan. 2020. Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4609–4622. Retrieved from DOI: DOI: https://www.aclweb.org/anthology/2020.acl-main.420Google ScholarCross Ref
Reference 1Reference 2Reference 3
[131] Prasad Archiki and Jyothi Preethi. 2020. How accents confound: Probing for accent information in end-to-end speech recognition systems. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3739–3753. DOI:Google ScholarCross Ref
Reference
[132] Prasad Grusha, Nie Yixin, Bansal Mohit, Jia Robin, Kiela Douwe, and Williams Adina. 2021. To what extent do human explanations of model behavior align with actual model behavior? In Proceedings of the 4th BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 1–14. DOI:Google ScholarCross Ref
Reference
[133] Radford Alec and Narasimhan Karthik. 2018. Improving language understanding by generative pre-training. Preprint at http://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdfGoogle Scholar
Reference
[134] Radford Alec, Wu Jeff, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
Reference
[135] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140 (2020), 1–67. Retrieved from DOI: DOI: http://jmlr.org/papers/v21/20-074.htmlGoogle Scholar
Reference
[136] Raganato Alessandro and Tiedemann Jörg. 2018. An analysis of encoder representations in transformer-based machine translation. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 287–297. DOI:Google ScholarCross Ref
Reference
[137] Rajani Nazneen Fatema, McCann Bryan, Xiong Caiming, and Socher Richard. 2019. Explain yourself! Leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4932–4942. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[138] Rajani Nazneen Fatema, Zhang Rui, Tan Yi Chern, Zheng Stephan, Weiss Jeremy, Vyas Aadit, Gupta Abhijit, Xiong Caiming, Socher Richard, and Radev Dragomir. 2020. ESPRIT: Explaining solutions to physical reasoning tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7906–7917. DOI:Google ScholarCross Ref
Reference 1Reference 2
[139] Rajpurkar Pranav, Jia Robin, and Liang Percy. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 784–789.Google ScholarCross Ref
Reference
[140] Rajpurkar Pranav, Zhang Jian, Lopyrev Konstantin, and Liang Percy. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2383–2392.Google ScholarCross Ref
Reference
[141] Ravichander Abhilasha, Belinkov Yonatan, and Hovy Eduard. 2021. Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, 3363–3377. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[142] Rebanal Juan, Combitsis Jordan, Tang Yuqi, and Chen Xiang “Anthony”. 2021. XAlgo: A design probe of explaining algorithms’ internal states via question-answering. In Proceedings of the International Conference on Intelligent User Interfaces (IUI’21). Association for Computing Machinery, New York, NY, 329–339. DOI:Google ScholarDigital Library
Reference
[143] Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1135–1144.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[144] Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2018. Anchors: High-precision model-agnostic explanations. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[145] Ribeiro Marco Tulio, Wu Tongshuang, Guestrin Carlos, and Singh Sameer. 2020. Beyond accuracy: Behavioral testing of NLP models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4902–4912.Google ScholarCross Ref
Reference
[146] Rubinstein Dana, Levi Effi, Schwartz Roy, and Rappoport Ari. 2015. How well do distributional models capture different types of semantic knowledge? In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 726–730. DOI:Google ScholarCross Ref
Reference
[147] Rudin Cynthia. 2018. Please stop explaining black box models for high stakes decisions. Stat 1050 (2018), 26.Google Scholar
Reference 1Reference 2
[148] Sanh Victor, Webson Albert, Raffel Colin, Bach Stephen, Sutawika Lintang, Alyafeai Zaid, Chaffin Antoine, Stiegler Arnaud, Raja Arun, Dey Manan, Bari M. Saiful, Xu Canwen, Thakker Urmish, Sharma Shanya Sharma, Szczechla Eliza, Kim Taewoon, Chhablani Gunjan, Nayak Nihal, Datta Debajyoti, Chang Jonathan, Jiang Mike Tian-Jian, Wang Han, Manica Matteo, Shen Sheng, Yong Zheng Xin, Pandey Harshit, Bawden Rachel, Wang Thomas, Neeraj Trishala, Rozen Jos, Sharma Abheesht, Santilli Andrea, Fevry Thibault, Fries Jason Alan, Teehan Ryan, Scao Teven Le, Biderman Stella, Gao Leo, Wolf Thomas, and Rush Alexander M.. 2022. 2022. Multitask prompted training enables zero-shot task generalization. In Proceedings of the International Conference on Learning Representations. Retrieved from DOI: DOI: https://openreview.net/forum?id=9Vrb9D0WI4Google Scholar
Reference
[149] Serrano Sofia and Smith Noah A.. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2931–2951.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[150] Sha Lei, Camburu Oana-Maria, and Lukasiewicz Thomas. 2021. Learning from the best: Rationalizing predictions by adversarial information calibration. In Proceedings of the AAAI Conference on Artificial Intelligence. 13771–13779.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[151] Shen Ying, Deng Yang, Yang Min, Li Yaliang, Du Nan, Fan Wei, and Lei Kai. 2018. Knowledge-aware attentive neural network for ranking question answer pairs. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 901–904.Google ScholarDigital Library
Reference
[152] Shi Xing, Padhi Inkit, and Knight Kevin. 2016. Does string-based neural MT learn source syntax? In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1526–1534. DOI:Google ScholarCross Ref
Reference
[153] Shrikumar Avanti, Greenside Peyton, and Kundaje Anshul. 2017. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning. JMLR. org, 3145–3153.Google ScholarDigital Library
Reference
[154] Slack Dylan, Hilgard Sophie, Jia Emily, Singh Sameer, and Lakkaraju Himabindu. 2020. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 180–186.Google ScholarDigital Library
Reference
[155] Sommerauer Pia and Fokkens Antske. 2018. Firearms and tigers are dangerous, kitchen knives and zebras are not: Testing whether word embeddings can tell. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 276–286. DOI:Google ScholarCross Ref
Reference
[156] Sorodoc Ionut-Teodor, Gulordava Kristina, and Boleda Gemma. 2020. Probing for referential information in language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4177–4189. DOI:Google ScholarCross Ref
Reference 1Reference 2
[157] Springenberg J., Dosovitskiy Alexey, Brox Thomas, and Riedmiller M.. 2015. Striving for simplicity: The all convolutional net. In Proceedings of the International Conference on Learning Representations (Workshop Track).Google Scholar
Reference
[158] Srivastava Shashank, Labutov Igor, and Mitchell Tom. 2017. Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1527–1536. DOI:Google ScholarCross Ref
Reference 1Reference 2
[159] Stadelmaier Josua and Padó Sebastian. 2019. Modeling paths for explainable knowledge base completion. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 147–157. DOI:Google ScholarCross Ref
[160] Stahlberg Felix, Saunders Danielle, and Byrne Bill. 2018. An operation sequence model for explainable neural machine translation. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 175–186. DOI:Google ScholarCross Ref
[161] Stammbach Dominik and Ash Elliott. 2020. e-FEVER: Explanations and summaries for automated fact checking. In Proceedings of the Truth and Trust Online Conference (TTO’20).Google Scholar
Reference
[162] Su Weijie, Zhu Xizhou, Cao Yue, Li Bin, Lu Lewei, Wei Furu, and Dai Jifeng. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations. Retrieved from DOI: DOI: https://openreview.net/forum?id=SygXPaEYvHGoogle Scholar
Reference
[163] Sundararajan Mukund, Taly Ankur, and Yan Qiqi. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning. JMLR. org, 3319–3328.Google ScholarDigital Library
Reference
[164] Sydorova Alona, Poerner Nina, and Roth Benjamin. 2019. Interpretable question answering on knowledge bases and text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4943–4951.Google ScholarCross Ref
Reference
[165] Tenney Ian, Xia Patrick, Chen Berlin, Wang Alex, Poliak Adam, McCoy R. Thomas, Kim Najoung, Durme Benjamin Van, Bowman Samuel R., Das Dipanjan, and Pavlick Ellie. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proceedings of the International Conference on Learning Representations. Retrieved from DOI: DOI: https://openreview.net/forum?id=SJzSgnRcKXGoogle Scholar
Reference
[166] Tu Ming, Huang Kevin, Wang Guangtao, Huang Jing, He Xiaodong, and Zhou Bowen. 2020. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. In Proceedings of the AAAI Conference on Artificial Intelligence. 9073–9080.Google ScholarCross Ref
Reference
[167] Tutek Martin and Šnajder Jan. 2018. Iterative recursive attention model for interpretable sequence classification. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 249–257. DOI:Google ScholarCross Ref
[168] Hassan Muneeb ul, Mulhem Philippe, Pellerin Denis, and Quénot Georges. 2019. Explaining visual classification using attributes. In Proceedings of the International Conference on Content-Based Multimedia Indexing (CBMI’19). 1–6. DOI:Google ScholarCross Ref
Reference
[169] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems Conference. 5998–6008.Google Scholar
Reference 1Reference 2
[170] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google ScholarCross Ref
Reference
[171] Voita Elena and Titov Ivan. 2020. Information-theoretic probing with minimum description length. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 183–196. DOI:Google ScholarCross Ref
Reference
[172] Wang Cunxiang, Liang Shuailong, Zhang Yue, Li Xiaonan, and Gao Tian. 2019. Does it make sense? And why? A pilot study for sense making and explanation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4020–4026. DOI:Google ScholarCross Ref
Reference 1Reference 2
[173] Wang Jingjing, Li Jie, Li Shoushan, Kang Yangyang, Zhang Min, Si Luo, and Zhou Guodong. 2018. Aspect sentiment classification with both word-level and clause-level attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18), Vol. 2018. 4439–4445.Google ScholarCross Ref
Reference
[174] Wang Ziqi, Qin Yujia, Zhou Wenxuan, Yan Jun, Ye Qinyuan, Neves Leonardo, Liu Zhiyuan, and Ren Xiang. 2020. Learning from explanations with neural execution tree. In Proceedings of the International Conference on Learning Representations.Google Scholar
Reference
[175] Wang Zhiguo, Zhang Yue, Yu Mo, Zhang Wei, Pan Lin, Song Linfeng, Xu Kun, and El-Kurdi Yousef. 2019. Multi-granular text encoding for self-explaining categorization. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 41–45. DOI:Google ScholarCross Ref
[176] Wei Jason, Bosma Maarten, Zhao Vincent, Guu Kelvin, Yu Adams Wei, Lester Brian, Du Nan, Dai Andrew M., and Le Quoc V.. 2022. Finetuned language models are zero-shot learners. In Proceedings of the International Conference on Learning Representations. Retrieved from DOI: DOI: https://openreview.net/forum?id=gEZrGCozdqRGoogle Scholar
Reference
[177] Wickramanayake Sandareka, Hsu Wynne, and Lee Mong Li. 2019. FLEX: Faithful linguistic explanations for neural net based model decisions. Proc. AAAI Conf. Artif. Intell. 33, 01 (July 2019), 2539–2546. DOI:Google ScholarDigital Library
Reference
[178] Wiegreffe Sarah and Marasovic Ana. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In Proceedings of the 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). Retrieved from DOI: DOI: https://openreview.net/forum?id=ogNcxJn32BZGoogle Scholar
Reference
[179] Wiegreffe Sarah, Marasović Ana, and Smith Noah A.. 2021. Measuring association between labels and free-text rationales. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 10266–10284. DOI:Google ScholarCross Ref
Reference 1Reference 2
[180] Wiegreffe Sarah and Pinter Yuval. 2019. Attention is not not explanation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 11–20. DOI:Google ScholarCross Ref
Reference
[181] Wiegreffe Sarah and Pinter Yuval. 2019. Attention is not not explanation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 11–20.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[182] Williams Ronald J.. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 3-4 (1992), 229–256.Google ScholarDigital Library
Reference
[183] Wu Jialin and Mooney Raymond. 2019. Faithful multimodal explanation for visual question answering. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 103–112. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[184] Wu Tongshuang, Ribeiro Marco Tulio, Heer Jeffrey, and Weld Daniel S.. 2021. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6707–6723.Google ScholarCross Ref
Reference
[185] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.Google ScholarDigital Library
Reference
[186] Yang Zichao, He Xiaodong, Gao Jianfeng, Deng Li, and Smola Alex. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21–29.Google ScholarCross Ref
Reference
[187] Yang Zhilin, Qi Peng, Zhang Saizheng, Bengio Yoshua, Cohen William, Salakhutdinov Ruslan, and Manning Christopher D.. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2369–2380.Google ScholarCross Ref
Reference
[188] Yordanov Yordan, Kocijan Vid, Lukasiewicz Thomas, and Camburu Oana-Maria. 2021. Few-shot out-of-domain transfer learning of natural language explanations. In Proceedings of the Workshop on Deep Generative Models and Downstream Applications (NeurIPS’21). Retrieved from DOI: DOI: https://openreview.net/forum?id=g9PUonwGk2MGoogle Scholar
Reference 1Reference 2
[189] Yu Bin. 2013. Stability. Bernoulli 19, 4 (09 2013), 1484–1500. DOI:Google ScholarCross Ref
Reference
[190] Yu Mo, Chang Shiyu, Zhang Yang, and Jaakkola Tommi. 2019. Rethinking cooperative rationalization: Introspective extraction and complement control. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 4085–4094.Google ScholarCross Ref
Reference 1Reference 2
[191] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281–6290.Google ScholarCross Ref
Reference 1Reference 2
[192] Zeiler Matthew D., Krishnan Dilip, Taylor Graham W., and Fergus Rob. 2010. Deconvolutional networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2528–2535.Google ScholarCross Ref
Reference
[193] Zellers Rowan, Bisk Yonatan, Farhadi Ali, and Choi Yejin. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarCross Ref
Reference 1Reference 2
[194] Zellers Rowan, Lu Ximing, Hessel Jack, Yu Youngjae, Park Jae Sung, Cao Jize, Farhadi Ali, and Choi Yejin. 2021. MERLOT: Multimodal neural script knowledge models. In Proceedings of the Advances in Neural Information Processing Systems Conference.Google Scholar
Reference
[195] Zhang Hongming, Zhao Xinran, and Song Yangqiu. 2020. WinoWhy: A deep diagnosis of essential commonsense knowledge for answering Winograd schema challenge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5736–5745. DOI:Google ScholarCross Ref
Reference 1Reference 2
[196] Zhang Kelly and Bowman Samuel. 2018. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 359–361. DOI:Google ScholarCross Ref
Reference
[197] Zhao Xinyan and Vydiswaran V. G. Vinod. 2021. LIRex: Augmenting language inference with relevant explanations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14532–14539.Google ScholarCross Ref
Reference
[198] Zhou Wangchunshu, Hu Jinyi, Zhang Hanlin, Liang Xiaodan, Sun Maosong, Xiong Chenyan, and Tang Jian. 2020. Towards interpretable natural language understanding with explanations as latent variables. In Advances in Neural Information Processing Systems, Larochelle H., Ranzato M., Hadsell R., Balcan M.F., and Lin H. (Eds.), Vol. 33. Curran Associates, Inc., 6803–6814. Retrieved from DOI: DOI: https://proceedings.neurips.cc/paper/2020/file/4be2c8f27b8a420492f2d44463933eb6-Paper.pdfGoogle Scholar
Reference
[199] Zhou Yichu and Srikumar Vivek. 2021. DirectProbe: Studying representations without classifiers. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 5070–5083. DOI:Google ScholarCross Ref
Reference 1Reference 2
[200] Zhou Yangqiaoyu and Tan Chenhao. 2021. Investigating the effect of natural language explanations on out-of-distribution generalization in few-shot NLI. In Proceedings of the 2nd Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, 117–124. DOI:Google ScholarCross Ref
Reference

Index Terms

Local Interpretations for Explainable Natural Language Processing: A Survey
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized ...
Read More
Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey
Survey Paper and Regular Papers

With the development of high computational devices, deep neural networks (DNNs), in recent years, have gained significant popularity in many Artificial Intelligence (AI) applications. However, previous efforts have shown that DNNs are vulnerable to ...
Read More
Introduction to Chinese Natural Language Processing
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 56, Issue 9
September 2024
980 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3613649
Editors:
David Atienza
Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland
,
Michela Milano
University of Bologna, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 April 2024
- Online AM: 15 March 2024
- Accepted: 31 January 2024
- Revised: 19 February 2023
- Received: 1 August 2021
Published in csur Volume 56, Issue 9

Check for updates
Author Tags
Deep neural networks
explainable AI
local interpretation
natural language processing
Qualifiers
- survey
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 670
  Total Downloads
- Downloads (Last 12 months)670
- Downloads (Last 6 weeks)535
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Local Interpretations for Explainable Natural Language Processing: A Survey

ACM Computing Surveys

Abstract

1 INTRODUCTION

1.1 Definitions of Interpretability

1.1.1 Explainability vs. Interpretability.

1.1.2 Local and Global Interpretability.

1.1.3 Post Hoc vs. In-built Interpretations.

1.2 Article Layout

2 ASPECTS OF INTERPRETABILITY

2.1 Interpretability Requirements

2.2 Dimensions of Interpretability

2.2.1 Faithfulness.

2.2.2 Stability.

2.2.3 Comprehensibility.

2.2.4 Trustworthiness.

3 INTERPRETABLE METHODS

3.1 Feature Importance

3.1.1 Rationale Extraction.

3.1.2 Input Perturbation.

3.1.3 Attention Weights.

3.1.4 Attribution Methods.

3.1.5 Datasets.

3.2 Natural Language Explanation

3.2.1 Multimodal NLE.

3.2.2 Text-only NLE.

3.2.3 NLE in Dialog.

3.2.4 Datasets.

3.2.5 Challenges and Future Work.

3.3 Probing

3.3.1 Embedding Probes.

3.3.2 Model Probes.

3.3.3 Probe Considerations and Limitations.

3.3.4 Interpretability of Probes and Future Work.

4 EVALUATION METHODS

4.1 Evaluation of Feature Importance

4.1.1 Automatic Evaluation.

4.1.2 Human Evaluation.

4.2 Evaluation of NLE

4.2.1 Automatic Evaluation.

4.2.2 Human Evaluation.

4.3 Evaluation of Probing

5 DISCUSSION AND CONCLUSION

6 FUTURE TREND OF INTERPRETABILITY

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey

Introduction to Chinese Natural Language Processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media