"Evaluating Saliency Map Explanations for Convolutional Neural Networks: A User Study"
Ahmed Alqaraawi , Martin Schuessler, Philipp Weiß , Enrico Costanza, Nadia Berthouze
Accepted at IUI 2020
In recent decades, the popularity of machine learning has increased because of its ability to outperform humans at specific tasks. Machine learning has seen use cases ranging from predicting customer behaviour to forming the operating system for self-driving cars. Therefore, it is imperative to ensure that the system can be accounted for so that users can know when they have to trust the results of the predictions given by the system, even for users who are not experts in the field of machine learning.
The problem is, it has become quite common these days to hear people refer to modern machine learning systems as "black boxes". The Black boxes typically refer to a method for which we can only observe the inputs and outputs but not the internal workings. Machine Learning may work very well, but understanding the work process of the system is very difficult, even for experts. Especially in the case of deep neural networks. They know how neural networks learn, but they don't know what specific neural networks have learned.
In this study, the researchers chose CNN as the research topic because this method is part of a deep neural network that is very popular in classifying images. And to create an explainable CNN, the most popular approach is to use a saliency map, highlighting essential pixels for the image classification algorithm. The saliency map is claimed to facilitate interpretation by beginners or experts. Several studies have been carried out related to making saliency maps from CNN, but research is still minimal regarding them with actual users. To address this research gap, in this paper, they report on an online user study designed to evaluate the performance of saliency maps generated by a state of the art algorithm: layer-wise relevance propagation (LRP) [1].
In this section, I will discuss several studies that have relevance to the research conducted by the author. First, Bach et al. proposed a general solution to understanding classification decisions by pixel-wise decomposition of non-linear classifiers called LRP. The author chooses this LRP algorithm as an algorithm that generated a saliency map in their research. Second, a study by Yin [2] investigates whether an ML model's accuracy affects laypeople's willingness to trust the model via a sequence of large-scale, randomized, pre-registered human subject experiments. Their work investigates various aspects of users' understanding and trust of the model performance on a hold-out set and maps to the post-deployment performance. Third, Cai et al. [3] conducted a study involving a user study for evaluated two kinds of example-based explanations for a sketch-recognition algorithm: normative and comparative explanations. However, they did not evaluate saliency maps, even though they are both text-based. Riberio et al. [4] proposed LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner. This method is built by learning an interpretable model locally around the prediction. This research also evaluated the saliency map for text-based classifiers but lack of statistical significance test. In addition, it is unclear whether results would apply to more complex scenarios like multi-class classification with a CNN. So, to address this research gap, Alqarawi et al. (the author of this research) doing an online user study designed to evaluate the performance of saliency maps generated by a state of the art algorithm: layer-wise relevance propagation (LRP).
To evaluate whether the saliency map helps users understand how CNN works in terms of results, they designed a between-group online study on the CNN model for the multi-label image classification. This multi-class selection is because the saliency map can highlight specific parts of the image corresponding to one label and features that fit alternative labels. The authors used the existing Keras Library model trained using the Imagenet dataset with VGG16 Architecture. Then, they fine-tuned the model using the Pascal VOC 2012 dataset. As for the on hold-test set, they use PASCAL VOG 2007 with Average Precision of 0.74%; this is intentional because they want to see whether user understanding can improve regardless of the weaknesses and strengths of the model. To generate the Saliency Map, the author used Bach's LRP algorithm. In the pilot study, the author used two algorithms (namely LRP and LIME) and compared which algorithms were the most informative according to the pilot study participants and based on that; they decide to choose LRP. In Preprocessing data, they determine the threshold value to translate a classification score (0 and 1) into an output; detected if the score is above the threshold and missed if the score is below the threshold. They calculated threshold values for each class (e.g. horse, cat) by maximizing the F1-score for the class on the training set.
Figure 1: The interface: Examples are presented in the blue box at the top. The task is shown in the green box at the bottom. All participants worked on the same tasks and where shown the same examples. Conditions differed only in terms of the additional information that was presented alongside each example. Here, saliency maps and scores are shown.
In general, the task in user study are: first, the participant should list 2-3 features they believe the system is sensitive to and 2-3 features the system ignores (see Figure1). Second, they asked participants to predict whether the system will recognize an object of interest ('cat' or 'horse') in the given task image. The author wants to reduce fatigue and want the experiment not to last more than 40 minutes, so they only use 14 images of assignments from various assignments or samples from subset classes and limits the number of classes used in this study to only two categories: cats and horses. This study evaluates two independent variables: the Presence of Saliency Maps and the Presence of Classification Score. Each variable has two main factors, whether shown or omitted. They recruited 64 participants (16 per condition) through Prolific 2, an online crowdsourcing platform. They required participants to have normal or corrected to normal vision and be fluent in English for data quality. They also made it mandatory for participants to be above 18 years of age and to have a technical background (i.e. a degree in computing or engineering).
4. Result of the Research Paper
They evaluated the effects of saliency maps and classification scores based on the percentage of correct estimates per participant. As a result, when the saliency map showed, participants predicted CNN results to be 60.7% more accurate (although still relatively low) than when the map was not displayed (55.1%). The test was carried out using the Two Way Anova Independent Test. In contrast, there is no significant effect on the existence of a classification score. And there is no relationship between the presence of a saliency map and the classification score (see Figure 2). Participants were asked to rate their confidence in their predictions on a 1-4 Likert scale on the task. Using the Kruskal Wallis independent test, it was found that participants tended to feel "slightly confident" in their predictions, with a median value of 3.
To evaluate the prediction accuracy of the results, the authors also assessed the features that were considered sensitive to the classification results. Since the feedback from users is based on free text, the authors perform an inductive assessment of the input based on the features/concepts they experience. The author decided to divide the features into two groups, namely Saliency Feature and General Attribute. The percentage of mentioned saliency features when the saliency map is displayed is much higher than when the map is not displayed (83.5% vs 54.6%). Meanwhile, the existence of a classification score has no significant effect.
The saliency map significantly affected the participant's prediction accuracy but was still relatively low (only 60.7%). The authors investigated participants' performance; the result is that participants are better at predicting whether the system output is true or true positive, and participants have difficulty predicting errors. (false positives: 46.9% and false negatives: 36.7%). That's because participants may have overestimated the performance of the system. Most of the participants predicted that the system was correct when the system failed to classify the class. It is essential to note that users need to be aware and understand when the system fails. Until now, the instance-level description claimed in detecting errors is necessary to evaluate this issue empirically in the future.
In addition, CNN looks for patterns in a sub-symbol fashion that lead to results instead of processing data "semantically" like Humans. Further research is needed to develop algorithmic explanations that can bridge the gap between humans and the system by directing users not to make decisions into high-level image classifications such as semantics.
The reason why paying attention to the saliency features do not give each other a good understanding of the CNN model is that general attributes can affect the classification results. The saliency map makes participants only consider the highlighted features and miss some other general attribute; They suggest a more global representation of the image should complement the saliency map, such as a measure of contrast or overall brightness.
There are
several limitations in this paper, such as :
1. They were using small classes for reasons of time and
minimizing participant fatigue. Future work should carry out long-term
evaluations (i.e. lasting several days or weeks) to allow participants to
explore large data sets with multiple classes in more depth.
2. It's used one specific network architecture (VGG16) and
one specific technique to generate saliency maps (LRP).
3. This study design does not allow the authors to conclude user performance for other outcomes (e.g. TP, FN, FP). The reason is that they perform a fully offset task, and True Negatives (TN) are not part of the task set; future research should address this limitation and study this aspect in more detail.
4. Their participants are required to have technical
backgrounds, but they do not control Machine Learning expertise. We saw the
potential to repeat their study with different participant populations, such as
Machine Learning Experts, or lay users.
5. The selected methodology doesn't answer the research question;
instead, it checks how well the user can guess the answer of the classifier. Why
do authors check how many saliency features do participants pick with and
without a saliency map?
6. About the Frequency of Individual features mentions by participants, is there any reason to divide the features into two groups and make so much difference? What's the idea behind that.
7. In my opinion, it may be better to use a highlight system directly on the image task (e.g. marking features that are considered sensitive to the classification process with circles or rectangles), than to use text-based feedback.
7. Reference
[1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. R. Müller, and W. Samek, "On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation," PLoS One, vol. 10, no. 7, pp. 1–46, 2015, DOI: 10.1371/journal.pone.0130140.
[2] M. Yin, J. W. Vaughan, and H. Wallach, "Understanding the effect of accuracy on trust in machine learning models," Conf. Hum. Factors Comput. Syst. - Proc., pp. 1–12, 2019, DOI: 10.1145/3290605.3300509.
[3] C. J. Cai, J. Jongejan, and J. Holbrook, "The effects of example-based explanations in a machine learning interface," Int. Conf. Intell. User Interfaces, Proc. IUI, vol. Part F147615, pp. 258–262, 2019, DOI: 10.1145/3301275.3302289.
[4] M. T. Ribeiro, S. Singh, and C. Guestrin, " 'Why should i trust you?' Explaining the predictions of any classifier," Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 13-17-August-2016, pp. 1135–1144, 2016, DOI: 10.1145/2939672.2939778.
The writing for this paper is good and well-organized. However, the summary can be shorter to summarize the main contributions and key findings of the paper only. If you want to show more details, you can show the system diagram along with explanation. The limitation, discussion, and future work part is good. I think this paper is more for human computer interaction. I would recommend you to find papers from CVPR, ICCV, ECCV, BMVC, WACV, NeurIPS, ACM MM, ICML, etc.
BalasHapusThank you for your feedback Professor.
HapusBTW, are you interested in explainable AI? If so, you can check the technical details how LIME, GradCam, and LRP does and think if this can be useful for your evolutionary architecture search method based on gentic algorithm.
BalasHapusYes, I will prof. Thank you
Hapushttps://ai.ntu.edu.tw/mlss2021/wp-content/uploads/2021/08/0813_BeenKim.pdf
BalasHapus