Tamp-X: Attacking explainable natural language classifiers through tampered activations

Ali, Hassan; Khan, Muhammad Suleman; Al-Fuqaha, Ala; Qadir, Junaid

المؤلف	Ali, Hassan
المؤلف	Khan, Muhammad Suleman
المؤلف	Al-Fuqaha, Ala
المؤلف	Qadir, Junaid
تاريخ الإتاحة	2023-07-13T05:40:52Z
تاريخ النشر	2022
اسم المنشور	Computers and Security
المصدر	Scopus
الرقم المعياري الدولي للكتاب	1674048
معرّف المصادر الموحد	http://dx.doi.org/10.1016/j.cose.2022.102791
معرّف المصادر الموحد	http://hdl.handle.net/10576/45573
الملخص	While the technique of Deep Neural Networks (DNNs) has been instrumental in achieving state-of-the-art results for various Natural Language Processing (NLP) tasks, recent works have shown that the decisions made by DNNs cannot always be trusted. Recently Explainable Artificial Intelligence (XAI) methods have been proposed as a method for increasing DNN's reliability and trustworthiness. These XAI methods are however open to attack and can be manipulated in both white-box (gradient-based) and black-box (perturbation-based) scenarios. Exploring novel techniques to attack and robustify these XAI methods is crucial to fully understand these vulnerabilities. In this work, we propose Tamp-X-a novel attack which tampers the activations of robust NLP classifiers forcing the state-of-the-art white-box and black-box XAI methods to generate misrepresented explanations. To the best of our knowledge, in current NLP literature, we are the first to attack both the white-box and the black-box XAI methods simultaneously. We quantify the reliability of explanations based on three different metrics-the descriptive accuracy, the cosine similarity, and the Lp norms of the explanation vectors. Through extensive experimentation, we show that the explanations generated for the tampered classifiers are not reliable, and significantly disagree with those generated for the untampered classifiers despite that the output decisions of tampered and untampered classifiers are almost always the same. Additionally, we study the adversarial robustness of the tampered NLP classifiers, and find out that the tampered classifiers which are harder to explain for the XAI methods, are also harder to attack by the adversarial attackers. 2022 The Author(s)
راعي المشروع	This publication was made possible by NPRP grant # [13S-0206-200273] from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. Open Access funding is provided by the Qatar National Library. This document is the result of the research project funded by the Qatar National Research Fund (a member of Qatar Foundation)
اللغة	en
الناشر	Elsevier
الموضوع	Adversarial attacks Attacking XAI Explainable artificial intelligence (XAI) Model tampering Natural language processing
العنوان	Tamp-X: Attacking explainable natural language classifiers through tampered activations
النوع	Article
رقم المجلد	120
dc.accessType	Open Access

تحقق من خيارات الوصول

الملفات في هذه التسجيلة

الاسم:: 1-s2.0-S0167404822001857-main.pdf
الحجم:: 3.771Mb
الصيغة:: PDF

عرض / فتح

هذه التسجيلة تظهر في المجموعات التالية

علوم وهندسة الحاسب [‎2428‎ items ]

عرض بسيط للتسجيلة

Tamp-X: Attacking explainable natural language classifiers through tampered activations

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video