Con-Detect: Detecting adversarially perturbed natural language inputs to deep classifiers through holistic analysis

Hassan, Ali; Khan, Muhammad Suleman; AlGhadhban, Amer; Alazmi, Meshari; Alzamil, Ahmed; Al-utaibi, Khaled; Qadir, Junaid

المؤلف	Hassan, Ali
المؤلف	Khan, Muhammad Suleman
المؤلف	AlGhadhban, Amer
المؤلف	Alazmi, Meshari
المؤلف	Alzamil, Ahmed
المؤلف	Al-utaibi, Khaled
المؤلف	Qadir, Junaid
تاريخ الإتاحة	2025-07-07T04:21:47Z
تاريخ النشر	2023-09-30
اسم المنشور	Computers & Security
المعرّف	http://dx.doi.org/10.1016/j.cose.2023.103367
الاقتباس	Ali, H., Khan, M. S., AlGhadhban, A., Alazmi, M., Alzamil, A., Al-Utaibi, K., & Qadir, J. (2023). Con-detect: Detecting adversarially perturbed natural language inputs to deep classifiers through holistic analysis. Computers & Security, 132, 103367.
الرقم المعياري الدولي للكتاب	01674048
معرّف المصادر الموحد	https://www.sciencedirect.com/science/article/pii/S0167404823002778
معرّف المصادر الموحد	http://hdl.handle.net/10576/65987
الملخص	Deep Learning (DL) algorithms have shown wonders in many Natural Language Processing (NLP) tasks such as language-to-language translation, spam filtering, fake-news detection, and comprehension understanding. However, research has shown that the adversarial vulnerabilities of deep learning networks manifest themselves when DL is used for NLP tasks. Most mitigation techniques proposed to date are supervised—relying on adversarial retraining to improve the robustness—which is impractical. This work introduces a novel, unsupervised detection methodology for detecting adversarial inputs to NLP classifiers. In summary, we note that minimally perturbing an input to change a model’s output—a major strength of adversarial attacks—is a weakness that leaves unique statistical marks reflected in the cumulative contribution scores of the input. Particularly, we show that the cumulative contribution score, called CF-score, of adversarial inputs is generally greater than that of the clean inputs. We thus propose Con-Detect—a Contribution based Detection method—for detecting adversarial attacks against NLP classifiers. Con-Detect can be deployed with any classifier without having to retrain it. We experiment with multiple attackers—Text-bugger, Text-fooler, PWWS—on several architectures—MLP, CNN, LSTM, Hybrid CNN-RNN, BERT—trained for different classification tasks—IMDB sentiment classification, fake-news classification, AG news topic classification—under different threat models—Con-Detect-blind attacks, Con-Detect-aware attacks, and Con-Detect-adaptive attacks—and show that Con-Detect can reduce the attack success rate (ASR) of different attacks from 100% to as low as 0% for the best cases and ≈70% for the worst case. Even in the worst case, we note a 100% increase in the required number of queries and a 50% increase in the number of words perturbed, suggesting that Con-Detect is hard to evade.
راعي المشروع	This research has been funded by Deputy for Research & Innovation, Ministry of Education through Initiative of Institutional Funding at University of Ha’il-Saudi Arabia through project number IFP-22 216. Open Access funding provided by the Qatar National Library.
اللغة	en
الناشر	Elsevier
الموضوع	Machine learning security Adversarial detection Adversarial machine learning Secure natural language processing Adversarial signatures
العنوان	Con-Detect: Detecting adversarially perturbed natural language inputs to deep classifiers through holistic analysis
النوع	Article
رقم المجلد	132
Open Access user License	http://creativecommons.org/licenses/by/4.0/
ESSN	1872-6208
dc.accessType	Full Text

تحقق من خيارات الوصول

الملفات في هذه التسجيلة

الاسم:: 1-s2.0-S0167404823002778-main.pdf
الحجم:: 4.527Mb
الصيغة:: PDF

عرض / فتح

هذه التسجيلة تظهر في المجموعات التالية

علوم وهندسة الحاسب [‎2482‎ items ]

عرض بسيط للتسجيلة

Con-Detect: Detecting adversarially perturbed natural language inputs to deep classifiers through holistic analysis

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video