View Submission

A0310

Title: Textual backdoor attack detection Authors: Xinglin Li - University of North Carolina at Chapel Hill (United States)
Xianwen He - University of North Carolina at Chapel Hill (United States)
Minhao Cheng - Pennsylvania State University (United States)
Yao Li - University of North Carolina at Chapel Hill (United States) [presenting]
Abstract: Backdoor attacks pose a stealthy threat to deep neural network-based classifiers. Such attacks introduce a backdoor into the model by contaminating part of the training data with carefully chosen triggers. The victim model then erroneously predicts inputs containing the same triggers as a certain class. In the field of natural language processing (NLP), the study on defenses against backdoor attacks is insufficient. To the best of knowledge, existing NLP defense methods primarily target special token-based triggers, leaving syntax-based triggers unaddressed. To fill this gap, a novel defense algorithm is proposed that effectively counters syntax-based as well as special token-based backdoor attacks. The algorithm replaces semantically meaningful words in sentences with entirely different ones but preserves the syntactic templates or special tokens and then compares the predicted labels before and after the substitution to determine whether a sentence contains triggers. Experimental results confirm the algorithm's performance against these two types of triggers, offering a comprehensive defense strategy for model integrity.