Towards Hidden Backdoor Attacks on Natural Language Processing Models




Journal Title

Journal ISSN

Volume Title



Over the years, machine learning techniques have been used in a wide variety of security sensitive applications due to the high reliability and accuracy of its results. But recent findings in the domain of adversarial machine learning have shown that such deep learning models could be potentially vulnerable to attacks. A backdoor attack is one such attack where malicious data containing a predefined perturbation is added to the training data so that when the model is trained on it, a backdoor is created. This backdoor is generally hidden and can only be activated when the attacker adds the perturbation to the test data. In the domain of natural language processing, such poisoned data is generally created by adding a sequence of trigger words and changing the label of the data to the target class. But these attacks can be easily detected by visual inspection since the context of the poisoned text does not resemble its label. That is why to hide the poisoned data better, we have come up with a novel approach to generate poisoned data that modifies the text in such a way that the label fits the context of the poisoned text. Our attack algorithm called SentMod can achieve an attack success ratio of 97% by poisoning only 2% of the training data. We run extensive experiments on multiple deep learning models using different datasets to verify the effectiveness of our attack method.



Machine learning, Malware (Computer software), Computer algorithms