Contributed by Fanchao Qi, Chenghao Yang and Yuan Zang.
- 1. Survey Papers
- 2. Attack Papers (classified according to perturbation level)
- 3. Defense Papers
- 4. Certified Robustness Papers
- 5. Other Papers
- Towards a Robust Deep Neural Network in Texts: A Survey. Wenqi Wang, Lina Wang, Benxiao Tang, Run Wang, Aoshuang Ye. arXiv 2020. [pdf]
- Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. Han Xu, Yao Ma, Haochen Liu, Debayan Deb, Hui Liu, Jiliang Tang, Anil K. Jain. arXiv 2019. [pdf]
- Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey. Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, Chenliang Li. arXiv 2019. [pdf]
- Analysis Methods in Neural Language Processing: A Survey. Yonatan Belinkov, James Glass. TACL 2019. [pdf]
Each paper is attached to one or more following labels indicating how much information the attack model knows about the victim model: gradient
(=white
, all information), score
(output decision and scores), decision
(only output decision) and blind
(nothing)
- Probing Neural Network Understanding of Natural Language Arguments.
Timothy Niven, Hung-Yu Kao. ACL 2019.
score
[pdf] [code&data] - Robust Neural Machine Translation with Doubly Adversarial Inputs.
Yong Cheng, Lu Jiang, Wolfgang Macherey. ACL 2019.
gradient
[pdf] - Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering.
Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, Jordan Boyd-Graber. TACL 2019.
score
[pdf] - PAWS: Paraphrase Adversaries from Word Scrambling.
Yuan Zhang, Jason Baldridge, Luheng He. NAACL-HLT 2019.
blind
[pdf] [dataset] - Evaluating and Enhancing the Robustness of Dialogue Systems: A Case Study on a Negotiation Agent.
Minhao Cheng, Wei Wei, Cho-Jui Hsieh. NAACL-HLT 2019.
gradient
score
[pdf] [code] - Semantically Equivalent Adversarial Rules for Debugging NLP Models.
Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin. ACL 2018.
decision
[pdf] [code] - Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models.
Tong Niu, Mohit Bansal. CoNLL 2018.
score
[pdf] [code&data] - Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge.
Pasquale Minervini, Sebastian Riedel. CoNLL 2018.
score
[pdf] [code&data] - Robust Machine Comprehension Models via Adversarial Training.
Yicheng Wang, Mohit Bansal. NAACL-HLT 2018.
decision
[pdf] [dataset] - Adversarial Example Generation with Syntactically Controlled Paraphrase Networks.
Mohit Iyyer, John Wieting, Kevin Gimpel, Luke Zettlemoyer. NAACL-HLT 2018.
blind
[pdf] [code&data] - Generating Natural Adversarial Examples.
Zhengli Zhao, Dheeru Dua, Sameer Singh. ICLR 2018.
decision
[pdf] [code] - Adversarial Examples for Evaluating Reading Comprehension Systems.
Robin Jia, and Percy Liang. EMNLP 2017.
score
decision
blind
[pdf] [code] - Adversarial Sets for Regularising Neural Link Predictors.
Pasquale Minervini, Thomas Demeester, Tim Rocktäschel, Sebastian Riedel. UAI 2017.
score
[pdf] [code]
- Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment.
Di Jin, Zhijing Jin, Joey Tianyi Zhou, Peter Szolovits. AAAI-20.
score
[pdf] [code] - Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency.
Shuhuai Ren, Yihe Deng, Kun He, Wanxiang Che. ACL 2019.
score
[pdf] [code] - Generating Fluent Adversarial Examples for Natural Languages.
Huangzhao Zhang, Hao Zhou, Ning Miao, Lei Li. ACL 2019.
gradient
score
[pdf] [code] - Universal Adversarial Attacks on Text Classifiers.
Melika Behjati, Seyed-Mohsen Moosavi-Dezfooli, Mahdieh Soleymani Baghshah, Pascal Frossard. ICASSP 2019.
gradient
[pdf] - Generating Natural Language Adversarial Examples.
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, Kai-Wei Chang. EMNLP 2018.
score
[pdf] [code] - Breaking NLI Systems with Sentences that Require Simple Lexical Inferences.
Max Glockner, Vered Shwartz, Yoav Goldberg. ACL 2018.
blind
[pdf] [dataset] - Deep Text Classification Can be Fooled.
Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, Wenchang Shi. IJCAI 2018.
gradient
score
[pdf] - Interpretable Adversarial Perturbation in Input Embedding Space for Text.
Sato, Motoki, Jun Suzuki, Hiroyuki Shindo, and Yuji Matsumoto. IJCAI 2018.
gradient
[pdf] [code] - Towards Crafting Text Adversarial Samples.
Suranjana Samanta, Sameep Mehta. ECIR 2018.
gradient
[pdf] - Crafting Adversarial Input Sequences For Recurrent Neural Networks.
Nicolas Papernot, Patrick McDaniel, Ananthram Swami, Richard Harang. MILCOM 2016.
gradient
[pdf]
- Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems.
Steffen Eger, Gözde Gül ¸Sahin, Andreas Rücklé, Ji-Ung Lee, Claudia Schulz, Mohsen Mesgar, Krishnkant Swarnkar, Edwin Simpson, Iryna Gurevych. NAACL-HLT 2019.
score
[pdf] [code&data] - TEXTBUGGER: Generating Adversarial Text Against Real-world Applications.
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, Ting Wang. NDSS 2019.
gradient
score
[pdf] - Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers.
Ji Gao, Jack Lanchantin, Mary Lou Soffa, Yanjun Qi. IEEE SPW 2018.
score
blind
[pdf] [code] - On Adversarial Examples for Character-Level Neural Machine Translation.
Javid Ebrahimi, Daniel Lowd, Dejing Dou. COLING 2018.
gradient
[pdf] [code] - Synthetic and Natural Noise Both Break Neural Machine Translation.
Yonatan Belinkov, Yonatan Bisk. ICLR 2018.
blind
[pdf] [code&data]
- Universal Adversarial Triggers for Attacking and Analyzing NLP.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh. EMNLP-IJCNLP 2019.
gradient
[pdf] [code] [website] - Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model.
Prashanth Vijayaraghavan, Deb Roy. ECMLPKDD 2019.
score
[pdf] - HotFlip: White-Box Adversarial Examples for Text Classification.
Javid Ebrahimi, Anyi Rao, Daniel Lowd, Dejing Dou. ACL 2018.
gradient
[pdf] [code] - Comparing Attention-based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension.
Matthias Blohm, Glorianna Jagfeld, Ekta Sood, Xiang Yu, Ngoc Thang Vu. CoNLL 2018.
gradient
[pdf] [code]
-
Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification. Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, Wei Wang. EMNLP-IJCNLP 2019. [pdf] [code]
-
Combating Adversarial Misspellings with Robust Word Recognition. Danish Pruthi, Bhuwan Dhingra, Zachary C. Lipton. ACL 2019. [pdf] [code]
- Robustness Verification for Transformers. Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie Huang, Cho-Jui Hsieh. ICLR 2020. [pdf] [code]
- Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation. Po-Sen Huang, Robert Stanforth, Johannes Welbl, Chris Dyer, Dani Yogatama, Sven Gowal, Krishnamurthy Dvijotham, Pushmeet Kohli. EMNLP-IJCNLP 2019. [pdf]
- Certified Robustness to Adversarial Word Substitutions. Robin Jia, Aditi Raghunathan, Kerem Göksel, Percy Liang. EMNLP-IJCNLP 2019. [pdf] [code]
- POPQORN: Quantifying Robustness of Recurrent Neural Networks. Ching-Yun Ko, Zhaoyang Lyu, Lily Weng, Luca Daniel, Ngai Wong, Dahua Lin. ICML 2019. [pdf] [code]
- LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification. Jingjing Xu, Liang Zhao, Hanqi Yan, Qi Zeng, Yun Liang, Xu Sun. EMNLP-IJCNLP 2019. [pdf] [code]
- On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models. Paul Michel, Xian Li, Graham Neubig, Juan Miguel Pino. NAACL-HLT 2019. [pdf] [code]
- Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma. CVPR 2019. [pdf]
- AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples. Dongyeop Kang, Tushar Khot, Ashish Sabharwal, Eduard Hovy. ACL 2018. [pdf] [code]
- Learning Visually-Grounded Semantics from Contrastive Adversarial Samples. Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, Jian Sun. COLING 2018. [pdf] [code]