If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.
We group the papers by text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, adversarial examples, compositionality, and automated augmentation.
This repository is based on our paper, "A survey of data augmentation approaches in NLP (Findings of ACL '21)". You can cite it as follows:
@article{feng2021survey,
title={A Survey of Data Augmentation Approaches for NLP},
author={Feng, Steven Y and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard},
journal={Findings of ACL},
year={2021}
}
Authors: Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy
Note: WIP. More papers will be added from our survey paper to this repo over the next month or so.
Inquiries should be directed to [email protected] or by opening an issue here.
Paper | Datasets |
---|---|
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods (ACL '95) | Paper-Specific/Legacy Corpus |
Synonym Replacement (Character-Level Convolutional Networks for Text Classification, NeurIPS '15) | AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon |
That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets (EMNLP '15) | |
Robust Training under Linguistic Adversity (EACL '17) code | Movie review, customer review, SUBJ, SST |
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations (NAACL '18) code | SST, SUBJ, MRQA, RT, TREC |
Variational Pretraining for Semi-supervised Text Classification (ACL '19) code | IMDB, AG News, Yahoo, hatespeech |
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (EMNLP '19) code | SST, CR, SUBJ, TREC, PC |
Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification (AAAI '20) | TREC, SST, Subj, MR |
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (ACL '20) code | AG News, DBpedia, Yahoo, IMDb |
Unsupervised Data Augmentation for Consistency Training (NeurIPS '20) code | Yelp, IMDb, amazon, DBpedia |
Not Enough Data? Deep Learning to the Rescue! (AAAI '20) | ATIS, TREC, WVA |
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code | IWSLT'14 |
Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation (EMNLP '20) | ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony |
Textual Data Augmentation for Efficient Active Learning on Tiny Datasets (EMNLP '20) | SST2, TREC |
Text Augmentation in a Multi-Task View (EACL '21) | SST2, TREC, SUBJ |
Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning (NAACL '21) code | HUFF, COV-Q, AMZN, FEWREL |
Paper | Datasets |
---|---|
Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, ACL '16) | WMT '15 en-de, IWSLT ''15 en-tr |
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) | IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de |
Soft Contextual Data Augmentation for Neural Machine Translation (ACL '19) code | IWSLT '14 de/es/he-en, WMT '14 en-de |
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code | IWSLT'14 |
Data Augmentation for Low-Resource Neural Machine Translation (ACL '17) | TODO |
Generalized Data Augmentation for Low-Resource Translation (ACL '19) | TODO |
Data diversification: A simple strategy for neural machine translation (NeurIPS '20) | TODO |
Improving Robustness of Machine Translation with Synthetic Noise (NAACL '19) | TODO |
Synthetic Data for Neural Machine Translation of Spoken-Dialects (arxiv '17) | TODO |
AdvAug: Robust Adversarial Augmentation for Neural Machine Translation (ACL '20) | TODO |
Generalizing Back-Translation in Neural Machine Translation (WMT '19) | TODO |
Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation (ACL '19) | TODO |
Augmenting Neural Machine Translation with Knowledge Graphs (arxiv '19) | TODO |
Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation (arxiv '20) | TODO |
Sentence Boundary Augmentation For Neural Machine Translation Robustness (arxiv '20) | TODO |
Multi-Source Neural Machine Translation with Data Augmentation (IWSLT '18) | TODO |
Data augmentation using back-translation for context-aware neural machine translation (DiscoMT @ EMNLP '19) | TODO |
Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation (W-NUT @ EMNLP '19) | TODO |
Adapting Neural Machine Translation with Parallel Synthetic Data (WMT '17) | TODO |
Data augmentation for pipeline-based speech translation (Baltic HLT '20) | TODO |
Valar nmt : Vastly lacking resources neural machine translation (Stanford CS224N) | TODO |
Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation (IJCAI '20) | TODO |
A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation (Information '20) | TODO |
Syntax-aware Data Augmentation for Neural Machine Translation (arxiv '20) | TODO |
Paper | Datasets |
---|---|
Transforming Wikipedia into Augmented Data for Query-Focused Summarization (arxiv '19) | DUC |
Iterative Data Augmentation with Synthetic Data (Abstract Text Summarization: A Low Resource Challenge (EMNLP '19) | Swisstext, commoncrawl |
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation (NAACL '21) | CNN-DailyMail |
Data Augmentation for Abstractive Query-Focused Multi-Document Summarization (AAAI '21) | TODO |
Paper | Datasets |
---|---|
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering (EMNLP '19 Workshop) | MRQA |
Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering (arxiv '19) | SQuAD, Trivia-QA, CMRC, DRCD |
XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering (arxiv '19) | XNLI, SQuAD |
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering (arxiv '20) | MLQA, XQuAD, SQuAD-it, PIAF |
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (ACL '20) code | WIQA, QuaRel, HotpotQA |
QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension (ICLR '18) | TODO |
Paper | Datasets |
---|---|
Data Augmentation via Dependency Tree Morphing for Low-Resource Languages (EMNLP '18) code | universal dependencies project |
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks (EMNLP '20) | TODO |
An Analysis of Simple Data Augmentation for Named Entity Recognition (COLING '20) | TODO |
SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup (EMNLP '20) | TODO |
Paper | Datasets |
---|---|
Named Entity Recognition for Social Media Texts with Semantic Augmentation (EMNLP '20) | TODO |
Data Recombination for Neural Semantic Parsing (ACL '16) | TODO |
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing (ICLR '21) | TODO |
Good-Enough Compositional Data Augmentation (ACL '20) | TODO |
A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages (EMNLP '19) | TODO |
Paper | Datasets |
---|---|
Using Wikipedia Edits in Low Resource Grammatical Error Correction. (WNUT @ EMNLP '18) | Falko-MERLIN GEC Corpus |
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) | CoNLL-2014 , JFLEG |
Controllable Data Synthesis Method for Grammatical Error Correction (arxiv '19) | TODO |
Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. (BEA @ ACL '19) | FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task) |
A neural grammatical error cor-rection system built on better pre-training and se-quential transfer learning. (BEA @ ACL '19) | FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task), Gutenberg, Tatoeba, WikiText-103 (Pretraining) |
Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation (COLING'20) | FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task) |
Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. (NAACL'18) | Lang-8, CoNLL-2014, CoNLL-2013, JFLEG |
Corpora Generation for Grammatical Error Correction (NAACL'19) | CoNLL-2014, JFLEG, Lang-8 |
A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction (BEA @ ACL '20) | TODO |
GenERRate: Generating Errors for Use in Grammatical Error Detection (BEA '09) | TODO |
A syntactic rule-based framework for parallel data synthesis in Japanese GEC (MIT Thesis '20) | TODO |
Artificial error generation for translation-based grammatical error correction (University of Cambridge Technical Report) | TODO |
Erroneous data generation for Grammatical Error Correction (BEA @ ACL '19) | TODO |
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) | TODO |
Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners (IJCNLP '11) | TODO |
Paper | Datasets |
---|---|
GenAug: Data Augmentation for Finetuning Text Generators (DeeLIO @ EMNLP '20) code | Yelp |
Findings of the Third Workshop on Neural Generation and Translation (WNGT @ EMNLP '19) | TODO |
Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers (WebNLG+ @ INLG '20) | TODO |
TNT-NLG, System 2: Data repetition and meaning representation manipulation to improve neural generation (E2E NLG Challenge System Descriptions) | TODO |
A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models (INLG '19) | TODO |
Paper | Datasets |
---|---|
Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding (COLING '18) code | ATIS, Dec94, Stanford dialogue |
Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context (arxiv '19) code | MultiWOZ |
Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding (Student Research Workshop @ NAACL '19) | ATIS, Snips, MR |
Data Augmentation with Atomic Templates for Spoken Language Understanding (EMNLP '19) code | DSTC 2&3, DSTC2 |
Data Augmentation for Spoken Language Understanding via Joint Variational Generation (AAAI '19) | ATIS, Snips, MIT |
Effective Data Augmentation Approaches to End-to-End Task-Oriented Dialogue (IALP '19) | CamRest676, KVRET |
Paraphrase Augmented Task-Oriented Dialog Generation (ACL '20) code | TCamRest676, MultiWOZ |
Dialog State Tracking with Reinforced Data Augmentation (AAAI '20) | WoZ, MultiWoZ |
Data Augmentation for Copy-Mechanism in Dialogue State Tracking (arxiv '20) | WoZ, DSTC2, Multi |
Simple is Better! Lightweight Data Augmentation for Low Resource Slot Filling and Intent Classification (PACLIC '20) code | ATIS, SNIPS, FB |
Conversation Graph: Data Augmentation, Training, and Evaluation for Non-Deterministic Dialogue Management (TACL '21) | M2M, MultiWOZ |
Paper | Datasets |
---|---|
Data Augmentation for Visual Question Answering (INLG '17) | COCO-VQA, COCO-QA |
Low Resource Multi-modal Data Augmentation for End-to-end ASR (CoRR ’18) | TODO |
Multi-Modal Data Augmentation for End-to-end ASR (Interspeech '18) | Voxforge, HUB4 |
Augmenting Image Question Answering Dataset by Exploiting Image Captions (LREC '18) | IQA |
Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks (AVEC '18) | TODO |
Multimodal Dialogue State Tracking By QA Approach with Data Augmentation (DSTC8 @ AAAI '20) | DSTC7-AVSD |
Data augmentation techniques for the Video Question Answering task (arxiv '20) | TGIF-QA, MSVD-QA |
Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors (NLP for ConvAI @ ACL '20) | DSTC2 |
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering (ECCV '20) | TODO |
Text Augmentation Using BERT for Image Captioning (Applied Sciences '20) | MSCOCO |
MDA: Multimodal Data Augmentation Framework for Boosting Performance on Image-Text Sentiment/Emotion Classification Tasks (IEEE Intelligent Systems '20) | TODO |
Paper | Datasets |
---|---|
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. (NAACL '18) | TODO |
Gender Bias in Neural Natural Language Processing. (Springer '20) | TODO |
Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology (ACL '19) | TODO |
It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution (EMNLP '19) | TODO |
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures (arxiv '20) | TODO |
Paper | Datasets |
---|---|
SMOTE: Synthetic Minority Over-sampling Technique (Journal of Artificial Intelligence Research '02) | Pima, Phoneme, Adult, E-state, Satimage, Forest Cover, Oil, Mammography, Can |
Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem (EMNLP '07) | TODO |
MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation (Knowledge-Based Systems '15) | bibtex, cal500, corel5k, slashdot, tmc2007, mediamill, medical, scene, enron, emotions |
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary (Journal of Artificial Intelligence Research '18) | TODO |
Paper | Datsets |
---|---|
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (NAACL '18) | SST, SICK |
Certified Robustness to Adversarial Word Substitutions (EMNLP '19) | TODO |
PAWS: Paraphrase Adversaries from Word Scrambling (NAACL '19) | TODO |
AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples (ACL '18) | TODO |
Breaking NLI Systems with Sentences that Require Simple Lexical Inferences (ACL '18) | TODO |
Paper | Datsets |
---|---|
Good-Enough Compositional Data Augmentation (ACL '20) code | TODO |
Sequence-Level Mixed Sample Data Augmentation (EMNLP '20) code | IWSLT ’14, WMT ’14 |
Paper | Datsets |
---|---|
Learning Data Manipulation for Augmentation and Weighting (NeurIPS '19) code | SST, IMDB, TREC, CIFAR-10 |
Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight (ACL '20) | DailyDialog, OpenSubtitles |