Skip to content

styfeng/DataAug4NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 

Repository files navigation

Data Augmentation Techniques for NLP

If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.

We group the papers by text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, adversarial examples, compositionality, and automated augmentation.

This repository is based on our paper, "A survey of data augmentation approaches in NLP (Findings of ACL '21)". You can cite it as follows:

@article{feng2021survey,
  title={A Survey of Data Augmentation Approaches for NLP},
  author={Feng, Steven Y and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard},
  journal={Findings of ACL},
  year={2021}
}

Authors: Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

Note: WIP. More papers will be added from our survey paper to this repo over the next month or so.

Inquiries should be directed to [email protected] or by opening an issue here.

Text Classification

Paper Datasets
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods (ACL '95) Paper-Specific/Legacy Corpus
Synonym Replacement (Character-Level Convolutional Networks for Text Classification, NeurIPS '15) AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon
That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets (EMNLP '15) twitter
Robust Training under Linguistic Adversity (EACL '17) code Movie review, customer review, SUBJ, SST
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations (NAACL '18) code SST, SUBJ, MRQA, RT, TREC
Variational Pretraining for Semi-supervised Text Classification (ACL '19) code IMDB, AG News, Yahoo, hatespeech
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (EMNLP '19) code SST, CR, SUBJ, TREC, PC
Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification (AAAI '20) TREC, SST, Subj, MR
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (ACL '20) code AG News, DBpedia, Yahoo, IMDb
Unsupervised Data Augmentation for Consistency Training (NeurIPS '20) code Yelp, IMDb, amazon, DBpedia
Not Enough Data? Deep Learning to the Rescue! (AAAI '20) ATIS, TREC, WVA
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code IWSLT'14
Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation (EMNLP '20) ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony
Textual Data Augmentation for Efficient Active Learning on Tiny Datasets (EMNLP '20) SST2, TREC
Text Augmentation in a Multi-Task View (EACL '21) SST2, TREC, SUBJ
Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning (NAACL '21) code HUFF, COV-Q, AMZN, FEWREL

Translation

Paper Datasets
Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, ACL '16) WMT '15 en-de, IWSLT ''15 en-tr
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de
Soft Contextual Data Augmentation for Neural Machine Translation (ACL '19) code IWSLT '14 de/es/he-en, WMT '14 en-de
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code IWSLT'14
Data Augmentation for Low-Resource Neural Machine Translation (ACL '17) TODO
Generalized Data Augmentation for Low-Resource Translation (ACL '19) TODO
Data diversification: A simple strategy for neural machine translation (NeurIPS '20) TODO
Improving Robustness of Machine Translation with Synthetic Noise (NAACL '19) TODO
Synthetic Data for Neural Machine Translation of Spoken-Dialects (arxiv '17) TODO
AdvAug: Robust Adversarial Augmentation for Neural Machine Translation (ACL '20) TODO
Generalizing Back-Translation in Neural Machine Translation (WMT '19) TODO
Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation (ACL '19) TODO
Augmenting Neural Machine Translation with Knowledge Graphs (arxiv '19) TODO
Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation (arxiv '20) TODO
Sentence Boundary Augmentation For Neural Machine Translation Robustness (arxiv '20) TODO
Multi-Source Neural Machine Translation with Data Augmentation (IWSLT '18) TODO
Data augmentation using back-translation for context-aware neural machine translation (DiscoMT @ EMNLP '19) TODO
Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation (W-NUT @ EMNLP '19) TODO
Adapting Neural Machine Translation with Parallel Synthetic Data (WMT '17) TODO
Data augmentation for pipeline-based speech translation (Baltic HLT '20) TODO
Valar nmt : Vastly lacking resources neural machine translation (Stanford CS224N) TODO
Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation (IJCAI '20) TODO
A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation (Information '20) TODO
Syntax-aware Data Augmentation for Neural Machine Translation (arxiv '20) TODO

Summarization

Paper Datasets
Transforming Wikipedia into Augmented Data for Query-Focused Summarization (arxiv '19) DUC
Iterative Data Augmentation with Synthetic Data (Abstract Text Summarization: A Low Resource Challenge (EMNLP '19) Swisstext, commoncrawl
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation (NAACL '21) CNN-DailyMail
Data Augmentation for Abstractive Query-Focused Multi-Document Summarization (AAAI '21) TODO

Question Answering

Paper Datasets
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering (EMNLP '19 Workshop) MRQA
Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering (arxiv '19) SQuAD, Trivia-QA, CMRC, DRCD
XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering (arxiv '19) XNLI, SQuAD
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering (arxiv '20) MLQA, XQuAD, SQuAD-it, PIAF
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (ACL '20) code WIQA, QuaRel, HotpotQA
QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension (ICLR '18) TODO

Sequence Tagging

Paper Datasets
Data Augmentation via Dependency Tree Morphing for Low-Resource Languages (EMNLP '18) code universal dependencies project
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks (EMNLP '20) TODO
An Analysis of Simple Data Augmentation for Named Entity Recognition (COLING '20) TODO
SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup (EMNLP '20) TODO

Parsing

Paper Datasets
Named Entity Recognition for Social Media Texts with Semantic Augmentation (EMNLP '20) TODO
Data Recombination for Neural Semantic Parsing (ACL '16) TODO
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing (ICLR '21) TODO
Good-Enough Compositional Data Augmentation (ACL '20) TODO
A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages (EMNLP '19) TODO

Grammatical Error Correction

Paper Datasets
Using Wikipedia Edits in Low Resource Grammatical Error Correction. (WNUT @ EMNLP '18) Falko-MERLIN GEC Corpus
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) CoNLL-2014 , JFLEG
Controllable Data Synthesis Method for Grammatical Error Correction (arxiv '19) TODO
Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. (BEA @ ACL '19) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task)
A neural grammatical error cor-rection system built on better pre-training and se-quential transfer learning. (BEA @ ACL '19) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task), Gutenberg, Tatoeba, WikiText-103 (Pretraining)
Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation (COLING'20) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task)
Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. (NAACL'18) Lang-8, CoNLL-2014, CoNLL-2013, JFLEG
Corpora Generation for Grammatical Error Correction (NAACL'19) CoNLL-2014, JFLEG, Lang-8
A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction (BEA @ ACL '20) TODO
GenERRate: Generating Errors for Use in Grammatical Error Detection (BEA '09) TODO
A syntactic rule-based framework for parallel data synthesis in Japanese GEC (MIT Thesis '20) TODO
Artificial error generation for translation-based grammatical error correction (University of Cambridge Technical Report) TODO
Erroneous data generation for Grammatical Error Correction (BEA @ ACL '19) TODO
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) TODO
Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners (IJCNLP '11) TODO

Generation

Paper Datasets
GenAug: Data Augmentation for Finetuning Text Generators (DeeLIO @ EMNLP '20) code Yelp
Findings of the Third Workshop on Neural Generation and Translation (WNGT @ EMNLP '19) TODO
Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers (WebNLG+ @ INLG '20) TODO
TNT-NLG, System 2: Data repetition and meaning representation manipulation to improve neural generation (E2E NLG Challenge System Descriptions) TODO
A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models (INLG '19) TODO

Dialogue

Paper Datasets
Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding (COLING '18) code ATIS, Dec94, Stanford dialogue
Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context (arxiv '19) code MultiWOZ
Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding (Student Research Workshop @ NAACL '19) ATIS, Snips, MR
Data Augmentation with Atomic Templates for Spoken Language Understanding (EMNLP '19) code DSTC 2&3, DSTC2
Data Augmentation for Spoken Language Understanding via Joint Variational Generation (AAAI '19) ATIS, Snips, MIT
Effective Data Augmentation Approaches to End-to-End Task-Oriented Dialogue (IALP '19) CamRest676, KVRET
Paraphrase Augmented Task-Oriented Dialog Generation (ACL '20) code TCamRest676, MultiWOZ
Dialog State Tracking with Reinforced Data Augmentation (AAAI '20) WoZ, MultiWoZ
Data Augmentation for Copy-Mechanism in Dialogue State Tracking (arxiv '20) WoZ, DSTC2, Multi
Simple is Better! Lightweight Data Augmentation for Low Resource Slot Filling and Intent Classification (PACLIC '20) code ATIS, SNIPS, FB
Conversation Graph: Data Augmentation, Training, and Evaluation for Non-Deterministic Dialogue Management (TACL '21) M2M, MultiWOZ

Multimodal

Paper Datasets
Data Augmentation for Visual Question Answering (INLG '17) COCO-VQA, COCO-QA
Low Resource Multi-modal Data Augmentation for End-to-end ASR (CoRR ’18) TODO
Multi-Modal Data Augmentation for End-to-end ASR (Interspeech '18) Voxforge, HUB4
Augmenting Image Question Answering Dataset by Exploiting Image Captions (LREC '18) IQA
Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks (AVEC '18) TODO
Multimodal Dialogue State Tracking By QA Approach with Data Augmentation (DSTC8 @ AAAI '20) DSTC7-AVSD
Data augmentation techniques for the Video Question Answering task (arxiv '20) TGIF-QA, MSVD-QA
Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors (NLP for ConvAI @ ACL '20) DSTC2
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering (ECCV '20) TODO
Text Augmentation Using BERT for Image Captioning (Applied Sciences '20) MSCOCO
MDA: Multimodal Data Augmentation Framework for Boosting Performance on Image-Text Sentiment/Emotion Classification Tasks (IEEE Intelligent Systems '20) TODO

Mitigating Bias

Paper Datasets
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. (NAACL '18) TODO
Gender Bias in Neural Natural Language Processing. (Springer '20) TODO
Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology (ACL '19) TODO
It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution (EMNLP '19) TODO
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures (arxiv '20) TODO

Mitigating Class Imbalance

Paper Datasets
SMOTE: Synthetic Minority Over-sampling Technique (Journal of Artificial Intelligence Research '02) Pima, Phoneme, Adult, E-state, Satimage, Forest Cover, Oil, Mammography, Can
Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem (EMNLP '07) TODO
MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation (Knowledge-Based Systems '15) bibtex, cal500, corel5k, slashdot, tmc2007, mediamill, medical, scene, enron, emotions
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary (Journal of Artificial Intelligence Research '18) TODO

Adversarial examples

Paper Datsets
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (NAACL '18) SST, SICK
Certified Robustness to Adversarial Word Substitutions (EMNLP '19) TODO
PAWS: Paraphrase Adversaries from Word Scrambling (NAACL '19) TODO
AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples (ACL '18) TODO
Breaking NLI Systems with Sentences that Require Simple Lexical Inferences (ACL '18) TODO

Compositionality

Paper Datsets
Good-Enough Compositional Data Augmentation (ACL '20) code TODO
Sequence-Level Mixed Sample Data Augmentation (EMNLP '20) code IWSLT ’14, WMT ’14

Automated Augmentation

Paper Datsets
Learning Data Manipulation for Augmentation and Weighting (NeurIPS '19) code SST, IMDB, TREC, CIFAR-10
Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight (ACL '20) DailyDialog, OpenSubtitles

Popular Resources