Skip to content

oraby8/TextDataAug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextDataAug

TextDataAug is pipeline has implemented for Boosting Performance on Text Classification tasks by using "Easy Data Augmentation" , "Back-Translation" techniques and to support 22 languages.Given a sentence in the training set, we perform the following operations:

  • Synonym Replacement (SR): Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
  • Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
  • Random Swap (RS): Randomly choose two words in the sentence and swap their positions. Do this n times.
  • Random Deletion (RD): For each word in the sentence, randomly remove it with probability p.
  • Back-Translation

The pipeline has been implemented based on EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks and Low Resource Text Classification with ULMFit and Backtranslation.

TextDataAug supports 22 languages:

Arabic , Catalan , Danish , English , Basque , Persian , Finnish , French , Galician , Hebrew , Indonesian , Italian , Japanese , Norwegian Nynorsk , Norwegian Bokmål , Polish , Polish , Spanish , Thai , Mal

Requirements

Python 3

The following software packages are dependencies.

$ pip install numpy nltk gensim textblob googletrans 

The following code downloads stopwords and wordnet data

nltk.download('stopwords')
nltk.download('omw')
nltk.download('wordnets')

Usage

tda=DataAugmentation('english')
text_out=tda.AugPipeLine("Great movie. This is the type of movie you just want to watch time and time again. A real classic.)
tda=DataAugmentation('english')
print(tda.AugPipeLine("Great movie. This is the type of movie you just want to watch time and time again. A real classic.",num=2,probability=0.2,bktr=True,translate_to='es'))

References

About

Text Data augmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages