TextDataAug is pipeline has implemented for Boosting Performance on Text Classification tasks by using "Easy Data Augmentation" , "Back-Translation" techniques and to support 22 languages.Given a sentence in the training set, we perform the following operations:
- Synonym Replacement (SR): Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
- Random Insertion (RI): Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
- Random Swap (RS): Randomly choose two words in the sentence and swap their positions. Do this n times.
- Random Deletion (RD): For each word in the sentence, randomly remove it with probability p.
- Back-Translation
The pipeline has been implemented based on EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks and Low Resource Text Classification with ULMFit and Backtranslation.
Arabic , Catalan , Danish , English , Basque , Persian , Finnish , French , Galician , Hebrew , Indonesian , Italian , Japanese , Norwegian Nynorsk , Norwegian Bokmål , Polish , Polish , Spanish , Thai , Mal
Python 3
The following software packages are dependencies.
$ pip install numpy nltk gensim textblob googletrans
The following code downloads stopwords and wordnet data
nltk.download('stopwords')
nltk.download('omw')
nltk.download('wordnets')
tda=DataAugmentation('english')
text_out=tda.AugPipeLine("Great movie. This is the type of movie you just want to watch time and time again. A real classic.)
tda=DataAugmentation('english')
print(tda.AugPipeLine("Great movie. This is the type of movie you just want to watch time and time again. A real classic.",num=2,probability=0.2,bktr=True,translate_to='es'))