Skip to content
forked from goru001/inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

License

Notifications You must be signed in to change notification settings

sujitjean/inltk

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural Language Toolkit for Indic Languages (iNLTK)

Gitter

iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.

Alt Text

Installation on Linux

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install inltk

Note: Just make sure to pick the correct torch wheel url, according to the needed platform and python version, which you will find here.

iNLTK runs on CPU, as is the desired behaviour for most of the Deep Learning models in production.

The first command above will install pytorch for cpu, which, as the name suggests, does not have cuda support.

Note: inltk is currently supported only on Linux and Windows 10 with Python >= 3.6

Supported languages

Language Code
Hindi hi
Punjabi pa
Sanskrit sa
Gujarati gu
Kannada kn
Malayalam ml
Nepali ne
Odia or
Marathi mr
Bengali bn
Tamil ta
Urdu ur

Usage

Setup the language

from inltk.inltk import setup

setup('<code-of-language>') // if you wanted to use hindi, then setup('hi')

Note: You need to run setup('<code-of-language>') when you use a language for the FIRST TIME ONLY. This will download all the necessary models required to do inference for that language.

Tokenize

from inltk.inltk import tokenize

tokenize(text ,'<code-of-language>') // where text is string in <code-of-language>

Get Embedding Vectors

This returns an array of "Embedding vectors", containing 400 Dimensional representation for every token in the text.

from inltk.inltk import get_embedding_vectors

vectors = get_embedding_vectors(text, '<code-of-language>') // where text is string in <code-of-language>

Example:

>> vectors = get_embedding_vectors('भारत', 'hi')
>> vectors[0].shape
(400,)

>> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
[array([-0.894777, -0.140635, -0.030086, -0.669998, ...,  0.859898,  1.940608,  0.09252 ,  1.043363], dtype=float32), array([ 0.290839,  1.459981, -0.582347,  0.27822 , ..., -0.736542, -0.259388,  0.086048,  0.736173], dtype=float32), array([ 0.069481, -0.069362,  0.17558 , -0.349333, ...,  0.390819,  0.117293, -0.194081,  2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131,  0.161678, ...,  0.048844, -1.090546,  0.154555,  0.925028], dtype=float32), array([ 0.219287,  0.759776,  0.695487,  1.097593, ...,  0.016115, -0.81602 ,  0.333799,  1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479,  0.177357, ...,  0.729619, -0.161499, -0.270225,  2.083801], dtype=float32), array([-0.501414,  1.337661, -0.405563,  0.733806, ..., -0.182045, -1.413752,  0.163339,  0.907111], dtype=float32), array([ 0.185258, -0.429729,  0.060273,  0.232177, ..., -0.537831, -0.51664 , -0.249798,  1.872428], dtype=float32)]
>> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
>> len(vectors)
8

Links to Embedding visualization on Embedding projector for all the supported languages are given in table below.

Predict Next 'n' words

from inltk.inltk import predict_next_words

predict_next_words(text , n, '<code-of-language>') 

// text --> string in <code-of-language>
// n --> number of words you want to predict (integer)

Note: You can also pass a fourth parameter, randomness, to predict_next_words. It has a default value of 0.8

Identify language

Note: If you update the version of iNLTK, you need to run reset_language_identifying_models before identifying language.

from inltk.inltk import identify_language, reset_language_identifying_models

reset_language_identifying_models() # only if you've updated iNLTK version
identify_language(text)

// text --> string in one of the supported languages

Example:

>> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ')
'sanskrit'

Remove foreign languages

from inltk.inltk import remove_foreign_languages

remove_foreign_languages(text, '<code-of-language>')

// text --> string in one of the supported languages
// <code-of-language> --> code of that language whose words you want to retain

Example:

>> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi')
['▁विकिपीडिया', '▁सभी', '▁विषयों', '', '<unk>', '', '<unk>', '', '<unk>', '', '<unk>', '', '<unk>', '', '<unk>', '', '<unk>', '', '<unk>', '', '<unk>', '▁पर', '▁प्रामाणिक', '▁और', '', '<unk>', ':', '<unk>', '▁उपयोग', ',', '▁परिवर्तन', '', '<unk>', ':', '<unk>']

Every word other than that of host language will become <unk> and signifies space character

Checkout this notebook by Amol Mahajan where he uses iNLTK to remove foreign characters from iitb_en_hi_parallel corpus

Get Sentence Encoding

from inltk.inltk import get_sentence_encoding

get_sentence_encoding(text, '<code-of-language>')

Example: 

>> encoding = get_sentence_encoding('मुझे अपने देश से', 'hi')
>> encoding.shape
(400,)

get_sentence_encoding returns 400 dimensional encoding of the sentence from ULMFiT LM Encoder of <code-of-language> trained in repositories linked below.

Get Sentence Similarity

from inltk.inltk import get_sentence_similarity

get_sentence_similarity(sentence1, sentence2, '<code-of-language>', cmp = cos_sim)

// sentence1, sentence2 are strings in '<code-of-language>'
// similarity of encodings is calculated by using cmp function whose default is cosine similarity

Example: 

>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'मैंने कन्फेक्शनरी स्टोर्स पर सेब और संतरे की कीमतों की तुलना की', 'hi')
0.126698300242424

>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'यहां कोई तुलना नहीं है। आप सेब की तुलना संतरे से कर रहे हैं', 'hi')
0.25467658042907715

get_sentence_similarity returns similarity between two sentences by calculating cosine similarity (default comparison function) between the encoding vectors of two sentences.

Get Similar Sentences

from inltk.inltk import get_similar_sentences

get_similar_sentences(sentence, no_of_variants, '<code-of-language>')


Example:

>> get_similar_sentences('मैं आज बहुत खुश हूं', 10, 'hi')
['मैं आजकल बहुत खुश हूं',
 'मैं आज काफ़ी खुश हूं',
 'मैं आज काफी खुश हूं',
 'मैं अब बहुत खुश हूं',
 'मैं आज अत्यधिक खुश हूं',
 'मैं अभी बहुत खुश हूं',
 'मैं आज बहुत हाजिर हूं',
 'मैं वर्तमान बहुत खुश हूं',
 'मैं आज अत्यंत खुश हूं',
 'मैं सदैव बहुत खुश हूं']

get_similar_sentences returns list of length no_of_variants which contains sentences which are similar to sentence

Repositories containing models used in iNLTK

Language Repository Perplexity of Language model Wikipedia Articles Dataset Classification accuracy Classification Kappa score Embeddings visualization on Embedding projector
Hindi NLP for Hindi ~36 55,000 articles ~79 (News Classification) ~30 (Movie Review Classification) Hindi Embeddings projection
Punjabi NLP for Punjabi ~13 44,000 articles ~89 (News Classification) ~60 (News Classification) Punjabi Embeddings projection
Sanskrit NLP for Sanskrit ~6 22,273 articles ~70 (Shloka Classification) ~56 (Shloka Classification) Sanskrit Embeddings projection
Gujarati NLP for Gujarati ~34 31,913 articles ~91 (News Classification) ~85 (News Classification) Gujarati Embeddings projection
Kannada NLP for Kannada ~70 32,997 articles ~94 (News Classification) ~90 (News Classification) Kannada Embeddings projection
Malayalam NLP for Malayalam ~26 12,388 articles ~94 (News Classification) ~91 (News Classification) Malayalam Embeddings projection
Nepali NLP for Nepali ~32 38,757 articles ~97 (News Classification) ~96 (News Classification) Nepali Embeddings projection
Odia NLP for Odia ~27 17,781 articles ~95 (News Classification) ~92 (News Classification) Odia Embeddings Projection
Marathi NLP for Marathi ~18 85,537 articles ~91 (News Classification) ~84 (News Classification) Marathi Embeddings projection
Bengali NLP for Bengali ~41 72,374 articles ~94 (News Classification) ~92 (News Classification) Bengali Embeddings projection
Tamil NLP for Tamil ~20 >127,000 articles ~97 (News Classification) ~95 (News Classification) Tamil Embeddings projection
Urdu NLP for Urdu ~13 >150,000 articles ~94 (News Classification) ~90 (News Classification) Urdu Embeddings projection

Contributing

Add a new language support for iNLTK

If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here

Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.

Improving models/Using models for your own research

If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.

Add new functionality

If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here

What's next (and being worked upon)

Shout out if you want to help :)

  • Add Telugu and Maithili support
  • Add NER support
  • Add Textual Entailment support
  • Add English to iNLTK

What's next - (and NOT being worked upon)

Shout out if you want to lead :)

Appreciation for iNLTK

About

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%