Skip to content

Python - NSW package for Vietnamese: Normalization system to convert numbers, abbreviations, and words that cannot be pronounced into syllables

License

Notifications You must be signed in to change notification settings

TruongScotl/Vinorm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Install ViNorm package

pip install vinorm

Using in python script

from vinorm import TTSnorm
S=TTSnorm("Hàm này được phát triển từ 8/2019. Có phải tháng 12/2020 đã có vaccine phòng ngừa Covid-19 xmz ?")

Some option

TTSnorm(text, punc = False, unknown = True, lower = True, rule = False )
  • lower: If true, get normalization with lowercase
  • rule: If true, just get normalization wit Regex, not using Dictionary Checking (this flag is not used with another flag)
  • punc: If true, do not replace punctuation with dot and coma
  • unknown: If true, replace unknown word, discard word undefine and do not contain vowel, do not spell word with vowel

From version 2.0, do not replace unknown words, skip them for espeak handle in phonetization step

  • This version does not parse case: "Tổ chức WTO" WTO do not in dictionary -> unknown -> keep origin, do not spell as in version 1.0, this aim to use with espeak, let espeak handle, but the drawback is the output of espeak for this case is "ve1kɛɜpte1ɔ7", it does not split each syllable.
  • For new entity, need to update in the dictionary

For update lastest version access: https://github.com/NoahDrisort/vinorm

For version 1.0: spell words that is unknown by each character, check previous commit

For mac version: https://github.com/v-nhandt21/Vinorm/tree/vinorm_mac

For C++ version: https://github.com/NoahDrisort/vinorm_cpp_version

Update pypi

python setup.py sdist bdist_wheel
twine upload dist/*

About

Python - NSW package for Vietnamese: Normalization system to convert numbers, abbreviations, and words that cannot be pronounced into syllables

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Makefile 81.5%
  • Python 13.3%
  • HTML 5.2%