Skip to content

Resource and Tool for Writing System Identification -- LREC 2024

License

Notifications You must be signed in to change notification settings

cisnlp/GlotScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlotScript

  • GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.

  • GlotScript-Resource: provides a resource displaying the writing systems for various languages.

GlotScript Resource

What writing system is each language written in?

See metadata folder.

GlotScript Tool

Detect the script (writing system) of text based on ISO 15924.

Special codes

  • Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
  • Zyyy code is the Unicode script for "Common" characters.
  • Zzzz code is for Unicode script for "uncoded" script.

Install from pip

pip3 install GlotScript

Install from git

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage: Script Detection

from GlotScript import get_script_predictor
sp = get_script_predictor()

OR

from GlotScript import sp
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1]
>> ('Latn', 1.0)
sp('මේක සිංහල')[0]
>> 'Sinh'
sp('𝄞𝄫  𒊕𒀸')
>> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})

Usage: Script Separation

from GlotScript import separate_script
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
separate_script(sent)
>> {
   "Latn":"Hello Salut     ",
   "Hebr":"     שלום ",
   "Arab":"  سلام    مرحبا",
   "Hani":"   你好   ",
   "Hira":"    こんにちは  "
}

Exploring Unicode Blocks: Related Sources

Click to Exapand

Citation

If you use any part of this library in your research, please cite it using the following BibTex entry.

@article{kargaran2023glotscript,
title        = {GlotScript: A Resource and Tool for Low Resource Writing System Identification},
author       = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year         = 2023,
journal      = {arXiv preprint arXiv:2309.13320}
}