-
GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.
-
GlotScript-Resource: provides a resource displaying the writing systems for various languages.
What writing system is each language written in?
See metadata folder.
Detect the script (writing system) of text based on ISO 15924.
- Unicode version: 15.0.0
- The codes were sourced from Wikipedia ISO_15924.
- Unicode ranges were extracted from Unicode Character Database.
Zinh
code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.Zyyy
code is the Unicode script for "Common" characters.Zzzz
code is for Unicode script for "uncoded" script.
pip3 install GlotScript
pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript
from GlotScript import get_script_predictor
sp = get_script_predictor()
OR
from GlotScript import sp
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1]
>> ('Latn', 1.0)
sp('මේක සිංහල')[0]
>> 'Sinh'
sp('𝄞𝄫 𒊕𒀸')
>> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})
from GlotScript import separate_script
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
separate_script(sent)
>> {
"Latn":"Hello Salut ",
"Hebr":" שלום ",
"Arab":" سلام مرحبا",
"Hani":" 你好 ",
"Hira":" こんにちは "
}
Click to Exapand
- List of Unicode characters - Wikipedia
- Lightweight Plain-Text Editor for macOS - CotEditor
- The Cygwin Terminal – terminal emulator for Cygwin, MSYS, and WSL - mintty
- ISO_15924 Wikipedia
- Unicode Character Database (Blocks) - Unicode
- Unicode Character Database (Scripts) - Unicode
- A free, web-based font editor, focusing on font design hobbyists. - Glyphr-Studio-1
- Kotlin - JetBrains
- UNIX-like reverse engineering framework and command-line toolset - radare2
- FreeOrion Game
- DOMinator - Firefox
- SHSans-derived CJK font family - glow-sans
- Unicode Subset Bitfields - Microsoft
- Stops - FAIR NLLB FB
- Gradient Boosting on Decision Trees - catboost
- Blender
- Unicode Wikipedia
If you use any part of this library in your research, please cite it using the following BibTex entry.
@article{kargaran2023glotscript,
title = {GlotScript: A Resource and Tool for Low Resource Writing System Identification},
author = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year = 2023,
journal = {arXiv preprint arXiv:2309.13320}
}