Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Evaluation Codes		Evaluation Codes
datasets		datasets
Annotation guideline in Indonesian.pdf		Annotation guideline in Indonesian.pdf
Asmaul_Husna_Reference.pdf		Asmaul_Husna_Reference.pdf
LICENSE		LICENSE
README.md		README.md

Repository files navigation

IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3114 sentences
48689 tokens
2476 named entities
18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes as follows:

Allah
Allah's Throne
Artifact
Astronomical body
Event
False deity
Holy book
Language
Angel
Person
Messenger
Prophet
Sentient
Afterlife location
Geographical location
Color
Religion
Food
Fruit
The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They are Informatics Engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri
Muhammad Destamal Junas
Naufaldi Hafidhigbal
Nur Kholis Azzam Ubaidillah
Puspitasari
Septiany Nur Anggita
Wilda Nurjannah
William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department, the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Lilik Ummi Kultsum, MA
Dr. Jauhar Azizy, MA
Dr. Eva Nugraha, M.Ag.

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning). The first model obtained an F1 score of 0.95 and the second one yielded an F1 score of 0.64

This dataset is also a part of NusaCrowd project that aims to collect Natural Language Processing (NLP) datasets for the Indonesian languages.

Contact

If you have any questions or feedbacks, feel free to contact us at [email protected] or [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndQNER

Named Entity Classes

Annotation Stage

Verification Stage

Evaluation

Contact

About

Releases

Packages

Languages

License

dice-group/IndQNER

Folders and files

Latest commit

History

Repository files navigation

IndQNER

Named Entity Classes

Annotation Stage

Verification Stage

Evaluation

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages