IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:
- 3114 sentences
- 48689 tokens
- 2476 named entities
- 18 named entity categories
The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes as follows:
- Allah
- Allah's Throne
- Artifact
- Astronomical body
- Event
- False deity
- Holy book
- Language
- Angel
- Person
- Messenger
- Prophet
- Sentient
- Afterlife location
- Geographical location
- Color
- Religion
- Food
- Fruit
- The book of Allah
There were eight annotators who contributed to the annotation process. They are Informatics Engineering students at the State Islamic University Syarif Hidayatullah Jakarta.
- Anggita Maharani Gumay Putri
- Muhammad Destamal Junas
- Naufaldi Hafidhigbal
- Nur Kholis Azzam Ubaidillah
- Puspitasari
- Septiany Nur Anggita
- Wilda Nurjannah
- William Santoso
We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department, the State Islamic University Syarif Hidayatullah Jakarta.
- Dr. Lilik Ummi Kultsum, MA
- Dr. Jauhar Azizy, MA
- Dr. Eva Nugraha, M.Ag.
We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning). The first model obtained an F1 score of 0.95 and the second one yielded an F1 score of 0.64
This dataset is also a part of NusaCrowd project that aims to collect Natural Language Processing (NLP) datasets for the Indonesian languages.
If you have any questions or feedbacks, feel free to contact us at [email protected] or [email protected]