Neural Coreference Resolution for Arabic

Introduction

This repository contains code introduced in the following paper:

Neural Coreference Resolution for Arabic
Abdulrahman Aloraini*, Juntao Yu* and Massimo Poesio *equal contribution
In Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC@COLING), 2020.

Setup Environments

The code is written in Python 2, the compatibility to Python 3 is not guaranteed.
Before starting, you need to install all the required packages listed in the requirment.txt using pip install -r requirements.txt.
After that run setup.sh to download the fastText embeddings that required by the system and compile the Tensorflow custom kernels.

To use a pre-trained model

Pre-trained models can be download from this link. We provide two pre-trained models:
- One (arabic_cleaned_arabert) for Lee et al (2018) style training.
- The second model (arabic_cleaned_arabert_e2e_annealing) that uses the predicted mention output from the Yu et al (2020) and also the best model from our paper.
- We include the predicted mentions used in our evaluation for all three datasets (train, dev and test sets).
- In the folder you will also find a file called char_vocab.arabic.txt which is the vocabulary file for character-based embeddings used by our pre-trained models.
Put the downloaded models along with the char_vocab.arabic.txt in the root folder of the code.
Modifiy the test_path and conll_test_path accordingly:
- the test_path is the path to .jsonlines file, each line of the .jsonlines file must in the following format:
```
{
"clusters": [[[0,0],[5,5]],[[2,3],[7,8]],
"pred_mentions":[[0,0],[2,3],[5,5],[7,9]], #Optional
"doc_key": "nw",
"sentences": [["John", "has", "a", "car", "."], ["He", "washed", "the", "car", "yesteday","."],["Really","?","it", "was", "raining","yesteday","!"]],
"speakers": [["sp1", "sp1", "sp1", "sp1", "sp1"], ["sp1", "sp1", "sp1", "sp1", "sp1","sp1"],["sp2","sp2","sp2","sp2","sp2","sp2","sp2"]]
}
```
- For "clusters" and "pred_mentions" the mentions contain two properties [start_index, end_index] the indices are counted in document level and both inclusive.
- the conll_test_path is the path to the file of gold data in CoNLL format, see the CoNLL 2012 shared task page for more detail
- For how to create the json and CoNLL files please follow the instractions from the Lee et al (2018).
- You can preprocess the Arabic tokens by using python preprocess_arabic.py test.jsonlines test.cleaned.jsonlines.
Then you need to run the extract_bert_features.sh to compute the BERT embeddings for the test set.
Then use python evaluate.py config_name to start your evaluation.

To train your own model

To train your own model you need first create the character vocabulary by using python get_char_vocab.py train.jsonlines dev.jsonlines
Then you need to run the extract_bert_features.sh to compute the BERT embeddings for training, development and test sets.
Finally you can start training by using python train.py config_name

Training speed

The cluster ranking model takes about 40 hours to train (400k steps) on a GTX 1080Ti GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
extract_bert_features		extract_bert_features
LICENSE		LICENSE
README.md		README.md
conll.py		conll.py
coref_kernels.cc		coref_kernels.cc
coref_model.py		coref_model.py
coref_ops.py		coref_ops.py
evaluate.py		evaluate.py
experiments.conf		experiments.conf
get_char_vocab.py		get_char_vocab.py
metrics.py		metrics.py
preprocess_arabic.py		preprocess_arabic.py
requirements.txt		requirements.txt
setup.sh		setup.sh
train.py		train.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Coreference Resolution for Arabic

Introduction

Setup Environments

To use a pre-trained model

To train your own model

Training speed

About

Releases

Packages

Languages

License

juntaoy/aracoref

Folders and files

Latest commit

History

Repository files navigation

Neural Coreference Resolution for Arabic

Introduction

Setup Environments

To use a pre-trained model

To train your own model

Training speed

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages