This repository contains the DiscoGeM corpus: A Crowdsourced Corpus of Genre-Mixed Inter-Sentential Implicit Discourse Relations annotated in PDTB3-style.
DiscoGeM is a crowdsourced corpus of 6,505 implicit discourse relations from different genres. It contains data from political speech (Europarl), literature, and encyclopedic (Wikipedia) texts. It has now been updated with annotations of an additional set of 300 implicit relations from the Penn Discourse Treebank (data stemming from newspaper text), to allow for a comparison between the methodologies.
Each instance in DiscoGeM was annotated by 10 crowd workers, using a discourse connective insertion paradigm (see Yung et al., 2019; Scholman et al., 2022). In addition to the annotated dataset, we also make available the dataset with all annotator-level insertions and annotator quality scores.
A subset of the data was also annotated using a Question-Answer annotation paradigm (see Pyatkin et al., 2023). These annotations can be found in the folder QADiscourse_annotations.
If you use this resource, please consider citing:
@inproceedings{scholman2022DiscoGeM,
title = "DiscoGeM: A Crowdsourced Corpus of Genre-Mixed Implicit Discourse Relations",
author = "Scholman, Merel C. J. and
Dong, Tianai and
Yung, Frances and
Demberg, Vera",
booktitle = "Proceedings of the Thirteenth International Conference on Language Resources and Evaluation ({LREC}'22)",
month = June,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association (ELRA)"
}
[1] Pyatkin, V., Yung, F., Scholman, M.C.J., Dagan, I., Tsarfaty, R., & Demberg, V. (2023). Design Choices for Crowdsourcing Implicit Discourse Relations: Revealing the Biases introduced by Task Design. TACL.
[2] Scholman, M.C.J., Pyatkin, V., Yung, F., Dagan, I., Tsarfaty, R., & Demberg, V. (2022). Design Choices in Crowdsourcing Discourse Relation Annotations: The Effect of Worker Selection and Training. Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 22), Marseille, France.
[3] Yung, F., Demberg, V., & Scholman, M.C.J. (2019). Crowdsourcing discourse relation annotations by a two-step connective insertion task. Proceedings of the 13th Linguistic Annotation Workshop, Florence, Italy.