This directory contains discourse parses according to enhanced Rhetorical Structure Theory (eRST), in multiple formats. The recommended native format for the discourse parses is the .rs4 XML format in the rstweb directory. The remaining formats are automatically converted from .rs4 by the GUM build bot.
- dependencies: RST dependency representation according to the algorithm described in Li et al. (2014), plus treebreaking eRST secondary edges. The format resembles the 10 column tab-delimited conllu format. Each row is a single discourse unit, with an ID in column 1, text in column 2 (tokens separated by space) and the parent unit and relation name in columns 7-8. Column 9 is reserved for added, tree-breaking eRST relations. Note that multinuclear vs. satellite-nucleus relations can be distinguished by the suffixes
_m
and_r
respectively. Other columns give additional information on attachment depth in the constituent tree, the head word of each EDU, POS tags, sentence types, etc., which are encoded in other formats of the corpus. - disrpt: data in the official (DISRPT shared task)[https://github.com/disrpt] format for three tasks: connective detection in files containing
.pdtb.
(plain tokenized.tok
and treebanked.conllu
data), EDU segmentation in files containing.erst.
(plain tokenized.tok
and treebanked.conllu
data) and relation classification (.rels
) - gdtb: the GUM Discourse Treebank (GDTB) version of GUM discourse relations, following PDTB v3 guidelines, in two formats: original PDTB pipe format with standoff raw text, and DISRPT .rels format. See Liu et al. (2024) for more information.
- lisp_binary: binary branching consituent trees, with head unit indicated using
SN
(satellite-nucleus),NS
(nucleus-satellite) orNN
(multinuclear). Only terminal EDU nodes have text content, with tokens separated by space and surrounded bytext _!..._!
- lisp_nary: same as lisp_binary, but trees are not guaranteed to be binary branching: multinuclear nodes may have n children, where n > 1. Corresponds more directly to the source data in rstweb/ .rs3, but cannot be used to train parsers which require binary trees
- rstweb: source format for the enhanced RST (eRST, Zeldes et al. 2024) annotations, compatible with rstWeb (Zeldes 2016) and RSTTool. The format natively distinguishes multinuclear nodes and satellite-nucleus, nested hierarchy and n-ary nodes. Can be used to visualize eRST trees (see the website for rstWeb at https://gucorpling.org/rstweb/info/ for examples)
- Li, Sujian, Liang Wang, Ziqiang Cao & Wenjie Li (2014) Text-level discourse dependency parsing. In Proceedings of ACL 2014. Baltimore, MD, 25–35.
- Liu, Yang Janet, Tatsuya Aoyama, Wesley Scivetti, Yilnu Zhu, Shabnam Behzad, Lauren Elizabeth Levine, Jessica Lin, Devika Tiwari & Amir Zeldes (2024) "GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains". Proceedings of EMNLP 2024, Miami, FL.
- Zeldes, Amir (2016) rstWeb - A Browser-based Annotation Interface for Rhetorical Structure Theory and Discourse Relations. In Proceedings of NAACL 2016 System Demonstrations. San Diego, CA, 1-5.
- Zeldes, Amir, Tatsuya Aoyama, Yang Janet Liu, Siyao Peng, Debopam Das & Luke Gessler (2024) "eRST: A Signaled Graph Theory of Discourse Relations and Organization". Computational Linguistics.