Skip to content

Commit

Permalink
new get_text.py script
Browse files Browse the repository at this point in the history
  * Reddit data can now be reconstructed in top folders without rebuilding the corpus
amir-zeldes committed Feb 2, 2023
1 parent cd17cea commit 40af9bd
Showing 7 changed files with 985 additions and 48 deletions.
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -19,9 +19,11 @@ This repository contains release versions of the Georgetown University Multilaye

The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: https://gucorpling.org/gum.

## A note about reddit data
## A note about Reddit data

For one of the twelve text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run `_build/process_reddit.py`, then `run _build/build_gum.py`. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
For one of the twelve text types in this corpus, Reddit forum discussions, plain text data is not supplied, and you will find **underscores** in place of word forms in documents from this data (files named `GUM_reddit_*`). To obtain this data, please run `python get_text.py`, which will allow you to reconstruct the text in these files. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and Reddit data is subject to Reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.

Note that the `get_text.py` script only regenerates the files named `GUM_reddit_*` in each folder, and will not create full versions of the data in `PAULA/` and `annis/`. If you require PAULA XML or searchable ANNIS data containing these documents, you will need to recompile the corpus from the source files under `_build/src/`. To do this, run `_build/process_reddit.py`, then run `_build/build_gum.py`.

## Train / dev / test splits

@@ -78,16 +80,21 @@ For a full list of contributors please see [the corpus website](https://gucorpli

The corpus is downloadable in multiple formats. Not all formats contain all annotations: The most accessible format is probably CoNLL-U dependencies (in `dep/`), but the most complete XML representation is in [PAULA XML](https://www.sfb632.uni-potsdam.de/en/paula.html), and the easiest way to search in the corpus is using [ANNIS](http://corpus-tools.org/annis). Here is [an example query](https://gucorpling.org/annis/#_q=ZW50aXR5IC0-YnJpZGdlIGVudGl0eSAmICMxIC0-aGVhZCBsZW1tYT0ib25lIg&_c=R1VN&cl=5&cr=5&s=0&l=10) for phrases headed by 'one' bridging back to a different, previously mentioned entity. Other formats may be useful for other purposes. See website for more details.

**NB: reddit data is not included in top folders - consult README_reddit.md to add it**
**NB: Reddit data in top folders does not inclulde the base text forms - consult README_reddit.md to add it**

* _build/ - The [GUM build bot](https://gucorpling.org/gum/build.html) and utilities for data merging and validation
* annis/ - The entire merged corpus, with all annotations, as a relANNIS 3.3 corpus dump, importable into [ANNIS](http://corpus-tools.org/annis)
* const/ - Constituent trees with function labels and PTB POS tags in the PTB bracketing format (automatic parser output from gold POS with functions projected from gold dependencies)
* coref/ - Entity and coreference annotation in two formats:
* conll/ - CoNLL shared task tabular format (with Wikification but no bridging or split antecedent annotations)
* tsv/ - WebAnno .tsv format, including entity type, salience and information status annotations, Wikification, bridging, split antecedent and singleton entities
* ontogum/ - alternative version of coreference annotation in CoNLL, tsv and CoNLL-U formats following OntoNotes guidelines (see Zhu et al. 2021)
* tsv/ - WebAnno .tsv format, including entity and information status annotations, Wikification, bridging, split antecedent and singleton entities
* dep/ - Dependency trees using Universal Dependencies, enriched with sentence types, enhanced dependencies, entities, information status, coreference, bridging, Wikification, XML markup, morphological tags and Universal POS tags according to the UD standard
* dep/ - Dependency trees using Universal Dependencies, enriched with metadata, summaries, sentence types, speaker information, enhanced dependencies, entities, information status, salience, centering, coreference, bridging, Wikification, XML markup, morphological tags and Universal POS tags according to the UD standard
* paula/ - The entire merged corpus in standoff [PAULA XML](https://github.com/korpling/paula-xml), with all annotations
* rst/ - Rhetorical Structure Theory analyses in .rs3 format as used by RSTTool and rstWeb, as well as binary and n-ary lisp trees (.dis) and an RST dependency representation (.rsd)
* xml/ - vertical XML representations with 1 token or tag per line and tab delimited lemmas and POS tags (extended VVZ style, vanilla, UPOS and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
* rst/ - Rhetorical Structure Theory analyses
* rstweb/ - full .rs3 format data as used by RSTTool and rstWeb (recommended)
* lisp_nary/ - n-ary lisp trees (.dis format)
* lisp_binary/ - binarized lisp trees (.dis format)
* dependencies/ - a converted RST dependency representation (.rsd format)
* disrpt/ - plain segmentation and relation-per-line data formats following the DISRPT shared task specification
* xml/ - vertical XML representations with 1 token or tag per line, metadata, summaries and tab delimited lemmas and POS tags (extended VVZ style, vanilla, UPOS and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
21 changes: 10 additions & 11 deletions README_reddit.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,24 @@
# Data from reddit
# Data from Reddit

For one of the text types in this corpus, reddit forum discussions, plain text data is not supplied in this repository. To obtain this data, please follow the instructions below.
For one of the text types in this corpus, Reddit forum discussions, plain text data is not supplied in this repository. To obtain this data, please follow the instructions below.

## Annotations

Documents in the reddit subcorpus are named GUM_reddit_* (e.g. GUM_reddit_superman) and are *not* included in the root folder with all annotation layers. The annotations for the reddit subcorpus can be found together with all other document annotations in `_build/src/`. Token representations in these files are replaced with underscores, while the annotations themselves are included in the files. To compile the corpus including reddit data, you must obtain the underlying texts.
Documents in the Reddit subcorpus are named `GUM_reddit_*` (e.g. GUM_reddit_superman) and are included in the root folder with all annotation layers but with underscores instead of text. To compile the corpus including Reddit data, you must obtain the underlying texts, and either regenerate the files in the top level folders (works for all formats except `PAULA` and `annis`), or rebuild the corpus (see below).

## Obtaining underlying reddit text data
## Obtaining underlying Reddit text data

To recover reddit data, use the API provided by the script `_build/process_reddit.py`. If you have your own credentials for use with the Python reddit API wrapper (praw) and Google bigquery, you should include them in two files, `praw.txt` and `key.json` in `_build/utils/get_reddit/`. For this to work, you must have the praw and bigquery libraries installed for python (e.g. via pip). You can then run `python _build/process_reddit.py` to recover the data, and proceed to the next step, re-building the corpus.
To recover Reddit data, use the API provided by the Python script `get_text.py`, which will restore text in all top-level folders except for `PAULA` and `annis`. If you do not have credentials for the Python Reddit API wrapper (praw) and Google bigquery, the script can attempt to download data for you from a proxy. Otherwise you can also use your own credentials for praw etc. and include them in two files, `praw.txt` and `key.json`. For this to work, you must have the praw and bigquery libraries installed for python (e.g. via pip).

Alternatively, if you can't use praw/bigquery, the script `_build/process_reddit.py` will offer to download the data for you by proxy. To do this, run the script and confirm that you will only use the data according to the terms and conditions determined by reddit, and for non-commercial purposes. The script will then download the data for you - if the download is successful, you can continue to the next step and re-build the corpus.
If you also require the `PAULA` and `annis` formats, you must rebuild the corpus from `_build/src/`. To do this, run `_build/process_reddit.py`, which again requires either running a proxy download or using your own credentials and placing them in `_build/utils/get_reddit/`. Once the download completes successfully, you will need to rebuild the corpus as explained in the next step.

## Rebuilding the corpus with reddit data
## Rebuilding the corpus with Reddit data

To compile all projected annotations and produce all formats not included in `_build/src/`, you will need to run the GUM build bot: `python _build/build_gum.py`. This process is described in detail at https://gucorpling.org/gum/build.html, but summarized instructions follow.
To compile all projected annotations and produce all formats not included in `_build/src/`, you will need to run the GUM build bot: `python build_gum.py` in `_build/`. This process is described in detail at https://gucorpling.org/gum/build.html, but summarized instructions follow.

At a minumum, you can run `python _build/build_gum.py` with no options. This will produce basic formats in `_build/target/`, but skip generating fresh constituent parses, CLAWS5 tags and the Universal Dependencies version of the dependency data. To include these you will need:
At a minumum, you can run `build_gum.py` with no options. This will produce basic formats in `_build/target/`, but skip generating fresh constituent parses and CLAWS5 tags. To include these you will need:

* CLAWS5: use option -c and ensure that utils/paths.py points to an executable for the TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). The CLAWS 5 parameter file is already included in utils/treetagger/lib/, and tags are auto-corrected by the build bot based on gold PTB tags.
* Constituent parses: option -p; ensure that paths.py correctly points your installation of the Stanford Parser/CoreNLP
* Universal Dependencies: option -u; ensure the paths.py points to CoreNLP, and that you have installed udapi and depedit (pip install udapi; pip install depedit). Note that this only works with Python 3.

If you run into problems building the corpus, feel free to report an issue via GitHub or contact us via e-mail.
After the build bot runs, data including `PAULA` and `annis` versions will be generated in the specified `target/` folder. If you run into problems building the corpus, feel free to report an issue via GitHub or contact us via e-mail.
90 changes: 60 additions & 30 deletions _build/utils/get_reddit/underscores.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
import os, glob, re, io, sys
from collections import defaultdict
from copy import deepcopy

PY3 = sys.version_info[0] == 3

@@ -10,43 +12,67 @@ def deunderscoring(src_folder, textdic):
make_text_const(src_folder + "const" + os.sep, textdic)


def make_text(folder, textdic, tok_col, lemma_col=None, unescape_xml=False):
def make_text(folder, textdic, tok_col, lemma_col=None, unescape_xml=False, docs2lemmas=None, docs2tokens=None):
files_to_process = glob.glob(folder + "GUM_reddit*")
print("o Processing " + str(len(files_to_process)) + " files in " + folder + "...")

lemma_dict = defaultdict(list)
token_dict = defaultdict(list)
docs2tokens_copy = deepcopy(docs2tokens)
docs2lemmas_copy = deepcopy(docs2lemmas)
for f_path in files_to_process:

with io.open(f_path, 'r', encoding='utf-8') as fin:
in_lines = fin.read().replace("\r","").split("\n")

tokens = textdic[os.path.basename(f_path)[:os.path.basename(f_path).find(".")]]
docname = os.path.basename(f_path)[:os.path.basename(f_path).find(".")]
tokens = textdic[docname]
if unescape_xml:
tokens = tokens.replace("&gt;",">").replace("&lt;","<").replace("&amp;","&")
else:
if "&" in tokens and not "&amp;" in tokens and not "_ring" in f_path:
tokens = tokens.replace("&","&amp;")
tokens = tokens.replace(">","&gt;").replace("<","&lt;")
if not PY3:
tokens = tokens.decode("utf8")
tokens = tokens.decode("utf8")

text_tokens = list(tokens)
with io.open(f_path, 'w', encoding='utf-8', newline="\n") as fout:
for i, line in enumerate(in_lines):
if line.startswith('<'):
fout.write(line+"\n")
elif line.startswith("#") and "Text=" in line or "text =" in line:
restored = [line.split("=",1)[0] + "="]
for c in line.split("=",1)[1]:
if c != " ":
restored.append(text_tokens.pop(0))
else:
restored.append(c)
fout.write("".join(restored)+"\n")
elif "\t" in line:
elements = line.split('\t')
elements[tok_col] = tokens[:len(elements[tok_col])]
tokens = tokens[len(elements[tok_col]):]
#if not unescape_xml:
# elements[tok_col] = elements[tok_col].replace("&amp;","&").replace("&","&amp;")
if lemma_col is not None:
if elements[lemma_col] == '_':
if not (elements[tok_col] in ["hearing","hind"] and "_card" in f_path): # Check known goeswith cases
elements[lemma_col] = elements[tok_col]
else:
elements[lemma_col] = "_"
elif elements[lemma_col] == "*LOWER*":
elements[lemma_col] = elements[tok_col].lower()
if not (len(elements) == 10 and len(elements[-1]) >0 and ("." in elements[0] or "-" in elements[0])):
elements[tok_col] = tokens[:len(elements[tok_col])]
token_dict[docname].append(elements[tok_col])
tokens = tokens[len(elements[tok_col]):]
#if not unescape_xml:
# elements[tok_col] = elements[tok_col].replace("&amp;","&").replace("&","&amp;")
if lemma_col is not None:
if elements[lemma_col] == '_':
if not (elements[tok_col] in ["hearing","hind"] and "_card" in f_path): # Check known goeswith cases
elements[lemma_col] = elements[tok_col]
else:
elements[lemma_col] = "_"
elif elements[lemma_col] == "*LOWER*":
elements[lemma_col] = elements[tok_col].lower()
lemma_dict[docname].append(elements[lemma_col])
if docs2lemmas is not None: # Reconstruct lemmas for conllu
if "." not in elements[0] and "-" not in elements[0]:
elements[2] = docs2lemmas_copy[docname].pop(0)
docs2tokens_copy[docname].pop(0)
elif "-" in elements[0]: # Conllu MWT
elements[1] = docs2tokens_copy[docname][0]
elements[1] += docs2tokens_copy[docname][1]
try:
fout.write('\t'.join(elements)+"\n")
except Exception as e:
@@ -58,10 +84,11 @@ def make_text(folder, textdic, tok_col, lemma_col=None, unescape_xml=False):
fout.write("\n")
else:
fout.write(unicode("\n"))
return lemma_dict, token_dict


def make_text_rst(folder, textdic):
files_to_process = glob.glob(folder + "GUM_reddit*.rs3")
def make_text_rst(folder, textdic, unescape_xml=False, extension="rs3", edu_regex=r'(.*<segment[^>]*>)(.*)(</segment>)'):
files_to_process = glob.glob(folder + "GUM_reddit*." + extension)
print("o Processing " + str(len(files_to_process)) + " files in "+folder+"...")

# Delete tokens in .xml files
@@ -70,21 +97,23 @@ def make_text_rst(folder, textdic):
tokens = textdic[os.path.basename(f_path)[:os.path.basename(f_path).find(".")]]
if not PY3:
tokens = tokens.decode("utf8")
if "&" in tokens and not "&amp;" in tokens and not "_ring" in f_path: # Some bigquery entries have no &amp;
tokens = tokens.replace("&","&amp;")
tokens = tokens.replace(">","&gt;").replace("<","&lt;") # Reddit API does not escape lt/gt, but does escape &amp;

if unescape_xml:
tokens = tokens.replace("&gt;",">").replace("&lt;","<").replace("&amp;","&")
else:
if "&" in tokens and not "&amp;" in tokens and not "_ring" in f_path: # Some bigquery entries have no &amp;
tokens = tokens.replace("&", "&amp;")
tokens = tokens.replace(">", "&gt;").replace("<","&lt;") # Reddit API does not escape lt/gt, but does escape &amp;

with io.open(f_path, 'r', encoding='utf-8') as fin:
in_lines = fin.read().replace("\r","").split("\n")

with io.open(f_path, 'w', encoding='utf-8', newline="\n") as fout:
cursor = 0
for i, line in enumerate(in_lines):
if "<segment" not in line:
if re.search(edu_regex,line) is None:
fout.write(line + "\n")
else:
m = re.search(r'(.*<segment[^>]*>)(.*)(</segment>)',line)
m = re.search(edu_regex,line)
pre = m.group(1)
seg = m.group(2)
post = m.group(3)
@@ -113,8 +142,8 @@ def underscoring(src_folder):
make_underscores_const(src_folder + "const" + os.sep)


def make_underscores_rst(folder):
files_to_process = glob.glob(folder + "GUM_reddit*.rs3")
def make_underscores_rst(folder, extension="rs3", edu_regex=r'(.*<segment[^>]*>)(.*)(</segment>)'):
files_to_process = glob.glob(folder + "GUM_reddit*." + extension)
print("o Processing " + str(len(files_to_process)) + " files in "+folder+"...")

# Delete tokens in .xml files
@@ -125,10 +154,10 @@ def make_underscores_rst(folder):

with io.open(f_path, 'w', encoding='utf-8', newline="\n") as fout:
for i, line in enumerate(in_lines):
if "<segment" not in line:
if re.search(edu_regex,line) is None:
fout.write(line + "\n")
else:
m = re.search(r'(.*<segment[^>]*>)(.*)(</segment>)',line)
m = re.search(edu_regex,line)
pre = m.group(1)
seg = m.group(2)
post = m.group(3)
@@ -156,11 +185,12 @@ def make_underscores(folder, tok_col, lemma_col=None):
for i, line in enumerate(in_lines):
if line.startswith('<'):
fout.write(line + "\n")
elif line.startswith("#Text="):
elif line.startswith("#Text=") or line.startswith("# text ="):
underscored_text = line.split("=",1)[0] + "=" + re.sub(r'[^\s]','_',line.split("=",1)[1])
if PY3:
fout.write("#Text=_" + "\n")
fout.write(underscored_text + "\n")
else:
fout.write(unicode("#Text=_" + "\n"))
fout.write(unicode(underscored_text + "\n"))
elif "\t" in line:
#line = line.replace("&amp;","&")
elements = line.split('\t')
264 changes: 264 additions & 0 deletions _build/utils/get_reddit/underscores_disrpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
"""
process_underscores.py
Script to handle licensed data for which underlying text cannot be posted online (e.g. LDC data).
Users need a copy of the LDC distribution of an underlying resource to restore text in some of the corpora.
"""

__author__ = "Amir Zeldes"
__license__ = "Apache 2.0"
__version__ = "2.0.0"

import io, re, os, sys
from glob import glob
from collections import defaultdict
script_dir = os.path.dirname(os.path.realpath(__file__)) + os.sep

PY3 = sys.version_info[0] == 3


def underscore_files(filenames):
def underscore_rel_field(text):
blanked = []
text = text.replace("<*>","❤")
for c in text:
if c!="❤" and c!=" ":
blanked.append("_")
else:
blanked.append(c)
return "".join(blanked).replace("❤","<*>")

if isinstance(filenames,str):
filenames = glob(filenames + "*.*")
for f_path in filenames:
skiplen = 0
with io.open(f_path, 'r', encoding='utf8') as fin:
lines = fin.readlines()

with io.open(f_path, 'w', encoding='utf8', newline="\n") as fout:
output = []
if f_path.endswith(".rels"):
for l, line in enumerate(lines):
line = line.strip()
if "\t" in line and l > 0:
doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label = line.split("\t")
if "GUM" in doc and "reddit" not in doc:
output.append(line)
continue
unit1_txt = underscore_rel_field(unit1_txt)
unit2_txt = underscore_rel_field(unit2_txt)
unit1_sent = underscore_rel_field(unit1_sent)
unit2_sent = underscore_rel_field(unit2_sent)
fields = doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label
line = "\t".join(fields)
output.append(line)
else:
doc = ""
for line in lines:
line = line.strip()
if line.startswith("# newdoc id"):
doc = line.split("=",maxsplit=1)[1].strip()
if "GUM" in doc and "reddit" not in doc:
output.append(line)
continue
if line.startswith("# text"):
m = re.match(r'(# text ?= ?)(.+)',line)
if m is not None:
line = m.group(1) + re.sub(r'[^\s]','_',m.group(2))
output.append(line)
elif "\t" in line:
fields = line.split("\t")
tok_col, lemma_col = fields[1:3]
if lemma_col == tok_col: # Delete lemma if identical to token
fields[2] = '_'
elif tok_col.lower() == lemma_col:
fields[2] = "*LOWER*"
if skiplen < 1:
fields[1] = len(tok_col)*'_'
else:
skiplen -=1
output.append("\t".join(fields))
if "-" in fields[0]: # Multitoken
start, end = fields[0].split("-")
start = int(start)
end = int(end)
skiplen = end - start + 1
else:
output.append(line)
fout.write('\n'.join(output) + "\n")


def restore_docs(path_to_underscores,text_dict):
def restore_range(range_string, underscored, tid_dict):
output = []
tok_ids = []
range_strings = range_string.split(",")
for r in range_strings:
if "-" in r:
s, e = r.split("-")
tok_ids += list(range(int(s),int(e)+1))
else:
tok_ids.append(int(r))

for tok in underscored.split():
if tok == "<*>":
output.append(tok)
else:
tid = tok_ids.pop(0)
output.append(tid_dict[tid])
return " ".join(output)

dep_files = glob(path_to_underscores+os.sep+"*.conllu")
tok_files = glob(path_to_underscores+os.sep+"*.tok")
rel_files = glob(path_to_underscores+os.sep+"*.rels")
skiplen = 0
token_dict = {}
tid2string = defaultdict(dict)
for file_ in dep_files + tok_files + rel_files:
lines = io.open(file_,encoding="utf8").readlines()
underscore_len = 0 # Must match doc_len at end of file processing
doc_len = 0
if file_.endswith(".rels"):
output = []
violation_rows = []
for l, line in enumerate(lines):
line = line.strip()
if l > 0 and "\t" in line:
fields = line.split("\t")
docname = fields[0]
text = text_dict[docname]
if "GUM_" in docname and "reddit" not in docname: # Only Reddit documents need reconstruction in GUM
output.append(line)
continue
doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label = line.split("\t")
underscore_len += unit1_txt.count("_") + unit2_txt.count("_") + unit1_sent.count("_") + unit2_sent.count("_")
if underscore_len == 0:
#sys.stderr.write("! Non-underscored file detected - " + os.path.basename(file_) + "\n")
print("! DISRPT format alreadt restored in " + os.path.basename(file_) + "\n")
sys.exit(0)
unit1_txt = restore_range(unit1_toks, unit1_txt, tid2string[docname])
unit2_txt = restore_range(unit2_toks, unit2_txt, tid2string[docname])
unit1_sent = restore_range(s1_toks, unit1_sent, tid2string[docname])
unit2_sent = restore_range(s2_toks, unit2_sent, tid2string[docname])
plain = unit1_txt + unit2_txt + unit1_sent + unit2_sent
plain = plain.replace("<*>","").replace(" ","")
doc_len += len(plain)
fields = doc, unit1_toks, unit2_toks, unit1_txt, unit2_txt, s1_toks, s2_toks, unit1_sent, unit2_sent, direction, orig_label, label
line = "\t".join(fields)
if doc_len != underscore_len and len(violation_rows) == 0:
violation_rows.append(str(l) + ": " + line)
output.append(line)

else:
tokfile = True if ".tok" in file_ else False
output = []
parse_text = ""
docname = ""
for line in lines:
line = line.strip()
if "# newdoc id " in line:
tid = 0
if parse_text !="":
if not tokfile:
token_dict[docname] = parse_text
parse_text = ""
docname = re.search(r'# newdoc id ?= ?([^\s]+)',line).group(1)
if "GUM" in docname and "reddit" not in docname:
output.append(line)
continue
if docname not in text_dict:
raise IOError("! Text for document name " + docname + " not found.\n Please check that your LDC data contains the file for this document.\n")
if ".tok" in file_:
if docname not in token_dict: # Fetch continuous token string from conllu
parse_conllu = open(os.sep.join([script_dir,"..","..","..","dep",docname + ".conllu"])).read()
toks = [l.split("\t") for l in parse_conllu.split("\n") if "\t" in l]
toks = [l[1] for l in toks if "-" not in l[0] and "." not in l[0]]
toks = "".join(toks)
token_dict[docname] = toks
text = token_dict[docname]
else:
text = text_dict[docname]
doc_len = len(text)
underscore_len = 0

if "GUM" in docname and "reddit" not in docname:
output.append(line)
continue

if line.startswith("# text"):
m = re.match(r'(# ?text ?= ?)(.+)',line)
if m is not None:
i = 0
sent_text = ""
for char in m.group(2).strip():
if char != " ":
try:
sent_text += text[i]
except:
raise IOError("Can't fix")
i+=1
else:
sent_text += " "
line = m.group(1) + sent_text
output.append(line)
elif "\t" in line:
fields = line.split("\t")
if skiplen < 1:
underscore_len += len(fields[1])
fields[1] = text[:len(fields[1])]
if not "-" in fields[0] and not "." in fields[0]:
parse_text += fields[1]
tid += 1
tid2string[docname][tid] = fields[1]
if not tokfile:
if fields[2] == '_' and not "-" in fields[0] and not "." in fields[0]:
fields[2] = fields[1]
elif fields[2] == "*LOWER*":
fields[2] = fields[1].lower()
if skiplen < 1:
text = text[len(fields[1]):]
else:
skiplen -=1
output.append("\t".join(fields))
if "-" in fields[0]: # Multitoken
start, end = fields[0].split("-")
start = int(start)
end = int(end)
skiplen = end - start + 1
else:
output.append(line)

if not doc_len == underscore_len:
if ".rels" in file_:
sys.stderr.write(
"\n! Tried to restore file " + os.path.basename(file_) + " but source text has different length than tokens in shared task file:\n" + \
" Source text in data/: " + str(doc_len) + " non-whitespace characters\n" + \
" Token underscores in " + file_ + ": " + str(underscore_len) + " non-whitespace characters\n" + \
" Violation row: " + violation_rows[0])
else:
sys.stderr.write("\n! Tried to restore document " + docname + " but source text has different length than tokens in shared task file:\n" + \
" Source text in data/: " + str(doc_len) + " non-whitespace characters\n" + \
" Token underscores in " + file_+": " + str(underscore_len) + " non-whitespace characters\n")
with io.open("debug.txt",'w',encoding="utf8") as f:
f.write(text_dict[docname])
f.write("\n\n\n")
f.write(parse_text)
sys.exit(0)

if not tokfile and parse_text != "":
token_dict[docname] = parse_text

with io.open(file_, 'w', encoding='utf8', newline="\n") as fout:
fout.write("\n".join(output) + "\n")

print("o Restored text for DISRPT format in " + \
#str(len(dep_files)) + " .conllu files, " + \
str(len(tok_files)) + " .tok files and "+ str(len(rel_files)) + " .rels files\n")






3 changes: 3 additions & 0 deletions annis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# GUM corpus - ANNIS version (without Reddit)

The zip file in this directory can be imported for search and visualization into [ANNIS](https://corpus-tools.org/annis/). Note that this version of the corpus does not include the Reddit subcorpus of GUM. To compile an ANNIS version of the corpus including the Reddit subcorpus, please see [README_reddit.md](https://github.com/amir-zeldes/gum/blob/master/README_reddit.md).
631 changes: 631 additions & 0 deletions get_text.py

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions paula/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# GUM corpus - PAULA XML version (without Reddit)

The zip file in this directory contains a complete version of all annotations in [PAULA standoff XML](https://github.com/korpling/paula-xml). However note that this version of the corpus does not include the Reddit subcorpus of GUM. To compile a PAULA version of the entire corpus including the Reddit subcorpus, please see [README_reddit.md](https://github.com/amir-zeldes/gum/blob/master/README_reddit.md).

0 comments on commit 40af9bd

Please sign in to comment.